OPTIMAL LEARNING OF DEPLOYMENT AND SEARCH
           STRATEGIES FOR ROBOTIC TEAMS
                               By
                             Lai Wei
                       A DISSERTATION
                           Submitted to
                   Michigan State University
           in partial fulfillment of the requirements
                        for the degree of
Electrical and Computer Engineering – Doctor of Philosophy
                              2021


                                             ABSTRACT
                   OPTIMAL LEARNING OF DEPLOYMENT AND SEARCH
                               STRATEGIES FOR ROBOTIC TEAMS
                                                  By
                                                Lai Wei
In the problem of optimal learning, the dilemma of exploration and exploitation stems from the fact
that gathering information and exploiting it are, in many cases, two mutually exclusive activities.
The key to optimal learning is to strike a balance between exploration and exploitation. The Multi-
Armed Bandit (MAB) problem is a prototypical example of such an explore-exploit tradeoff, in
which a decision-maker sequentially allocates a single resource by repeatedly choosing one among
a set of options that provide stochastic rewards. The MAB setup has been applied in many robotics
problems such as foraging, surveillance, and target search, wherein the task of robots can be modeled
as collecting stochastic rewards. The theoretical work of this dissertation is based on the MAB
setup and three problem variations, namely heavy-tailed bandits, nonstationary bandits, and multi-
player bandits, are studied. The first two variations capture two key features of stochastic feedback
in complex and uncertain environments: heavy-tailed distributions and nonstationarity; while the
last one addresses the problem of achieving coordination in uncertain environments. We design
several algorithms that are robust to heavy-tailed distributions and nonstationary environments.
Besides, two distributed policies that require no communication among agents are designed for the
multi-player stochastic bandits in a piece-wise stationary environment.
    The MAB problems provide a natural framework to study robotic search problems. The above
variations of the MAB problems directly map to robotic search tasks in which a robot team searches
for a target from a fixed set of view-points (arms). We further focus on the class of search problems
involving the search of an unknown number of targets in a large or continuous space. We view the
multi-target search problem as a hot-spots identification problem in which, instead of the global
maximum of the field, all locations with a value greater than a threshold need to be identified.
We consider a robot moving in 3D space with a downward-facing camera sensor. We model the


robot’s sensing output using a multi-fidelity Gaussian Process (GP) that systematically describes the
sensing information available at different altitudes from the floor. Based on the sensing model, we
design a novel algorithm that (i) addresses the coverage-accuracy tradeoff: sampling at a location
farther from the floor provides a wider field of view but less accurate measurements, (ii) computes
an occupancy map of the floor within a prescribed accuracy and quickly eliminates unoccupied
regions from the search space, and (iii) travels efficiently to collect the required samples for target
detection. We rigorously analyze the algorithm and establish formal guarantees on the target
detection accuracy and the detection time.
    An approach to extend the single robot search policy to multiple robots is to partition the
environment into multiple regions such that workload is equitably distributed among all regions
and then assign a robot to each region. The coverage control focuses on such equitable partitioning
and the workload is equivalent to the so-called service demands in the coverage control literature. In
particular, we study the adaptive coverage control problem, in which the demands of robotic service
within the environment are modeled as a GP. To optimize the coverage of service demands in the
environment, the team of robots aims to partition the environment and achieve a configuration that
minimizes the coverage cost, which is a measure of the average distance of a service demand from
the nearest robot. The robots need to address the explore-exploit tradeoff: to minimize coverage
cost, they need to gather information about demands within the environment, whereas information
gathering deviates them from maintaining a good coverage configuration. We propose an algorithm
that schedules learning and coverage epochs such that its emphasis gradually shifts from exploration
to exploitation while never fully ceasing to learn. Using a novel definition of coverage regret, we
analyze the algorithm and characterizes its coverage performance over a finite time horizon.


Copyright by
   LAI WEI
       2021


                                    ACKNOWLEDGEMENTS
Over the course of my PhD, I have received encouragement and support from many people. First
and foremost, I would express my deep and most sincere gratitude to my advisor, Prof Vaibhav
Srivastava. He guided me into the fascinating world of control and learning research and led my
pathway to dissertation. I really enjoyed my research experience with him. He inspired me with his
vision and expertise in every stage of my PhD study. I’ll always remember his great mentorship,
constant encouragement, and impeccable support.
    I thank all my committee members, Prof Xiaobo Tan, Prof Ranjan Mukherjee, and Prof Shaunak
Bopardikar for their great support and advice during my PhD study. Also, I thank them for teaching
me adaptive control, nonlinear control and game theory. It had been a really nice experience to
attend our robotics reading group and the National Robotics Initiative project meeting together.
    I thank all my teachers at Michigan State University for their detailed explanation and quick
responses to my questions. I particularly thank Prof Hassan Khalil for teaching me my first PhD
course linear systems and control and ignite my interest in control theory. I’m also grateful to him
for advising me before Dr. Srivastava joined MSU.
    I thank every member of Prof Srivastava’s, Prof Bopardikar’s and Prof Tan’s lab for their
friendship, for their research presentations and for all our brainstorming sessions. Thank you
Andrew McDonald for working together with me on coverage control. Thank you Pearce Reickert
and Eric Gaskell for helping me design the robotic underwater target search experiment.
    My stay at MSU is truly a fun experience. I’m blessed to make so many friends here. I thank
Prof Ning Xi for taking me to MSU. Thank you Liangliang Chen for all the help when I first come
to the U.S. and all the get-togethers at your home. Thank you Hongyang Shi for help me moving
home. I’m so fortunate to meet with my girlfriend Kemeng Wang at MSU. Thank you so much for
your constant love and encouragement. Thank you for being tolerant and supporting. Last, but not
the least, I’d like to thank my mother and father for their unconditional love and endless support
enabling me to pursue my dream.
                                                 v


                                 TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   ix
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       x
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             1
   1.1 Background and Literature Synopsis . . . . . . . . . . . . . . . . . . . . . . . . .      3
   1.2 Contribution and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . .     7
CHAPTER 2 STATIONARY STOCHASTIC BANDITS                     . . . . . . . . . . . . . . . . . . 12
   2.1 Lower Bounds on Regrets . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . 13
   2.2 Upper Confidence Bound Strategies . . . . . . .      . . . . . . . . . . . . . . . . . . 14
   2.3 Heavy-tailed Stochastic MAB . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . 15
       2.3.1 A Robust Minimax Policy: Robust MOSS           . . . . . . . . . . . . . . . . . . 16
       2.3.2 Analysis of Robust MOSS . . . . . . . .        . . . . . . . . . . . . . . . . . . 18
       2.3.3 Numerical Illustration of Robust MOSS .        . . . . . . . . . . . . . . . . . . 26
   2.4 Summary . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . 26
   2.5 Bibliographic Remarks . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . 27
CHAPTER 3 PIECE-WISE STATIONARY STOCHASTIC BANDITS                        . . . . . . . . . . . 29
   3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . 29
   3.2 The LM-DSEE Algorithm . . . . . . . . . . . . . . . . . . .        . . . . . . . . . . . 30
   3.3 Analysis of the LM-DSEE Algorithm . . . . . . . . . . . . . .      . . . . . . . . . . . 31
   3.4 The SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . .        . . . . . . . . . . . 33
   3.5 Analysis of the SW-UCB# Algorithm . . . . . . . . . . . . . .      . . . . . . . . . . . 34
   3.6 Numerical Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
   3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . 38
   3.8 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . 39
CHAPTER 4 MULTI-PLAYER PIECEWISE STATIONARY STOCHASTIC BANDITS                                . 41
   4.1 The RR-SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .         . 42
   4.2 Analysis of the RR-SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . .         . 43
   4.3 The SW-DLP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . 48
   4.4 Analysis of the SW-DLP Algorithm . . . . . . . . . . . . . . . . . . . . . . . .       . 48
   4.5 Numerical Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
   4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 52
   4.7 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 52
CHAPTER 5 GENERAL NONSTATIONARY BANDITS WITH VARIATION BUDGET                                   54
   5.1 Lower Bound on Minimax Regret in Nonstationary Environment . . . . . . . . . .           55
   5.2 UCB Algorithms for Sub-Gaussian Nonstationary Stochastic Bandits . . . . . . . .         56
       5.2.1 Resetting MOSS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .         56
       5.2.2 Sliding-Window MOSS Algorithm . . . . . . . . . . . . . . . . . . . . .            58
                                               vi


      5.2.3 Discounted UCB Algorithm . . . . . . . . . . . . . . . . . . . . . . . .        .  65
  5.3 UCB Policies for Heavy-tailed Nonstationary Stochastic MAB Problems . . . . .         .  70
      5.3.1 Resetting robust MOSS for the non-stationary heavy-tailed MAB problem           .  70
      5.3.2 SW-RMOSS for the non-stationary heavy-tailed MAB problem . . . . . .            .  70
  5.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .  74
      5.4.1 Bernoulli Nonstationay Stochastic MAB Experiment . . . . . . . . . . .          .  75
      5.4.2 Heavy-tailed Nonstationary Stochastic MAB Experiment . . . . . . . . .          .  76
  5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .  77
  5.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .  77
CHAPTER 6    MULTI-TARGET SEARCH VIA MULTI-FIDELITY GAUSSIAN PRO-
             CESSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  79
  6.1 Multi-target Search Problem Description . . . . . . . . . . . . . . . . . . . . .   . .  80
      6.1.1 Multi-fidelity Sensing Model . . . . . . . . . . . . . . . . . . . . . . .    . .  80
      6.1.2 Objective of the Multi-target Search Algorithm . . . . . . . . . . . . .      . .  82
  6.2 Expedited Multi-target Search Algorithm . . . . . . . . . . . . . . . . . . . . .   . .  82
      6.2.1 Inference Algorithm for Multi-fidelity GPs . . . . . . . . . . . . . . . .    . .  83
      6.2.2 Multi-fidelity Sampling & Path Planning . . . . . . . . . . . . . . . . .     . .  84
      6.2.3 Classification and Region Elimination . . . . . . . . . . . . . . . . . .     . .  86
  6.3 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  86
  6.4 Analysis of the EMTS Algorithms . . . . . . . . . . . . . . . . . . . . . . . .     . .  89
      6.4.1 Analysis of the classification algorithm . . . . . . . . . . . . . . . . . .  . .  89
      6.4.2 Analysis of the Sampling and Fidelity Planner . . . . . . . . . . . . . .     . .  90
      6.4.3 Analysis of Expected Detection Time . . . . . . . . . . . . . . . . . .       . .  93
  6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  95
  6.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  96
CHAPTER 7 ONLINE ESTIMATION AND COVERAGE CONTROL                          . . . . . . . . . .  97
  7.1 Online Estimation and Coverage Problem . . . . . . . . . . . .      . . . . . . . . . .  97
      7.1.1 Graph Representation of Environment . . . . . . . . . .       . . . . . . . . . .  98
      7.1.2 Nonparametric Estimation . . . . . . . . . . . . . . . .      . . . . . . . . . .  98
      7.1.3 Voronoi Partition and Coverage Problem . . . . . . . . .      . . . . . . . . . .  99
      7.1.4 Coverage Performance Evaluation . . . . . . . . . . . .       . . . . . . . . . . 100
  7.2 Deterministic Sequencing of Learning and Coverage Algorithm .       . . . . . . . . . . 101
      7.2.1 Estimation Phase . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 101
      7.2.2 Information Propagation Phase . . . . . . . . . . . . . .     . . . . . . . . . . 102
      7.2.3 Coverage Phase . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 103
  7.3 Analysis of the DSLC Algorithm . . . . . . . . . . . . . . . . .    . . . . . . . . . . 103
      7.3.1 Mutual Information and Uncertainty Reduction . . . . .        . . . . . . . . . . 104
      7.3.2 Convergence within Coverage Phase . . . . . . . . . . .       . . . . . . . . . . 105
      7.3.3 An Upper Bound on Expected Coverage Regret . . . . .          . . . . . . . . . . 106
  7.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . 107
  7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . 110
  7.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . 110
                                            vii


CHAPTER 8 CONCLUSIONS AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . 113
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
                                           viii


                                     LIST OF FIGURES
Figure 2.1: Comparison of Robust MOSS with MOSS and other Robust UCB algorithms. . 27
Figure 3.1: Comparison of LM-DSEE and SW-UCB#.             . . . . . . . . . . . . . . . . . . . 38
Figure 4.1: Simulation of RR-SW-UCB# and SW-DLP in a piecewise stationary environment. 51
Figure 5.1: Comparison of different policies. . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 5.2: Performances with heavy-tailed rewards. . . . . . . . . . . . . . . . . . . . . . 76
Figure 6.1: Architecture of EMTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Figure 6.2: Underwater victim search simulation setups. . . . . . . . . . . . . . . . . . . . 87
Figure 6.3: Simulation result of EMTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 6.4: Uncertainty reduction results.   . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 7.1: Distributed implementation of DSLC. . . . . . . . . . . . . . . . . . . . . . . 108
Figure 7.2: Comparison of DSLC, Todescato and Cortes.            . . . . . . . . . . . . . . . . 109
                                                ix


                           LIST OF ALGORITHMS
1 The LM-DSEE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2 The SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 The RR-SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 The SW-DLP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 The R-MOSS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 The SW-MOSS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7 The D-UCB Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8 Deterministic Sequencing of Learning and Coverage (DSLC) . . . . . . . . . . . . 101
                                        x


                                            CHAPTER 1
                                         INTRODUCTION
Decision-making in the face of uncertainty is one of the most fundamental problems in robotics re-
search which requires the system to actively learn the environment while completing assigned tasks.
Exploration versus exploitation tradeoff is a key challenge in such problems. Here, exploration
means learning the environment to reduce the uncertainty while exploitation means taking the best
actions according to the existing information. In applications such as robotic deployment and search
where efficiency is of major concern, it is imperative to strike a good balance between exploration
and exploitation. This encourages an investigation into algorithms that enable the robotic system
to address this tradeoff in uncertain environments and achieve good finite-time performance.
    The problems of interest in this dissertation include both theoretical investigations of the
exploration versus exploitation tradeoff and its applications to robotic systems. The theoretical
work is based on the Multi-Armed Bandit (MAB) setup [1] which is a classic mathematical
formulation featuring the exploration versus exploitation tradeoff. With no prior information, a
decision-maker sequentially chooses one among a set of stochastic arms (options) to achieve the
maximum cumulative reward. An efficient adaptive arm selection rule keeps a good balance of
learning the expected mean rewards and picking the empirically most profitable option. The high-
level ideas embodied in the MAB problem extend naturally to many applications dealing with
resource allocation in the face of uncertainty. Two prototypical examples in robotics research are
target search and multi-robot deployment, and they are studied in this work. In a target search
problem, to quickly and accurately locate the targets of interest, the autonomous vehicle needs
to effectively learn the likelihood of target positions through sensing and spend more effort at
locations that are more plausible to contain the target. The multi-robot deployment problem deals
with the optimal allocation of a team of robots such that the robot team configuration matches with
the demand of services within the environment. For example, more firefighting robots should be
assigned to areas with a larger probability of wildfire breakouts to ensure a shorter response time.
                                                  1


    In this dissertation, we study three variations of the MAB problem, namely heavy-tailed bandits,
nonstationary bandits, and multi-player bandits. Though the classical stochastic MAB problem has
a concise formulation and well-established theoretical guarantees, it fails to capture many key
properties of stochastic processes in real-world problems. The heavy-tailed MAB relaxed the
assumption that rewards from each arm are bounded or sub-Gaussian. This extension is motivated
by applications such as social networks [2] and financial markets [3], wherein certain variables of
interest exhibit heavy-tailed distributions. In the nonstationary MAB problem, in addition to being
unknown, the reward distributions are assumed to be time-varying. This formulation characterizes
the drift of physical processes in unknown dynamic working environments, e.g., the frequency of
certain biological activities changes due to the different light conditioning in a day. The multi-player
MAB problem involves maximizing the total group reward such that no two decision-makers select
the same arm. This requires achieving a coordinated behavior of multiple decision-makers facing
uncertainty. An important aspect of the multi-player MAB problems is the collision model, which
means the reward is shared or no reward is received if multiple agents select the same option.
This formulation arises naturally in cognitive radio [4–6], wherein a transceiver can intelligently
detect and use vacant communication channels, and provides a rich modeling framework for other
interesting domains such as animal and robotic foraging [7, 8], autonomous surveillance [9] and
acoustic relay positioning for underwater communication [10].
    A canonical example of the exploration-exploitation tradeoff in robotics applications is the
target search problem. In many scenarios, the search task is to find the locations that emit large
signals [11]. For example, the human body is relatively hot compared with its surroundings and
emits more infrared radiations, which can be used to sense the existence of victims in a search and
rescue scenario. The MAB problem provides a natural framework to study robotic search problems.
In particular, the class of robotic search problems in which a robot team searches for a target from a
set of view-points (arms), or monitors an environment from a set of viewpoints, directly map to the
above variations of the MAB problems. We next focus on the class of search problems involving
the search of an unknown number of targets in a large or continuous space that is not constrained
                                                    2


to be observed from a small set of viewpoints.
    We consider a robot moving in a 3D continuous environment to search for targets located on
the 2D floor. For a given location of the robot, the sensors on the robot provide a score indicating
the likelihood of the presence of a victim within the sensing footprint of the robot. We refer to
these scores at all locations in the environment at a given altitude as the sensing field. The group
of sensing fields at different altitudes from the floor is modeled a multi-fidelity Gaussian process
(GP) [12] which is an auto-regressive model that captures the following fact: sensing at locations
farther from the floor provides a wider field of view but less accurate measurements. The objective
is to design a search policy that schedules the altitude and locations of the points at which the
robot should collect samples such that the search time is minimized while ensuring a desired search
accuracy.
    An approach to extend the single robot search policy to 𝑁 robots is to partition the environment
into 𝑁 regions such that some measure of "search load" is equitably distributed across all regions.
The coverage control focuses on such equitable partitioning and the search load is typically referred
to as serving demand in coverage control literature. In particular, the coverage control [13] concerns
the deployment of a multi-agent team in order to meet "servicing" demands in an environment. The
team of agents aims to minimize the coverage cost which is a function of the demand distribution,
agent locations in the environment, and the allocation of the set of points within the environment to
robots. In a standard coverage problem [13], the demand function is assumed to be known, while
in this dissertation, we assume it to be unknown and model it as a realization of a GP. Similar to
the MAB problem, the coverage problem with unknown demands exhibits the exploration versus
exploitation tradeoff: learning the demands requires inefficient deployment with respect to the true
demands.
1.1     Background and Literature Synopsis
The canonical stochastic MAB was historically proposed in [1] to model the clinical trials, where
different treatments with unknown effects are viewed as stochastic arms. At each time step, the
                                                   3


agent selects one arm from a set of options and receives a reward associated with it. The reward
sequence at each arm is assumed to be an unknown i.i.d random process. The difficulty of such a
problem lies in the exploration versus exploitation dilemma since the agent has to play the poorer
arms to learn about their mean rewards. In fact, there exists an intrinsic tradeoff between choosing
the most informative and seemingly the most rewarding alternative.
    Robbins [14] formulated the objective of the stochastic MAB problem as minimizing the regret,
that is, the loss in expected cumulative rewards caused by failing to select the best arm every time. In
their seminal work, Lai and Robbins [15], established a logarithm problem-dependent asymptotic
lower bound on the regret achieved by any policy. The lower bound being problem-dependent
means it depends on the reward distribution at each arm. Using a simple heuristic called optimism
in the face of uncertainty, a general method of constructing Upper Confidence Bound (UCB) rules
for parametric families of reward distributions is also presented in [15], and the associated policy
is shown to attain the logarithm lower bound. Several subsequent UCB-based algorithms [16, 17]
with efficient finite-time performance have been proposed.
    The simple formulation of the stochastic MAB limits its applications in many real-world
problems. Bubeck et al. [18] relaxed the sub-Gaussian assumption by only assuming the rewards
to have finite moments of order 1 + 𝜖 for some 𝜖 > 0. Their work allows the MAB model to be
used in applications such as social networks [2] and financial markets [3] wherein certain variables
of interest are inherently heavy-tailed.
    The nonstationary MAB problem captures the dynamic aspect of the environment and has
received some interest. In [19], the authors studied the bandit problem in which an adversary,
rather than well-behaved stochastic arms, controls the payoffs. The performance of a policy is
evaluated using the weak regret, which is the difference in the cumulated reward of a policy
compared against the best single action policy. While being able to capture nonstationarity, the
generality of the reward model in adversarial MAB makes the investigation of globally optimal
policies very challenging. The nonstationary stochastic MAB can be viewed as a compromise
between stationary stochastic MAB and adversarial MAB. It maintains the stochastic nature of the
                                                   4


reward sequence while allowing some degree of nonstationarity in reward distributions. Instead of
the weak regret analyzed in adversarial MAB, a strong notion of regret defined with respect to the
best arm at each time step is studied in these problems. A broadly studied nonstationary problem
is piecewise stationary MAB [20], wherein the reward distributions are piecewise stationary. A
more general nonstationary problem is studied in [21], wherein the cumulative maximum variation
in mean rewards is subject to a variation budget.
     In some decision-making problems, multiple agents could get involved and the choice made by
one agent could influence the selections of other agents. Most multi-player MAB studies deal with
a stationary environment and the task is to maximize the total rewards collected by all the agents.
As in the single-player case, the performance of the entire group of agents can be characterized
using group regret, which is defined as the loss in expected total rewards caused by the agents
failing to select the best set of arms every time. In [22], a lower bound on the group regret for
a centralized policy is derived and algorithms that asymptotically achieve this lower bound are
designed. Distributed multi-player MAB problem with no communication among players has
been studied in [4, 5, 23–25]. In [26–28], distributed cooperative agents communicate through a
communication network to improve their estimates of mean rewards and arm selections.
     The MAB problem has been applied in many scientific and technological areas. For example, it
is used for opportunistic spectrum access in communication networks, wherein multiple secondary
users actively detect and use vacant channels [5, 29]. The arm models the availability of a channel
and the reward from an arm is eliminated if it is selected by multiple users. In the MAB formulation
of online learning for demand response [30, 31], an aggregator calls upon a subset of users (arms)
who have an unknown response to the request to reduce their loads. Besides, contextual bandits
are widely used in recommender systems [32, 33], wherein the acceptance of a recommendation
corresponds to the rewards from an arm. The MAB setup has also been used in robotic research,
which is one of the major topics in this dissertation.
     In many robotic applications such as foraging, surveillance [7–9, 34] and acoustic relay po-
sitioning for underwater communication [10], the task of robots can be modeled to be collecting
                                                   5


stochastic rewards [35, 36]. These rewards may correspond to, for example, the likelihood of
an anomaly at a spatial location, the concentration of a certain type of algae in the ocean, the
communication quality of a specific location, etc. Algorithms for the MAB problems have been
extended to these problems. It needs to be noted that in robotic applications, motion is normally
energy-consuming and time-demanding, while switch between arms comes with no cost in MAB
models. By introducing block-allocation strategies, the exploration versus exploitation tradeoff can
be balanced with a sufficiently small number of arm switches [37–39]. In [9], the robotic surveil-
lance problem is studied in an environment that is abruptly changing due to the arrival of unknown
spatial events. To solve the problem, a block-allocation strategy [20] is adapted to the piece-wise
stationary MAB setting. Other algorithms that benefit path planning in robotic applications include
the Deterministic Sequencing of Exploration and Exploitation (DSEE) algorithms [25, 40, 41] due
to their deterministic and predictable structures.
    Unlike the MAB problems in which rewards from different arms are commonly assumed to
be independent, the feedbacks in robotic applications such as sensing information from different
locations are usually correlated. GPs are powerful tools to capture spatially correlated information
and they are widely used to characterizing spatiotemporal sensing fields [42, 43]. Sung et al. [11]
study the hot-spot identification problem in an environment within the framework of GP MAB [38,
44]. GPs have also been used in robotic inspection and search. Hollinger et al. [45] study an
inspection problem in which the robot needs to classify the underwater surface. They use a
combination of GP-implicit surface modeling and sequential hypothesis testing to classify surfaces.
    In autonomous robotic target search, the vehicle is required to quickly and accurately locate
the targets of interest in an unknown and uncertain environment. Examples include victim search
and rescue, mineral exploration, and tracking natural phenomena. There have been some efforts
to address target search within the context of informative path planning [9, 11, 36, 45–58], which
deals with path planning for the robots to maximize the utility of data collection. For example,
Meera et al. [52] model target occupancy map as a GP and design a heuristic algorithm for target
detection that handles tradeoffs among information gain, field coverage, sensor performance, and
                                                   6


collision avoidance. In this dissertation, our multi-target search solution is inspired by successive
elimination ideas from MAB research in [59, 60]. The robot sequentially collected new sensing
information and remove regions unlikely to contain the target from the search task. The proposed
algorithm combines informative path planning with Bayesian confidence interval estimation and
enables the robot to efficiently collect information and concentrate measurements in promising
areas.
    The coverage control [13] is another interesting topic in robotics. It arises naturally in multi-
robot systems when a team of agents is assigned to deploy themselves over an environment according
to a particular demand function, which specifies the degree to which a robot is needed at each
location. The objective is to minimize the coverage cost which is determined by the demand
distribution, robot locations, and task allocations. Example applications of coverage control range
from autonomous wildfire fighting, to smart agriculture, to ecological surveying, to environmental
cleaning. Classical coverage control problem [13, 61–63] assumes the demand function is known,
while recent works have focused on the adaptive coverage problem, in which agents are not assumed
to have knowledge of demand function a priori. In [64–69], the demand function defined on the
working environment is modeled as a realization of a GP which can be learned by taking samples
at different locations. The exploration versus exploitation dilemma in adaptive coverage control
is due to the conflict between collecting samples to refine demand estimation and maintaining a
good configuration to reduce the coverage cost. As with the MAB setup, adaptive coverage also
solves a stochastic optimization problem, though its task is more complicated than selecting the
most rewarding arm. In the same spirit as the DSEE algorithm for MAB problems, the adaptive
coverage policy proposed in this dissertation deterministically shifts its emphasis from exploration
to exploitation.
1.2     Contribution and Organization
In this section, the organization of the chapters in this dissertation is outlined and the contributions
in each chapter are discussed in detail.
                                                   7


Chapter 2. We first review the stationary stochastic MAB problem and corresponding concepts
such as regret, worst-case regret, and lower bounds on regret. We then study the heavy-tailed
Bandits in which the sub-Gaussian assumption of reward distributions is relaxed. Instead, the
reward distributions admit moments of order 1 + 𝜖 for some 𝜖 > 0, similarly as in [18]. We modify
the MOSS [70] algorithm for the sub-Gaussian reward distribution by using a saturated empirical
mean to design a new algorithm called Robust MOSS. By analyzing Robust MOSS, we show
that it achieves worst-case regret matching with the lower bound while maintaining a distribution-
dependent logarithm regret. To the best of our knowledge, Robust MOSS is the first algorithm to
achieve order-optimal worst-case regret for heavy-tailed bandits. This is the major contribution of
this chapter. Numerical illustrations are provided to verify the robustness of the proposed algorithm
against heavy-tailed rewards. We close the chapter with comments and bibliographic remarks on
the classic MAB problem.
Chapter 3. In this chapter, we study a special nonstationary stochastic MAB problem called
piece-wise stationary bandits. We assume the mean rewards from arms switch to unknown values
at unknown time instants and the reward distribution remains stationary between consecutive
switches.
    The main contribution of this chapter is the design of two generic algorithms, namely, the
Limited Memory DSEE (LM-DSEE) and the Sliding-Window UCB# (SW-UCB#). LM-DSEE
inherits the structure of the DSEE algorithm [25, 40] for the stationary bandit and comprises
interleaving blocks of exploration and exploitation. In the exploitation epochs, an arm is selected
based on the most recent exploration. This avoids a large bias in reward estimation in a nonstationary
environment. And, SW-UCB# is a modification of the SW-UCB algorithm from [20] that relaxes
the assumption of knowing horizon length. We rigorously show these algorithms incur sublinear
regret, i.e., the time average of the regret asymptotically converges to zero. A comparison of both
algorithms is made and discussed based on the simulation results.
Chapter 4. We study the multi-player stochastic bandits in a piece-wise stationary environment.
We consider a collision model in which a player receives a reward at an arm if it is the only player
                                                  8


to select the arm. This problem features achieving coordination in an uncertain and nonstationary
environment.
     The contribution of this chapter is the design of two novel distributed algorithms that require no
communication between agents, namely Round-Robin SW-UCB# (RR-SW-UCB#) and the Sliding-
Window Distributed Learning with Prioritization (SW-DLP). For both algorithms, it is shown that
the group regret is upper bounded by a sublinear function of time even with the collision model,
i.e., both algorithms achieve coordination while efficiently learning a nonstationary environment.
Chapter 5. In this chapter, we study a more general nonstationary stochastic MAB problem
proposed in [21] in which the cumulative maximum variation in mean rewards is restricted to
a variation budget. There is no restriction on how do reward distributions change, for example,
they may change abruptly like the piece-wise stationary bandits, or they may drift slowly between
subsequent abrupt changes. The performance of a policy is measured by comparing its cumulative
expected rewards with that of an oracle that selects the arm with the maximum mean reward at
each time and it is characterized using the worst-case regret, which is the regret for the worst
choice of reward distribution sequences that satisfies the variation budget. We extend UCB-based
policies with three different approaches, namely, periodic resetting, sliding observation window,
and discount factor, and show that they achieve order-optimal worst-case performance. We also
relax the sub-Gaussian assumption on reward distributions and develop robust versions of the
proposed policies that can handle heavy-tailed reward distributions and maintain their performance
guarantees.
     The major contributions of this work are threefold. First, we extend MOSS [70] to design
Resetting MOSS (R-MOSS) and Sliding-Window MOSS (SW-MOSS). Also, we show Discounted
UCB (D-UCB) [20] can be tuned to solve the problem. Secondly, with rigorous analysis, we show
that R-MOSS and SW-MOSS achieve the exact order-optimal worst-case performance and D-UCB
is near-optimal. Finally, we relax the bounded or sub-Gaussian assumption on the rewards required
by these algorithms and design policies robust to heavy-tailed rewards. We show the theoretical
guarantees on the worst-case regret can be maintained by the robust policies.
                                                    9


Chapter 6. We consider a scenario in which an autonomous vehicle equipped with a downward-
facing camera operates in a 3D environment and is tasked with searching for an unknown number
of stationary targets on the 2D floor of the environment. The key challenge is to design a search
policy that minimizes the search time while ensuring a high target detection accuracy. We model
the sensing field using a multi-fidelity GP that systematically describes the sensing information
available at different altitudes from the floor. Based on the sensing model, we design a novel
algorithm called Expedited Multi-Target Search (EMTS) that (i) addresses the coverage-accuracy
tradeoff: sampling at a location farther from the floor provides a wider field of view but less
accurate measurements, (ii) computes an occupancy map of the floor within a prescribed accuracy
and quickly eliminates unoccupied regions from the search space, and (iii) travels efficiently to
collect the required samples for target detection. We rigorously analyze the algorithm and establish
formal guarantees on the target detection accuracy and the expected detection time. We illustrate
the algorithm using a simulated multi-target search scenario.
    The primary contribution of this chapter is the extension of the classical informative path
planning approach for the single-fidelity GP to the multi-fidelity GP setting. This novel extension
allowed for jointly planning for sampling locations and associated fidelity levels, and thus, ad-
dresses the fidelity-coverage tradeoff for expedited target search. The EMTS algorithm is proposed
and illustrated in an underwater victim search scenario using the Unmanned Underwater Vehicle
Simulator. The algorithm is analyzed in terms of its accuracy and expected detection time.
Chapter 7. We study the problem of distributed multi-robot coverage over an unknown, nonuniform
sensory field, which is a deployment problem with uncertain demands. Modeling the sensory field
as a realization of a GP and using Bayesian techniques, we devise a policy that aims to balance
the tradeoff between learning the sensory function and covering the environment. We propose an
adaptive coverage algorithm called Deterministic Sequencing of Learning and Coverage (DSLC)
that schedules learning and coverage epochs such that its emphasis gradually shifts from exploration
to exploitation while never fully ceasing to learn. Using a novel definition of coverage regret which
characterizes the overall coverage performance of a multi-robot team over a time horizon 𝑇, we
                                                  10


analyze DSLC to provide an upper bound on expected cumulative coverage regret. Finally, we
illustrate the empirical performance of the algorithm through simulations of the coverage task over
an unknown distribution of wildfires.
    The most important contribution of this chapter is the definition of coverage regret which
enables the finite-time analysis of the online estimation and coverage algorithm. Existing works
evaluate the algorithm performance by assuming the coverage algorithm attains global optimality
and comparing the performance with a globally optimal result. Since the coverage problem itself is
NP-Hard, this assumption is too strong, especially with distributed deployment being considered.
The coverage regret is defined with respect to locally optimal solutions which relaxes the assumption
of achieving a globally optimal coverage and perfectly characterizes the convergence property of a
policy.
Chapter 8. In this chapter, we conclude the dissertation with a summary of contributions and future
directions. For the MAB problems addressed in this work, we discuss their potential applications
in robotic patrolling. For the robotic target search and adaptive coverage control, the problem
generalizations and possible solutions are illustrated in detail.
                                                  11


                                                     CHAPTER 2
                                STATIONARY STOCHASTIC BANDITS
In a stationary stochastic MAB problem, an agent chooses an arm 𝜑𝑡 from the set of 𝐾 arms
                                                                                             𝜑
{1, . . . , 𝐾 } at each time 𝑡 ∈ {1, . . . , 𝑇 } and receives the associated random reward 𝑋𝑡 𝑡 . The reward
at each arm 𝑘 is drawn from an unknown probability distribution 𝑓 𝑘 with unknown mean 𝜇 𝑘 . The
problem being stationary mean the reward distribution for each arm does not change with time.
Normally, it is assumed that the reward distributions are sub-Gaussian.
Definition 2.1 (Sub-Gaussian reward). For any arm 𝑘 ∈ {1, . . . , 𝐾 }, the probability distribution
 𝑓 𝑘 is 1/2 sub-Gaussian, i.e., if 𝑋 ∼ 𝑓 𝑘 ,
                                                                                    !
                                                                              𝜆2
                                  ∀𝜆 ∈ R : E exp(𝜆(𝑋 − 𝜇)) ≤ exp                      .
                                                                                 8
The main example for sub-Gaussia rewards are random rewards with bounded support [0, 1] which
is commonly used in MAB literature.
      The objective for the stochastic MAB problem is to maximize the expected value of the cumu-
                   Í      𝜑                                                                        𝜑
lative reward 𝑇𝑡=1 𝑋𝑡 𝑡 . We assume that 𝜑𝑡 is selected based upon past observations {𝑋𝑠 𝑠 , 𝜑 𝑠 }𝑡−1     𝑠=1
following some policy 𝜌. Specifically, 𝜌 determines the conditional distribution
                                                                          
                                                               𝜑
                                             P 𝜌 𝜑𝑡 = 𝑘 | {𝑋𝑠 𝑠 , 𝜑 𝑠 }𝑡−1
                                                                       𝑠=1
at each time 𝑡 ∈ {1, . . . , 𝑇 − 1}. If P 𝜌 (·) takes binary values, we call 𝜌 deterministic; otherwise, it
is called stochastic.
      Let the maximum mean reward among all arms be 𝜇∗ . We use Δ 𝑘 = 𝜇∗ − 𝜇 𝑘 to measure the
suboptimality of arm 𝑘. For a policy 𝜌, to maximize the expected cumulative reward E [𝑆𝑇 ] is
equivalent to minimize the regret defined by
                                           " 𝑇                 #    𝐾
                                             Õ                   Õ
                                𝜌       𝜌            ∗    𝜑𝑡
                              𝑅𝑇 := E               𝜇 − 𝑋𝑡       =      Δ 𝑘 E 𝜌 [𝑛 𝑘 (𝑇)],
                                             𝑡=1                   𝑘=1
where 𝑛 𝑘 (𝑇) is the total number of times the arm 𝑘 has been chosen until time 𝑇, and the second
                                                                                                        𝜌
expectation is taken over different realization of arm selections. It needs to be noted that 𝑅𝑇 can
                                                          12


be viewed as the difference between the expected cumulative reward obtained by selecting the arm
with the maximum mean reward 𝜇∗ and selecting arms 𝜑1 , . . . , 𝜑𝑇 .
2.1     Lower Bounds on Regrets
The objective of regret minimization was originally formulated by Robbins [14]. It was established
later a logarithm problem-dependent asymptotic lower bound on the number of times a suboptimal
arm is selected by a uniformly good policy in [15, 71]. Here, a policy 𝜌 being uniformly good
means for any possible set of reward distributions { 𝑓1 , . . . , 𝑓𝐾 },
                                       𝜌
                                  E [𝑅𝑇 ] = 𝑜(𝑇 𝑎 )      for every 𝑎 > 0,
                            𝜌
which means lim𝑇→∞ E [𝑅𝑇 ]/𝑇 𝑎 = 0 for any 𝑎 > 0.
Lemma 2.1 (Lai and Robbins’ Lower bound [15, 71]). Suppose there is a unique best arm with
reward distribution 𝑓 ∗ and an uniformly good policy 𝜌 is applied. For any suboptimal arm 𝑘 and
every 𝜖 > 0,
                                                                         
                                                             1−𝜖
                                 lim P 𝑛 𝑘 (𝑇) ≥                             = 1,
                                𝑇→∞                  𝐷 KL ( 𝑓 𝑘 || 𝑓 ∗ )
where 𝐷 KL denote the Kullback-Leibler divergence of two distributions. Hence,
                                           E [𝑛 𝑘 (𝑇)]                1
                                 lim inf                 ≥                        .
                                   𝑇→∞       log 𝑇            𝐷 KL ( 𝑓 𝑘 || 𝑓 ∗ )
    The above result indicates a suboptimal arm needs to be selected at least logarithm number of
times, resulting in the logarithm lower bound for a stochastic MAB problem. It can also be seen
               𝜌
that regret 𝑅𝑇 is implicitly determined by reward distributions in { 𝑓1 , . . . , 𝑓𝐾 } as well as policy
           𝜌
𝜌. So, 𝑅𝑇 is also called distribution-dependent regret. In contrast, the distribution-independent
regret, also known as worst-case regret, is defined by taking the maximum over at all possible
reward distribution combinations.
Definition 2.2 (Worst-case Regret). The worst-case regret is the regret for the worst possible choice
of reward distributions and it can be expressed as
                                         worst   𝜌                    𝜌
                                               𝑅𝑇 =      sup         𝑅𝑇 .
                                                     { 𝑓1 ,..., 𝑓𝐾 }
                                                     13


    The regret associated with the policy that minimizes the above worst-case regret is called
                                                                                            √
minimax regret. According to [72], the minimax regret also has a lower bound 1/20 𝐾𝑇. This
result is about finite-time performance and can be derived by selecting a set of reward distributions
that present challenges to the allocation policy. Consider a scenario in which there is a unique
best arm and all other arms have identical mean rewards such that the gap between optimal and
suboptimal mean rewards is Δ. For such a problem, it has been shown in [73] that for any policy 𝜌,
                                       𝜌         𝐾  𝑇Δ2               𝐾
                                     𝑅𝑇 ≥ 𝐶1        ln           + 𝐶2 ,                            (2.1)
                                                 Δ          𝐾           Δ
                                                                                       p
where 𝐶1 and 𝐶2 are some positive constants. It needs to be noted that for Δ =           𝐾/𝑇, the above
                           √                                                         √
lower bound becomes 𝐶2 𝐾𝑇, which matches with the lower bound 1/20 𝐾𝑇.
2.2     Upper Confidence Bound Strategies
The family of Upper Confidence Bound (UCB) strategies uses the principle called optimism in the
face of uncertainty. At each time slot, a UCB index which is a statistical index composed of both
mean reward estimate and the associated uncertainty measure is computed at each arm, and the
arm with the maximum UCB is picked. Within the family of UCB algorithms, two state-of-the-art
algorithms for the stationary stochastic MAB problems are UCB1 [16] and MOSS [70]. With arm
𝑘 being sampled 𝑛 𝑘 (𝑡) of times before 𝑡, 𝜇ˆ 𝑘,𝑛 𝑘 (𝑡) is the associated empirical mean. Then, UCB1
computes the UCB index for each arm 𝑘 at time 𝑡 as
                                                                s
                                                                  2 ln 𝑡
                                      𝑔 UCB1   = 𝜇ˆ 𝑘,𝑛 𝑘 (𝑡) +           .
                                        𝑘,𝑡
                                                                  𝑛 𝑘 (𝑡)
The finite-time performance guarantee of UCB1, which is stronger than asymptotic property, has
been proved in [16].
Lemma 2.2 (Regret Upper Bound for UCB1). For the stationary stochastic MAB problem, if the
reward distributions have bounded support [0, 1], the regret of UCB1 after any 𝑇 satisfies
                                                                        ! 𝐾
                                             Õ ln 𝑇                 𝜋  2 Õ
                               𝑅𝑇UCB1 ≤ 8                   + 1+                Δ𝑘 .
                                            𝑘:Δ >0
                                                     Δ  𝑘            3      𝑘=1
                                               𝑘
                                                      14


    Notice that the above upper bound matches with the order of Lai and Robbin’s logarithm lower
bound in Lemma 2.1. As is shown in [70], the worst-case regret of UCB1 can be derived by
selecting values for Δ 𝑘 to maximize the upper bound, resulting in
                                                          p
                                     worst UCB1
                                          𝑅𝑇         ≤ 10 (𝐾 − 1)𝑇 (ln 𝑇).
                                                                                 √
Comparing this result with the lower bound on the minimax regret 1/20 𝐾𝑇 there exists an extra
       √
factor ln 𝑇. This issue has been resolved by the algorithm called Minimax Optimal Strategy in the
Stochastic case (MOSS) [70], which is the first algorithm that enjoys both logarithm distribution-
dependent and order-optimal distribution-independent bound. With prior knowledge of horizon
length 𝑇, and the UCB index for MOSS is expressed as
                                                         v
                                                         u
                                                         t               
                                                           max ln 𝐾𝑛𝑇𝑘 (𝑡) , 0
                               𝑔 MOSS   = 𝜇ˆ 𝑘,𝑛 𝑘 (𝑡) +                       .
                                 𝑘,𝑡
                                                                 𝑛 𝑘 (𝑡)
We now recall the worst-case regret upper bound for MOSS.
Lemma 2.3 (Worst-case regret upper bound for MOSS [70]). For the stationary stochastic MAB
problem, the worst-case regret of the MOSS algorithm satisfies
                                           worst MOSS
                                                               √
                                                   𝑅𝑇      ≤ 49 𝐾𝑇 .
2.3     Heavy-tailed Stochastic MAB
This section is a slightly modified version of our published work on heavy-tailed bandits, and it is
reproduced here with the permission of the copyright holder1.
    The rewards being bounded or sub-Gaussian is a common assumption that gives the sample
mean an exponential convergence and simplifies the MAB problem. However, in many applications,
such as social networks [2] and financial markets [3], the rewards are heavy-tailed. Bubeck et al. [18]
relax the sub-Gaussian assumption by only assuming the rewards to have finite moments of order
1 + 𝜖 for some 𝜖 ∈ (0, 1]. They present the robust UCB algorithm and show that it attains a
    1 ©2018 IEEE. Reprinted with permission from [74].
                                                         15


logarithmic distribution-depend regret upper bound on the regret that is within a constant factor of
the lower bound in the heavy-tailed setting. However, the solutions provided in [18] are not able
to provably achieve an order optimal worst-case regret. Specifically, the factor of optimality is a
poly-logarithmic function of time-horizon.
    The heavy-tailed stochastic MAB problem studied in this section is the stochastic MAB problem
with the following assumptions.
Assumption 2.1. Let 𝑋 be a random reward drawn from any arm 𝑘 ∈ {1, . . . , 𝐾 }. There exists a
                                    
constant 𝑢 ∈ R>0 such that E |𝑋 | 1+𝜖 ≤ 𝑢 1+𝜖 for some 𝜖 ∈ (0, 1].
Assumption 2.2. Parameters 𝑇, 𝐾, 𝑢 and 𝜖 are known.
    We now recall the lower bound on the minimax regret for the heavy tailed bandit problem
derived in [18].
Theorem 2.4 ([18, Th. 2]). For any fixed time horizon 𝑇 and the stochastic MAB problem under
Assumptions 2.1 and 2.2 with 𝑢 = 1, the worst-case regret for a uniformly good policy 𝜌 satisfies
                                            𝜌           𝜖     1
                                     worst
                                           𝑅𝑇 ≥ 0.01𝐾 1+𝜖 𝑇 1+𝜖 .
                                                                                         𝜖   1 
Remark 2.1. Since 𝑅𝑇 scales with 𝑢, the lower bound for heavy tail bandit is Ω 𝑢𝐾 1+𝜖 𝑇 1+𝜖 . This
lower bound also indicates that within a finite horizon 𝑇, it is almost impossible to differentiate the
                                               𝜖 
optimal arm from arm 𝑘, if Δ 𝑘 ∈ 𝑂 𝑢(𝐾/𝑇) 1+𝜖 . As a special case, rewards with bounded support
                                                                     √
[0, 1] correspond to 𝜖 = 1 and 𝑢 = 1. Then, the lower bound Ω( 𝐾𝑇) matching with the regret
upper bound is achieved by MOSS.
2.3.1   A Robust Minimax Policy: Robust MOSS
In Robust MOSS, to deal with the heavy-tailed reward distribution, we replace the empirical mean
with a saturated empirical mean. Although the saturated empirical mean is a biased estimator, it
has better convergence properties. The formal definition is given later in this section. We construct
a novel UCB index to evaluate the arms, and at each time slot, the arm with the maximum UCB
                                                   16


index is picked. Let 𝑛 𝑘 (𝑡) be the number of times that arm 𝑘 has been selected until time 𝑡 − 1. At
time 𝑡, let 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) be the saturated empirical mean reward computed from the 𝑛 𝑘 (𝑡) samples at arm
𝑘. Robust MOSS initializes by selecting each arm once and subsequently, at each time 𝑡, selects
the arm that maximizes the following UCB
                                           𝑔𝑛𝑘𝑘 (𝑡) = 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) + (1 + 𝜂)𝑐 𝑛 𝑘 (𝑡) ,
                                                                                   𝜖
where 𝜂 > 0 is an appropriate constant, 𝑐 𝑛 𝑘 (𝑡) = 𝑢 × 𝜙(𝑛 𝑘 (𝑡)) 1+𝜖 and
                                                                         𝑇
                                                                           
                                                                  ln+ 𝐾𝑛
                                                    𝜙(𝑛) =                   ,
                                                                      𝑛
where ln+ (𝑥) := max(ln 𝑥, 1). Note that both 𝜙(𝑛) and 𝑐 𝑛 are monotonically decreasing in 𝑛.
    The robust saturated empirical mean is similar to the truncated empirical mean used in [18],
which is employed to extend UCB1 to achieve logarithm distribution-dependent regret for the
heavy-tailed MAB problem. Let {𝑋𝑖 }𝑖∈{1,...,𝑚} be a sequence of i.i.d. random variables with mean
𝜇 and E |𝑋𝑖 | 1+𝜖 ≤ 𝑢 1+𝜖 , where 𝑢 > 0. Pick 𝑎 > 1 and let ℎ(𝑚) = 𝑎 blog𝑎 (𝑚) c +1 such that
                       
ℎ(𝑚) ≥ 𝑚. Define the saturation point 𝐵𝑚 by
                                                                          − 1
                                             𝐵𝑚 := 𝑢 × 𝜙 ℎ(𝑚) 1+𝜖 .
Then, the saturated empirical mean estimator is defined by
                                                               𝑚
                                                        1 Õ
                                              𝜇ˆ 𝑚 :=              sat(𝑋𝑖 , 𝐵𝑚 ),                  (2.2)
                                                       𝑚 𝑖=1
                                            
where sat(𝑋𝑖 , 𝐵𝑚 ) := sign(𝑋𝑖 ) min |𝑋𝑖 | , 𝐵𝑚 .
    Define 𝑑𝑖 := sat(𝑋𝑖 , 𝐵𝑚 ) − E [sat(𝑋𝑖 , 𝐵𝑚 )] which has the following properties.
Lemma 2.5. For any 𝑖 ∈ {1, . . . , 𝑚}, 𝑑𝑖 satisfies (i) |𝑑𝑖 | ≤ 2𝐵𝑚 (ii) E [𝑑𝑖2 ] ≤ 𝑢 1+𝜖 𝐵1−𝜖
                                                                                             𝑚 .
Proof. Property (i) follows immediately from the definition of 𝑑𝑖 , and property (ii) follows from
                                                                                      
                                 E [𝑑𝑖2 ] ≤ E sat2 (𝑋𝑖 , 𝐵𝑚 ) ≤ E |𝑋𝑖 | 1+𝜖 𝐵1−𝜖       𝑚   .
                                                                                                      
                                                                 17


    The following lemma examines the estimator bias and provides an upper bound on the error of
the saturated empirical mean.
Lemma 2.6 (Bias of saturated empirical mean). For an i.i.d. sequence of random variables
                                                           
{𝑋𝑖 }𝑖∈{1,...,𝑚} such that E [𝑋𝑖 ] = 𝜇 and E 𝑋𝑖1+𝜖 ≤ 𝑢 1+𝜖 , the saturated empirical mean (2.2)
satisfies
                                                           𝑚
                                                       1 Õ             𝑢 1+𝜖
                                           𝜇ˆ 𝑚 − 𝜇 −           𝑑𝑖 ≤ 𝜖 .
                                                       𝑚 𝑖=1            𝐵𝑚
                          h                             i
Proof. Since 𝜇 = E 𝑋𝑖 1{|𝑋𝑖 |≤𝐵𝑚 } + 1{|𝑋𝑖 |>𝐵𝑚 } , the error of estimator 𝜇ˆ 𝑚 satisfies
                             𝑚                               𝑚               𝑚
                           1 Õ                       1 Õ               1 Õ                             
              𝜇ˆ 𝑚 − 𝜇 =          sat(𝑋𝑖 , 𝐵𝑚 ) − 𝜇 =             𝑑𝑖 +           E [sat(𝑋𝑖 , 𝐵𝑚 )] − 𝜇 ,
                          𝑚 𝑖=1                         𝑚 𝑖=1          𝑚 𝑖=1
where the second term is the bias of 𝜇ˆ 𝑚 . We now compute an upper bound on the bias.
                                                                              "          #
                                                                                     1+𝜖
                                                  h                  i         |𝑋𝑖 |            𝑢 1+𝜖
                     E [sat(𝑋𝑖 , 𝐵𝑚 )] − 𝜇 ≤ E |𝑋𝑖 | 1{|𝑋𝑖 |>𝐵𝑚 } ≤ E                      ≤          ,
                                                                                (𝐵𝑚 ) 𝜖       (𝐵𝑚 ) 𝜖
which concludes the proof.                                                                                   
2.3.2    Analysis of Robust MOSS
In this section, we analyze Robust MOSS to provide both distribution-free and distribution-
dependent regret bounds.To derive the concentration property of saturated empirical mean, we
use a maximal Bennett type inequality as shown in Lemma 2.7.
Lemma 2.7 (Maximal Bennett’s inequality [75]). Let {𝑋𝑖 }𝑖∈{1,...,𝑛} be a sequence of bounded
random variables with support [−𝐵, 𝐵], where 𝐵 ≥ 0. Suppose that E [𝑋𝑖 |𝑋1 , . . . , 𝑋𝑖−1 ] = 𝜇𝑖 and
                                             Í𝑚
Var[𝑋𝑖 |𝑋1 , . . . , 𝑋𝑖−1 ] ≤ 𝑣. Let 𝑆 𝑚 = 𝑖=1    (𝑋𝑖 − 𝜇𝑖 ) for any 𝑚 ∈ {1, . . . , 𝑛}. Then, for any 𝛿 ≥ 0
                                                                                   !
                                                                            𝛿       𝐵𝛿
                            P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑆 𝑚 ≥ 𝛿 ≤ exp − 𝜓                       ,
                                                                             𝐵       𝑛𝑣
                                                                                      !
                                                                              𝛿       𝐵𝛿
                            P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑆 𝑚 ≤ −𝛿 ≤ exp − 𝜓                        ,
                                                                               𝐵       𝑛𝑣
where 𝜓(𝑥) = (1 + 1/𝑥) ln(1 + 𝑥) − 1.
                                                         18


Remark 2.2. For 𝑥 ∈ (0, ∞), function 𝜓(𝑥) is monotonically increasing in 𝑥.
    Now, we establish an upper bound on the probability that the UCB underestimates the mean at
arm 𝑘 by an amount 𝑥.
Lemma 2.8. For any arm 𝑘 ∈ {1, . . . , 𝐾 } and any 𝑡 ∈ {𝐾 + 1, . . . , 𝑇 } and 𝑥 > 0, if 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎,
                             
the probability of event 𝑔𝑛𝑘𝑘 (𝑡) ≤ 𝜇 𝑘 − 𝑥 is no greater than
                                                                     ! − 1+𝜖
                                                                              𝜖
                                      𝐾 𝑎         1          𝜓 2𝜂/𝑎 𝑥
                                              Γ     +2                           .
                                      𝑇 ln(𝑎) 𝜖                 2𝑎      𝑢
Proof. It follows from Lemma 2.6 that
                                                                                       
            P 𝑔𝑛𝑘𝑘 (𝑡) ≤ 𝜇 𝑘 − 𝑥 ≤P ∃𝑚 ∈ {1, . . . , 𝑇 } : 𝜇ˆ 𝑚𝑘 + (1 + 𝜂)𝑐 𝑚 ≤ 𝜇 𝑘 − 𝑥
                                                                    𝑚
                                                                      𝑑𝑖𝑘
                                                                                                
                                                                 Õ            𝑢 1+𝜖
                                       ≤P ∃𝑚 ∈ {1, . . . , 𝑇 } :            ≤ 𝜖 − (1 + 𝜂)𝑐 𝑚 − 𝑥
                                                                  𝑖=1
                                                                       𝑚       𝐵𝑚
                                                                     𝑚                 
                                                                   1 Õ 𝑘
                                       ≤P ∃𝑚 ∈ {1, . . . , 𝑇 } :          𝑑 ≤ −𝑥 − 𝜂𝑐 𝑚 ,
                                                                  𝑚 𝑖=1 𝑖
where 𝑑𝑖𝑘 is defined similarly to 𝑑𝑖 for i.i.d. reward sequence at arm 𝑘 and the last inequality is due
to
                                 𝑢 1+𝜖                𝜖                𝜖
                                    𝜖   = 𝑢 𝜙 ℎ(𝑚) 1+𝜖 ≤ 𝑢 𝜙(𝑚) 1+𝜖 = 𝑐 𝑚 .                        (2.3)
                                  𝐵𝑚
Recall 𝑎 > 1. We apply a peeling argument [76, Sec 2.2] with geometric grid 𝑎 𝑠 ≤ 𝑚 < 𝑎 𝑠+1 over
time interval {1, . . . , 𝑇 }. Since 𝑐 𝑚 is monotonically decreasing with 𝑚,
                                                           𝑚                    
                                                       1 Õ 𝑘
                             P ∃𝑚 ∈ {1, . . . , 𝑇 } :          𝑑 ≤ −𝑥 − 𝜂𝑐 𝑚
                                                       𝑚 𝑖=1 𝑖
                             Õ                           Õ 𝑚                         
                                              𝑠 𝑠+1             𝑘         𝑠         
                           ≤      P ∃𝑚 ∈ [𝑎 , 𝑎 ) :            𝑑𝑖 ≤ −𝑎 𝑥 + 𝜂𝑐 𝑎 𝑠+1 .
                              𝑠≥0                          𝑖=1
Also notice that 𝐵𝑚 = 𝐵𝑎 𝑠 for all 𝑚 ∈ [𝑎 𝑠 , 𝑎 𝑠+1 ). Then with properties in Lemma 2.5, we apply
                                                         19


Lemma 2.7 to get
                           Õ                               Õ 𝑚                                       
                                              𝑠 𝑠+1                 𝑘         𝑠                    
                                 P ∃𝑚 ∈ [𝑎 , 𝑎 ) :                𝑑𝑖 ≤ −𝑎 𝑥 + 𝜂𝑐 𝑎 𝑠+1
                           𝑠≥0                               𝑖=1
                                                                                            !
                                     © 𝑎 𝑥 + 𝜂𝑐 𝑎 𝑠+1              2𝐵𝑎 𝑠 𝑥 + 𝜂𝑐 𝑎 𝑠+1
                           Õ             𝑠
                        ≤        exp ­−
                                                                                                 ª
                                                              𝜓                                  ®
                           𝑠≥0
                                               2𝐵𝑎 𝑠                    𝑎𝑢 1+𝜖 𝐵1−𝜖𝑎𝑠
                                     «                                                           ¬
                                                                           
                             𝜓(𝑥) is monotonically increasing
                                                                                 !
                           Õ            𝑎 𝑠 𝑥 + 𝜂𝑐 𝑎 𝑠+1           2𝜂𝐵𝜖𝑎 𝑠 𝑐 𝑎 𝑠+1
                        ≤        exp −                        𝜓
                           𝑠≥0
                                               2𝐵   𝑎 𝑠                𝑎𝑢 1+𝜖
                           plug in 𝑐 𝑎 𝑠+1 , 𝐵𝑎 𝑠 and use ℎ(𝑎 𝑠 ) = 𝑎 𝑠+1 )
                                                                                 !
                           Õ                    𝑥                      𝜓   2𝜂/𝑎
                        =        exp −𝑎 𝑠               + 𝜂𝜙(𝑎 𝑠 )                      .
                           𝑠≥1
                                              𝐵 𝑎 𝑠−1                      2𝑎
Plugging in 𝜙(𝑎 𝑠 ), with 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎 and ln+ (𝑦) ≥ ln(𝑦), we have
                                                          ! Õ                                             !
          Õ                 𝑥                  𝜓     2𝜂/𝑎                                   𝑥     𝜓     2𝜂/𝑎    𝐾 𝑠
              exp −𝑎 𝑠            + 𝜂𝜙(𝑎 𝑠 )                     ≤       exp −𝑎 𝑠                                 𝑎 .
          𝑠≥1
                          𝐵 𝑎 𝑠−1                    2𝑎             𝑠≥1
                                                                                        𝐵   𝑎 𝑠−1       2𝑎      𝑇
                                                       𝑠
Let 𝑏 = 𝑥𝜓 2𝜂/𝑎 /(2𝑎𝑢). Since 𝐵𝑎 𝑠−1 ≤ 𝑢𝑎 1+𝜖 , we have
                                        !
         Õ
                     𝑠   𝑥     𝜓   2𝜂/𝑎     𝐾 𝑠 𝐾Õ 𝑠                               𝜖𝑠
                                                                                        
             exp −𝑎                            𝑎 ≤               𝑎 exp −𝑏𝑎         1+𝜖
         𝑠≥1
                       𝐵𝑎 𝑠−1      2𝑎        𝑇          𝑇 𝑠≥1
                                                           ∫ +∞
                                                        𝐾                                ( 𝑦−1) 𝜖 
                                                     ≤              𝑎 𝑦 exp − 𝑏𝑎 1+𝜖 𝑑𝑦
                                                        𝑇 1
                                                              ∫ +∞
                                                        𝐾                                    𝑦𝜖 
                                                     = 𝑎              𝑎 𝑦 exp − 𝑏𝑎 1+𝜖 𝑑𝑦
                                                        𝑇      0
                                                                                         𝑦𝜖
                                                                                               
                                                          where we set 𝑧 = 𝑏𝑎           1+𝜖
                                                                                    ∫ +∞
                                                        𝐾 𝑎 1 + 𝜖 − 1+𝜖                           1+𝜖
                                                                                                𝑧 𝜖 −1 exp − 𝑧 𝑑𝑧
                                                                                                                  
                                                     =                     𝑏 𝜖
                                                        𝑇 ln(𝑎) 𝜖
                                                                                𝑏
                                                        𝐾 𝑎             1              1+𝜖
                                                     ≤             Γ      + 2 𝑏− 𝜖 ,
                                                        𝑇 ln(𝑎) 𝜖
which concludes the proof.                                                                                            
    The following is a straightforward corollary of Lemma 2.8.
Corollary 2.9. For any arm 𝑘 ∈ {1, . . . , 𝐾 } and any 𝑡 ∈ {𝐾 + 1, . . . , 𝑇 } and 𝑥 > 0, if 𝜂𝜓(2𝜂/𝑎) ≥
                                
2𝑎, the probability of event 𝑔𝑛𝑘𝑘 (𝑡) − 2(1 + 𝜂)𝑐 𝑛 𝑘 (𝑡) ≥ 𝜇 𝑘 + 𝑥} shares the same bound in Lemma 2.8.
                                                           20


    The distribution-free upper bound for Robust MOSS, which is the main result for the paper, is
presented in this section. We show that the algorithm achieves order optimal worst-case regret.
Theorem 2.10. For the heavy-tailed stochastic MAB problem with 𝐾 arms and time horizon 𝑇, if
𝜂 and 𝑎 are selected such that 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎, then Robust MOSS satisfies
                                                               𝜖       1
                               worst Robust MOSS
                                    𝑅𝑇              ≤ 𝐶𝑢𝐾 1+𝜖 (𝑇/𝑒) 1+𝜖 + 2𝑢𝐾,
where
                                     1                 1+𝜖
       𝐶 =Γ 1/𝜖 + 2        𝑎/ 6 + 3𝜂 𝜖       3/𝜓 6 + 3𝜂 𝜖
                                     − 1                  1+𝜖                         −𝜖 
            + 𝜖Γ 1/𝜖 + 2 6 + 3𝜂 𝜖             6𝑎/𝜓(2𝜂/𝑎) 𝜖 𝑎/ln(𝑎) + 6 + 3𝜂 𝑒 + (1 + 𝜖)𝑒 1+𝜖 .
Remark 2.3. Parameter 𝑎 and 𝜂 as inputs to Robust MOSS can be selected by minimizing the
leading constant 𝐶 in the upper bound on the regret in Theorem 2.10. We have found that
selecting 𝑎 slightly larger than 1 and selecting smallest 𝜂 that satisfies 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎 yields good
performance.
Proof. Since both the UCB and the regret scales with 𝑢 defined in Assumption 2.1, to simplify the
expressions, we assume 𝑢 = 1. Also notice that Assumption 2.1 indicates 𝜇 𝑘 ≤ 𝑢, so Δ 𝑘 ≤ 2 for
any 𝑘 ∈ {1, . . . , 𝐾 }. In the following, any terms with superscript or subscript “∗" and “𝑘" are with
respect to the best and the 𝑘-th arm, respectively. The proof is divided into 4 steps.
Step 1: We follow a decoupling technique inspired by the proof of regret upper bound in MOSS [70].
Take the set of 𝛿-bad arms as B𝛿 as
                                     B𝛿 := {𝑘 ∈ {1, . . . , 𝐾 } | Δ 𝑘 > 𝛿},                       (2.4)
                                           𝜖
where we assign 𝛿 = 6 + 3𝜂 𝑒𝐾/𝑇 1+𝜖 . Thus,
                                         𝐾
                                                     " 𝑇                            #
                                       Õ                Õ                         
                           𝑅𝑇 ≤ 𝑇 𝛿 +        Δ𝑘 + E           1{𝜑𝑡 ∈ B𝛿 } Δ𝜑𝑡 − 𝛿
                                        𝑡=1            𝑡=𝐾+1
                                                 "   𝑇
                                                                                #
                                                    Õ                         
                               ≤ 𝑇 𝛿 + 2𝐾 + E            1{𝜑𝑡 ∈ B𝛿 } Δ𝜑𝑡 − 𝛿 .                    (2.5)
                                                   𝑡=𝐾+1
                                                        21


Furthermore, we make the following decomposition
            𝑇                                       𝑇                                             
           Õ                                       Õ                                          Δ𝜑
                                                                     B𝛿 , 𝑔𝑛∗∗ (𝑡)       ∗
                                                                                                             
                 1{𝜑𝑡 ∈ B𝛿 } Δ𝜑𝑡 − 𝛿 =                   1 𝜑𝑡 ∈                     ≤𝜇 − 𝑡            Δ𝜑𝑡 − 𝛿         (2.6)
          𝑡=𝐾+1                                   𝑡=𝐾+1
                                                                                                3
                                                       𝑇                                            
                                                      Õ                                          Δ𝜑
                                                                       B𝛿 , 𝑔𝑛∗∗ (𝑡)        ∗                   
                                                  +         1 𝜑𝑡 ∈                     >𝜇 − 𝑡           Δ𝜑𝑡 − 𝛿 .
                                                     𝑡=𝐾+1
                                                                                                   3
Notice that the first summand (2.6) describes regret from underestimating optimal arm ∗. For the
                                       ≥ 𝑔𝑛∗∗ (𝑡) and 𝜇∗ = 𝜇 𝜑𝑡 + Δ𝜑𝑡 ,
                             𝜑
second summand, since 𝑔𝑛 𝑡        (𝑡)
                              𝜑𝑡
                                   𝑇                                              
                                 Õ                                          Δ𝜑
                                                   B𝛿 , 𝑔𝑛∗∗ (𝑡)      ∗                          
                                        1 𝜑𝑡 ∈                    >𝜇 − 𝑡               Δ𝜑𝑡 − 𝛿
                                𝑡=𝐾+1
                                                                              3
                                   𝑇                                                  
                                 Õ
                                                          𝜑                     2Δ𝜑𝑡
                             ≤          1 𝜑𝑡 ∈     B𝛿 , 𝑔𝑛 𝑡 (𝑡)   > 𝜇 𝜑𝑡 +              Δ𝜑𝑡
                                𝑡=𝐾+1
                                                           𝜑𝑡                      3
                                          𝑇                                             
                                 Õ Õ
                                                                𝑘                  2Δ 𝑘
                             =                 1 𝜑𝑡 = 𝑘, 𝑔𝑛 𝑘 (𝑡) > 𝜇 𝑘 +                  Δ𝑘 ,                       (2.7)
                                𝑘∈B 𝑡=𝐾+1
                                                                                     3
                                      𝛿
which characterizes the regret caused by overestimating 𝛿-bad arms.
                                                                                           n                              o
Step 2: In this step, we bound the expectation of (2.6). When event 𝜑𝑡 ∈ B𝛿 , 𝑔𝑛∗ (𝑡) ≤ 𝜇 − Δ𝜑𝑡 /3      ∗         ∗
happens, we know
                                                                                          𝛿
                                   Δ𝜑 ≤ 3𝜇∗ − 3𝑔𝑛∗∗ (𝑡) and 𝑔𝑛∗∗ (𝑡) < 𝜇∗ − .
                                                                                          3
Thus, we get
                                                                                     
                                    Δ𝜑𝑡                                         ∗     𝛿
             B𝛿 , 𝑔𝑛∗∗ (𝑡)   ∗                                     ∗
                                                                                           × 3𝜇∗ − 3𝑔𝑛∗∗ (𝑡) − 𝛿 := 𝑌𝑡
                                                                                                                  
    1 𝜑𝑡 ∈                 ≤𝜇 −            (Δ𝜑𝑡 − 𝛿) ≤ 1 𝑔𝑛∗ (𝑡) < 𝜇 −
                                      3                                               3
Since 𝑌𝑡 is a positive random variable, its expected value can be computed involving only its
cumulative density function:
                             ∫   +∞                          ∫    +∞                                  
                  E [𝑌𝑡 ] =            P (𝑌𝑡 > 𝑥) 𝑑𝑥 ≤                P 3𝜇∗ − 3𝑔𝑛∗∗ (𝑡) − 𝛿 > 𝑥 𝑑𝑥
                             ∫0 +∞                             0
                                                                
                                                             𝑥
                           =           P 𝜇∗ − 𝑔𝑛∗∗ (𝑡) > 𝑑𝑥.
                               𝛿                             3
Then we apply Lemma 2.8 at optimal arm ∗ to get
                                                         ∫   +∞
                                                  𝐾𝐶1             1 − 1+𝜖            𝐾𝐶1
                                    E [𝑌𝑡 ] ≤                        𝑥 𝜖 𝑑𝑥 =             1
                                                    𝑇     𝛿       𝜖                  𝑇𝛿 𝜖
                                                             22


                                                 1+𝜖
where 𝐶1 = 𝜖Γ 1/𝜖 + 2         6𝑎/𝜓(2𝜂/𝑎)              𝜖
                                                         𝑎/ln(𝑎). We conclude this step by
                   𝑇                                                                       𝑇
               h Õ                                            Δ𝜑                   i      Õ                       1
            E          1 𝜑𝑡 ∈    B𝛿 , 𝑔𝑛∗∗ (𝑡)    ≤𝜇 − 𝑡∗
                                                                       Δ𝜑𝑡 − 𝛿         ≤           𝑌𝑡 ≤ 𝐶1 𝐾𝛿− 𝜖 .
                 𝑡=𝐾+1
                                                                3                         𝑡=𝐾+1
Step 3: In this step, we bound the expectation of (2.7). For each arm 𝑘 ∈ B𝛿 ,
                              𝑇                                               
                             Õ                                           2Δ 𝑘
                                    1 𝜑𝑡 =       𝑘, 𝑔𝑛𝑘𝑘 (𝑡)  ≥ 𝜇𝑘 +
                           𝑡=𝐾+1
                                                                           3
                              𝑇 Õ   𝑡−𝐾                                                                
                             Õ                                                                  2Δ 𝑘
                         =                1 𝜑𝑡 = 𝑘, 𝑛 𝑘 (𝑡) = 𝑚 1               𝑔𝑚𝑘  ≥ 𝜇𝑘 +
                           𝑡=𝐾+1 𝑚=1
                                                                                                    3
                           𝑇−𝐾
                           Õ                                   𝑇
                                        𝑘               2Δ 𝑘 Õ 
                         =       1 𝑔𝑚 ≥ 𝜇 𝑘 +                           1 𝜑𝑡 = 𝑘, 𝑛 𝑘 (𝑡)        =𝑚
                           𝑚=1
                                                           3 𝑡=𝑚+𝐾
                             𝑇                               
                           Õ
                                        𝑘               2Δ 𝑘
                         ≤      1 𝑔𝑚 ≥ 𝜇 𝑘 +
                           𝑚=1
                                                          3
                             𝑇      Õ      𝑚                                     
                           Õ          1           𝑘      2Δ 𝑘
                         ≤      1               𝑑𝑖 ≥             − (2 + 𝜂)𝑐 𝑚 ,                                              (2.8)
                           𝑚=1
                                     𝑚    𝑖=1
                                                            3
where in the last inequality we apply Lemma 2.6 and use the fact that 𝑢 1+𝜖 /𝐵𝑚                             𝜖 ≤ 𝑐 in (2.3). We
                                                                                                                  𝑚
set                                                                                       !
                                                       1+𝜖                       1+𝜖
                                           6 + 3𝜂 𝜖                𝑇        Δ 𝑘
                                                                                       𝜖    
                                 𝑙 𝑘 =                       ln                           .
                                           Δ𝑘                      𝐾 6 + 3𝜂                
                                                                                            
                                                                                           
With Δ 𝑘 ≥ 𝛿, we get 𝑙 𝑘 is no less than
                                        1+𝜖                       1+𝜖                   1+𝜖
                              6 + 3𝜂        𝜖         𝑇         𝛿       𝜖           6 + 3𝜂 𝜖
                                                ln                             =                      .
                                Δ𝑘                    𝐾 6 + 3𝜂                         Δ𝑘
Furthermore, since 𝑐 𝑚 is monotonically decreasing with 𝑚, for 𝑚 ≥ 𝑙 𝑘 ,
                                                  "        𝑇 Δ𝑘
                                                                      1+𝜖
                                                                        𝜖
                                                                                𝜖
                                                                            # 1+𝜖
                                                    ln+    𝐾 6+3𝜂                         Δ𝑘
                               𝑐𝑚 ≤ 𝑐𝑙𝑘 ≤                                           ≤             .                          (2.9)
                                                               𝑙𝑘                       6 + 3𝜂
With this result and 𝑙 𝑘 ≥ 1, we continue from (2.8) to get
      Õ𝑇     Õ   𝑚                                                        𝑇       Õ    𝑚                                
               1        𝑘     2Δ 𝑘                                           Õ          1           𝑘     2Δ 𝑘
    E      1          𝑑 ≥             − (2 + 𝜂)𝑐 𝑚 ≤𝑙 𝑘 − 1 +                      P             𝑑 ≥            − (2 + 𝜂)𝑐 𝑚
       𝑚=1
              𝑚 𝑖=1 𝑖           3                                           𝑚=𝑙 𝑘
                                                                                       𝑚 𝑖=1 𝑖              3
                                                                              𝑇       Õ    𝑚                 
                                                                             Õ          1           𝑘     Δ𝑘
                                                              ≤𝑙 𝑘 − 1 +           P             𝑑𝑖 ≥                      (2.10)
                                                                            𝑚=𝑙
                                                                                       𝑚   𝑖=1
                                                                                                           3
                                                                                 𝑘
                                                                23


Therefore by using Lemma 2.7 together with statement (ii) from Lemma 2.5, we get
        𝑇
              (     𝑚
                                       )       𝑇                                     Õ      𝑇                                
       Õ        1 Õ 𝑘 Δ𝑘                     Õ                 𝑚Δ 𝑘           𝜖                            𝑚Δ 𝑘              
            P            𝑑 ≥              ≤         exp −               𝜓 𝐵𝑚 Δ 𝑘 ≤                exp −            𝜓 6 + 3𝜂 ,
      𝑚=𝑙
                𝑚 𝑖=1 𝑖             3       𝑚=𝑙
                                                               3𝐵𝑚                          𝑚=𝑙
                                                                                                            3𝐵𝑚
          𝑘                                      𝑘                                              𝑘
where the last step is due to that 𝜓(𝑥) is monotonically increasing and 𝐵𝑚                         𝜖 Δ ≥ (6+3𝜂)𝐵 𝜖 𝑐 ≥ 6+3𝜂
                                                                                                        𝑘                𝑚 𝑚
                                                               1
                                                           − 1+𝜖                   1                1
from (2.9) and (2.3). Since 𝐵𝑚 = 𝜙 ℎ(𝑚)                            ≤ 𝜙(𝑎𝑚) − 1+𝜖 ≤ (𝑎𝑚) 1+𝜖 , we have
                   𝑇                                      Õ     𝑇                                               
                   Õ                𝑚Δ 𝑘                                            𝜖        1
                                                                                          − 1+𝜖                Δ𝑘
                         exp −            𝜓 6 + 3𝜂 ≤                    exp −𝑚 𝑎   1+𝜖          𝜓 6 + 3𝜂              .
                  𝑚=𝑙 𝑘
                                    3𝐵𝑚                          𝑚=1
                                                                                                                 3
                                                                 ∫ +∞                       
                                                                                          𝜖
                                                             ≤            exp −𝛽𝑦 1+𝜖 𝑑𝑦
                                                                   0
                            1                                                    𝜖
where we set 𝛽 = 𝑎 − 1+𝜖 𝜓 6 + 3𝜂 Δ 𝑘 /3. Taking 𝑧 = 𝛽𝑦 1+𝜖 , we obtain
                                            
          ∫   +∞                                                 ∫   +∞                                            
                                 𝜖
                                             1 + 𝜖 − 1+𝜖                   1+𝜖                               1           1+𝜖
                 exp −𝛽𝑦         1+𝜖   𝑑𝑦 =            𝛽 𝜖                𝑧  𝜖  −1
                                                                                   exp (−𝑧) 𝑑𝑦 = Γ              + 2 𝛽− 𝜖 .
            0                                    𝜖                 0                                          𝜖
Plugging it into (2.10),
              Õ 𝑇     Õ      𝑚                                                                              𝑇      1+𝜖 
                          1          𝑘     2Δ 𝑘                                    − 1+𝜖            − 1+𝜖
            E       1               𝑑𝑖 ≥          − (2 + 𝜂)𝑐 𝑚 ≤ 𝐶2 Δ 𝑘 𝜖 + 𝐶3 Δ 𝑘 𝜖 ln                               Δ𝑘 𝜖
                𝑚=1
                          𝑚 𝑖=1             3                                                                   𝐾𝐶3
                               1                       1+𝜖                             1+𝜖
where 𝐶2 = Γ 1/𝜖 + 2 𝑎 𝜖 3/𝜓 6 + 3𝜂 𝜖 and 𝐶3 = 6 + 3𝜂 𝜖 . Putting it together with Δ 𝑘 ≥ 𝛿
for all 𝑘 ∈ B𝛿 , the expectation of (2.7) is no greater than
                                                                     
                              −1           −1           𝑇        1+𝜖                                     −𝜖
                 Õ                                                                     1                              1
                       𝐶2 Δ 𝑘 𝜖     + 𝐶3 Δ 𝑘 𝜖  ln            Δ𝑘 𝜖       ≤ 𝐶2 𝐾𝛿− 𝜖 + (1 + 𝜖)𝑒 1+𝜖 𝐶3 𝐾𝛿− 𝜖 ,
                𝑘∈B 𝛿
                                                      𝐾𝐶3
                                        1          1+𝜖
where we use the fact that 𝑥 − 𝜖 ln 𝑇𝑥
                                                                     
                                                    𝜖  /(𝐾𝐶3 ) takes its maximum at 𝑥 = 𝛿 exp(𝜖 2 /(1 + 𝜖)).
Step 4: Plugging the results in step 2 and step 3 into (2.5),
                                                           h                             −𝜖
                                                                                                i         1
                    worst Robust MOSS
                           𝑅𝑇                 ≤ 𝑇 𝛿 + 𝐶1 + 𝐶2 + (1 + 𝜖)𝑒                 1+𝜖 𝐶3 𝐾𝛿− 𝜖 + 2𝐾.
Straightforward calculation concludes the proof.                                                                                  
    We now show that robust MOSS also preserves a logarithmic upper bound on the distribution-
dependent regret.
                                                                    24


Theorem 2.11. For the heavy-tailed stochastic MAB problem with 𝐾 arms and time horizon 𝑇, if
𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎, the regret 𝑅𝑇 for Robust MOSS is no greater than
                                                      "                                          #
                                Õ  𝑢 1+𝜖  𝜖1                       𝑇  Δ 𝑘  1+𝜖𝜖
                                                        𝐶1 ln                           + 𝐶2 𝐾 + Δ 𝑘 ,
                              𝑘:Δ >0
                                            Δ  𝑘                   𝐾𝐶   1    𝑢
                                  𝑘
                         1+𝜖                                                                         1+𝜖          
where 𝐶1 = 4 + 4𝜂          𝜖
                                and 𝐶2 = max 𝑒𝐶1 , 2Γ(1/𝜖 + 2) 8𝑎/𝜓(2𝜂/𝑎) 𝜖 𝑎/ln(𝑎) .
                                            𝜖
Proof. Let 𝛿 = 4 + 4𝜂 𝑒𝐾/𝑇 1+𝜖 and define B𝛿 the same as (2.4). Since Δ 𝑘 ≤ 𝛿 for all 𝑘 ∉ B𝛿 ,
the regret satisfies
                                                Õ               Õ 𝑇
                     𝑅𝑇Robust MOSS ≤                 𝑇Δ 𝑘 +           1{𝜑𝑡 ∈ B𝛿 }Δ𝜑𝑡
                                               𝑘∉B 𝛿             𝑡=1
                                                                       1+𝜖                  𝑇
                                                Õ            4 + 4𝜂       𝜖         Õ Õ
                                            ≤        𝑒𝐾                      Δ𝑘 +                1{𝜑𝑡 = 𝑘 }Δ 𝑘 .        (2.11)
                                               𝑘∉B 𝛿
                                                                Δ𝑘                 𝑘∈B 𝛿 𝑡=1
Pick arbitrary 𝑙 𝑘 ∈ Z+ , thus
                           Õ  𝑇                              Õ 𝑇
                                                                      
                                  1{𝜑𝑡 = 𝑘 } ≤ 𝑙 𝑘 +               1 𝜑𝑡 = 𝑘, 𝑛 𝑘 (𝑡) ≥ 𝑙 𝑘
                             𝑡=1                            𝑡=𝐾+1
                                                             Õ 𝑇     n                                     o
                                                    ≤ 𝑙𝑘 +         1   𝑔𝑛𝑘𝑘 (𝑡) ≥ 𝑔𝑛∗∗ (𝑡) , 𝑛 𝑘 (𝑡) ≥ 𝑙𝑘 .
                                                            𝑡=𝐾+1
Observe that 𝑔𝑛𝑘𝑘 (𝑡) ≥ 𝑔𝑛∗∗ (𝑡) implies at least one of the following is true
                                            𝑔𝑛∗∗ (𝑡) ≤ 𝜇∗ − Δ 𝑘 /4,                                                     (2.12)
                                            𝑔𝑛𝑘𝑘 (𝑡) ≥ 𝜇 𝑘 + Δ 𝑘 /4 + 2(1 + 𝜂)𝑐 𝑛 𝑘 (𝑡) ,                               (2.13)
                                            (1 + 𝜂)𝑐 𝑛 𝑘 (𝑡) > Δ 𝑘 /4.                                                  (2.14)
We select                                                                                       !
                                                            1+𝜖                      1+𝜖
                                                 4 + 4𝜂 𝜖                𝑇      Δ𝑘          𝜖    
                                        𝑙 𝑘 =                     ln                            .
                                                 Δ𝑘                      𝐾 4 + 4𝜂                
                                                                                                  
                                                                                                 
Similarly as (2.9), 𝑛 𝑘 (𝑡) ≥ 𝑙 𝑘 indicates 𝑐 𝑛 𝑘 (𝑡) ≤ Δ 𝑘 /(4 + 4𝜂), so (2.14) is false. Then we apply
Lemma 2.8 and Corollary 2.9,
               n                                         o                                                  𝐶20 𝐾 − 1+𝜖
             P   𝑔𝑛𝑘𝑘 (𝑡) ≥    𝑔𝑛∗∗ (𝑡) , 𝑛 𝑘 (𝑡)  ≥ 𝑙𝑘     ≤ P ((2.12) or (2.13) is true ) ≤                    Δ 𝜖 ,
                                                                                                             𝑇 𝑘
                                                                     25


                                             1+𝜖
where 𝐶20 = 2Γ 1/𝜖 + 2 8𝑎/𝜓(2𝜂/𝑎) 𝜖 𝑎/ln(𝑎). Substituting it into (2.11),
                           
                                                                                          
                                    Õ 𝑒𝐶1 𝐾         Õ  𝐶1  𝑇             1+𝜖
                                                                                
                                                                                   𝐶 0𝐾    
                                                                                    2
                𝑅𝑇Robust MOSS ≤                 +                         Δ𝑘      + 1 + Δ𝑘  .
                                                                             𝜖             
                                            1             1 ln
                                                                     𝐾𝐶1
                                         Δ 𝑘𝜖      𝑘∈B 𝛿  Δ 𝜖                     Δ 𝑘𝜖
                                                                                          
                                   𝑘∉B 𝛿                                                   
                                                          𝑘                               
Considering the scaling factor 𝑢, the proof can be concluded with easy computation.               
2.3.3    Numerical Illustration of Robust MOSS
In this section, we compare Robust MOSS with MOSS and Robust UCB (with truncated empirical
mean or Catoni’s estimator) [18] in a 3-armed heavy-tailed bandit setting. The mean rewards are
𝜇1 = −0.3, 𝜇2 = 0 and 𝜇3 = 0.3 and sampling at each arm 𝑘 returns a random reward equals to 𝜇 𝑘
added by sampling noise 𝜈, where |𝜈| is a generalized Pareto random variable and the sign of 𝜈 has
equal probability to be positive and negative. The PDF of reward at arm 𝑘 is
                                                          ! − 𝜉1 −1
                                     1         𝜉 𝑥 − 𝜇𝑘
                          𝑓 𝑘 (𝑥) =      1+                         for 𝑥 ∈ (−∞, +∞),
                                    2𝜎             𝜎
where we select 𝜉 = 0.33 and 𝜎 = 0.32. Thus, for a random reward 𝑋 from any arm, we know
E [𝑋 2 ] ≤ 1, which means 𝜖 = 1 and 𝑢 = 1. We select parameters 𝑎 = 1.1 and 𝜂 = 2.2 for Robust
MOSS so that condition 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎 is met.
    Figure.2.1 shows the mean regret together with quantiles of regret distribution as a function of
time, which are computed using 200 simulations of each policy. On each graph, the bold curve is
the emprical mean regret while light shaded and dark shaded regions correspond respectively to
upper 5% and lower 95% quantile cumulative regrets. The simulation result shows that there is a
chance MOSS loses stability in heavy-tailed MAB and suffers linear regret while other algorithms
work consistently and maintain sub-linear regrets. Robust MOSS slightly outperforms Robust UCB
in this specific problem.
2.4     Summary
We reviewed the stationary stochastic MAB problem and concepts including regret and regret
lower bound. Specially, we studied the heavy-tailed bandit problem and proposed the Robust
                                                       26


              Figure 2.1: Comparison of Robust MOSS with MOSS and other Robust UCB algorithms.
MOSS algorithm. We evaluate it by deriving upper bounds on the associated distribution-free
and distribution-dependent regrets. Our analysis shows that Robust MOSS achieves order optimal
performance in both scenarios. It can be noticed that the saturated mean estimator centers at zero
so that the algorithm is not translation invariant. Exploration of translation invariant robust mean
estimator in this context remains an open problem.
2.5     Bibliographic Remarks
Since the seminal work by Lai and Robbins [15], several subsequent works design simpler al-
gorithms by assuming that the rewards are bounded or more generally, sub-Gaussian. By using
Kullback-Leibler(KL) divergence-based uncertainty estimates, Garivier and Cappé [17] designed
KL-UCB and proved that it strictly dominates UCB1 [16], which uses Hoeffding inequality-based
uncertainty estimates. Aside from the nonBayesian policies mentioned above, Bayesian strategies
have also been proved to be effective for MAB problems. The Bayes-UCB algorithm by Kaufmann
et al. [77] is the first Bayesian algorithm proved to be asymptotic optimal. Thompson sampling [1],
                                                    27


proposed in 1933, has long been shown to perform very well in practice. Very recently, asymp-
totic and finite-time performance guarantees that are very close to the optimal have been proved
for Thompson sampling [78, 79]. Other effective policies include 𝜖-greedy [16]and deterministic
sequencing of exploration and exploitation [25, 40, 41]. However, both these classes of algorithms
require the knowledge of a lower bound on the minimum gap in mean rewards.
    In the context of minimizing the worst-case regret, Ménard and Garivier [80] adapted the MOSS
algorithm with KL divergence-based uncertainty estimates and proposed a minimax algorithm kl-
UCB++ that improves the algorithm performance. It also needs to be noticed that MOSS requires
knowing horizon length, so it is not an anytime algorithm. Degenne and Perchet [81] extend MOSS
to an any-time version called MOSS-anytime that can adapt to different horizon lengths.
                                                 28


                                                CHAPTER 3
                       PIECE-WISE STATIONARY STOCHASTIC BANDITS
In nonstationary stochastic MAB problem, the reward sequence {𝑋𝑡𝑘 }𝑇𝑡=1 at each arm 𝑘 ∈ {1, . . . , 𝐾 }
is composed of independent samples from time-varying reward distributions { 𝑓𝑡𝑘 }𝑇𝑡=1 . Piece-wise
stationary MAB is a special type of the nonstationary bandit problem in which 𝑓𝑡𝑘 switches at
unknown time instants referred as breakpoints. Between consecutive breakpoints, 𝑓𝑡𝑘 remains the
same for any 𝑘 ∈ {1, . . . , 𝐾 }. In this chapter, we assume each 𝑓𝑡𝑘 has bounded support [0, 1], and
the total number of breakpoints until time 𝑇 is Υ𝑇 ∈ 𝑂 (𝑇 𝜈 ), where 𝜈 ∈ [0, 1) and is known a priori.
    Similarly as the stationary Stochastic MAB problem, the decision-maker’s objective is to select
                                                                                 Í𝑇     𝜑𝑡 
arms 𝜑𝑡 , . . . , 𝜑𝑇 that maximizes the expected of cumulative reward E             𝑡=1 𝑡 . Let the time-
                                                                                       𝑋
varying mean reward associated with arm 𝑘 be 𝜇𝑡𝑘 at time 𝑡 ∈ {1, . . . , 𝑇 }. Then, for the nonstationary
MAB problem, the regret for a policy 𝜌 can be defined by
                                     𝑇
                                                " 𝑇        #     " 𝑇             #
                                    Õ            Õ                Õ
                                        𝜇𝑡∗ − E                        𝜇𝑡∗ − 𝜇𝑡 𝑡 ,
                              𝜌                          𝜑                    𝜑
                            𝑅𝑇 :=                     𝑋𝑡 𝑡 = E 𝜌                                     (3.1)
                                    𝑡=1           𝑡=1              𝑡=1
where 𝜇𝑡∗ = max 𝑘∈{1,...,𝐾 } 𝜇𝑡𝑘 and the expectation is with respect to different realization of {𝜑𝑡 }𝑇𝑡=1
that depends on obtained rewards through policy 𝜌.
    This chapter is a slightly modified version of our published work on piece-wise stationary
stochastic bandits, and it is reproduced here with the permission of the copyright holder1. In the
following sections, two generic algorithms, namely Limited-Memory DSEE (LM-DSEE) algo-
rithm and the Sliding-Window UCB# (SW-UCB#) algorithm, are presented and analyzed. These
algorithms require parameters to be tuned based on environment characteristics.
3.1     Preliminaries
We first recall two MAB policies since algorithms proposed in this chapter are developed based upon
them. The first algorithm is Deterministic Sequencing of Exploration and Exploitation (DSEE) [40].
    1 ©2018  IEEE. Reprinted with permission from [82].
                                                        29


It divides the set of natural numbers N into interleaving blocks of exploration and exploitation.
In the exploration block all arms are played in a round-robin fashion, while in the exploitation
block, the arm with the maximum statistical mean reward is played. For an appropriately defined
𝑤 ∈ R>0 , the DSEE algorithm at time 𝑡 exploits if the number of exploration steps until time 𝑡 − 1
are greater than or equal to 𝐾 d𝑤 log 𝑡e, otherwise it starts a new exploration block. Vakili et al.
showed that the DSEE algorithm achieves efficient performance. It should be noted that tuning 𝑤
requires knowledge of a lower bound on the gap between the mean reward from the best arm and
the second-best arm. This requirement can be relaxed at the cost of degraded performance.
    The second policy is Sliding-Window UCB (SW-UCB) [20]. It is a variation of UCB1 [16]
that intends to solve the piecewise-stationary bandits. A sliding observation window is used to
erase the outdated sampling history, and the UCB index is computed within it. Since the size of
the sliding observation window in SW-UCB depends on the horizon length, it requires knowledge
of the horizon length of the problem. The SW-UCB# proposed in this chapter intends to relax this
assumption and enable the policy to adapt to different horizon lengths.
3.2     The LM-DSEE Algorithm
The LM-DSEE algorithm comprises interleaving blocks of exploration and exploitation. In the
𝑛-th exploration epoch, each arm is sampled 𝐿 (𝑛) = d𝛾 ln(𝑛 𝜚 𝑙𝑏)e number of times. In the 𝑛-th
exploitation epoch, the arm with the highest sample mean in the 𝑛-th exploration epoch is sampled
d𝑎𝑛 𝜚 𝑙e − 𝐾 𝐿 (𝑛) times. Here, the parameters 𝜚, 𝛾, 𝑎, 𝑏, and 𝑙 are tuned based on the environment
characteristics (see Algorithm 1 for details). In the following, we will set 𝑎 and 𝑏 to unity for the
purposes of analysis.
    The LM-DSEE algorithm is similar in spirit to the DSEE algorithms [25, 40], wherein the
length of the exploitation epoch increases exponentially with the epoch number, and all the data
collected in the previous exploration epochs are used to estimate mean rewards. However, in a
non-stationary environment using all the rewards from the previous exploration epochs may lead to
a heavily biased estimate of mean rewards. Furthermore, an exponentially increasing exploitation
                                                 30


  Algorithm 1: The LM-DSEE Algorithm
      Input      : 𝜈 ∈ [0, 1),Δmin ∈ (0, 1), 𝑇 ∈ N,𝑎 ∈ R>0 , 𝑏 ∈ (0, 1];
      Set        : 𝛾 ≥ Δ22 , 𝑎𝑙 ∈ {𝐾 d𝛾 ln 𝑙𝑏e, . . . , +∞}, and 𝜚 = 1−𝜈       1+𝜈 ;
                           min
     Output      : sequence of arm selection;
     % Initialization:
   1 Set batch index 𝑛 ← 1 and 𝑡 ← 1;
   2 while 𝑡 ≤ 𝑇 do
          % Exploration
   3      for 𝑘 ∈ {1, . . . , 𝐾 } do
              Pick arm 𝑘, 𝐿(𝑛) ← d𝛾 ln(𝑛 𝜚 𝑙𝑏)e times ;
              collect rewards {𝑋𝑖𝑘 (𝑛)}𝑖 ∈ {1,...,𝐿 (𝑛) } ;
                                           epch            1   Í 𝐿 (𝑛)
              compute sample mean 𝜇¯ 𝑘          (𝑛) ←    𝐿 (𝑛)    𝑖=1    𝑋𝑖𝑘 (𝑛);
          % Exploitation
                                   epch                             epch
   4      Select the best arm 𝜑 𝑛       = arg max 𝑘 ∈ {1,...,𝐾 } 𝜇¯ 𝑘    (𝑛) ;
                      epch
   5      Pick arm 𝜑 𝑛 , d𝑎𝑛 𝜚 𝑙e − 𝐾 𝐿(𝑛) times ;
   6      Update 𝑡 ← 𝑡 + d𝑎𝑛 𝜚 𝑙e and batch index 𝑛 ← 𝑛 + 1;
epoch length may lead to excessive exploitation based on an outdated estimate of the mean rewards.
To address these issues, we modify the DSEE algorithm by using only the rewards from the current
exploration epoch to estimate the mean rewards, and we increase the length of the exploitation
epoch using a power law instead of an exponential function.
3.3     Analysis of the LM-DSEE Algorithm
Before we analyze the LM-DSEE algorithm, we introduce the following notation for the piece-wise
stationary environment. Let
                       Δ 𝑘 = max{𝜇𝑡∗ − 𝜇𝑡𝑘 | 𝑡 ∈ {1, . . . , 𝑇 }},
                    Δmax = max{Δ 𝑘 | 𝑘 ∈ {1, . . . , 𝐾 }},
               and Δmin = min{𝜇𝑡∗ − 𝜇𝑡𝑘 | 𝑡 ∈ {1, . . . , 𝑇 }, 𝑘 ∈ {1, . . . , 𝐾 }, 𝜇𝑡∗ − 𝜇𝑡𝑘 > 0}.
Theorem 3.1 (Regret Upper Bound for LM-DSEE). For piece-wise stationary environment with
                                                             31


number of breakpoints Υ𝑇 ∈ 𝑂 (𝑇 𝜈 ) and 𝜈 ∈ [0, 1), the regret for the LM-DSEE algorithm satisfies
                                                                    1+𝜈
                                               𝑅𝑇LM-DSEE ∈ 𝑂 (𝑇       2  ln 𝑇).
Proof. Let 𝑁 be the index of the epoch containing the time-instant 𝑇, then the length of each epoch
is at most d𝑁 𝜚 𝑙e. Since breakpoints are located in at most Υ𝑇 epochs, we can upper bound the
regret from epochs containing breakpoints by
                                                  𝑅b ≤ Υ𝑇 d𝑁 𝜚 𝑙eΔmax .
    In the epochs containing no breakpoint, let 𝑅e and 𝑅i denote, respectively, the regret from
exploration and exploitation epochs. Note that in such epochs, the mean reward from each arm
does not change. In the 𝑛-th epoch with no breakpoint, we denote the maximum mean reward by
  ∗
𝜇no-break (𝑛) and the set of arms with maximum mean reward by Kno-break               ∗       (𝑛). Then, the regret in
exploration epochs 𝑅e satisfies,
                                    Õ 𝑁 Õ   𝐾                                         Õ 𝐾
                                                         𝜚                      𝜚
                             𝑅e ≤              d𝛾 ln(𝑛 𝑙)eΔ 𝑘 ≤ 𝑁 d𝛾 ln(𝑁 𝑙)e              Δ𝑘 .
                                    𝑛=1 𝑘=1                                            𝑘=1
In exploitation epochs, regret is incurred if a sub-optimal arm is selected, and consequently, the
regret in exploitation epochs 𝑅i satisfies
                              Õ𝑁 Õ  𝐾
                                                                                   ∗
                                                                 epch
                        𝑅i ≤              d𝑛 𝜚 𝑙e − 𝐾 𝐿(𝑛) P(𝜑𝑛 = 𝑘 ∉ Kno-break            (𝑛))Δ 𝑘 ,             (3.2)
                              𝑛=1 𝑘=1
          epch
where 𝜑𝑛        is the arm selected in the 𝑛-th exploitation epoch.
    It follows from the Chernoff-Hoeffding inequality [83, Theorem 1] that
                   epch          epch                     epch          epch        
             P 𝜇¯ 𝑘     (𝑛) ≥ 𝜇 𝑘     (𝑛) + 𝛿 = P 𝜇¯ 𝑘 (𝑛) ≤ 𝜇 𝑘 (𝑛) − 𝛿 = exp(−2𝛿2 𝐿 (𝑛)),
          epch
where 𝜇 𝑘      (𝑛) is the mean reward of arm 𝑘 in the 𝑛-th epoch and 𝐿 (𝑛) is the number of times an
arm is selected in the 𝑛-th exploration epoch. Thus, we take 𝑗 ∗ ∈ Kno-break          ∗       (𝑛) and get
                         epch             ∗            
                     P 𝜑𝑛      = 𝑘 ∉ Kno-break     (𝑛)
                         epch           epch          Δmin         epch          ∗               Δmin 
                  ≤P 𝜇¯ 𝑘     (𝑛) ≥ 𝜇 𝑘       (𝑛) +           + P 𝜇¯ 𝑗 ∗ (𝑛) ≤ 𝜇no-break   (𝑘) −
                                                       2                                             2
                               Δ2min               
                  ≤2 exp −             𝛾 ln(𝑛 𝜚 𝑙) .
                                  2
                                                               32


                          epch                 ∗         (𝑛) ≤ 2(𝑛 𝜚 𝑙) −1 . Substituting it into (3.2), we have
                                                             
Since 𝛾 ≥ Δ22 , P 𝜑𝑛            = 𝑘 ∉ Kno-break
               min
         Í
𝑅i ≤ 2𝑁 𝐾𝑘=1 Δ 𝑘 since d𝑛 𝜚 𝑙e − 𝐾 𝐿(𝑛) < 𝑛 𝜚 𝑙. Furthermore, it can be seen that
                               𝑙                                        𝑙
                                      (𝑁 − 1) 1+𝜚 − 𝑁 ≤ 𝑇 ≤                (𝑁 + 1) 1+𝜚 + 𝑁,
                            1+ 𝜚                                     1+ 𝜚
                                    1
and consequently 𝑁 ∈ 𝑂 (𝑇 1+ 𝜚 ). Therefore, it follows that
                                                                                              Õ𝐾
                LM-DSEE                                          𝜚                      𝜚
              𝑅           (𝑇) = 𝑅b + 𝑅e + 𝑅i ≤ Υ𝑇 𝑁 𝑙Δmax + 𝑁 (d𝛾 ln(𝑁 𝑙)e + 2)                   Δ𝑘 .
                                                                                              𝑘=1
                                                 1+𝜈
Thus, the regret 𝑅 LM-DSEE (𝑇) ∈ 𝑂 (𝑇             2  ln 𝑇), and this establishes the theorem.                  
3.4    The SW-UCB# Algorithm
The SW-UCB# algorithm is an adaptation of the SW-UCB algorithm proposed and studied in [20].
At time 𝑡, SW-UCB# maintains an estimate of the mean reward 𝜇¯ 𝑘 (𝑡, 𝛼) at each arm 𝑘, using only
the rewards collected within a sliding-window of observations. Let the width of the sliding-window
at time 𝑡 ∈ {1, . . . , 𝑇 } be 𝜏(𝑡, 𝛼) = min{d𝜆𝑡 𝛼 e, 𝑡}, where parameters 𝛼 ∈ (0, 1], 𝜉 ∈ (1, 2] and
𝜆 ∈ R≥0 ∪{+∞} are tuned based on environment characteristics. Let
                                                               Õ𝑡
                                          𝑛 𝑘 (𝑡, 𝛼) =                 1{𝜑 𝑠 = 𝑘 }
                                                          𝑠=𝑡−𝜏(𝑡,𝛼)+1
be the number of times arm 𝑘 has been selected in the time-window at time 𝑡, then we have
                                                                   𝑡
                                                      1           Õ
                                                                           𝜑
                                𝜇¯ 𝑘 (𝑡, 𝛼) =                             𝑋𝑠 𝑠 1{𝜑 𝑠 = 𝑘 }.
                                                 𝑛 𝑘 (𝑡, 𝛼)
                                                             𝑠=𝑡−𝜏(𝑡,𝛼)+1
Based on the above estimate, the SW-UCB# algorithm at each time selects the arm
                                    𝜑𝑡 = arg max 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼),                          (3.3)
                                          𝑘 ∈{1,...,𝐾 }
                   p
where 𝑐 𝑘 (𝑡, 𝛼) =    𝜉 ln(𝑡)/𝑛 𝑘 (𝑡, 𝛼). The details of the algorithm are presented in Algorithm 2.
    In contrast to the SW-UCB algorithm [20], the SW-UCB# algorithm employs a time-varying
width of the sliding-window. The tuning of the fixed window width in [20] requires a priori
knowledge of the time horizon 𝑇 which is no longer needed for the SW-UCB# algorithm.
                                                               33


  Algorithm 2: The SW-UCB# Algorithm
      Input       : 𝜈 ∈ [0, 1), Δmin ∈ (0, 1), 𝜆 ∈ R>0 & 𝑇 ∈ N;
      Set         : 𝛼 = 1−𝜈  2
     Output       : sequence of arm selection;
     % Initialization:
   1 while 𝑡 ≤ 𝑇 do
   2      if 𝑡 ∈ {1, . . . , 𝑁 } then
               Pick arm 𝜑𝑡 = 𝑡;
   3      else
               Pick arm 𝜑𝑡 defined in (3.3) ;
3.5     Analysis of the SW-UCB# Algorithm
We analyze the performance of the SW-UCB# algorithm (Algorithm 2) to get the following result.
Theorem 3.2 (Regret Upper Boudn for SW-UCB#). For the piece-wise stationary environment
with number of breakpoints Υ𝑇 = 𝑂 (𝑇 𝜈 ) and 𝜈 ∈ [0, 1), the regret for the SW-UCB# algorithm
satisfies
                                                             1+𝜈
                                           𝑅𝑇SW-UCB# ∈ 𝑂 (𝑇   2  ln 𝑇).
Proof. We define set T̂ such that for all 𝑡 ∈ T̂ , 𝑡 is either a breakpoint or there exists a break point
in its sliding-window of observations {𝑡 − 𝜏(𝑡 − 1, 𝛼), . . . , 𝑡 − 1}. For 𝑡 ∈ T̂ , the statistical means
are corrupted. Since the maximum sliding-window width is d𝜆(𝑇 − 1) 𝛼 e, it can be shown that
                                             |T | ≤ Υ𝑇 d𝜆(𝑇 − 1) 𝛼 e.
Then, the regret can be upper bounded as follows.
                                                                   Õ𝐾
                               𝑅𝑇SW-UCB#                𝛼
                                         ≤ Υ𝑇 d𝜆(𝑇 − 1) eΔmax +        E[ 𝑁˜ 𝑘 (𝑇)]Δ 𝑘 ,              (3.4)
                                                                   𝑘=1
where 𝑁˜ 𝑘 (𝑇) := 𝑇𝑡=1 1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑡 ∉ T̂ }, and K𝑡∗ is the set of arms with maximum mean
                       Í
                                                       34


reward at 𝑡. It can be seen that
                                     Õ 𝑇
                   𝑁˜ 𝑘 (𝑇) ≤ 1 +         1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1)}
                                    𝑡=𝐾+1
                                    𝑇
                                                                                                             (3.5)
                                   Õ
                                +        1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑡 ∉ T̂ , 𝑛 𝑘 (𝑡 − 1, 𝛼) ≥ 𝐴(𝑡 − 1)},
                                  𝑡=𝐾+1
where 𝐴(𝑡) = 4𝜉 ln 𝑡/Δ2min .
     We first bound the second term on the right side of inequality (3.5). Let 𝐺 ∈ N be such that
                                                        1                             1
                                [𝜆(1 − 𝛼)(𝐺 − 1)] 1−𝛼 < 𝑇 ≤ [𝜆(1 − 𝛼)𝐺] 1−𝛼 .                                (3.6)
Then, consider the following partition of time indices
                    n                              1                          1 
                                                                                        o
                       1 + [𝜆(1 − 𝛼)(𝑔 − 1)] 1−𝛼 , . . . , [𝜆(1 − 𝛼)𝑔] 1−𝛼                            .      (3.7)
                                                                                          𝑔∈{1,...,𝐺}
In the 𝑔-th epoch in the partition, either
                               Õ
                                       1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1)} = 0,
                          𝑡∈𝑔-th epoch
or there exist at least one time instant 𝑡 that 𝜑𝑡 = 𝑘 ∉ K𝑡∗ and 𝑛 𝑗 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1). Let the last
time instant satisfying these conditions in the 𝑔-th epoch be
                𝑡 𝑘 (𝑔) = max 𝑡 ∈ 𝑔-th epoch | 𝜑𝑡 = 𝑘 ∉ K𝑡∗ and 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡) .
                                  
We will now show that there exists at most one time index in the 𝑔-th epoch until 𝑡 𝑘 (𝑔) − 1 that is not
covered by the time-window at 𝑡 𝑘 (𝑔). Towards this end, consider the increasing convex function
            1
 𝑓 (𝑥) = 𝑥 1−𝛼 with 𝛼 ∈ (0, 1). It follows that 𝑓 (𝑥 2 ) − 𝑓 (𝑥1 ) ≤ 𝑓 0 (𝑥2 )(𝑥 2 − 𝑥 1 ) if 𝑥2 ≥ 𝑥 1 . Let 𝑡˜ be
                                                                          𝑡˜1−𝛼
a time index in the 𝑔-th epoch, and set 𝑥 1 = 𝑔 − 1 and 𝑥 2 =            𝜆(1−𝛼) .  Then, substituting 𝑥1 and 𝑥 2 in
the above inequality and simplifying, we get
                                                         1
                                                                     𝑡˜1−𝛼                
                            𝑡˜ − (𝜆(1 − 𝛼)(𝑔 − 1)) 1−𝛼 ≤ 𝜆𝑡˜𝛼                   −𝑔+1 .                       (3.8)
                                                                    𝜆(1 − 𝛼)
                                               𝑡˜1−𝛼
Since by definition of the 𝑔-th epoch,        𝜆(1−𝛼)   ≤ 𝑔, we have
                                                  1
                   𝑡˜ − b(𝜆(1 − 𝛼)(𝑔 − 1)) 1−𝛼 c ≤ min{𝑡˜ + 1, 𝜆d𝑡˜𝛼 e + 1} = 𝜏( 𝑡˜, 𝛼) + 1.                 (3.9)
                                                           35


Setting 𝑡˜ = 𝑡 𝑘 (𝑔) − 1 in (3.9), we obtain
                                                                                          1
                              𝑡 𝑘 (𝑔) − 𝜏 𝑡 𝑘 (𝑔) − 1, 𝛼 ≤ 2 + b𝜆(1 − 𝛼)(𝑔 − 1) 1−𝛼 c,
i.e., the first time instant in the sliding-window at 𝑡 𝑗 (𝑔) is located at or to the left of the second time
instant of the 𝑔-th epoch in the partition (3.7). Therefore, it follows that
                                           1
                              b𝜆(1−𝛼)𝑔
                                    Õ 1−𝛼 c
                                                      1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1)}
                                                  1
                          𝑡=1+b𝜆(1−𝛼) (𝑔−1) 1−𝛼 c
                                  ≤ 𝑛 𝑘 (𝑡 − 1, 𝛼) + 2 ≤ 𝐴(𝑡 𝑘 (𝑔) − 1) + 2.
Now we have
                                     Õ 𝑇
                                             1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1)}
                                    𝑡=𝐾+1
                                              𝐺                        
                                             Õ                                 4𝜉 ln 𝑇 
                                  ≤2𝐺 +             𝐴(𝑡 𝑘 (𝑔) − 1) ≤ 𝐺 2 + 2             .                 (3.10)
                                             𝑔=1
                                                                                Δmin
     Next, we upper-bound the expectation of the last term on the right-hand side of inequality (3.5).
Taking 𝑗𝑡∗ ∈ K𝑡∗ , it can be shown that
                  1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑡 ∉ T̂ , 𝑛 𝑘 (𝑡 − 1, 𝜏) ≥ 𝐴(𝑡 − 1)}
                      Õ e d𝜆(𝑡−1)
                   d𝜆(𝑡−1)         Õ e
                            𝛼           𝛼
                ≤                          1{𝑛 𝑘 (𝑡 − 1, 𝛼) = 𝑠, 𝑛 𝑗𝑡∗ (𝑡 − 1, 𝛼) = 𝑠∗ }                   (3.11)
                      𝑠∗ =1     𝑠=𝐴(𝑡−1)
                   × 1{ 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) > 𝜇¯ 𝑗𝑡∗ (𝑡 − 1, 𝛼) + 𝑐 𝑗𝑡∗ (𝑡 − 1, 𝛼), 𝑡 ∉ T̂ }.
When 𝑡 ∉ T̂ , for each arm 𝑘 ∈ {1, . . . , 𝐾 }, 𝜇 𝑘 (𝑠) is a constant for all 𝑠 ∈ {𝑡 − 𝜏(𝑡 − 1, 𝛼), . . . , 𝑡}.
Note that if 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) > 𝜇¯ 𝑗𝑡∗ (𝑡 − 1, 𝛼) + 𝑐 𝑗𝑡∗ (𝑡 − 1, 𝛼) is true, at least one of the
following inequalities holds.
                                           𝜇¯ 𝑘 (𝑡 − 1, 𝛼) ≥ 𝜇𝑡𝑘 + 𝑐 𝑘 (𝑡 − 1, 𝛼),                         (3.12)
                                                                  𝑗∗
                                           𝜇¯ 𝑗𝑡∗ (𝑡 − 1, 𝛼) ≤ 𝜇𝑡 𝑡 − 𝑐 𝑗𝑡∗ (𝑡 − 1, 𝛼),                    (3.13)
                                           𝜇𝑡∗ − 𝜇𝑡𝑘 < 2𝑐 𝑘 (𝑡 − 1, 𝛼).                                    (3.14)
                                                                 36


Since 𝑛 𝑘 (𝑡 − 1, 𝛼) ≥ 𝐴(𝑡 − 1), (3.14) does not hold. Applying Chernoff-Hoeffding inequality [83,
Theorem 1] to bound the probability of events (3.12) and (3.13), we obtain
                             P( 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) ≥ 𝜇 𝑘 (𝑡) + 𝑐 𝑘 (𝑡 − 1, 𝛼)) ≤ (𝑡 − 1) −2𝜉 ,
                             P( 𝜇¯ 𝑗𝑡∗ (𝑡 − 1, 𝛼) ≤ 𝜇 𝑗𝑡∗ (𝑡) − 𝑐 𝑗𝑡∗ (𝑡 − 1, 𝛼)) ≤ (𝑡 − 1) −2𝜉 ,
where 𝜉 = 1 + 𝛼. Applying both probability inequalities in conjuncture with (3.11), we get
                             h Õ 𝑇                                                                i
                           E            1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑡 ∉ T̂ , 𝑛 𝑘 (𝑡 − 1, 𝛼) ≥ 𝐴(𝑡 − 1)}
                              𝑡=𝐾+1
                                Õ𝑇
                            ≤           2(𝑡 − 1) −2𝜉 [𝜆(𝑡 − 1) 𝛼 + 1] 2
                              𝑡=𝐾+1
                                ∞
                              Õ                         (𝜆 + 1) 2 𝜋 2
                            ≤        2(𝜆 + 1) 2 𝑡 −2 =                  .                                         (3.15)
                               𝑡=1
                                                              3
Therefore, it follows from (3.4), (3.5), (3.10), and (3.15) that
                                                              𝐾                                          
                SW-UCB#                         𝛼
                                                            Õ               4𝜉 ln 𝑇         (𝜆 + 1) 2 𝜋 2
             𝑅𝑇             ≤ Υ𝑇 d𝜆(𝑇 − 1) eΔmax +                𝐺 2+ 2              +1+                    Δ𝑘 .
                                                             𝑘=1
                                                                              Δmin                3
                                                                                       1+𝜈
From (3.6), we have 𝐺 = 𝑂 (𝑇 1−𝛼 ), and this yields 𝑅𝑇SW-UCB# ∈ 𝑂 (𝑇                    2  ln 𝑇).                     
3.6      Numerical Illustration
In this section, we present simulation results for the SW-UCB# and LM-DSEE algorithms. For each
simulation, we consider a 10-armed bandit in which the reward at each arm is generated using Beta
distribution. The breakpoints are introduced at time instants where the next element of the sequence
{b𝑡 𝜈 c}𝑡∈{1,...,𝑇 } is different from the current element. At each breakpoint, the mean rewards at each
arm were randomly selected from the set {0.05, 0.12, 0.19, 0.26, 0.33, 0.39, 0.46, 0.53, 0.6, 0.9}.
We select the parameters (𝑎, 𝑏) equal to (1, 0.25) for LM-DSEE in Algorithm 1. For SW-UCB# in
Algorithm 2, we select 𝜆 = 12.3.
     The parameters 𝜈 that describe characteristics of nonstationarity are varied to evaluate the
performance of algorithms. Figure. 3.1 shows that both SW-UCB# and LM-DSEE are effective in
the piece-wise stationary environment.
                                                               37


                             Figure 3.1: Comparison of LM-DSEE and SW-UCB#.
    It can be seen in Figs. 3.1 that for both algorithms, as expected, the ratio of the empirical regret
to the order of the regret established in Sections 3.3 and 3.5 is upper bounded by a constant. The
regret for the SW-UCB# is relatively smoother than the regret for the LM-DSEE algorithm. The
saw-tooth behavior of the regret for LM-DSEE is attributed to the fixed exploration-exploitation
structure, wherein the regret is certainly incurred during the exploration epochs.
3.7     Summary
We studied the stochastic MAB problem in the piece-wise stationary environment and designed
two novel algorithms, the LM-DSEE and the SW-UCB# for these problems. We analyzed these
algorithms to show that these algorithms incur sublinear regret, i.e., the time average of the regret
asymptotically converges to zero. The theoretical results are verified with numerical illustrations.
    While both the algorithms incur the same order of regret, compared with LM-DSEE, SW-
UCB# has a better leading constant. This illustrates the cost of constraining the algorithm to have
a deterministic structure. On the other hand, this deterministic structure can be very useful, for
example, in the context of planning trajectories for a mobile robot performing surveillance or search
using a MAB framework. Though both algorithms can balance the explore-exploit tradeoff, they are
reactive in the sense that they select only one arm at a time, i.e., they only provide information about
the next location to be visited by the robot. Certain motion constraints on the robots such as non-
holonomicity may make such movements energetically demanding. Therefore, the deterministic
and predictable structure of LM-DSEE can be leveraged to design a tour for the robot which can
                                                     38


be efficiently traversed even under motion constraints.
    There are several possible extensions of this work. In the next chapter, we’ll study the multiple
decision-maker version of the problem in this chapter. Besides, extensions of the methodology
developed in this paper to other classes on MAB problems such as the Markovian MAB problem [84]
and the restless bandits [85] are also of interest.
3.8      Bibliographic Remarks
In a non-stationary environment, achieving logarithmic expected cumulative regret may not be
feasible and the focus is on the design of algorithms that achieve sub-linear regret. Indeed, the
                                                                                               √
lower bound for the piece-wise stationary stochastic bandit has been shown to be Ω( 𝐾Υ𝑇 𝑇)
in [20]. Thus, both of the proposed algorithms in this chapter are near-optimal. The approaches
to handle nonstationary environments can be classified into active approach and passive approach.
The former actively detects the breakpoints to and accordingly removes the old sampling results,
while the latter follows a predetermined rule and disregards the information about breakpoints.
    One of the passive approaches is proposed by Kocsis and Szepesvári [86] which uses a discount-
ing factor to compute the UCB index. In subsequent work, Garivier and Moulines [20] provide
a formal analysis of Discounted UCB (D-UCB) and propose SW-UCB, which is also a passive
approach. They pointed out that if the number of change points Υ𝑇 is available, both algorithms
                                                    √
can be tuned to achieve a regret close to the Ω( 𝐾Υ𝑇 𝑇) regret lower bound.
    The active approach handles the change of reward distributions in an adaptive manner. Harland
et al. [87] actively detect the change point with the Page-Hinkley test and design two restarting strate-
gies to prevent false alarm, namely 𝛾-Restart and Meta-Bandit. The 𝛾-Restart triggers discounting
the sampling history when a breakpoint is detected, while the meta-Bandit models preserving or
discarding old information as a new 2-armed bandit. Other change point detection techniques
such as cumulative sum (CUMSUM) and Generalized Likelihood Ratio (GLR) tests are used in
subsequent work to design CUMSUM-UCB [88], GLR-klUCB [89], and M-UCB [90]. Change
point detection has also been implemented together with non-UCB methods to design EXP3.R [91]
                                                    39


and Change-Point Thompson sampling [92]. These policies either need the knowledge of Υ𝑇 to
tune the parameter or have regret upper bounds with a strong dependence on 𝑁𝑇 . Very recently,
two parameter-free policies AdSwitch [93] and Ada-ILTCB+ [94] for contextual MAB are proved
            √
to have 𝑂˜ ( 𝐾 𝑁𝑇 𝑇) regret.
                                             40


                                                  CHAPTER 4
            MULTI-PLAYER PIECEWISE STATIONARY STOCHASTIC BANDITS
In a variety of applications including robotic swarming, opportunistic spectrum access, and the
Internet of Things, achieving coordinated behavior of multiple decision-makers in unknown, un-
certain, and non-stationary environments without any explicit communication among them is of
immense interest. This multi-player decision-making in the face of uncertainty is embodied by the
multi-player MAB problem, in which several decision-makers simultaneously play the bandit game
in a decentralized fashion. For such a problem, we follow a common routine to assume a collision
model: the reward from an arm is eliminated or shared when it is selected by multiple agents.
The work in this chapter is slightly modified from our published paper on multi-player piecewise
stationary stochastic bandits, and it is reproduced here with the permission of the copyright holder1.
     To formally formulate the problem, we consider a multi-player MAB problem with 𝐾 arms and
the total number of players is 𝑀 ∈ {1, . . . , 𝐾 }. Similarly as the single-player case in last chapter,
at each time 𝑡, there is a random reward 𝑋𝑡𝑘 ∈ [0, 1] associated with each arm 𝑘 ∈ {1, . . . , 𝐾 }, and
                                                                                                   𝜑𝑡 ( 𝑗)
every agent 𝑗 ∈ {1, . . . , 𝑀 } picks a particular arm 𝜑𝑡 ( 𝑗) ∈ {1, . . . , 𝐾 } and observes 𝑋𝑡           . We
assume no communication between agents, so that 𝜑𝑡 ( 𝑗) is selected based only on agent 𝑗’s own
                                                   𝜑 ( 𝑗)          𝑡−1
observation and decision-making history 𝑋𝑠 𝑠 , 𝜑 𝑠 ( 𝑗) 𝑠=1 .
                                                                                           𝜑𝑡 ( 𝑗)
     With collision model M that eliminate rewards, agent 𝑗 receives the reward 𝑋𝑡                 from arm
𝜑𝑡 ( 𝑗) if it is the only player to select arm 𝜑𝑡 ( 𝑗) at time 𝑡. Then, the group reward till time 𝑇 is
                                                     Õ 𝑇 Õ 𝐾
                                               𝑆𝑇 =           𝑋𝑡𝑘 O𝑡𝑘 ,
                                                      𝑡=1 𝑘=1
where O𝑡𝑘 = 1 if arm 𝑘 is selected by only one player at time 𝑡 and is zero otherwise. If the collision
model allows the reward to be shared, O𝑡𝑘 = 1 if arm 𝑘 is selected by a player. Since the algorithm
design and analysis are similar in both cases, the discussion will only be made based on the collision
model M in this chapter.
     1 ©2018   IEEE. Reprinted with permission from [95].
                                                          41


    We assume the minimum difference in mean rewards between any pair of arms at any time is
lower bounded by Δmin > 0. Let 𝜎𝑡 be a permutation of {1, . . . , 𝐾 } at time 𝑡 such that the mean
rewards satisfy
                                                  𝜇𝑡𝜎𝑡 (1) > . . . > 𝜇𝑡𝜎𝑡 (𝐾) .
Then, the group regret for a policy 𝜌 till time 𝑇 is defined by
                          Õ𝑇 Õ    𝑀                                Õ 𝑇 Õ 𝑀                   Õ𝑇 Õ 𝐾          
                                      𝜇𝑡𝜎𝑡 (𝑘)                              𝜇𝑡𝜎𝑡 (𝑘)
                 𝜌                                           
               𝑅𝑇 (M)  =                       −E     𝜌
                                                           𝑆𝑇 =                      −E    𝜌
                                                                                                      𝜇𝑡𝑘 O𝑡𝑘   ,
                          𝑡=1 𝑘=1                                   𝑡=1 𝑘=1                   𝑡=1 𝑘=1
where the second expectation is computed over different realizations of O𝑡𝑘 under policy 𝜌. Our
                                                                                                    𝜌
main purpose here is to design a multi-player policy 𝜌 that minimizes 𝑅𝑇 (M). Like the last
chapter, we study the above MAB problem in a piecewise stationary environment with the number
of breakpoints until time 𝑇 to be Υ𝑇 ∈ 𝑂 (𝑇 𝜈 ), where 𝜈 ∈ [0, 1) is known a priori.
4.1    The RR-SW-UCB# Algorithm
The Round Robin SW-UCB# (RR-SW-UCB#) algorithm is designed based upon SW-UCB# pre-
sented in the last chapter. In the RR-SW-UCB# algorithm, at each time 𝑡, every agent 𝑗 maintains
                                            𝑗
an estimate of the mean reward 𝜇¯ 𝑘 (𝑡, 𝛼) at each arm 𝑘, using only the rewards collected within
a sliding-window of width 𝜏(𝑡, 𝛼) = min{d𝜆𝑡 𝛼 e, 𝑡}, where parameter 𝛼 ∈ (0, 1]. The number of
times arm 𝑘 has been selected within the time-window at time 𝑡 is
                                                             Õ 𝑡
                                         𝑗
                                      𝑛 𝑘 (𝑡, 𝛼)    =                  1{𝜑 𝑠 ( 𝑗) = 𝑘 }.
                                                        𝑠=𝑡−𝜏(𝑡,𝛼)+1
         𝑗
Then, 𝜇¯ 𝑘 (𝑡, 𝛼) can be computed by
                                                                 𝑡
                             𝑗                    1            Õ
                                                                          𝜑 ( 𝑗)
                          𝜇¯ 𝑘 (𝑡, 𝛼) =                                 𝑋𝑠 𝑠     1{𝜑 𝑠 ( 𝑗) = 𝑘 }.
                                            𝑛 𝑘 (𝑡, 𝛼)
                                                           𝑠=𝑡−𝜏(𝑡,𝛼)+1
Using its own observations, each agent 𝑗 computes upper confidence bounds on the mean rewards
                                 𝑗                     𝑗
                              𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼),           ∀𝑘 ∈ {1, . . . , 𝐾 },
                         q
         𝑗                                       𝑗
where 𝑐 𝑘 (𝑡 − 1, 𝛼) =     (1 + 𝛼) ln 𝑡/𝑛 𝑘 (𝑡 − 1, 𝛼). For initial 𝐾 iterations, i.e., 𝑡 ∈ {1, . . . , 𝐾 }, the
player 𝑗 selects each arm once. Then, at time instants {𝐾 + 𝜂𝑀 + 1}𝜂∈Z ≥0 , it computes the set Ω 𝑗
                                                                 42


  Algorithm 3: The RR-SW-UCB# Algorithm
     Input       : 𝜈 ∈ [0, 1), Δmin ∈ (0, 1), 𝜆 ∈ R>0 , 𝑇 ∈ N and player number 𝑗;
     Set         : 𝛼 = 1−𝜈  2 ;
     Output      : sequence of arm selections for each player 𝑗;
     % Initialization:
  1  Set Ω 𝑗 ← ∅, ordered set G 𝑗 ← (), and 𝑡 ← 1;
  2  while 𝑡 ≤ 𝑇 do
         % round-robin selection of each arm starting at arm 𝑗
   3     if 𝑡 ∈ {1, . . . , 𝐾 } then
              Pick arm 𝜑𝑡 ( 𝑗) = mod(𝑡 + 𝑗 − 2, 𝐾) + 1;
   4     else
              Compute Ω 𝑗 containing 𝑀 arms with 𝑀 largest values in
                                               𝑗                𝑗
                                             𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) | 𝑘 ∈ {1, . . . , 𝐾 } ;
              Ascending sort the arm indices in Ω 𝑗 , G 𝑗 ← sort↑ (Ω 𝑗 );
              % round-robin selection of arms in G 𝑗 starting at G 𝑗 ( 𝑗)
              for round ∈ {1, . . . , 𝑀 } do
                   Pick arm 𝜑𝑡 ( 𝑗) = G 𝑗 (mod(𝑡 − 𝐾 + 𝑗 − 2, 𝑀) + 1);
                   𝑡 ← 𝑡 + 1;
containing 𝑀 arms with 𝑀 largest values in the set
                                     𝑗                   𝑗
                                   𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) | 𝑘 ∈ {1, . . . , 𝐾 } .
Let G 𝑗 be the ordered set that contains arms in Ω 𝑗 sorted in ascending value of their indices (not
using the upper confidence bounds), and let G 𝑗 (𝑖) denote the 𝑖-th element in G 𝑗 . The player 𝑗 selects
                                                                                𝑗
arms in G 𝑗 in a round-robin fashion starting with the arm G𝑡 ( 𝑗). It will be shown in the following
section that the estimated set of 𝑀 best arms, denoted by Ω 𝑗 , will be the same for each player with
high probability. Details of RR-SW-UCB# are shown in Algorithm 3. The free parameter 𝜆 in the
algorithm can be used to refine the finite-time performance of the algorithm.
4.2    Analysis of the RR-SW-UCB# Algorithm
Before the analysis, we introduce the following notation. Let Ω∗𝑀 (𝑡) denote the set of 𝑀 arms with
the 𝑀 largest mean rewards at time 𝑡. Then, the total number of times Ω 𝑗 (𝑡) ≠ Ω∗𝑀 (𝑡) until time 𝑇
                                                                43


can be defined as
                                                   Õ 𝑇
                                      N 𝑗 (𝑇) :=        1{Ω 𝑗 (𝑡) ≠ Ω∗𝑀 (𝑡)}.
                                                    𝑡=1
We now upper bound N 𝑗 (𝑇) in the following lemma.
Lemma 4.1. For the RR-SW-UCB# algorithm and the multi-player MAB problem with 𝐾 arms and
𝑀 players in the piecewise stationary environment with the number of break points Υ𝑇 = 𝑂 (𝑇 𝜈 ),
𝜈 ∈ [0, 1), the total number of times Ω 𝑗 (𝑡) ≠ Ω∗𝑀 (𝑡) until time 𝑇 for any player 𝑘 satisfies
                                   h 𝑇 1−𝛼            4𝑀 (1 + 𝛼) ln 𝑇  𝜋 2  𝜆 + 𝑀 + 1  2 i
            N 𝑗 (𝑇) ≤(𝐾 − 𝑀)                      +1 1+                         +
                                      𝜆(1 − 𝛼)                    Δ2min            3        𝑀
                                                         
                        + Υ𝑇 d𝜆(𝑇 − 1) 𝛼 e + 𝑀 − 1 + 𝐾.
Proof. We begin by separately analyzing windows with and without breakpoints. For the ease of
                                                                  𝑗         𝑗              𝑗
notation, in the following, superscript 𝑗 is omitted in 𝜇¯ 𝑘 (𝑡, 𝛼), 𝑛 𝑘 (𝑡, 𝛼) and 𝑐 𝑘 (𝑡, 𝛼).
Step 1: Let set T̂ such that for all 𝑡 ∈ T̂ , 𝑡 is either a breakpoint or there exists a break point in
its sliding-window of observations {𝑡 − 𝜏(𝑡 − 1, 𝛼), . . . , 𝑡 − 1}. For 𝑡 ∈ T̂ , the statistical means are
biased. It follows that
                                            | T̂ | ≤ Υ𝑇 d𝜆(𝑇 − 1) 𝛼 e.
Consequently, N 𝑗 (𝑇) can be upper-bounded as
                                                                       
                               N 𝑗 (𝑇) ≤ Υ𝑇 d𝜆(𝑇 − 1) 𝛼 e + 𝑀 − 1 + Ñ 𝑗 (𝑇),                          (4.1)
                    Í𝑇
where Ñ 𝑗 (𝑇) :=     𝑡=1 1{Ω 𝑗 (𝑡)  ≠ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ }. The term 𝑀 − 1 in (4.1) is due to the fact that Ω 𝑗
is computed every 𝑀 steps. In the following steps, we will bound Ñ 𝑗 (𝑇).
Step 2: If Ω 𝑗 (𝑡) ≠ Ω∗𝑀 (𝑡), there exists at least one arm 𝑖 such that 𝑖 ∈ Ω 𝑗 (𝑡) and 𝑖 ∉ Ω∗𝑀 (𝑡). Then,
it follows that
                              Õ𝑇 Õ   𝐾
             Ñ 𝑗 (𝑇) ≤𝐾 +              1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)}
                             𝑡=𝐾+1 𝑖=1
                             𝑇     𝐾
                                                                                                       (4.2)
                            Õ Õ
                        +             1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼)},
                           𝑡=𝐾+1 𝑖=1
                                                         44


where we choose 𝑙 (𝑡, 𝛼) = 4(1 + 𝛼) ln 𝑡/Δ2min .
    We begin with bounding the second term on the right-hand side of inequality (4.2). First, we
partition time instants into 𝐺 epochs. Let 𝐺 ∈ N be such that
                                                       1                          1
                               [𝜆(1 − 𝛼)(𝐺 − 1)] 1−𝛼 < 𝑇 ≤ [𝜆(1 − 𝛼)𝐺] 1−𝛼 .                               (4.3)
Then, we have the following epochs
                                      {1 + 𝜙(𝑔 − 1), . . . , 𝜙(𝑔)}𝑔∈{1,...,𝐺} ,                            (4.4)
                                  1 
where 𝜙(𝑔) = [𝜆(1 − 𝛼)𝑔] 1−𝛼 . Let 𝑡˜ be any time instant other than the first instant in the 𝑔-th
epoch. We will now show that all but one of the time instants in the 𝑔-th epoch until 𝑡˜ must be
contained in the time-window at 𝑡˜. Towards this end, consider the increasing convex function
            1
𝑓 (𝑥) = 𝑥 1−𝛼 with 𝛼 ∈ (0, 1). It follows that 𝑓 (𝑥 2 ) − 𝑓 (𝑥 1 ) ≤ 𝑓 0 (𝑥 2 )(𝑥 2 − 𝑥 1 ) if 𝑥2 ≥ 𝑥 1 . Then,
                                        𝑡˜1−𝛼
substituting 𝑥 1 = 𝑔 − 1 and 𝑥 2 =     𝜆(1−𝛼)   in the above inequality and simplifying, we get
                                                        1
                                                                    𝑡˜1−𝛼            
                           𝑡˜ − (𝜆(1 − 𝛼)(𝑔 − 1)) 1−𝛼 ≤ 𝜆𝑡˜𝛼                  −𝑔+1 .
                                                                   𝜆(1 − 𝛼)
                                               𝑡˜1−𝛼
Since by definition of the 𝑔-th epoch,        𝜆(1−𝛼)  ≤ 𝑔, we have
                                                  1
                   𝑡˜ − b(𝜆(1 − 𝛼)(𝑔 − 1)) 1−𝛼 c ≤ min{𝑡˜ + 1, 𝜆d𝑡˜𝛼 e + 1} = 𝜏( 𝑡˜, 𝛼) + 1.
The only time instant in the 𝑔-th epoch that is possibly not contained in the time window at 𝑡˜ is
1 + 𝜙(𝑔 − 1). Then for any arm 𝑖 ∈ {1, . . . , 𝐾 },
                                        Õ  𝑡˜
                                                 1{𝑖 ∈ Ω 𝑗 (𝑡)} ≤ 𝑀𝑛𝑖 ( 𝑡˜, 𝛼).                            (4.5)
                                     2+𝜙(𝑔−1)
Furthermore, in the 𝑔-th epoch in the partition, either
                  Õ        Õ 𝐾
                                1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} = 0,
              𝑡∈𝑔-th epoch 𝑖=1
or there exist at least one time-instant 𝑡 in the 𝑔-th epoch such that
                      Õ𝐾
                          1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} > 0.
                      𝑖=1
                                                          45


Let the last time instant satisfying this condition in the 𝑔-th epoch be
                                                Õ𝐾                                                                      
    𝑡 (𝑔) = max 𝑡 ∈ 𝑔-th epoch                       1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉  Ω∗𝑀 (𝑡),  𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} > 0 .
                                                 𝑖=1
Note that 𝑡 (𝑔) ∉ T̂ indicates, for each 𝑖 ∈ {1, . . . , 𝐾 }, 𝜇𝑖 (𝑠) is a constant for all 𝑠 ∈ {𝑡 (𝑔) − 𝜏(𝑡 (𝑔) −
1, 𝛼), . . . , 𝑡 (𝑔)}. Then, it follows from (4.5) that
                   Õ         Õ𝐾
                                    1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)}
              𝑡∈𝑔-th epoch 𝑖=1
                                     Õ𝑡 (𝑔)      Õ𝐾
               ≤𝐾−𝑀+                                 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)}
                                 𝑡=𝜙(𝑔−1)+2 𝑖=1
                                     Õ
               ≤𝐾−𝑀+                             𝑀𝑙 (𝑡𝑖 (𝑔), 𝛼)
                                 𝑖∉Ω∗𝑀 (𝑡 (𝑔))
                                        4𝑀 (1 + 𝛼) ln 𝑇 
               ≤ (𝐾 − 𝑀) 1 +                                   ,                                                        (4.6)
                                                 Δ2min
where 𝑡𝑖 (𝑔) = max{𝑡 ∈ 𝑔-th epoch | 𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} and
𝑡𝑖 (𝑔) ≤ 𝑡 (𝑔) for all 𝑖 ∈ {1, . . . , 𝐾 }. Therefore, from (4.4) and (4.6), we have
                                    Õ𝑇 Õ     𝐾
                                                 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗ (𝑡), 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)}
                                   𝑡=𝑁+1 𝑖=1
                                                        4𝑀 (1 + 𝛼) ln 𝑇 
                              ≤𝐺 (𝐾 − 𝑀) 1 +                                 .                                          (4.7)
                                                                Δ2min
Step 3: In this step, we bound the expectation of the last term in (4.2). It can be shown that
         Õ 𝐾
              1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼)}
          𝑖=1
            Õ         Õ       Õℎ(𝑡) Õ    ℎ(𝑡)
       ≤                                         1{𝑛 𝜁 (𝑡 − 1, 𝛼) = 𝑠 𝜁 , 𝑛𝑖 (𝑡 − 1, 𝛼) = 𝑠𝑖 , 𝑡 ∉ T̂ }                 (4.8)
         𝑖∉Ω∗𝑀 (𝑡) 𝜁 ∈Ω∗𝑀 (𝑡) 𝑠 𝜁 =1 𝑠𝑖 =𝑙 (𝑡,𝛼)
         × 1{ 𝜇¯ 𝜁 (𝑡 − 1, 𝛼) + 𝑐 𝜁 (𝑡 − 1, 𝛼) ≤ 𝜇¯𝑖 (𝑡 − 1, 𝛼) + 𝑐𝑖 (𝑡 − 1, 𝛼), 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼)},
                                              
where ℎ(𝑡) := d𝜆(𝑡 − 1) 𝛼 e/𝑀 is the maximum number of times an arm can be selected within
the time window at 𝑡 − 1. Note that 𝜇¯ 𝜁 (𝑡 − 1, 𝛼) + 𝑐 𝜁 (𝑡 − 1, 𝛼) ≤ 𝜇¯𝑖 (𝑡 − 1, 𝛼) + 𝑐𝑖 (𝑡 − 1, 𝛼) means
                                                                    46


at least one of the following holds.
                                      𝜇¯𝑖 (𝑡 − 1, 𝛼) ≥ 𝜇𝑖 (𝑡) + 𝑐𝑖 (𝑡 − 1, 𝛼),                              (4.9)
                                     𝜇¯ 𝜁 (𝑡 − 1, 𝛼) ≤ 𝜇 𝜁 (𝑡) − 𝑐 𝜁 (𝑡 − 1, 𝛼),                           (4.10)
                                    𝜇 𝜁 (𝑡) − 𝜇𝑖 (𝑡) < 2𝑐𝑖 (𝑡 − 1, 𝛼).                                     (4.11)
Since 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼), (4.11) does not hold. Applying Chernoff-Hoeffding inequality [83,
Theorem 1] to bound the probability of events (4.9) and (4.10), we obtain
                          P( 𝜇¯𝑖 (𝑡 − 1, 𝛼) ≥ 𝜇𝑖 (𝑡) + 𝑐𝑖 (𝑡 − 1, 𝛼)) ≤ 𝑡 −2(1+𝛼) ,                        (4.12)
                          P( 𝜇¯ 𝜁 (𝑡 − 1, 𝛼) ≤ 𝜇 𝜁 (𝑡) − 𝑐 𝜁 (𝑡 − 1, 𝛼)) ≤ 𝑡 −2(1+𝛼) .                     (4.13)
Since Ω 𝑗 is only computed at time instants {𝐾 + 𝜂𝑀 + 1}𝜂∈Z ≥0 , it follows from (4.8), (4.12) and
(4.13) that
                    h Õ𝑇 Õ  𝐾                                                                  i
                  E             1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼)}
                     𝑡=𝐾+1 𝑖=1
                                                            𝑇 −𝑁
                                    ℎ(Õ  𝑓 (𝜂)) ℎ(Õ𝑓 (𝜂)) d Õ 𝑀 e
                   ≤ (𝐾 − 𝑀)𝑀                                     2𝑀 𝑓 (𝜂) −2(1+𝛼)
                                      𝑠 𝜁 =1 𝑠𝑖 =𝑙 (𝑡,𝛼) 𝜂=0
                                           −𝑁
                                      d 𝑇Õ 𝑀 e
                   ≤ (𝐾 − 𝑀)𝑀 2                 2 𝑓 (𝜂) −2(1+𝛼) ℎ( 𝑓 (𝜂)) 2
                                         𝜂=0
                                 𝜆 + 𝑀 + 1 2 Õ        ∞
                   ≤ (𝐾 − 𝑀)                                 2𝜂−2
                                          𝑀            𝜂=1
                     𝜋2              𝜆 + 𝑀 + 1 2
                  =     (𝐾 − 𝑀)                           ,                                                (4.14)
                      3                       𝑀
where 𝑓 (𝜂) := 𝐾 + 𝜂𝑀 + 1. Therefore, it follows from (4.1), (4.2), (4.7), and (4.14) that
                         h          4𝑀 (1 + 𝛼) ln 𝑇  𝜋 2  𝜆 + 𝑀 + 1  2 i                        𝛼           
N 𝑗 (𝑇) ≤ 𝐾 + (𝐾 − 𝑀) 𝐺 1 +                                     +                  + Υ𝑇 d𝜆(𝑇  −  1)   e + 𝑀 − 1   .
                                               Δ2min               3      𝑀
From (4.3), we have 𝐺 ≤ 𝑇 1−𝛼 /(𝜆 − 𝜆𝛼) + 1, and this yields the desired result.                                
     Based on Lemma 4.1, we now establish the order of expected cumulative group regret of
RR-SW-UCB# in the abruptly changing environment.
                                                               47


Theorem 4.2. For the RR-SW-UCB# algorithm and the multi-player MAB problem with 𝑁 arms and
𝑀 players in the piecewise stationary environment with the number of break points Υ𝑇 = 𝑂 (𝑇 𝜈 ),
𝜈 ∈ [0, 1), under collision model M, the expected cumulative group regret satisfies
                                                                   1+𝜈      
                                      𝑅𝑇RR-SW-UCB# (M) ∈ 𝑂 𝑇        2   ln 𝑇 .
Proof. If all player identify Ω∗𝑀 (𝑡) correctly at time 𝑡, no expected regret is accrued. It follows
                                            1+𝜈
from Lemma 4.1 that N 𝑗 (𝑇) ∈ 𝑂 (𝑇           2  ln 𝑇) for all 𝑗 ∈ {1, . . . , 𝑀 }. The total number of times
                                                                                       Í
that any player misidentifies Ω∗𝑀 (𝑡) until time 𝑇 can be upper bounded by 𝑀             𝑗=1 N 𝑗 (𝑇). Thus, we
conclude the proof.                                                                                          
4.3     The SW-DLP Algorithm
Distributed Learning with Prioritization (DLP) [23] is designed for the multi-player stochastic MAB
problem in a stationary environment. The idea of DLP is to assign player 𝑗 to collect rewards from
𝑗-th best arm for most of circumstances. In a piecewise stationary environment, we extend the DLP
algorithm to design SW-DLP using a sliding observation window. The upper confidence bounds
on the mean rewards in SW-DLP are computed the same as SW-UCB#. SW-DLP employes an
identical allocation rule as DLP, i.e., at each time instant 𝑡, player 𝑗 computes a set 𝐴 𝑗 (𝑡) containing
𝑗 arms with 𝑗 largest values in the set
                                𝑗                 𝑗
                              𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) | 𝑘 ∈ {1, . . . , 𝐾 } ,
and selects arm
                                                      𝑗                𝑗
                            𝜑𝑡 ( 𝑗) = arg min { 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) − 𝑐 𝑘 (𝑡 − 1, 𝛼)}.
                                         𝑘∈𝐴 𝑗 (𝑡)
Details of the SW-DLP is shown in Algorithm 4. The parameters in the SW-DLP algorithm are the
                                                                                       𝑗             𝑗
same as in the RR-SW-UCB# algorithm. In the following, we will refer to 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) − 𝑐 𝑘 (𝑡 − 1, 𝛼)
as the lower confidence bound on the estimate reward from arm 𝑘.
4.4     Analysis of the SW-DLP Algorithm
We analyze the performance of the SW-DLP algorithm (Algorithm 4) to get the following result.
                                                        48


  Algorithm 4: The SW-DLP Algorithm
      Input, output, and parameters are the same as RR-SW-UCB#
   1  while 𝑡 ≤ 𝑇 do
   2      if 𝑡 ∈ {1, . . . , 𝐾 } then
                Pick arm 𝜑𝑡 ( 𝑗) = mod(𝑡 + 𝑗 − 2, 𝑁) + 1;
   3      else
                Compute 𝐴 𝑗 (𝑡) containing 𝑗 arms with 𝑗 largest values in
                                             𝑗                𝑗
                                           𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) | 𝑘 ∈ {1, . . . , 𝐾 } ;
                Pick arm
                                                                   𝑗               𝑗
                                         𝜑𝑡 ( 𝑗) = arg min { 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) − 𝑐 𝑘 (𝑡 − 1, 𝛼)};
                                                        𝑘 ∈𝐴𝑗
Theorem 4.3. For the SW-DLP algorithm and the multi-player MAB problem with 𝐾 arms and
𝑀 players in the piecewise stationary environment with the number of break points Υ𝑇 = 𝑂 (𝑇 𝜈 ),
𝜈 ∈ [0, 1), under collision model M, the expected cumulative group regret satisfies
                                                                        1+𝜈      
                                             𝑅𝑇SW-DLP (M) ∈ 𝑂 𝑇          2  ln 𝑇 .
Proof. The proof is similar to the proof of Theorem 4.2 and we only present a sketch. Let 𝜃 𝑡 ( 𝑗) be
the 𝑗-th best arm at time 𝑡. The total number of time instants that 𝜃 𝑡 ( 𝑗) is not selected by player 𝑗
with SW-DLP satisfies
                  Õ𝑇                                                      𝑇
                                                                  𝛼
                                                                      Õ
         N̂ 𝑗 =       1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗)} ≤ Υ𝑇 𝜆(𝑇 − 1) +                  1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗), 𝑡 ∉ T̂ }. (4.15)
                  𝑡=1                                                    𝑡=1
We partition the time horizon as in (4.4). Then, similarly to (4.5) in the proof of Lemma 4.1, it can
be shown that
                                              Õ 𝑡˜
                                                                             𝑗
                                                       1{𝜑𝑡 ( 𝑗) = 𝑖} ≤ 𝑛𝑖 ( 𝑡˜, 𝛼),                         (4.16)
                                        𝑡=2+𝜙(𝑔−1)
for any arm 𝑖 ∈ {1, . . . , 𝐾 } and 𝑡˜ ∈ 𝑔-th epoch.
     We study the event that player 𝑗 does not select arm 𝜃 𝑡 ( 𝑗) at time 𝑡 under two scenarios: (i)
              𝑗
𝐴 𝑗 (𝑡) ≠ Ω∗ (𝑡), and (ii) 𝐴 𝑗 (𝑡) = Ω∗𝑘 (𝑡), where Ω∗𝑘 (𝑡) is the set with 𝑘 best arms at time 𝑡. Then, we
                                                              49


have
        Õ𝑇                                      Õ𝑇
                                                                   𝑗
              1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗), 𝑡 ∉ T̂ } ≤       1{𝐴 𝑗 (𝑡) ≠ Ω∗ (𝑡), 𝑡 ∉ T̂ }
        𝑡=1                                     𝑡=1
                                                   Õ𝑇
                                                +      1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗), 𝐴 𝑗 (𝑡) = Ω∗𝑘 (𝑡), 𝑡 ∉ T̂ }. (4.17)
                                                   𝑡=1
    Note that unlike RR-SW-UCB#, in SW-DLP, after initialization, 𝐴 𝑘 (𝑡) is computed every time
instead of only at time instants {𝑁 + 𝜂𝑀 + 1}𝜂∈Z ≥0 . However, this difference does not change the
order of the total number of times that Ω∗𝑘 (𝑡) is misidentified. Therefore, using (4.16), it follows
similarly to the proof of Lemma 4.1 that
                                       Õ 𝑇
                                                        𝑗               1+𝜈       
                                            1{𝐴 𝑗 ≠ Ω∗ (𝑡)} ∈ 𝑂 𝑇        2  ln 𝑇 .                        (4.18)
                                        𝑡=1
    The Chernoff-Hoefding inequality is symmetric about the estimated mean and the upper tail
bound is identical to the lower tail bound. Hence, the second term on the right-hand side of
inequality (4.17) that involves selecting 𝜑𝑡 ( 𝑗) using lower confidence bounds can be bounded
similarly to the first term. Thus, we have
                        Õ 𝑇
                                                             𝑗                        1+𝜈      
                              1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗), 𝐴 𝑗 = Ω∗ (𝑡), 𝑡 ≠ T̂ } ∈ 𝑂 𝑇       2  ln 𝑇 .          (4.19)
                         𝑡=1
Substituting (4.18) and (4.19) into (4.17), and substituting (4.17) into (4.15), we conclude that
             1+𝜈      
N̂𝑘 ∈ 𝑂 𝑇 2 ln 𝑇 .
    The number of times the group does not receive a reward from arm 𝜃 𝑡 ( 𝑗) is upper bounded by
the number of times player 𝑗 does not receive a reward from arm 𝜃 𝑡 ( 𝑗). Player 𝑗 does not receive
a reward from arm 𝜃 𝑡 ( 𝑗) if one of the following conditions is true (i) arm 𝜃 𝑡 ( 𝑗) is not selected by
player 𝑗, and (ii) arm 𝜃 𝑡 ( 𝑗) is selected by another player 𝑗 0 ≠ 𝑗. The total number of times either
one of these events occurs at any arm 𝜃 𝑡 ( 𝑗), for all 𝑗 ∈ {1, . . . , 𝑀 }, can be upper bounded by
Í𝑀                                 1+𝜈      
  𝑘=1 2N̂ 𝑗 . Since N̂ 𝑗 ∈ 𝑂 𝑇
                                    2 ln 𝑇 for all 𝑗 ∈ {1, . . . , 𝑀 }, we conclude the proof.                
Remark 4.1 (Comparison of RR-SW-UCB# and SW-DLP). In multi-player MAB algorithms,
the assignment of a player to a targeted arm is crucial to avoid collisions. In RR-SW-UCB#,
                                                           50


              2                                                 15
                                           = 0.15
                                           = 0.30                                             = 0.15
                                           = 0.45                                             = 0.30
            1.5
                                                                                              = 0.45
                                                                10
              1
                                                                 5
            0.5
              0                                                  0
                0    2       4      6     8       10               0      2     4      6    8       10
                                                105                                              105
                      (a) RR-SW-UCB#                                (b) SW-DLP and RR-SW-UCB#
           Figure 4.1: Simulation of RR-SW-UCB# and SW-DLP in a piecewise stationary environment.
the indices of arms and the indices of players are employed for this assignment. A round-robin
policy ensures all players select 𝑀-best arms persistently and accurately estimate the associated
mean rewards. While in SW-DLP, such accurate estimation by all players is driven by the lower
confidence bound-based assignment of players to the arms.                                                
4.5     Numerical Illustration
In this section, we present simulation results for RR-SW-UCB# and SW-DLP in abruptly changing
environments. In the simulations, we consider a multi-player MAB problem with 6 arms and
3 players. We consider three different values {0.15, 0.3, 0.45} of parameter 𝜈 that describes the
number of breakpoints to show the performance the both algorithms. The breakpoints are introduce
at time instants where the next element of sequence {b𝑡 𝜈 c}𝑡∈{1,...,𝑇 } is different from current element.
We pick them at these time instants to make number of breakpoints Υ𝑡 ∈ 𝑂 (𝑡 𝜈 ) uniformly for all
𝑡 ∈ {1, . . . , 𝑇 }. At each break point, the mean rewards at each arm is randomly selected from
{0.05, 0.22, 0.39, 0.56, 0.73, 0.90}. In both algorithms, we select 𝜆 = 12.3.
     As shown in Figure. 4.1, with either algorithm, the ratio of the empirical cumulative group
                            1+𝜈
regret to the order of 𝑡     2  ln 𝑡 is upper bounded by a constant. The dashed lines in Figure. 4.1 (b)
are taken directly from (a). The comparison shows that the cumulative regret of RR-SW-UCB#
is much lower than SW-DLP. However, if the cost of switching between arms is considered, then
                                                      51


the round-robin structure of RR-SW-UCB# would incur significant cost, and in such a scenario
SW-DLP might be preferred.
4.6     Summary
We studied the multi-player stochastic MAB problem in abruptly changing environments under
a collision model in which a player receives a reward by selecting an arm if it is the only player
to select that arm. We designed two novel algorithms, RR-SW-UCB# and SW-DPL to solve this
problem. We analyzed these algorithms and characterized their performance in terms of group
regret. In particular, we showed that these algorithms incur sublinear expected cumulative regret,
i.e., the time average of the regret asymptotically converges to zero. It would be of interest to
extend this work to a more general nonstationary environment in which the reward distributions can
change at each time step. Another avenue of future research is the extension of these algorithms to
the multi-player Markovian MAB problem.
4.7     Bibliographic Remarks
Most of the studies on the multi-player MAB problem deal with a stationary environment. In [22],
a lower bound on the expected cumulative group regret for a centralized policy is derived and
algorithms that asymptotically achieve this lower bound are designed. Some works assume no
communication among players in [4, 5, 23–25], whereas other works allow agents to communicate
to improve their arm selection in [26–28].
     One of the major generalizations in the multi-player MAB problem is to consider player-
dependent rewards, i.e., an arm has different mean rewards for different players [24]. The optimal
allocation of the players to arms can be computed using approaches for a famous combinatorial
optimization problem known as the assignment problem [96]. To achieve a sublinear regret in a
distributed manner, a distributed solution to the assignment problem is required [97]. Assuming
collision results in no reward, implicit communication can be generated through collision. In [24],
the distributed MAB problem is solved using distributed auction [97] and collision-based implicit
                                                52


communication. The idea of implicit communication is used broadly in different distributed
protocols for multi-player MAB problems [98, 99].
    More recently, game-theoretic techniques have been used to design fully distributed multi-player
MAB algorithms without implicit communication [100]. Specifically, using the payoff dynamics
introduced in [101], the authors in [100] design an algorithm that plays, for a sufficiently large
portion of time, a strategy profile that optimizes the sum of player-specific mean rewards.
                                                  53


                                                 CHAPTER 5
           GENERAL NONSTATIONARY BANDITS WITH VARIATION BUDGET
In this chapter, we study a more general non-stationary stochastic MAB problem proposed in [21].
The reward distributions are allowed to either change abruptly like the piece-wise stationary bandits
or drift slowly. The nonstationarity of the environment is characterized by the cumulative maximum
variation in mean rewards, which subjects to a variation budget.
     In order to minimize clutter, we denote the set of arms as K := {1, . . . , 𝐾 } and the sequence
of time slots as T := {1, . . . , 𝑇 }. The reward sequence {𝑋𝑡𝑘 }𝑡∈T for each arm 𝑘 ∈ K is composed
of independent samples from potentially time-varying probability distribution function sequence
 𝑓T𝑘 := { 𝑓𝑡𝑘 (𝑥)}𝑡∈T . We refer to the set F𝑇K = { 𝑓T𝑘 | 𝑘 ∈ K} containing reward distribution
sequences at all arms as the environment. Then, the total variation of mean rewards in F𝑇K is
defined by
                                                    𝑇−1
                                                    Õ
                                         F𝑇K
                                                                 𝑘
                                       𝑣        :=       max 𝜇𝑡+1     − 𝜇𝑡𝑘 ,                      (5.1)
                                                         𝑘∈K
                                                    𝑡=1
which captures the non-stationarity of the environment. We focus on the class of non-stationary
environments that have the total variation within a variation budget 𝑉𝑇 ≥ 0 which is defined by
                                   E (𝑉𝑇 , 𝑇, 𝐾) := F𝑇K | 𝑣 F𝑇K ≤ 𝑉𝑇 .
                                                                      
     The objective is still to design a policy 𝜌 to minimize the regret in a nonstationary environment
𝑅𝑇 defined in (3.1). Note that the performance of a policy 𝜌 differs with different F𝑇K ∈ E (𝑉𝑇 , 𝑇, 𝐾).
   𝜌
For a fixed variation budget 𝑉𝑇 and a policy 𝜌, the worst-case regret is the regret with respect to
the worst possible choice of environment, i.e.,
                                      𝜌                                       𝜌
                                    𝑅worst (𝑉𝑇 , 𝑇, 𝐾) =         sup        𝑅𝑇 .
                                                           F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾)
In this work, we aim at designing policies to minimize the worst-case regret. The optimal worst-case
regret achieved by any policy is called the minimax regret, and is defined by
                                                                     𝜌
                                             inf       sup         𝑅𝑇 .
                                              𝜌
                                                 F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾)
                                                        54


We study the nonstationary MAB problem under the following two classes of reward distributions:
Assumption 5.1 (Sub-Gaussian reward). For any 𝑘 ∈ K and any 𝑡 ∈ T , distribution 𝑓𝑡𝑘 (𝑥) is 1/2
sub-Gaussian, i.e.,                                                           !
                                          h                   i           𝜆 2
                            ∀𝜆 ∈ R : E exp(𝜆(𝑋𝑡𝑘 − 𝜇)) ≤ exp                    .
                                                                           8
                                                            
Moreover, for any arm 𝑘 ∈ K and any time 𝑡 ∈ T , E 𝑋𝑡𝑘 ∈ [𝑎, 𝑎 + 𝑏], where 𝑎 ∈ R and 𝑏 > 0.
                                                                                          
Assumption 5.2 (Heavy-tailed reward). For any arm 𝑘 ∈ K and any time 𝑡 ∈ T , E (𝑋𝑡𝑘 ) 2 ≤ 1.
5.1     Lower Bound on Minimax Regret in Nonstationary Environment
In this section, we review existing minimax regret lower bounds and minimax policies from
literature. These results apply to both sub-Gaussian and heavy-tailed rewards. When 𝑉𝑇 = 0, the
minimax regret lower bound is the same as the one for stochastic stationary bandit (2.1). We show
how the minimax regret lower bound for 𝑉𝑇 = 0 can be extended to establish the minimax regret
lower bound for 𝑉𝑇 > 0. In the later sections, we design a variety of policies that match with the
minimax regret lower bound for 𝑉𝑇 > 0.
     In the setting of 𝑉𝑇 > 0, we recall here the minimax regret lower bound for nonstationary
stochastic MAB problems.
Lemma 5.1 (Minimax Lower Bound: 𝑉𝑇 > 0 [21]). For the non-stationary MAB problem with 𝐾
arms, time horizon 𝑇 and variation budget 𝑉𝑇 ∈ [1/𝐾, 𝑇/𝐾],
                                                       𝜌            1   2
                                 inf      sup         𝑅𝑇 ≥ 𝐶 (𝐾𝑉𝑇 ) 3 𝑇 3 ,
                                  𝜌
                                     F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾)
where 𝐶 ∈ R>0 is some constant.
     To understand this lower bound, consider the following non-stationary environment. The
                                                             1           2
horizon T is partitioned into epochs of length 𝜏 = 𝐾 3 (𝑇/𝑉𝑇 ) 3 . In each epoch, the reward
distribution sequences are stationary and all the arms have identical mean rewards except for the
                                                         p
unique best arm. Let the gap in the mean be Δ = 𝐾/𝜏. The index of the best arm switches at
                                                      55


the end of each epoch following some unknown rule. So, the total variation is no greater than
Δ𝑇/𝜏, which satisfies the variation budget 𝑉𝑇 . Besides, for any policy 𝜌, we know from (2.1) that
                                                           √
worst-case regret in each epoch is no less than 𝐶2 𝐾𝜏. Summing up the regret over all the epochs,
                                                        √
minimax regret is lower bounded by 𝑇/𝜏 × 𝐶2 𝐾𝜏, which is consistent with Lemma 5.1.
5.2     UCB Algorithms for Sub-Gaussian Nonstationary Stochastic Bandits
In this section, we extend UCB1 and MOSS to design nonstationary UCB policies for scenarios with
𝑉𝑇 > 0. Three different techniques are employed, namely periodic resetting, sliding observation
window and discount factor, to deal with the remembering-forgetting tradeoff. The proposed
algorithms are analyzed to provide guarantees on the worst-case regret. We show their performances
match closely with the lower bound in Lemma 5.1.
                                                                                   
    The following notations are used in later discussions. Let 𝑁 = 𝑇/𝜏 , for some 𝜏 ∈ {1, . . . , 𝑇 },
and let {T1 , . . . , T𝑁 } be a partition of time slots T , where each epoch T𝑖 has length 𝜏 except possibly
T𝑁 . In particular,
                                  n                                 o
                            T𝑖 = 1 + (𝑖 − 1)𝜏 , . . . , min (𝑖𝜏, 𝑇) , 𝑖 ∈ {1, . . . , 𝑁 }.
Let the maximum mean reward within T𝑖 be achieved at time 𝜏𝑖 ∈ T𝑖 and arm 𝜅𝑖 , i.e., 𝜇𝜏𝜅𝑖𝑖 =
max𝑡∈T𝑖 𝜇𝑡∗ . We define the variation within T𝑖 as
                                                   Õ
                                                              𝑘
                                            𝑣 𝑖 :=      max 𝜇𝑡+1  − 𝜇𝑡𝑘 ,
                                                        𝑘∈K
                                                   𝑡∈T𝑖
where we trivially assign 𝜇𝑇+1     𝑘   = 𝜇𝑇𝑘 for all 𝑘 ∈ K. Let 1 {·} denote the indicator function and |·|
denote the cardinality of the set, if its argument is a set, and the absolute value if its argument is a
real number.
5.2.1    Resetting MOSS Algorithm
Periodic resetting is an effective technique to preserve the freshness and authenticity of the informa-
tion history. It has been employed in [21] to modify Exp3 to design Rexp3 policy for nonstationary
                                                          56


  Algorithm 5: The R-MOSS Algorithm
     Input      : 𝑉𝑇 ∈l R ≥0 and 𝑇 ∈m N
                          1         2
     Set        : 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3
     Output     : sequence of arm selection
   1 while 𝑡 ≤ 𝑇 do
   2     if mod (𝑡, 𝜏) = 0 then
   3         Restart the MOSS policy;
stochastic MAB problems. We extend this approach to MOSS and propose a nonstationary policy
Resetting MOSS (R-MOSS). In R-MOSS, after every 𝜏 time slots, the sampling history is erased
and MOSS is restarted. The pseudo-code is provided in Algorithm 5 and the performance in terms
of the worst-case regret is established below.
Theorem 5.2. For the sub-Gaussian nonstationary MAB problem with 𝐾 arms, time horizon 𝑇,
                                          l 1            2m
variation budget 𝑉𝑇 > 0, and 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 , the worst case regret of R-MOSS satisfies
                                                                                 1  2
                                         sup         𝑅𝑇R-MOSS ∈ O ((𝐾𝑉𝑇 ) 3 𝑇 3 ).
                                   F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾)
Sketch of the proof. Note that one run of MOSS takes place in each epoch. For epoch T𝑖 , define the
set of bad arms for R-MOSS by
                                       B𝑖R := {𝑘 ∈ K | 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑘𝑖 ≥ 2𝑣 𝑖 }.                          (5.2)
Notice that for any 𝑡1 , 𝑡2 ∈ T𝑖 ,
                                             𝜇𝑡𝑘1 − 𝜇𝑡𝑘2 ≤ 𝑣 𝑖 ,    ∀𝑘 ∈ K.                            (5.3)
Therefore, for any 𝑡 ∈ T𝑖 , we have
                                    𝜇𝑡∗ − 𝜇𝑡 𝑡 ≤ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝑡 𝑡 ≤ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑖𝑡 + 𝑣 𝑖 .
                                             𝜑               𝜑               𝜑
Then, the regret from T𝑖 can be bounded as the following,
                       Õ                                   Õ                  
                              𝜇𝑡∗
                                        𝜑                                    𝜑
                     E            −  𝜇𝑡 𝑡    ≤ |T𝑖 | 𝑣 𝑖 + E        𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑖𝑡   ≤ 3|T𝑖 | 𝑣 𝑖 + 𝑆𝑖 , (5.4)
                         𝑡∈T𝑖                                  𝑡∈T𝑖
                                                            57


               Õ Õ                                                     
                                                             𝜑
where 𝑆𝑖 = E                      1 𝜑𝑡 = 𝑘          𝜇𝜏𝜅𝑖𝑖 −  𝜇𝜏𝑖𝑡 − 2𝑣 𝑖    .
                 𝑡∈T𝑖 𝑘∈B R
                                𝑖
    Now, we have decoupled the problem, enabling us to generalize the analysis of MOSS in the
stationary environment [70] to bound 𝑆𝑖 . We will only specify the generalization steps and skip the
details for brevity.
    First notice inequality (5.3) indicates that for any 𝑘 ∈ B𝑖R and any 𝑡 ∈ T𝑖 ,
                                               𝜇𝑡𝜅𝑖 ≥ 𝜇𝜏𝜅𝑖𝑖 − 𝑣 𝑖 and 𝜇𝑡𝑘 ≤ 𝜇𝜏𝑘𝑖 + 𝑣 𝑖 .
So, at any 𝑡 ∈ T𝑖 , 𝜇ˆ 𝜅𝑖 ,𝑛 𝜅𝑖 (𝑡) concentrate around a value no smaller than 𝜇𝜏𝜅𝑖𝑖 −𝑣 𝑖 , and 𝜇ˆ 𝑘,𝑛 𝑘 (𝑡) concentrate
around a value no greater than 𝜇𝜏𝑘𝑖 + 𝑣 𝑖 for any 𝑘 ∈ 𝐵𝑖R . Also 𝜇𝜏𝜅𝑖𝑖 − 𝑣 𝑖 ≥ 𝜇𝜏𝑘𝑖 + 𝑣 𝑖 due to the definition
in (5.2).
    In the analysis of MOSS in stationary environment [70], the UCB of each suboptimal arm is
compared with the best arm and each selection of suboptimal arm 𝑘 contribute Δ 𝑘 in regret. Here,
we can apply a similar analysis by comparing the UCB of each arm 𝑘 ∈ 𝐵𝑖R with 𝜅𝑖 and each
selection of arm 𝑘 ∈ 𝐵𝑖R contributes (𝜇𝜏𝜅𝑖𝑖 − 𝑣 𝑖 ) − (𝜇𝜏𝑘𝑖 + 𝑣 𝑖 ) in 𝑆𝑖 . Accordingly, we borrow the upper
                                                   p
bound in Lemma 2.3 to get 𝑆𝑖 ≤ 49 𝐾 |T𝑖 |.
    Substituting the upper bound on 𝑆𝑖 into (5.4) and summarizing over all the epochs, we conclude
that
                                                                              Õ 𝑁        √
                                            sup           𝑅𝑇R-MOSS   ≤ 3𝜏𝑉𝑇 +      49 𝐾𝜏,
                                       F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾)                        𝑖=1
which implies the theorem.                                                                                            
    The upper bound in Theorem 5.2 is in the same order as the lower bound in Lemma 5.1. So,
the worst-case regret for R-MOSS is order optimal.
5.2.2   Sliding-Window MOSS Algorithm
We have shown that periodic resetting coarsely adapts the stationary policy to a nonstationary
setting. However, it is inefficient to entirely remove the sampling history at the restarting points
and the regret accumulates quickly close to these points. In [20], a sliding observation window
                                                                   58


  Algorithm 6: The SW-MOSS Algorithm
      Input      : 𝑉𝑇 ∈l R>0 , 𝑇 ∈ N mand 𝜂 > 1/2
                           1                 2
      Set        : 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3
     Output      : sequence of arm selection
   1 Pick each arm once.
   2 while 𝑡 ≤ 𝑇 do
                                                        
          Compute statistics within W𝑡 = min(1, 𝑡 − 𝜏), . . . , 𝑡 − 1 :
                                                     1 Õ                                   Õ
                                    𝜇ˆ 𝑛𝑘 𝑘 (𝑡) =              𝑋𝑠 1{𝜑 𝑠 = 𝑘 }, 𝑛 𝑘 (𝑡) =        1{𝜑 𝑠 = 𝑘 }
                                                  𝑛 𝑘 (𝑡)
                                                          𝑠 ∈W𝑡                           𝑠 ∈W𝑡
                                                        v
                                                        u
                                                        t                     
                                                             max ln 𝐾 𝑛𝜏𝑘 (𝑡) , 0
          Pick arm 𝜑𝑡 = arg max 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) +              𝜂                         ;
                                  𝑘 ∈K                               𝑛 𝑘 (𝑡)
is used to erase the outdated information smoothly and more efficiently utilize the information
history. The authors proposed the SW-UCB algorithm that intends to solve the MAB problem with
piece-wise stationary mean rewards. We show that a similar approach can also deal with the general
nonstationary environment with a variation budget. In contrast to SW-UCB, we integrate the sliding
window technique with MOSS instead of UCB1 and achieve the order optimal worst-case regret.
                                                                                     
     Let the sliding observation window at time 𝑡 be W𝑡 := min(1, 𝑡 − 𝜏), . . . , 𝑡 − 1 . Then, the
associated mean estimator is given by
                                               1 Õ                                      Õ
                         𝜇ˆ 𝑛𝑘 𝑘 (𝑡) =                    𝑋𝑠 1{𝜑 𝑠 = 𝑘 }, 𝑛 𝑘 (𝑡) =          1{𝜑 𝑠 = 𝑘 }.
                                           𝑛 𝑘 (𝑡)
                                                   𝑠∈W𝑡                                𝑠∈W𝑡
For each arm 𝑘 ∈ K, define the UCB index for SW-MOSS by
                                                                         v
                                                                         u
                                                                         t                        
                                                                             max ln 𝐾𝑛𝜏𝑘 (𝑡) , 0
                          𝑔𝑡𝑘 = 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) + 𝑐 𝑛 𝑘 (𝑘) , 𝑐 𝑛 𝑘 (𝑡) = 𝜂                                 ,
                                                                                        𝑛 𝑘 (𝑡)
where 𝜂 > 1/2 is a tunable parameter. With these notations, SW-MOSS is defined in Algorithm 6.
To analyze it, we will use the following concentration bound for sub-Gaussian random variables.
Fact 5.1 (Maximal Hoeffding inequality[83]). Let 𝑋1 , . . . , 𝑋𝑛 be a sequence of independent 1/2
                                                                    59


sub-Gaussian random variables. Define 𝑑𝑖 := 𝑋𝑖 − 𝜇𝑖 , then for any 𝛿 > 0,
                                                           Õ𝑚                           
                                 P ∃𝑚 ∈ {1, . . . , 𝑛} :        𝑑𝑖 ≥ 𝛿 ≤ exp −2𝛿2 /𝑛 ,
                                                            𝑖=1
                                                        Õ 𝑚                             
                         and P ∃𝑚 ∈ {1, . . . , 𝑛} :          𝑑𝑖 ≤ −𝛿 ≤ exp −2𝛿2 /𝑛 .
                                                         𝑖=1
      At time 𝑡, for each arm 𝑘 ∈ K define
                                                      1 Õ 𝑘
                                          𝑀𝑡𝑘 :=                  𝜇 𝑠 1{𝜑𝑠 =𝑘 } .
                                                   𝑛 𝑘 (𝑡)
                                                            𝑠∈W𝑡
Now, we are ready to present concentration bounds for the sliding window empirical mean 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) .
Lemma 5.3. For any arm 𝑘 ∈ K and any time 𝑡 ∈ T , if 𝜂 > 1/2, for any 𝑥 > 0 and 𝑙 ≥ 1, the
                               
probability of event 𝐴 := 𝜇ˆ 𝑛𝑘 (𝑡) + 𝑐 𝑛 𝑘 (𝑡) ≤ 𝑀𝑡𝑘 − 𝑥, 𝑛 𝑘 (𝑡) ≥ 𝑙 is no greater than
                                     𝑘
                                                  3
                                             (2𝜂) 2 𝐾           
                                                                      2
                                                                            
                                                           exp    −𝑥    𝑙/𝜂   .                        (5.5)
                                            ln(2𝜂) 𝜏𝑥 2
                                      
The probability of event 𝐵 := 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) − 𝑐 𝑛 𝑘 (𝑡) ≥ 𝑀𝑡𝑘 + 𝑥, 𝑛 𝑘 (𝑡) ≥ 𝑙 is also upper bounded by (5.5).
Proof. For any 𝑡 ∈ T , let 𝑢𝑖𝑘𝑡 be the 𝑖-th time slot when arm 𝑘 is selected within W𝑡 and let
𝑑𝑖𝑘𝑡 = 𝑋 𝑘𝑘𝑡 − 𝜇 𝑘 𝑘𝑡 . Note that
          𝑢𝑖      𝑢𝑖
                                                                      𝑚                
                                                                  1 Õ 𝑘𝑡
                           P ( 𝐴) ≤ P ∃𝑚 ∈ {𝑙, . . . , 𝜏} :                𝑑 ≤ −𝑥 − 𝑐 𝑚 ,
                                                                  𝑚 𝑖=1 𝑖
           p
Let 𝑎 =      2𝜂 such that 𝑎 > 1. We now apply a peeling argument [76, Sec 2.2] with geometric grid
𝑎 𝑠 𝑙 < 𝑚 ≤ 𝑎 𝑠+1 𝑙 over {𝑙, . . . , 𝜏}. Since 𝑐 𝑚 is monotonically decreasing in 𝑚,
                                                          𝑚                      
                                                      1 Õ 𝑘𝑡
                           P ∃𝑚 ∈ {𝑙, . . . , 𝜏} :            𝑑 ≤ −𝑥 − 𝑐 𝑚
                                                      𝑚 𝑖=1 𝑖
                           Õ                                Õ𝑚                            
                                              𝑠     𝑠+1            𝑘𝑡          𝑠         
                         ≤      P ∃𝑚 ∈ [𝑎 𝑙, 𝑎 𝑙) :               𝑑𝑖 ≤ −𝑎 𝑙 𝑥 + 𝑐 𝑎 𝑠+1 𝑙 .
                            𝑠≥0                              𝑖=1
                                                           60


According to Fact 5.1, the above summand is no greater than
                             Õ                                 Õ𝑚                            
                                                      𝑠+1             𝑘𝑡        𝑠           
                                  P ∃𝑚 ∈ [1, 𝑎 𝑙) :                 𝑑𝑖 ≤ −𝑎 𝑙 𝑥 + 𝑐 𝑎 𝑠+1 𝑙
                              𝑠≥0                               𝑖=1
                                                                            !
                             Õ                 𝑎 2𝑠 𝑙 2                  
                           ≤      exp −2                𝑥 2 + 𝑐2𝑎 𝑠+1 𝑙
                              𝑠≥0
                                               𝑎 𝑠+1 𝑙
                                                                              !
                             Õ
                                               𝑠−1      2     2𝜂           𝜏
                           ≤      exp −2𝑎 𝑙𝑥 − 2 ln
                              𝑠≥0
                                                              𝑎       𝐾𝑎 𝑠+1 𝑙
                             Õ 𝐾𝑙𝑎 𝑠                            
                                                         𝑠−2 2
                           =              exp −2𝑎 𝑙𝑥 .
                              𝑠≥1
                                    𝜏
Let 𝑏 = 2𝑥 2 𝑙/𝑎 2 . It follows that
             Õ 𝐾𝑙𝑎 𝑠                             ∫    +∞
                                      𝑠    𝐾𝑙                                 
                          exp −𝑏𝑎        ≤                  𝑎 𝑦+1 exp − 𝑏𝑎 𝑦 𝑑𝑦
             𝑠≥1
                     𝜏                       𝜏     0
                                                         ∫ +∞
                                              𝐾𝑙𝑎                                                        
                                         =                       exp(−𝑏𝑧)𝑑𝑧         where we set 𝑧 = 𝑎 𝑦
                                            𝜏 ln(𝑎) 1
                                            𝐾𝑙𝑎𝑒 −𝑏
                                         =                ,
                                            𝜏𝑏 ln(𝑎)
which concludes the bound for the probability of event 𝐴. By using upper tail bound, similar result
exists for event 𝐵.                                                                                        
    We now leverage Lemma 5.3 to get an upper bound on the worst-case regret for SW-MOSS.
Theorem 5.4. For the nonstationary MAB problem with 𝐾 arms, time horizon 𝑇, variation budget
                   l 1           2m
𝑉𝑇 > 0 and 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 , the worst-case regret of SW-MOSS satisfies
                                                                                  1  2
                                         sup          𝑅𝑇SW-MOSS ∈ O ((𝐾𝑉𝑇 ) 3 𝑇 3 ).
                                   F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾)
Proof. The proof consists of the following five steps.
Step 1: Recall that 𝑣 𝑖 is the variation within T𝑖 . Here, we trivially assign T0 = ∅ and 𝑣 0 = 0. Then,
for each 𝑖 ∈ {1, . . . , 𝑁 }, let
                                   Δ𝑖𝑘 := 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑘𝑖 − 2𝑣 𝑖−1 − 2𝑣 𝑖 ,       ∀𝑘 ∈ K.
Define the set of bad arms for SW-MOSS in T𝑖 as
                                            B𝑖SW := {𝑘 ∈ K | Δ𝑖𝑘 ≥ 𝜖 },
                                                               61


                         p
where we assign 𝜖 = 4 𝑒𝜂𝐾/𝜏.
Step 2: We decouple the regret in this step. For any 𝑡 ∈ T𝑖 , since 𝜇𝑡𝑘 − 𝜇𝜏𝑘𝑖 ≤ 𝑣 𝑖 for any 𝑘 ∈ K, it
satisfies that
                        𝜇𝑡∗ − 𝜇𝑡 𝑡 ≤ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝑡 𝑡 ≤ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑖𝑡 + 𝑣 𝑖
                                 𝜑                 𝜑               𝜑
                                            n           o
                                                     SW       𝜑
                                     ≤ 1 𝜑𝑡 ∈ B𝑖           (Δ𝑖 𝑡 − 𝜖) + 2𝑣 𝑖−1 + 3𝑣 𝑖 + 𝜖 .
Then we get the following inequalities,
                   Õ                  Õ  𝑁 Õ n                  o
                       𝜇𝑡∗
                               𝜑                                       𝜑
                           − 𝜇𝑡 𝑡   ≤            1 𝜑𝑡 ∈ B𝑖SW (Δ𝑖 𝑡 − 𝜖) + 2𝑣 𝑖−1 + 3𝑣 𝑖 + 𝜖
                   𝑡∈T                  𝑖=1 𝑡∈T𝑖
                                                      Õ𝑁 Õ n                    o
                                                                                   𝜑
                                    ≤5𝜏𝑉𝑇 + 𝑇𝜖 +                1 𝜑𝑡 ∈ B𝑖SW (Δ𝑖 𝑡 − 𝜖).                  (5.6)
                                                      𝑖=1 𝑡∈T𝑖
To continue, we take a decomposition inspired by the analysis of MOSS in [70] below,
                              Õ n                    o           
                                                  SW      𝜑𝑡
                                    1 𝜑𝑡 ∈ B𝑖           Δ𝑖 − 𝜖
                              𝑡∈T𝑖
                                                                         𝜑 
                              Õ 
                                                 SW 𝜅 𝑖         𝜅𝑖     Δ𝑖 𝑡 𝜑𝑡
                           ≤        1 𝜑𝑡 ∈ B𝑖 , 𝑔𝑡 > 𝑀𝑡 −                    Δ𝑖                          (5.7)
                              𝑡∈T𝑖
                                                                         4
                                                                         𝜑 
                              Õ 
                                                 SW 𝜅 𝑖         𝜅𝑖     Δ𝑖 𝑡  𝜑𝑡     
                            +       1 𝜑𝑡 ∈ B𝑖 , 𝑔𝑡 ≤ 𝑀𝑡 −                     Δ𝑖 − 𝜖 ,                   (5.8)
                              𝑡∈T
                                                                         4
                                  𝑖
where summands (5.7) describes the regret when arm 𝜅𝑖 is fairly estimated and summand (5.8)
quantifies the regret incurred by underestimating arm 𝜅𝑖 .
                                                                              𝜑
Step 3: In this step, we bound the expectation of (5.7). Since 𝑔𝑡 𝑡 ≥ 𝑔𝑡𝜅𝑖 ,
                                              𝜑                                              𝜑 
                                            Δ𝑖 𝑡 𝜑𝑡 Õ                                        Δ𝑖 𝑡 𝜑𝑡
       Õ                                                       
                     SW 𝜅 𝑖          𝜅𝑖                                    SW 𝜑𝑡       𝜅𝑖
            1 𝜑𝑡 ∈ B𝑖 , 𝑔𝑡 > 𝑀𝑡 −                 Δ𝑖 ≤        1 𝜑𝑡 ∈ B𝑖 , 𝑔𝑡 > 𝑀𝑡 −               Δ𝑖
       𝑡∈T𝑖
                                              4          𝑡∈T𝑖
                                                                                              4
                                                           Õ Õ                                Δ𝑖𝑘 𝑘
                                                                                                  
                                                                                  𝑘       𝜅𝑖
                                                      =                1 𝜑𝑡 = 𝑘, 𝑔𝑡 > 𝑀𝑡 −          Δ𝑖 . (5.9)
                                                              SW 𝑡∈T
                                                                                                4
                                                         𝑘 ∈B𝑖       𝑖
Notice that for any 𝑡 ∈ T𝑖−1 ∪ T𝑖 ,
                                        𝜇𝑡𝑘 − 𝜇𝜏𝑘𝑖 ≤ 𝑣 𝑖−1 + 𝑣 𝑖 ,     ∀𝑘 ∈ K.
                                                         62


It indicates that an arm 𝑘 ∈ B𝑖SW is at least Δ𝑖𝑘 worse in mean reward than arm 𝜅𝑖 at any time slot
𝑡 ∈ T𝑖−1 ∪ T𝑖 . Since W𝑡 ⊂ T𝑖−1 ∪ T𝑖 , for any 𝑡 ∈ T𝑖
                                             𝑀𝑡𝜅𝑖 − 𝑀𝑡𝑘 ≥ Δ𝑖𝑘 ≥ 𝜖,                    ∀𝑘 ∈ B𝑖SW .
It follows from (5.9) that
                                                        Δ𝑖𝑘 𝑘                                                             3Δ𝑖𝑘 𝑘
        Õ Õ                                                            Õ Õ                                                   
                                   𝑘            𝜅𝑖                                                        𝑘          𝑘
                  1 𝜑𝑡 = 𝑘, 𝑔𝑡 > 𝑀𝑡 −                          Δ𝑖 ≤                      1 𝜑𝑡 = 𝑘, 𝑔𝑡 > 𝑀𝑡 +                      Δ𝑖 .  (5.10)
          SW 𝑡∈T
                                                         4                    SW 𝑡∈T
                                                                                                                             4
      𝑘∈B𝑖      𝑖                                                       𝑘∈B𝑖          𝑖
                                                                                                                               SW
       𝑠 be the 𝑠-th time slot when arm 𝑘 is selected within T𝑖 . Then, for any 𝑘 ∈ B𝑖 ,
Let 𝑡 𝑖𝑘
                                                   Õ                                        3Δ𝑖𝑘
                                                                                                    
                                                                            𝑘          𝑘
                                                         1 𝜑𝑡 = 𝑘, 𝑔𝑡 > 𝑀𝑡 +
                                                   𝑡∈T𝑖
                                                                                                4
                                                   Õ                               3Δ𝑖𝑘
                                                                                           
                                                                𝑘           𝑘
                                              =          1 𝑔𝑡 𝑖𝑘 > 𝑀𝑡 𝑖𝑘 +
                                                   𝑠≥1
                                                                 𝑠          𝑠         4
                                                                                              3Δ𝑖𝑘
                                                            Õ                                       
                                                    𝑘                    𝑘            𝑘
                                             ≤𝑙𝑖 +                  1 𝑔𝑡 𝑖𝑘 > 𝑀𝑡 𝑖𝑘 +                  ,                                (5.11)
                                                                          𝑠            𝑠         4
                                                         𝑠≥𝑙 𝑖 +1
                                                              𝑘
                                         
                                    2                  𝑘 2
                              4              𝜏 Δ𝑖
where we set 𝑙𝑖𝑘 = 𝜂 Δ𝑘 ln 𝜂𝐾                         4         . Since Δ𝑖𝑘 ≥ 𝜖, for 𝑘 ∈ B𝑖SW , we have
                               𝑖
                                                             2  𝜏                                         2
                                           l                                        2 m          
                                  𝑙𝑖𝑘 ≥ 𝜂 4/Δ𝑖 ln         𝑘
                                                                               𝜖/4          ≥ 𝜂 4/Δ𝑖𝑘 ,
                                                                     𝜂𝐾
                                                                                      p
where the second inequality follows by substituting 𝜖 = 4 𝑒𝜂𝐾/𝜏. Additionally, since 𝑡1𝑖𝑘 , . . . , 𝑡 𝑖𝑘                                 𝑠−1 ∈
W𝑡 𝑠𝑖𝑘 , we get 𝑛 𝑘 (𝑡 𝑖𝑘
                       𝑠 ) ≥ 𝑠 − 1. Furthermore, since 𝑐 𝑚 is monotonically decreasing with 𝑚,
                                                                   v
                                                                   u
                                                                   t                    2!
                                                                      𝜂           𝜏 Δ𝑖𝑘                  Δ𝑖𝑘
                                      𝑐 𝑛 𝑘 (𝑡 𝑠𝑘 ) ≤ 𝑐 𝑙 𝑘 ≤            ln                         ≤        ,
                                                           𝑖         𝑙𝑖𝑘        𝜂𝐾 4                     4
for 𝑠 ≥ 𝑙𝑖𝑘 + 1. Therefore, we continue from (5.11) to get
                                                        3Δ 𝑘                                                                    Δ𝑖𝑘
                    Õ                                                        Õ                                                   
            𝑙𝑖𝑘 +          1   𝑔𝑡𝑘𝑖𝑘  >   𝑀𝑡𝑘𝑖𝑘     + 𝑖            ≤ 𝑙𝑖𝑘 +             1    𝑔𝑡𝑘𝑖𝑘 − 2𝑐 𝑛 𝑘 (𝑡 𝑠𝑖𝑘 ) >  𝑀𝑡𝑘𝑖𝑘  +       .
                                  𝑠             𝑠         4                                    𝑠                          𝑠      4
                  𝑠≥𝑙𝑖𝑘 +1                                                    𝑠≥𝑙𝑖𝑘 +1
                                                                         63


By applying Lemma 5.3, considering 𝑛 𝑘 (𝑡 𝑖𝑘        𝑠 ) ≥ 𝑠 − 1,
                                                                                  Δ𝑘
                               Õ                                                    
                                        P    𝑔𝑡𝑘𝑖𝑘  − 2𝑐 𝑛 𝑘 (𝑡 𝑠𝑖𝑘 ) >   𝑀𝑡𝑘𝑖𝑘 + 𝑖
                                                𝑠                            𝑠     4
                             𝑠≥𝑙𝑖𝑘 +1
                                                                                     !
                             Õ (2𝜂) 23 𝐾  4  2                               2
                                                                            𝑠 Δ𝑖𝑘
                           ≤                                   exp −
                                  𝑘
                                     ln(2𝜂) 𝜏 Δ𝑖𝑘                           𝜂 4
                             𝑠≥𝑙𝑖
                                   +∞             3     2                      2!
                                                                               𝑦 Δ𝑖𝑘
                             ∫
                                         (2𝜂) 2 𝐾 4
                           ≤                               𝑘
                                                                     exp −                𝑑𝑦
                               𝑙𝑖𝑘 −1 ln(2𝜂) 𝜏 Δ𝑖                              𝜂 4
                                      3          4
                              (2𝜂) 2 𝜂𝐾 4
                           ≤                             .                                   (5.12)
                             ln(2𝜂) 𝜏 Δ𝑖𝑘
                                                                                  p
Let ℎ(𝑥) = 16𝜂/𝑥 ln 𝜏𝑥 2 /16𝜂𝐾 which achieves maximum at 4𝑒 𝜂𝐾/𝜏. Combining (5.12), (5.11), (5.10),
and (5.9), we obtain
                                        Õ (2𝜂) 23 𝜂𝐾 256
                                                                                𝑘 𝑘
                       E [(5.7)] ≤
                                                ln(2𝜂)     𝜏         3 + 𝑙𝑖 Δ𝑖
                                        𝑘 ∈B𝑖                        Δ𝑖𝑘
                                        Õ (2𝜂) 32 𝜂𝐾 256
                                                                                   𝑘       𝑘
                                     ≤                               3 + ℎ(Δ𝑖 ) + Δ𝑖
                                                ln(2𝜂)     𝜏
                                        𝑘 ∈B𝑖                        Δ𝑖𝑘
                                        Õ (2𝜂) 32 𝜂𝐾 256                        p         
                                     ≤                                    + ℎ 4𝑒 𝜂𝐾/𝜏 + 𝑏
                                        𝑘 ∈B𝑖
                                                ln(2𝜂) 𝜏 𝜖 3
                                                           √ √
                                                                 
                                            2.6𝜂
                                     ≤                + 3 𝜂 𝐾𝜏 + 𝐾𝑏.
                                          ln(2𝜂)
                                                                                             𝜑
Step 4: In this step, we bound expectation of (5.8). When event 𝜑𝑡 ∈ B𝑖SW , 𝑔𝑡𝜅𝑖 ≤ 𝑀𝑡𝜅𝑖 − Δ𝑖 𝑡 /4
happens, we know
                                 𝜑                                                     𝜖
                              Δ𝑖 𝑡 ≤ 4𝑀𝑡𝜅𝑖 − 4𝑔𝑡𝜅𝑖 and 𝑔𝑡𝜅𝑖 ≤ 𝑀𝑡𝜅𝑖 − .
                                                                                       4
Thus, we have
                                                                           𝜑 
                                                                         Δ𝑖 𝑡  𝜑𝑡
                                                                                        
                           1 𝜑𝑡 ∈       B𝑖SW , 𝑔𝑡𝜅𝑖    ≤   𝑀𝑡𝜅𝑖      −           Δ𝑖 − 𝜖
                                                                          4
                                                      
                                                     𝜖                                
                          ≤1 𝑔𝑡𝜅𝑖 ≤ 𝑀𝑡𝜅𝑖 −                × 4𝑀𝑡𝜅𝑖 − 4𝑔𝑡𝜅𝑖 − 𝜖 := 𝑌 .
                                                     4
                                                            64


Since 𝑌 is a nonnegative random variable, its expectation can be computed involving only its
cumulative density function:
                                       ∫   +∞
                            E [𝑌 ] =          P (𝑌 > 𝑥) 𝑑𝑥
                                        0
                                       ∫   +∞                            
                                    ≤         P 4𝑀𝑡𝜅𝑖 − 4𝑔𝑡𝜅𝑖 − 𝜖 ≥ 𝑥 𝑑𝑥
                                       ∫ 0 +∞                      
                                    =         P 4𝑀𝑡𝜅𝑖 − 4𝑔𝑡𝜅𝑖 > 𝑥 𝑑𝑥
                                        𝜖
                                       ∫   +∞           3                     3
                                               16(2𝜂) 2 𝐾          16(2𝜂) 2 𝐾
                                    ≤                      2
                                                             𝑑𝑥 =                 .
                                         𝜖      ln(2𝜂) 𝜏𝑥            ln(2𝜂) 𝜏𝜖
                             3                      
Hence, E [(5.8)] ≤ 16(2𝜂) 2 𝐾 |T𝑖 | / ln(2𝜂)𝜏𝜖 .
Step 5: With bounds on E [(5.7)] and E [(5.8)] from previous steps,
                                                                                  3
                                                   √ √
                                                     
                                      2.6𝜂                                16(2𝜂) 2 𝐾𝑇               1   2
      E [(5.6)] ≤5𝜏𝑉𝑇 + 𝑇𝜖 + 𝑁                + 3 𝜂 𝐾𝜏 + 𝑁𝐾𝑏 +                           ≤ 𝐶 (𝐾𝑉𝑇 ) 3 𝑇 3 ,
                                     ln(2𝜂)                                ln(2𝜂) 𝜏𝜖
for some constant 𝐶, which concludes the proof.                                                             
    We have shown that SW-MOSS also enjoys order optimal worst-case regret. One drawback of
the sliding window method is that all sampling history within the observation window needs to be
                                                          1          2
stored. Since window size is selected to be 𝜏 = 𝐾 3 (𝑇/𝑉𝑇 ) 3 , large memory is needed for large
horizon length 𝑇. The next policy resolves this problem.
5.2.3   Discounted UCB Algorithm
The discount factor is widely used in estimators to forget old information and put more attention on
recent information. In [20], such an estimation is used together with UCB1 to solve the piecewise
stationary MAB problem, and the policy designed is called Discounted UCB (D-UCB). Here, we
tune D-UCB to work in the nonstationary environment with variation budget 𝑉𝑇 . Specifically, the
mean estimator used is discounted empirical average given by
                                 𝑡−1                                  𝑡−1
                      𝑘      1 Õ 𝑡−𝑠                          𝑘
                                                                     Õ
                   𝜇ˆ 𝛾,𝑡 =  𝑘
                                     𝛾 1{𝜑 𝑠 = 𝑘 }𝑋𝑠 ,       𝑛𝛾,𝑡 =       𝛾 𝑡−𝑠 1{𝜑 𝑠 = 𝑘 },
                            𝑛𝛾,𝑡 𝑠=1                                  𝑠=1
                                                       65


  Algorithm 7: The D-UCB Algorithm
                                                   1
       Input       : 𝑉𝑇 ∈ R>0 , 𝑇 ∈ N and 𝜉 >      2
                                   1         2
       Set         : 𝛾 = 1 − 𝐾 − 3 (𝑇/𝑉𝑇 ) − 3
      Output       : sequence of arm selection
   1  for 𝑡 ∈ {1, . . . , 𝐾 } do
           Pick arm 𝜑𝑡 = 𝑡 and set 𝑛𝑡 ← 𝛾 𝐾 −𝑡 and 𝜇ˆ 𝑡 ← 𝑋𝑡𝑡 ;
   2  while 𝑡 ≤ 𝑇 do
                                             r
                                       𝑘        𝜉 ln(𝜏)
           Pick arm 𝜑𝑡 = arg max 𝜇ˆ + 2                   ;
                                 𝑘 ∈K              𝑛𝑘
           For each arm 𝑘 ∈ K, set 𝑛 𝑘 ← 𝛾𝑛 𝑘 ;
                                                     1       𝜑
           Set 𝑛 𝜑𝑡 ← 𝑛 𝜑𝑡 + 1 & 𝜇ˆ 𝜑𝑡 ← 𝜇ˆ 𝜑𝑡 +    𝑛 𝜑𝑡 (𝑋𝑡 𝑡 − 𝑋¯ 𝜑𝑡 );
                         1           2
where 𝛾 = 1 − 𝐾 − 3 (𝑇/𝑉𝑇 ) − 3 is the discount factor. Besides, the UCB is designed as 𝑔𝑡𝑘 = 𝜇ˆ 𝑡𝑘 + 2𝑐 𝑡𝑘 ,
                 q
          𝑘 =
where 𝑐 𝛾,𝑡         𝜉 ln(𝜏)/𝑛𝛾,𝑡𝑘 for some constant 𝜉 > 1/2. The pseudo code for D-UCB is reproduced
in Algorithm 7. It can be noticed that the memory size is only related to the number of arms, so
D-UCB requires small memory.
     To proceed the analysis, we review the concentration inequality for discounted empirical average,
which is an extension of Chernoff-Hoeffding bound. Let
                                                           𝑡−1
                                                     1 Õ 𝑡−𝑠
                                           𝑘
                                         𝑀𝛾,𝑡  :=    𝑘
                                                               𝛾 1{𝜑 𝑠 = 𝑘 }𝜇 𝑠𝑘 .
                                                   𝑛𝛾,𝑡 𝑠=1
Then, the following fact is a corollary of [20, Theorem 18].
Fact 5.2 (A Hoeffding-type inequality for discounted empirical average with a random number of
                                                                                                      q 
summands). For any 𝑡 ∈ T and for any 𝑘 ∈ K, the probability of event 𝐴 = 𝜇ˆ 𝛾,𝑡 − 𝑀𝛾,𝑡 ≥ 𝛿/ 𝑛𝛾,𝑡
                                                                                       𝑘      𝑘            𝑘
is no greater than
                                                                            
                                                                    2       2
                                         log1+𝜆 (𝜏) exp −2𝛿 1 − 𝜆 /16                                   (5.13)
                                                                                        q   
for any 𝛿 > 0 and 𝜆 > 0. The probability of event 𝐵 = 𝜇ˆ 𝛾,𝑡 − 𝑀𝛾,𝑡 ≤ −𝛿/ 𝑛𝛾,𝑡
                                                                             𝑘     𝑘       𝑘    is also upper
bounded by (5.13).
                                                               66


Theorem 5.5. For the nonstationary MAB problem with 𝐾 arms, time horizon 𝑇, variation budget
                           1                2
𝑉𝑇 > 0, and 𝛾 = 1 − 𝐾 − 3 (𝑇/𝑉𝑇 ) − 3 , if 𝜉 > 1/2, the worst case regret of D-UCB satisfies
                                                                                               1    2
                                         sup           𝑅𝑇D-UCB ≤ 𝐶 ln(𝑇)(𝐾𝑉𝑇 ) 3 𝑇 3 .
                                 F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾)
Proof. We establish the theorem in four steps.
                                          𝑘 − 𝑀 𝑘 at some time slot 𝑡 ∈ T . Let 𝜏0 = log (1 − 𝛾)𝜉 ln(𝜏)/𝑏 2
                                                                                                                    
Step 1: In this step, we analyze 𝜇 𝛾,𝑡               𝛾,𝑡                                     𝑖             𝛾
and take 𝑡 − 𝜏0 as a dividing point, then we obtain
                                                         𝑡−1
                                                    1 Õ 𝑡−𝑠
                             𝜇𝜏𝑘𝑖 − 𝑀𝛾,𝑡  𝑘
                                               ≤     𝑘
                                                                𝛾 1{𝜑 𝑠 = 𝑘 } 𝜇𝜏𝑘𝑖 − 𝜇 𝑠𝑘
                                                   𝑛𝛾,𝑡 𝑠=1
                                                    1 Õ 𝑡−𝑠
                                               ≤ 𝑘                 𝛾 1{𝜑 𝑠 = 𝑘 } 𝜇𝜏𝑘𝑖 − 𝜇 𝑠𝑘                 (5.14)
                                                   𝑛𝛾,𝑡 𝑠≤𝑡−𝜏 0
                                                             𝑡−1
                                                    1 Õ 𝑡−𝑠
                                                +    𝑘
                                                                   𝛾 1{𝜑 𝑠 = 𝑘 } 𝜇𝜏𝑘𝑖 − 𝜇 𝑠𝑘 .               (5.15)
                                                   𝑛𝛾,𝑡 𝑠≥𝑡−𝜏 0
Since 𝜇𝑡𝑘 ∈ [𝑎, 𝑎 + 𝑏] for all 𝑡 ∈ T , we have (5.14) ≤ 𝑏. Also,
                                                                                    0
                                            1 Õ                 𝑡−𝑠           𝑏𝛾 𝜏               𝜉 ln(𝜏)
                          (5.14) ≤ 𝑘                      𝑏𝛾 ≤                              =            .
                                          𝑛𝛾,𝑡 𝑠≤𝑡−𝜏 0                     (1 − 𝛾)𝑛𝛾,𝑡 𝑘              𝑘
                                                                                                  𝑏𝑛𝛾,𝑡
Accordingly, we get                                                          ! s
                                                                 𝜉 ln(𝜏)               𝜉 ln(𝜏)
                                   (5.14) ≤ min 𝑏,                     𝑘
                                                                                ≤           𝑘
                                                                                                   .
                                                                  𝑏𝑛𝛾,𝑡                  𝑛𝛾,𝑡
Furthermore, for any 𝑡 ∈ T𝑖 ,
                                                                                        Õ 𝑖
                                  (5.15) ≤             max          𝜇𝜏𝑘𝑖   − 𝜇 𝑠𝑘  ≤            𝑣𝑗,
                                                   𝑠∈[𝑡−𝜏 0 ,𝑡−1]
                                                                                       𝑗=𝑖−𝑛 0
where 𝑛0 = d𝜏0/𝜏e and 𝑣 𝑗 is the variation within T𝑗 . So we conclude that for any 𝑡 ∈ T𝑖 ,
                                                                      Õ  𝑖
                                                              𝑘
                                   𝜇 𝜅𝑘𝑖  −      𝑘
                                              𝑀𝛾,𝑡     ≤  𝑐 𝛾,𝑡  +            𝑣𝑗,     ∀𝑘 ∈ K.                (5.16)
                                                                     𝑗=𝑖−𝑛 0
Step 2: Within partition T𝑖 , let
                                                                              Õ  𝑖
                                              Δ̂𝑖𝑘  =  𝜇𝜏𝜅𝑖𝑖  −  𝜇𝜏𝑘𝑖 −2              𝑣𝑗,
                                                                             𝑗=𝑖−𝑛 0
                                                                  67


and define a subset of bad arms as
                                                                                       
                                                                                      0
                                              B𝑖D   = 𝑘∈K|               Δ̂𝑖𝑘  ≥𝜖 ,
                         p
where we select 𝜖 0 = 4 𝜉𝛾 1−𝜏 𝐾 ln(𝜏)/𝜏. Since 𝜇𝑡𝑘 − 𝜇𝜏𝑘𝑖 ≤ 𝑣 𝑖 for any 𝑡 ∈ T𝑖 and for any 𝑘 ∈ K
                            Õ                         Õ𝑁 Õ
                                    𝜇𝑡∗
                                              𝜑                                 𝜑
                                         −  𝜇𝑡 𝑡 ≤                 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑖𝑡 + 𝑣 𝑖
                            𝑡∈T                       𝑖=1 𝑡∈T𝑖
                                       Õ 𝑁 Õ n                         o                    Õ𝑖               
                                                                     D        𝜑𝑡                            0
                          ≤𝜏𝑉𝑇 +                    1 𝜑𝑡 ∈ B𝑖 Δ̂𝑖 + 2                               𝑣𝑗 + 𝜖
                                        𝑖=1 𝑡∈T𝑖                                           𝑗=𝑖−𝑛 0
                                                                 Õ 𝑁 Õ               Õ 
                                  0                       0
                          ≤(2𝑛 + 3)𝜏𝑉𝑇 + 𝑁𝜖 𝜏+                                 Δ̂𝑖𝑘          1 𝜑𝑡 = 𝑘 .         (5.17)
                                                                  𝑖=1   𝑘∈B𝑖D        𝑡∈T𝑖
                                           Í                                
Step 3: In this step, we bound E Δ̂𝑖𝑘 𝑡∈T𝑖 1 𝜑𝑡 = 𝑘 for an arm 𝑘 ∈ B𝑖D . Let 𝑡𝑖𝑘 (𝑙) be the 𝑙-th
                                                                                                              𝜑
time slot arm 𝑘 is selected within T𝑖 . From arm selection policy, we get 𝑔𝑡 𝑡 ≥ 𝑔𝑡𝜅𝑖 , which result in
                          Õ                                  Õ n                                          o
                                1 𝜑𝑡 = 𝑘 ≤ 𝑙𝑖𝑘 +                     1 𝑔𝑡𝑘 ≥ 𝑔𝑡𝜅𝑖 , 𝑡 > 𝑡𝑖𝑘 (𝑙𝑖𝑘 ) ,            (5.18)
                          𝑡∈T𝑖                                 𝑡∈T𝑖
                       l                              2m
where we pick 𝑙𝑖 = 16𝜉𝛾 ln(𝜏)/ Δ̂𝑖 . Note that 𝑔𝑡𝑘 ≥ 𝑔𝑡𝜅𝑖 is true means at least one of the
                   𝑘          1−𝜏                  𝑘
followings holds,
                                                           𝑘             𝑘         𝑘
                                                        𝜇ˆ 𝛾,𝑡  ≥ 𝑀𝛾,𝑡        + 𝑐 𝛾,𝑡   ,                       (5.19)
                                                           𝜅𝑖            𝜅𝑖         𝜅𝑖
                                                        𝜇ˆ 𝛾,𝑡  ≤ 𝑀𝛾,𝑡        − 𝑐 𝛾,𝑡   ,                       (5.20)
                                                 𝜅𝑖        𝜅𝑖            𝑘            𝑘
                                              𝑀𝛾,𝑡    + 𝑐 𝛾,𝑡   < 𝑀𝛾,𝑡        + 3𝑐 𝛾,𝑡     .                    (5.21)
For any 𝑡 ∈ T𝑖 , since every sample before 𝑡 within T𝑖 has a weight greater than 𝛾 𝜏−1 , if 𝑡 > 𝑡𝑖𝑘 (𝑙𝑖𝑘 ),
                                                s                    s
                                        𝑘           𝜉 ln(𝜏)               𝜉 ln(𝜏)             Δ̂𝑖𝑘
                                      𝑐 𝛾,𝑡  =          𝑘
                                                                  ≤                       ≤        .
                                                      𝑛𝛾,𝑡                 𝛾 𝜏−1 𝑙𝑖𝑘            4
Combining it with (5.16) yields
                                                                                                   Õ 𝑖
                               𝜅𝑖           𝑘                            𝜅𝑖
                           𝑀𝛾,𝑡     − 𝑀𝛾,𝑡      ≥ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑘𝑖 − 𝑐 𝛾,𝑡        − 𝑐 𝛾,𝑡𝑘
                                                                                           −2             𝑣𝑗
                                                                                                  𝑗=𝑖−𝑛 0
                                                               𝜅𝑖                                  𝜅𝑖
                                                ≥ Δ̂𝑖𝑘 − 𝑐 𝛾,𝑡            𝑘
                                                                    − 𝑐 𝛾,𝑡    ≥ 3𝑐 𝛾,𝑡  𝑘
                                                                                              − 𝑐 𝛾,𝑡  ,
                                                                  68


                                                                          p
which indicates (5.21) is false. As 𝜉 > 1/2, we select 𝜆 = 4 1 − 1/(2𝜉) and apply Fact 5.2 to get
                                                                                                
                                                               −2𝜉 (1−𝜆2 /16)     log1+𝜆 (𝜏)
                     P((5.19) is true) ≤ log1+𝜆 (𝜏) 𝜏                           ≤                  .
                                                                                        𝜏
The probability of (5.20) to be true shares the same bound. Then, it follows from (5.18) that
    Í                  
E Δ̂𝑖𝑘 𝑡∈T𝑖 1 𝜑𝑡 = 𝑘 is upper bounded by
                                                   Õ
                                 Δ̂𝑖𝑘 𝑙𝑖𝑘 + Δ̂𝑖𝑘        P ((5.19) or (5.20) is true)
                                                   𝑡∈T𝑖
                                 16𝜉𝛾 1−𝜏       ln(𝜏)                               
                               ≤                         + Δ̂𝑖𝑘 + 2Δ̂𝑖𝑘 log1+𝜆 (𝜏)
                                           Δ̂𝑖𝑘
                                 16𝜉𝛾 1−𝜏 ln(𝜏)                                 
                               ≤              0
                                                         + 𝑏 + 2𝑏 log1+𝜆 (𝜏) ,                       (5.22)
                                            𝜖
where we use 𝜖 0 ≤ Δ̂𝑖𝑘 ≤ 𝑏 in the last step.
Step 4: From (5.17) and (5.22), and plugging in the value of 𝜖 0, an easy computation results in
                                                                      q
                                                  0
                             𝑅𝑇D-UCB     ≤(2𝑛 + 3)𝜏𝑉𝑇 + 8𝑁 𝜉𝛾 1−𝜏 𝐾𝜏 ln(𝜏)
                                             + 2𝑁 𝑏 + 2𝑁 𝑏 log1+𝜆 (𝜏) ,
where the dominating term is (2𝑛0 + 3)𝜏𝑉𝑇 . Considering
                                                          2
                                                                                           2
                                                                                              
                            ln  (1   −   𝛾)𝜉    ln(𝜏)/𝑏           − ln  (1  − 𝛾)𝜉  ln(𝜏)/𝑏
                       𝜏0 =                                    ≤                                ,
                                          ln 𝛾                               1−𝛾
we get 𝑛0 ≤ 𝐶 0 ln(𝑇) for some constant 𝐶 0. Hence there exists some absolute constant 𝐶 such that
                                                                          1   2
                                         𝑅𝑇D-UCB ≤ 𝐶 ln(𝑇)(𝐾𝑉𝑇 ) 3 𝑇 3 .
                                                                                                         
    Although the discount factor method requires less memory, there exists an extra factor ln(𝑇) in
the upper bound on the worst-case regret for D-UCB comparing with the minimax regret. This is
due to the fact that the discount factor method does not entirely cut off outdated sampling history
like periodic resetting or sliding window techniques.
                                                            69


5.3     UCB Policies for Heavy-tailed Nonstationary Stochastic MAB Problems
In this section, we propose and analyze UCB algorithms for the non-stationary stochastic MAB
problem with heavy-tailed rewards defined in Assumption 5.2. For the stationary heavy-tailed
MAB problem, we have shown Robust MOSS in chapter 2 achieve order optimal worst-case
regret. We extend it to the nonstationary setting and design resetting robust MOSS algorithm and
sliding-window robust MOSS algorithm.
5.3.1    Resetting robust MOSS for the non-stationary heavy-tailed MAB problem
Like R-MOSS, Resetting Robust MOSS (R-RMOSS) restarts Robust MOSS after every 𝜏 time
slots. For a stationary heavy-tailed MAB problem, it has been shown in theorem 2.10 that the
                                                         √
worst-case regret of Robust MOSS belongs to O ( 𝐾𝑇). This result along with an analysis similar
to the analysis for R-MOSS in Theorem 5.2 yield the following theorem for R-RMOSS. For brevity,
we skip the proof.
Theorem 5.6. For the nonstationary heavy-tailed MAB problem with 𝐾 arms, horizon 𝑇, variation
                            l 1            2m
budget 𝑉𝑇 > 0 and 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 , if 𝜓(2𝜁/𝑎) ≥ 2𝑎/𝜁, the worst-case regret of R-RMOSS
satisfies
                                                                        1   2
                                      sup         𝑅𝑇R-RMOSS ∈ O ((𝐾𝑉𝑇 ) 3 𝑇 3 ).
                                 F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾)
5.3.2    SW-RMOSS for the non-stationary heavy-tailed MAB problem
In Sliding-Window Robust MOSS (SW-RMOSS), 𝑛 𝑘 (𝑡) and 𝜇¯ 𝑛 𝑘 (𝑡) are computed from the sampling
                                    q                
history within W𝑡 , and 𝑐 𝑛 𝑘 (𝑡) = ln+ 𝐾𝑛𝜏𝑘 (𝑡) /𝑛 𝑘 (𝑡). To analyze SW-RMOSS, we want to establish
a similar property as Lemma 5.3 to bound the probability about an arm being under or over
estimated. Toward this end, we need the following properties for truncated random variable.
                                                        70


Lemma 5.7. Let 𝑋 be a random variable with expected value 𝜇 and E [𝑋 2 ] ≤ 1. Let 𝑑 :=
sat(𝑋, 𝐵) − E [sat(𝑋, 𝐵)]. Then for any 𝐵 > 0, it satisfies (i) |𝑑| ≤ 2𝐵 (ii) E [𝑑 2 ] ≤ 1 (iii)
E [sat(𝑋, 𝐵)] − 𝜇 ≤ 1/𝐵.
Proof. Property (i) follows immediately from definition of 𝑑 and property (ii) follows from
                                                                      
                                     E [𝑑 2 ] ≤ E sat2 (𝑋, 𝐵) ≤ E 𝑋 2 .
To see property (iii), since
                                                                          
                                    𝜇 = E 𝑋 1 |𝑋 | ≤ 𝐵 + 1 |𝑋 | > 𝐵 ,
one have
                                                                                                " #
                                      h                       i      h                 i      𝑋2
         E [sat(𝑋, 𝐵)] − 𝜇 ≤ E           |𝑋 | − 𝐵 1 |𝑋 | > 𝐵 ≤ E |𝑋 | 1 |𝑋 | > 𝐵 ≤ E                  .
                                                                                                  𝐵
                                                                                                            
Moreover, we will also use a maximal Bennett type inequality as shown in the following.
Lemma 5.8 (Maximal Bennett’s inequality [75]). Let {𝑋𝑖 }𝑖∈{1,...,𝑛} be a sequence of bounded
random variables with support [−𝐵, 𝐵], where 𝐵 ≥ 0. Suppose that E [𝑋𝑖 |𝑋1 , . . . , 𝑋𝑖−1 ] = 𝜇𝑖 and
                                            Í𝑚
Var[𝑋𝑖 |𝑋1 , . . . , 𝑋𝑖−1 ] ≤ 𝑣. Let 𝑆 𝑚 = 𝑖=1   (𝑋𝑖 − 𝜇𝑖 ) for any 𝑚 ∈ {1, . . . , 𝑛}. Then, for any 𝛿 ≥ 0
                                                                               !
                                                                        𝛿      𝐵𝛿
                            P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑆 𝑚 ≥ 𝛿 ≤ exp − 𝜓                 ,
                                                                         𝐵      𝑛𝑣
                                                                                 !
                                                                          𝛿     𝐵𝛿
                            P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑆 𝑚 ≤ −𝛿 ≤ exp − 𝜓                  .
                                                                           𝐵     𝑛𝑣
    Now, we are ready to establish a concentration property for saturated sliding window empirical
mean.
Lemma 5.9. For any arm 𝑘 ∈ {1, . . . , 𝐾 } and any 𝑡 ∈ {𝐾 + 1, . . . , 𝑇 }, if 𝜓(2𝜁/𝑎) ≥ 2𝑎/𝜁,
                                                                                             
the probability of either event 𝐴 = 𝑔𝑡𝑘 ≤ 𝑀𝑡𝑘 − 𝑥, 𝑛 𝑘 (𝑡) ≥ 𝑙 or event 𝐵 = 𝑔𝑡𝑘 − 2𝑐 𝑛 𝑘 (𝑡) ≥
𝑀𝑡𝑘 + 𝑥, 𝑛 𝑘 (𝑡) ≥ 𝑙 , for any 𝑥 > 0 and any 𝑙 ≥ 1, is no greater than
                                 2𝑎     𝐾       p                        p         
                                            (𝛽𝑥   ℎ(𝑙)/𝑎  +  1) exp   −𝛽𝑥    ℎ(𝑙)/𝑎   ,
                              𝛽2 ln(𝑎) 𝜏𝑥 2
                        
where 𝛽 = 𝜓 2𝜁/𝑎 /(2𝑎).
                                                        71


Proof. Recall that 𝑢𝑖𝑘𝑡 is the 𝑖-th time slot when arm 𝑘 is selected within W𝑡 . Since 𝑐 𝑚 is a
monotonically decreasing in 𝑚, 1/𝐵𝑚 = 𝑐 ℎ(𝑚) ≤ 𝑐 𝑚 due to ℎ(𝑚) ≥ 𝑚. Then, it follows from
property (iii) in Lemma 5.7 that
                                                                       Õ 𝑚 𝜇 𝑘 𝑘𝑡                            
                                                                                 𝑢
                      P( 𝐴) ≤ P ∃𝑚 ∈ {𝑙, . . . , 𝜏} :         𝜇¯ 𝑚𝑘 ≤              𝑖
                                                                                      − (1 + 𝜁)𝑐 𝑚 − 𝑥
                                                                         𝑖=1
                                                                                𝑚
                                                               𝑚 ¯𝑘𝑡                                         
                                                             Õ      𝑑𝑖𝑚          1
                              ≤ P ∃𝑚 ∈ {𝑙, . . . , 𝜏} :                     ≤         − (1 + 𝜁)𝑐 𝑚 − 𝑥
                                                               𝑖=1
                                                                      𝑚         𝐵  𝑚
                                                                   𝑚                              
                                                               1 Õ ¯𝑘𝑡
                              ≤ P ∃𝑚 ∈ {𝑙, . . . , 𝜏} :                  𝑑 ≤ −𝑥 − 𝜁 𝑐 𝑚 ,                              (5.23)
                                                              𝑚 𝑖=1 𝑖𝑚
                                                        
where 𝑑¯𝑖𝑚
         𝑘𝑡 = sat 𝑋 𝑘 , 𝐵
                       𝑘𝑡     𝑚    − E sat 𝑋 𝑘𝑘𝑡 , 𝐵𝑚 . Recall we select 𝑎 > 1. Again, we apply a peeling
                      𝑢𝑖                         𝑢𝑖
                                                                                                                          
argument with geometric grid 𝑎 𝑠 ≤ 𝑚 < 𝑎 𝑠+1 over time interval {𝑙, . . . , 𝜏}. Let 𝑠0 = log𝑎 (𝑙) .
Since 𝑐 𝑚 is monotonically decreasing with 𝑚, we continue from (5.23) to get
                                  Õ                                 Õ 𝑚                                      
                                                       𝑠   𝑠+1              𝑘𝑡            𝑠                
                      P( 𝐴) ≤          P ∃𝑚 ∈ [𝑎 , 𝑎            ):        𝑑¯𝑖𝑚   ≤ −𝑎 𝑥 + 𝜁 𝑐 𝑎 𝑠+1            .       (5.24)
                                  𝑠≥𝑠0                              𝑖=1
                                                                                                                  
For all 𝑚 ∈ [𝑎 𝑠 , 𝑎 𝑠+1 ), since 𝐵𝑚 = 𝐵𝑎 𝑠 , from Lemma 5.7 we know 𝑑¯𝑖𝑚                       𝑘𝑡 ≤ 2𝐵 𝑠 and Var 𝑑¯𝑘𝑡 ≤ 1.
                                                                                                           𝑎        𝑖𝑚
Continuing from (5.24), we apply Maximal Bennett’s inequality in Lemma 2.7 to get
                                                                                                        !
                                     Õ              𝑎 𝑠 𝑥 + 𝜁 𝑐 𝑎 𝑠+1         2𝐵𝑎 𝑠                     
                         P( 𝐴) ≤           exp −                          𝜓               𝑥 + 𝜁 𝑐 𝑎 𝑠+1
                                     𝑠≥𝑠
                                                         2𝐵𝑎 𝑠                    𝑎
                                         0
                                                                                                 
                                      since 𝜓(𝑥) is monotonically increasing
                                                                                                !
                                     Õ              𝑎 𝑠 𝑥 + 𝜁 𝑐 𝑎 𝑠+1         2𝜁
                                  ≤        exp −                          𝜓          𝐵𝑎 𝑠 𝑐 𝑎 𝑠+1
                                     𝑠≥𝑠
                                                         2𝐵  𝑎  𝑠               𝑎
                                         0
                                    (substituting 𝑐 𝑎 𝑠+1 , 𝐵𝑎 𝑠 and using ℎ(𝑎 𝑠 ) = 𝑎 𝑠+1 )
                                                                                              !
                                      Õ                      𝑥                    𝜓   2𝜁/𝑎
                                  =           exp −𝑎 𝑠              + 𝜁 𝑐2𝑎 𝑠
                                     𝑠≥𝑠 +1
                                                           𝐵𝑎 𝑠−1                      2𝑎
                                         0
                                                                    
                                      since 𝜁𝜓(2𝜁/𝑎) ≥ 2𝑎
                                                                                       !
                                    𝐾 Õ 𝑠                           𝑥     𝜓   2𝜁/𝑎
                                  ≤             𝑎 exp −𝑎 𝑠                                  .
                                     𝜏 𝑠≥𝑠 +1                     𝐵𝑎 𝑠−1       2𝑎
                                            0
                                                              72


Let 𝑏 = 𝑥𝜓 2𝜁/𝑎 /(2𝑎). Since ln+ (𝑥) ≥ 1 for all 𝑥 > 0,
                                                                    !
                          𝐾 Õ 𝑠                       𝑥    𝜓   2𝜁/𝑎
                                    𝑎 exp −𝑎 𝑠
                          𝜏 𝑠≥𝑠 +1                  𝐵𝑎 𝑠−1     2𝑎
                                0
                          𝐾 Õ 𝑠                √ 
                        ≤           𝑎 exp −𝑏 𝑎 𝑠
                          𝜏 𝑠≥𝑠 +1
                                0
                            ∫ +∞
                          𝐾                         p      
                        ≤           𝑎 𝑦 exp − 𝑏 𝑎 𝑦−1 𝑑𝑦
                          𝜏 𝑠0 +1
                              ∫ +∞
                          𝐾                           √ 
                        = 𝑎           𝑎 𝑦 exp − 𝑏 𝑎 𝑦 𝑑𝑦
                          𝜏     𝑠0
                                       ∫ +∞
                          𝐾 2𝑎                                                 √
                        =                  √     𝑧 exp   − 𝑧   𝑑𝑧 (where 𝑧 =  𝑏   𝑎𝑦)
                          𝜏 ln(𝑎)𝑏 2 𝑏 𝑎 𝑠0
                          𝐾 2𝑎            √                     √
                        ≤              (𝑏    𝑎 𝑠0 + 1) exp(−𝑏 𝑎 𝑠0 ),
                          𝜏 ln(𝑎)𝑏 2
which concludes the proof.                                                                           
    With Lemma 5.9, the upper bound on the worst-case regret for SW-RMOSS in the nonstationary
heavy-tailed MAB problem can be analyzed similarly as Theorem 5.4.
Theorem 5.10. For the nonstationary heavy-tailed MAB problem with 𝐾 arms, time horizon 𝑇,
                                        l 1           2m
variation budget 𝑉𝑇 > 0 and 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 , if 𝜓(2𝜁/𝑎) ≥ 2𝑎/𝜁, the worst-case regret of
SW-RMOSS satisfies
                                                                         1  2
                                   sup          𝑅𝑇SW-RMOSS ≤ 𝐶 (𝐾𝑉𝑇 ) 3 𝑇 3 .
                              F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾)
Sketch of the proof. The procedure is similar as the proof of Theorem 5.4. The key difference is
due to the nuance between the concentration properties on mean estimator. Neglecting the leading
constants, the probability upper bound in Lemma 5.3 has a factor exp(−𝑥 2 𝑙/𝜂) comparing with
    p                      p          
(𝛽𝑥 ℎ(𝑙)/𝑎 + 1) exp −𝛽𝑥 ℎ(𝑙)/𝑎 in Lemma 5.9. Since both factors are no greater than 1, by
simply replacing 𝜂 with (1+ 𝜁) 2 and taking similar calculation in every step except inequality (5.12),
comparable bounds that only differs in leading constants can be obtained. Applying Lemma 5.9,
                                                       73


we revise the computation of (5.12) as the following,
                                                                         Δ𝑘
                                Õ                                          
                                        P   𝑔𝑡𝑘𝑠 − 2𝑐 𝑛 𝑘 (𝑡 𝑠 ) > 𝑀𝑡𝑘𝑠 + 𝑖
                                                                          4
                               𝑠≥𝑙𝑖𝑘 +1
                                                  r               !            r      !
                               Õ
                                        0
                                           𝛽Δ𝑖𝑘       ℎ(𝑙)                 𝛽Δ𝑖𝑘 ℎ(𝑙)
                             ≤       𝐶                        + 1 exp −
                                   𝑘
                                              4        𝑎                    4      𝑎
                               𝑠≥𝑙𝑖
                               ∫ +∞                              !                !
                                               𝛽Δ  𝑘r                     𝛽Δ 𝑘r
                                                        𝑦                       𝑦
                             ≤          𝐶0         𝑖
                                                            + 1 exp − 𝑖              𝑑𝑦
                                𝑙𝑖 −1
                                 𝑘               4      𝑎                  4    𝑎
                                                      4
                               6𝑎 2𝑎 𝐾 4
                             ≤ 2 2                            .                                  (5.25)
                               𝛽 𝛽 ln(𝑎) 𝜏 Δ𝑖𝑘
                          2
where 𝐶 0 = 2𝑎𝐾 4/Δ𝑖𝑘 / 𝛽2 ln(𝑎)𝜏 .The second inequality is due to the fact that (𝑥 + 1) exp(−𝑥)
                                            
is monotonically decreasing in 𝑥 for 𝑥 ∈ [0, ∞) and ℎ(𝑙) > 𝑙. In the last inequality, we change
the lower limits of the integration from 𝑙𝑖𝑘 − 1 to 0 since 𝑙𝑖𝑘 ≥ 1 and plug in the value of 𝐶 0.
Comparing with (5.12), this upper bound only varies in constant multiplier. So is the worst-regret
upper bound.                                                                                         
Remark 5.1. The benefit of the discount factor method is that it is memory-friendly. This advantage
is lost if the truncated empirical mean is used. As 𝑛 𝑘 (𝑡) could both increase and decrease with time,
the truncated point could both grow and decline, so all sampling history needs to be recorded. It
remains an open problem how to effectively use the discount factor in a nonstationary heavy-tailed
MAB problem.
5.4      Numerical Experiments
We complement the theoretical results in the previous section with two Monte-Carlo experiments.
For the light-tailed setting, we compare R-MOSS, SW-MOSS, and D-UCB with other state-of-art
policies. For the heavy-tailed setting, we test the robustness of R-RMOSS and SW-RMOSS against
both heavy-tailed rewards and nonstationarity. Each result in this section is derived by running
designated policies 500 times. And parameter selections for compared policies are strictly coherent
with referred literature.
                                                             74


5.4.1  Bernoulli Nonstationay Stochastic MAB Experiment
To evaluate the performance of different policies, we consider two nonstationary environments as
shown in Figs. 5.1a and 5.1b, which both have 3 arms with nonstationary Bernoulli rewards. The
success probability sequence at each arm is a Brownian motion in environment 1 and a sinusoidal
function of time 𝑡 in environment 2. And the variation budget 𝑉𝑇 is 8.09 and 3 respectively.
                 (a) Environment 1                                                (b) Environment 2
           (c) Regrets for environment 1                                     (d) Regrets for environment 2
                                    Figure 5.1: Comparison of different policies.
    The growths of regret in Figs. 5.1c and 5.1d show that UCB based policies (R-MOSS, SW-
MOSS, and D-UCB) maintain their superior performance against adversarial bandit-based policies
(Rexp3 and Exp3.S) for stochastic bandits even in nonstationary settings, especially for R-MOSS
and SW-MOSS. Besides, DTS outperforms other policies when the best arm does not switch. While
each switch of the best arm seems to incur larger regret accumulation for DTS, which results in
                                                       75


larger regret compared with SW-MOSS and R-MOSS.
5.4.2   Heavy-tailed Nonstationary Stochastic MAB Experiment
Again we consider the 3-armed bandit problem with sinusoidal mean rewards. In particular, for
each arm 𝑘 ∈ {1, 2, 3},
                                                             
                         𝜇𝑡𝑘 = 0.3 sin 0.001𝜋𝑡 + 2𝑘 𝜋/3 ,           𝑡 ∈ {1, . . . , 5000}.
Thus, the variation budget is 3. Besides, mean reward is contaminated by additive sampling noise
𝜈, where |𝜈| is a generalized Pareto random variable and the sign of 𝜈 has equal probability to be
“+" and “−". So the probability distribution for 𝑋𝑡𝑘 is
                                                           − 𝜉1 −1
                                      1   ©     𝜉𝑥−   𝜇𝑡𝑘 ª
                          𝑓𝑡𝑘 (𝑥) =       ­1 +            ®        for 𝑥 ∈ (−∞, +∞).
                                     2𝜎   ­         𝜎     ®
                                          «               ¬
We select 𝜉 = 0.4 and 𝜎 = 0.23 such that Assumption 5.2 is satisfied. We select 𝑎 = 1.1 and
𝜁 = 2.2 for both R-RMOSS and SW-RMOSS such that condition 𝜓(2𝜁/𝑎) ≥ 2𝑎/𝜁 is met.
                    (a) Regret                                                    (b) Histogram of 𝑅𝑇
                                Figure 5.2: Performances with heavy-tailed rewards.
    Figure. 5.2a show RMOSS based polices and slightly outperform MOSS-based polices in
heavy-tailed settings. While by comparing the estimated histogram of 𝑅𝑇 for different policies in
Figure. 5.2b, R-RMOSS and SW-RMOSS have a better consistency and a smaller possibility of a
particular realization of the regret deviating significantly from the mean value.
                                                       76


5.5    Summary
We studied the general nonstationary stochastic MAB problem with variation budget and provided
three UCB based policies for the problem. Our analysis showed that the proposed policies enjoy the
worst-case regret that is within a constant factor of the minimax regret lower bound. Besides, the
sub-Gaussian assumption on reward distributions is relaxed to define the nonstationary heavy-tailed
MAB problem. We show the order optimal worst-case regret can be maintained by extending the
previous policies to robust versions.
    There are several possible avenues for future research. In this work, we relied on passive
methods to balance the remembering-versus-forgetting tradeoff. The general idea is to keep taking
in new information and removing outdated information. Parameter-free active approaches that
adaptively detect and react to environment changes are promising alternatives and may result in
better experimental performance. Also, extensions from the single decision-maker to distributed
multiple decision-makers are of interest. Another possible direction is the nonstationary version of
rested and restless bandits.
5.6    Bibliographic Remarks
The adversarial MAB [19] is a paradigmatic nonstationary problem. In this model, the bounded
reward sequence at each arm is arbitrary. The performance of a policy is evaluated using the
weak regret, which is the difference in the cumulated reward of a policy compared with the best
                             √
single action policy. A Ω( 𝐾𝑇) lower bound on the weak regret and a near-optimal policy Exp3
is also presented in [19]. While being able to capture the nonstationarity, the generality of the
reward model in the adversarial MAB makes the investigation of globally optimal policies very
challenging.
    The nonstationary stochastic MAB can be viewed as a compromise between the stationary
stochastic MAB and the adversarial MAB. It maintains the stochastic nature of the reward sequence
while allowing some degree of nonstationarity in reward distributions. Instead of the weak regret
analyzed in adversarial MAB, a strong notion of regret defined with respect to the best arm at
                                                 77


each time step is studied in these problems. As a result, this problem can be studied from two
perspectives by extending ideas from adversarial bandits or stochastic bandits.
    After formulating the nonstationary stochastic MAB problem in [21], the authors tune the
Exp3.S policy for adversarial bandits [19] to achieve a near-optimal worst-case regret in their
subsequent work [102]. Discounted Thomson Sampling (DTS) [103] has also been shown to have
a good experimental performance within this general framework. However, we are not aware of any
analytic regret bounds for the DTS algorithm. The variation budget idea has already been extended
to more general problem settings such as nonstationary linear contextual bandits, and the ideas of
using periodic resetting, discounting factor, and sliding observation windows have been shown to
be applicable therein [104–106]. Nevertheless, to achieve exact order optimal worst-case regret
remains unsolved for those generalized problem setups
                                                 78


                                                CHAPTER 6
        MULTI-TARGET SEARCH VIA MULTI-FIDELITY GAUSSIAN PROCESSES
The robotic target search problems have a natural connection with MAB problems discussed in
the previous section. In particular, the class of robotic search problems in which a robot team
searches for a target from a set of view-points (arms), or monitors an environment from a set of
viewpoints, maps directly to the MAB problems. In this chapter, we focus on a class of search
problems involving the search of an unknown number of targets in a large or continuous space
instead of a small number of viewpoints.
     We consider a scenario in which an autonomous vehicle equipped with a downward-facing
camera operates in a 3D environment, and the task is to search for an unknown number of sta-
tionary targets on the 2D floor of the environment. For such a problem, there exists an intrinsic
fidelity-vs-coverage trade-off: sensing at a higher altitude provides more global but less accurate
information compared with sensing at a lower altitude. To capture this phenomenon, we model the
sensing information available at different altitudes from the floor using a multi-fidelity Gaussian
process [12]. The key idea to address the fidelity-vs-coverage trade-off is to use the low fidelity
information to remove regions unlikely to contain targets. This enables the robot to quickly transit
its focus to areas likely to contain targets, thus expedite the search process.
     This chapter is a slightly modified version of our published work on multi-target search with
multi-fidelity Gaussian process sensing model, and it is reproduced here with the permission of
the copyright holder1. The proposed multi-target search strategy leverages information-theoretic
techniques to efficiently explore the environment, and employ Bayesian techniques to accurately
identify targets and construct an occupancy map. The target search accuracy and efficiency are
proved with theoretical analysis, and also verified by simulation results.
     1 ©2020 IEEE. Reprinted with permission from [107].
                                                      79


6.1     Multi-target Search Problem Description
We consider an autonomous vehicle that moves in a 3D environment, e.g., an aerial or an underwater
vehicle. We assume that the vehicle either moves with unit speed or hovers at a location. The
vehicle is tasked with searching for multiple targets on the 2D floor of the environment. Let
𝐷 ⊂ R2 be the area of the floor in which the targets may be present. The vehicle is equipped with a
fixed camera that points towards the floor. The vehicle travels across the environment and collects
images/videos of the floor (samples) from different sampling points. These sampling points may
be located at different altitudes relative to the floor of the environment. We assume that no sample
is collected during the movement between sampling points to avoid misleading low-quality sensing
information. The collected samples are processed with a computer vision algorithm that outputs a
score, which corresponds to the likelihood of a target being present, for each frame. An example
of such a computer vision algorithm is the state of art deep neural network YOLOv3 [108]. The
score will be used to update the estimate of the sensing output, i.e., the estimated score function
 𝑓 : 𝐷 → [0, 1] which will be used to determine the location of the targets. The stochastic model
for 𝑓 is introduced below.
6.1.1   Multi-fidelity Sensing Model
GPs are widely used models for spatially distributed sensing outputs. In [52], a GP is used to model
the target detection output of a computer vision algorithm. While target presence is a binary event,
the computer vision algorithms such as YOLOv3 yield a score which is a function of the saliency
and location of the target in the image. GPs are appropriate models for such score functions.
So far in the literature, GPs have been used in the context of single-fidelity measurements. To
characterize the inherent fidelity-coverage trade-off in sensing the floor scene by an autonomous
vehicle operating in 3D space, we employ a novel multi-fidelity GP model. The two key physical
sensing characteristics the model seeks to capture are: (i) there is some information that can only be
accessed at lower altitudes, (ii) the sensing outputs are more spatially correlated at higher altitudes,
                                                    80


since the fields of view at neighboring locations have higher overlaps in their field of views.
      We assume that the vehicle can collect samples of the floor from 𝑀 possible heights from the
floor 𝑧 1 > 𝑧 2 > · · · > 𝑧 𝑀 . We refer to these heights as the fidelity level of the measurement, with
𝑀 (resp. 1) corresponding to the highest (resp. lowest) level of fidelity. Let the score function
𝑔𝑚 : 𝐷 → [0, 1] be defined by the output of the computer vision algorithm for an ideal noise-free
image collected at fidelity level 𝑚 ∈ {1, . . . , 𝑀 } with the field of view of the camera centered
at 𝒙 ∈ 𝐷. We assume that the score functions for a location 𝒙 obtained from different altitudes
(fidelity levels) are related to each other in an autoregressive manner as follows
                                      𝑔 𝑚 (𝒙) = 𝑎 𝑚−1 𝑔 𝑚−1 (𝒙) + 𝑏 𝑚 (𝒙),                          (6.1)
where 𝑎 𝑚−1 is a scale parameter and 𝑏 𝑚 is the bias term that captures the information that can
                                                                                    Î       
                                                                                       𝑀−1
be only be accessed at fidelities levels greater than 𝑚. Let 𝑓 𝑚 (𝒙) =                 𝑖=𝑚 𝑖 𝑔 (𝒙) and
                                                                                           𝑎   𝑚
            Î         
               𝑀−1
ℎ𝑚 (𝒙) =       𝑖=𝑚 𝑖 𝑏 (𝒙). Then, equation (6.1) reduces to
                    𝑎     𝑚
                                         𝑓 𝑚 (𝒙) = 𝑓 𝑚−1 (𝒙) + ℎ𝑚 (𝒙),                              (6.2)
where 𝑓 0 (𝒙) = 0 and 𝑓 (𝒙) := 𝑓 𝑀 (𝒙) is the score function at the highest fidelity level which
we treat as ground truth. We model the influence of systemic errors in sample collection and
environmental uncertainty on the output of the computer vision algorithm for an input at fidelity level
𝑚 through an additive zero mean Gaussian random variable 𝜖 𝑚 with variance 𝑠2𝑚 , i.e., 𝜖 𝑚 ∼ 𝑁 (0, 𝑠2𝑚 ).
Consequently, the (scaled) score obtained by collecting a sample at location 𝒙 is a random variable
𝑦 = 𝑓𝑚 (𝒙) + 𝜖 𝑚 .
      We assume that each ℎ𝑚 is a realization of a Gaussian process with a constant mean 𝜇𝑚 and a
squared exponential kernel function 𝑘 𝑚 (𝒙, 𝒙 0) expressed as
                                                                           !
                                                              k𝒙 −  𝒙 0 k2
                                    𝑘 𝑚 (𝒙, 𝒙 0) = 𝑣 2𝑚 exp −       2
                                                                             ,                      (6.3)
                                                                 2𝑙 𝑚
where 𝑙 𝑚 is the length scale parameter, and 𝑣 𝑚 is the variance parameter that satisfies 𝑣 1 > 𝑣 2 >
· · · > 𝑣 𝑀 . This kernel function describes the spatial correlation of score function at neighboring
                                                        81


locations at each fidelity level. Since the fields of view are more overlapped at lower fidelity levels,
it results in 𝑙 1 > 𝑙2 > · · · > 𝑙 𝑀 .
     We make the following assumptions about the highest-fidelity sample. If the target is not in
the field of view at (𝒙, 𝑧 𝑀 ), the mean score of the computer vision algorithm 𝑓 (𝒙) is smaller than
a threshold th. If a target is at the center of image collected at (𝒙, 𝑧 𝑀 ), 𝑓 (𝒙) ≥ th + Δ, for some
constant Δ > 0. Here, 1/Δ can be viewed as a measure of detection difficulty that depends both on
the quality of the computer vision algorithm and the environment complexity.
6.1.2    Objective of the Multi-target Search Algorithm
Our objective is to design an algorithm for sequentially determining sampling points that lead to
expedited detection and localization of targets within desired misclassification rate 𝛿 ∈ (0, 1/2).
In particular, the algorithm should estimate the region containing targets 𝐷 𝑡 ⊆ 𝐷 such that (i)
                                                                             
∀𝒙 ∈ 𝐷 𝑡 : P 𝑓 (𝒙) < th ≤ 𝛿 and (ii) ∀𝒙 ∈ 𝐷 \ 𝐷 𝑡 : P 𝑓 (𝒙) ≥ th + Δ ≤ 𝛿. The requirements
about both false alarm and mis-detection rate are set by above two conditions.
     Let 𝑡 (Δ, 𝛿) be the total (traveling and sampling) time to finish the search task with misclassi-
fication rate smaller than 𝛿. Then, the objective of the algorithm is to determine the sequence of
sampling points that minimize 𝑡 (Δ, 𝛿).
6.2     Expedited Multi-target Search Algorithm
The proposed Expedited Multi-target Search (EMTS) algorithm is illustrated in Figure. 6.1. It
operates using an epoch-based structure. In each epoch, the sampling and fidelity planner computes
a set of sampling points and the path planner optimizes a TSP tour going through those points.
The vehicle follows the TSP tour to collect measurements at sampling points and the inference
algorithm uses these measurements to update the estimate of the score function 𝑓 . Then, the
Bayesian classification uses these estimates to compute an occupancy map of the floor and the
region elimination module removes regions with no target with sufficiently high probability from
the search space. In the following, we describe each of these modules in detail.
                                                    82


                                           Figure 6.1: Architecture of EMTS.
6.2.1     Inference Algorithm for Multi-fidelity GPs
The Bayesian inference method for multi-fidelity GPs discussed in this section is an extension of
the inference procedure in [12] for the case of no sampling noise. Let the set of sampling location-
                                                                              
score-fidelity tuples after 𝑛 observations be P𝑛 = { 𝒙 𝑖 , 𝑦𝑖 , 𝑚𝑖 | 𝑖 ∈ {1, . . . , 𝑛}}. For each fidelity
𝑚, define a subset of P𝑛 ,
                                                              
                                        𝑃𝑛𝑚 = { 𝒙 𝑖 , 𝑦𝑖 , 𝑚𝑖 ∈ P𝑛 | 𝑚𝑖 = 𝑚},
and |𝑃𝑛𝑚 | denote the cardinality of 𝑃𝑛𝑚 . Recall that 𝑘 𝑖 (𝒙, 𝒙 0) is the kernel function for the GP ℎ𝑖 at 𝑖-th
                                     0                     0                                                  0
fidelity level. Let 𝑲 𝑖0 𝑃𝑛𝑚 , 𝑃𝑛𝑚 be a |𝑃𝑛𝑚 | × |𝑃𝑛𝑚 | matrix with entries 𝑘 𝑖 (𝒙, 𝒙 0), 𝒙 ∈ 𝑃𝑛𝑚 , 𝒙 0 ∈ 𝑃𝑛𝑚
and 𝑲 𝑖0 (𝑃𝑛𝑚 , 𝒙) be a |𝑃𝑛𝑚 | dimensional vector with entries 𝑘 0𝑖 (𝒙 0, 𝒙), 𝒙 0 ∈ 𝑃𝑛𝑚 . Let 𝑲 be a 𝑀 × 𝑀
block matrix with (𝑚, 𝑚0) block submatrix
                                                            0
                                                      Õ )
                                                   min(𝑚,𝑚
                                                                                  0 
                                        𝑲 𝑚,𝑚 0 =               𝑲 𝑖 𝑃𝑛(𝑚) , 𝑃𝑛(𝑚 ) .
                                                       𝑖=1
Let 𝒌 (𝒙) be a |P𝑛 | dimensional vector constructed by concatenating 𝑀 sub-vectors 𝒌 (𝒙) =
                          
  𝒌 1 (𝒙), . . . , 𝒌 𝑀 (𝒙) , where
                                            Õ 𝑚
                                   𝑚
                                 𝒌 (𝒙) =         𝑲 𝑖 (𝑃𝑛𝑚 , 𝒙),     ∀𝑚 ∈ {1, . . . , 𝑀 }.                 (6.4)
                                             𝑖=1
Denoted by 𝚯 is the 𝑀 × 𝑀 diagonal matrix with the variance of sampling noise at diagonal entries
                                                       n           o
                                           𝚯 = diag 𝑠2𝑚 𝑰 |𝑃𝑛𝑚 |                   .
                                                                     𝑚={1,...,𝑀 }
                                                            83


      Let 𝝂 𝑛 = [𝜈1 , . . . , 𝜈𝑛 ] be the a priori mean of the sample 𝒚 𝑛 = 𝑦 1 , . . . , 𝑦 𝑛 . In particular, if
                                                             Í𝑚
𝑦 𝑗 is a sample at fidelity 𝑚, then 𝜈 𝑗 = 𝑖=1                       𝜇𝑖 . The a priori covariance of 𝒚 𝑛 is 𝑲 + 𝚯. In the
training process with training dataset P𝑛 , the hyperparameters {𝜇𝑚 , 𝑣 𝑚 , 𝑙 𝑚 , 𝑠𝑚 } 𝑚=1               𝑀 and {𝑎 } 𝑀−1
                                                                                                                  𝑚 𝑚=1
in the multi-fidelity GP can be learned by maximizing a log marginal likelihood function
                                1                               1
                                                                          𝒚 − 𝝂 𝑛 (𝑲 + 𝚯) −1 𝒚 − 𝝂 𝑛 .
                                                                                   𝑇                 
                             − log det 2𝜋 (𝑲 + 𝚯) −
                                2                                      2
Such training can be performed using the GP toolbox [109].
      Due to the multi-fidelity structure described in (6.1) and (6.2), the prior mean and covariance
of 𝑓 are
                                                        Õ𝑀                          Õ𝑀
                                           𝜇0 (𝒙) =         𝜇𝑚 ,    𝑘 0 (𝒙, 𝒙 0) =      𝑘 𝑚 (𝒙, 𝒙 0).
                                                        𝑚=1                         𝑚=1
When running EMTS with learned hyperparameters, it can be shown that the posterior mean and
covariance functions of 𝑓 after 𝑛 measurements are
                                              𝜇𝑛 (𝒙) = 𝜇0 (𝒙) + 𝒌 𝑇 (𝒙) (𝑲 + 𝚯) −1 𝒚 − 𝝂 𝑛
                                                                                                    
                                                                                                                   (6.5)
                                        𝑘 𝑛 𝒙, 𝒙 0 = 𝑘 0 𝒙, 𝒙 0 − 𝒌 𝑇 (𝒙) (𝑲 + 𝚯) −1 𝒌 (𝒙 0).
                                                                  
Note that the posterior variance 𝜎𝑛2 (𝒙) = 𝑘 𝑛 (𝒙, 𝒙) is a measure of uncertainty that will be utilized
to classify 𝒙. It should be noted that the measurements collected at different fidelity levels are
appropriately scaled in inference (6.5).
6.2.2      Multi-fidelity Sampling & Path Planning
For each epoch 𝑗, we seek to design an efficient sampling tour through sampling locations
{(𝒙 𝑛 𝑗 +1 , 𝑧 𝑛 𝑗 +1 ), . . . , (𝒙 𝑛 𝑗+1 , 𝑧 𝑛 𝑗+1 )} to ensure
                                                                     
                                                      max 𝜎𝑛 𝑗+1 (𝒙) max 𝜎𝑛 𝑗 (𝒙) ≤ 𝛼,
                                                      𝒙∈𝐷               𝒙∈𝐷
where 𝑛 𝑗 is the number of samples collected before the beginning of the 𝑗-th epoch and the selection
of uncertainty reduction threshold 𝛼 is discussed in Section 6.2.3.
      Notice that the posterior variance update in (6.5) depends only on the location of the observations
𝒚 𝑛 , but not on the realized value of 𝒚 𝑛 . Therefore, the sequence of sampling location-fidelity
                                                                       84


tuples can be computed before physically visiting the locations. Such deterministic evolution of
the variance has also been leveraged within the context of single-fidelity GP planning to design
efficient sampling tours [110].
Sampling Point Selection. The vehicle follows a greedy sampling policy at each fidelity level, i.e.,
at each sampling round the vehicle selects the most uncertain point as the next sampling point
                                         𝒙 𝑛 = arg max 𝜎𝑛−1 (𝒙).                                          (6.6)
                                                   𝒙∈𝐷
In the information theoretic view [58], the greedy policy is near-optimal in terms of maximizing
an appropriate measure of uncertainty reduction.
Fidelity Selection. For each sampling point 𝒙 𝑛 , a fidelity level (or sampling altitude) needs to be
assigned. We let the vehicle start at fidelity level 1 and successively visit all fidelity levels from the
lowest to the highest. Since sampling 𝑓 𝑚 is not able to reduce the uncertainty about 𝑓 introduced
by the subsequent bias terms ℎ𝑚+1 , . . . , ℎ 𝑀 , we define the inaccessible uncertainty at fidelity level
            Í𝑀
𝑚 as 𝜉𝑚 = 𝑖=𝑚+1     𝑣 𝑖2 . Accordingly, we define the accessible uncertainty about 𝑓 at fidelity level 𝑚
by 𝑟 𝑛𝑚 = max𝒙∈𝐷 𝜎𝑛2 (𝒙) − 𝜉𝑚 . The assigned fidelity level to sample point 𝒙 𝑛 is designed to change
from fidelity 𝑚 to 𝑚 + 1 when
                                            𝑟 𝑛𝑚 ≤ 𝑣 2𝑚+1 𝑙 𝑚+1
                                                            2      2
                                                                /𝑙 𝑚 .
Notice that before the vehicle begins to sample at fidelity level 𝑚, 𝑟 𝑛𝑚 ≥ 𝑣 2𝑚 ≥ 𝑣 2𝑚+1 𝑙 𝑚+1
                                                                                            2 /𝑙 2 , where
                                                                                                   𝑚
the second inequality is due to the assumption that 𝑣 𝑚 > 𝑣 𝑚+1 and 𝑙 𝑚 > 𝑙 𝑚+1 . This ensures that all
fidelity levels are visited from the lowest to the highest successively.
Path Planning. Since the order of sampling locations does not influence the eventual posterior
mean and variance, the path going through the sampling location can be optimized by computing an
approximate TSP tour using packages, such as Concorde [111]. Such a tour-based sampling policy
allows for energy and time efficient operation of the vehicle. If all measurements within epoch 𝑗
are collected at the same fidelity level, the vehicle traverses the TSP tour TSP(𝒙 𝑛 𝑗 +1 , . . . , 𝒙 𝑛 𝑗+1 ) to
collect measurements from sampling points and update posterior distribution of 𝑓 . Otherwise, a
TSP tour is designed at each fidelity level.
                                                       85


6.2.3    Classification and Region Elimination
The classification and elimination of regions follow a confidence-bound-based rule, which has
been widely used in pure exploration multi-armed bandit algorithms [112] and robotic source
seeking [113]. We extend these ideas to the case of multi-fidelity GP setting.
    Conditioned on P𝑛 , the distribution of 𝑓 (𝒙) is Gaussian with mean function 𝜇𝑛 (𝒙) and vari-
ance 𝜎𝑛2 (𝒙). Let (𝐿 𝑛 (𝒙, 𝜀), 𝑈𝑛 (𝒙, 𝜀)) be the Bayesian confidence interval containing 𝑓 (𝒙) with
probability greater than (1 − 2𝜀). Here, the lower confidence bound 𝐿 𝑛 and upper confidence
bound 𝑈𝑛 are defined by 𝐿 𝑛 (𝒙, 𝜀) = 𝜇𝑛 (𝒙) − 𝑐(𝜀)𝜎𝑛 (𝒙) , 𝑈𝑛 (𝒙, 𝜀) = 𝜇𝑛 (𝒙) + 𝑐(𝜀)𝜎𝑛 (𝒙) , with
         q              
𝑐(𝜀) = 2 ln 1/(2𝜀) .
    Given the desired maximum misclassification rate 𝛿, at the end of epoch 𝑗, a location 𝒙 is
                                      
classified as target, if 𝐿 𝑛 𝑗 𝒙, 𝛿/2 𝑗 ≥ th, and is added to 𝐷 𝑡 ; while it is classified as empty, if
             
𝑈𝑛 𝑗 𝒙, 𝛿/2 𝑗 < th, and is added to the set 𝐷 𝑒 . Note that the confidence parameter 𝜀 = 𝛿/2 𝑗
defining the lower and upper bounds is decreased exponentially with epochs, and we will show that
it ensures a misclassification rate smaller than 𝛿. The locations in the set 𝐷 𝑒 are removed from
sampling space 𝐷 at the end of each epoch. EMTS is terminated if max𝒙∈𝐷 2𝜎𝑛 𝑗 (𝒙) ≤ Δ/𝑐(𝛿/2 𝑗 ).
    The selection of 𝛼 depends on the balance between the efficiency of the TSP path planer and
region elimination. TSP path planer is more effective with smaller 𝛼 since each exploration tour
includes more sample points. While region elimination favors bigger 𝛼 so that regions not likely
to contain targets are removed more frequently.
6.3     An Illustrative Example
In this section, we illustrate EMTS using the Unmanned Underwater Vehicle Simulator [114],
which is a ROS package designed for Gazebo robot simulation environment. We integrate it with
YOLOv3 [108] for image classification and Concorde solver [111] to compute TSP tours. We use 2
fidelity levels situated at 11m and 5m from the water floor, respectively. In Figure. 6.2, the left figure
shows our simulation setup, where an underwater vehicle is equipped with a downward camera and
a flashlight to facilitate the searching task in a dark underwater environment. The middle figure
                                                   86


                             Figure 6.2: Underwater victim search simulation setups.
and right figure in Figure 6.2 are the detection results with YOLOv3 at a high fidelity level and
a low fidelity level, respectively. There are 3 victims located at different unknown locations on a
40m × 40m water floor. At each sampling point, the vehicle takes 20 images and YOLOv3 returns
an average score about the confidence level of the existence of victims in the view.
     The first three subplots of Figure. 6.3 show the classification of regions before each epoch,
the sampling points selected by the greedy policy and the planned path. Classifications of the
environment are represented by 3 colors: red means target exist, blue means no target, and green
means uncertain. The dark green points and lines are the planned sampling locations and paths at
the low fidelity level and red points and lines are sampling locations and paths at the high fidelity
level. At the beginning of epoch 1, all regions are classified as uncertain. After each epoch, the
region of targets is narrowed down. The search task is terminated after three epochs. Notice that
the vehicle switches to the high fidelity level at epoch 2. The tours at low and high fidelity levels
are plotted using two different colors. The vehicles do not sample in blue regions since they have
been classified as empty. In the final result, the regions with target are successfully found. A video
of the simulation is available online2.
     Figure. 6.4a shows the heat map of posterior variance for the whole region at the end of
simulations. It reflects the nature of uncertainty reduction with EMTS, i.e., the posterior variance
is low only at areas that likely contain a target. The regions classified as empty have larger posterior
variance since they have been eliminated from sampling space in the early phase. This shows that
     2 https://mediaspace.msu.edu/media/EMTS/1_phbul7ui
                                                      87


                          (a) Epoch 1                                         (b) Epoch 2
                         (c) Epoch 3                                        (d) Final result
                                      Figure 6.3: Simulation result of EMTS.
EMTS is able to put more focus on areas likely to contain victims. The uncertainty reduction,
i.e. the decrease in maximum posterior variance, for multi-fidelity greedy sampling and single-
fidelity greedy sampling, are compared in Figure. 6.4b. It shows that greedy multi-fidelity sampling
can reduce uncertainty much faster at the beginning stage, which will enable EMTS to eliminate
unoccupied regions quickly, and hence, accelerate target search.
                   (a) Final posterior variance                            (b) Convergence of 𝜎𝑛2
                                     Figure 6.4: Uncertainty reduction results.
                                                       88


6.4     Analysis of the EMTS Algorithms
In this section, we analyze the modules of the EMTS algorithm and use these analyses to derive an
upper bound on the expected detection time for the overall algorithm.
6.4.1    Analysis of the classification algorithm
We first characterize the Bayesian confidence interval for 𝑓 (𝒙), and then use this result to establish
that the EMTS algorithm ensures the desired classification accuracy.
                                                                                       
Lemma 6.1 (Bayesian confidence interval). For 𝑓 (𝒙) | P𝑛 ∼ 𝑁 𝜇𝑛 (𝒙), 𝜎𝑛2 (𝒙) and 𝜀 ∈ (0, 1/2),
                                                                            
                          P 𝑓 (𝒙) ≤ 𝐿 𝑛 (𝒙, 𝜀) = P 𝑓 (𝒙) ≥ 𝑈𝑛 (𝒙, 𝜀) ≤ 𝜀.
                                                                                q         
Proof. To normalize 𝑓 (𝒙), let 𝑟 =          𝑓 (𝒙) − 𝜇(𝒙) /𝜎(𝒙) and 𝑐(𝜀) = 2 ln 1/(2𝜀) . Now 𝑟 ∼
𝑁 (0, 1), and from tail-inequality for standard normal distribution [115]
                                                                 !
                                                    1         𝑐2
                                      P (𝑟 ≥ 𝑐) ≤ exp −             = 𝜀,
                                                    2         2
                                        
which prove the P 𝑓 (𝒙) ≥ 𝑈𝑛 (𝒙, 𝜀) ≤ 𝜀. Similar result holds for lower confidence bound.              
Theorem 6.2 (Misclassification Rate). For the classification strategy in the EMTS algorithm, a
location 𝒙 ∈ 𝐷 is misclassified with probability at most equal to 𝛿.
Proof. Consider a location 𝒙 such that 𝑓 (𝒙) ≤ th, i.e., the true classification of 𝒙 is empty. Since at
the end of epoch 𝑗, the lower and upper confidence bounds used for classification employ 𝜀 = 𝛿/2 𝑗 ,
we apply a union bound to show the probability of classifying 𝒙 as a target satisfies
                     Õ∞                            Õ  ∞                            
                                          𝑗
                         P 𝐿 𝑛 𝑗 (𝒙, 𝛿/2 ) > th ≤          P 𝐿 𝑛 𝑗 (𝒙, 𝛿/2 𝑗 ) > 𝑓 (𝒙) .
                     𝑗=1                               𝑗=1
                                                                                           Í∞
Then, it follows from Lemma 6.1 that the misclassification probability is no greater than    𝑗=1 𝛿/2 𝑗 =
𝛿. The case of location 𝒙 being occupied by a target follows similarly.                                
                                                     89


6.4.2    Analysis of the Sampling and Fidelity Planner
We now analyze the information gain and uncertainty reduction properties for our sampling and
fidelity planner. We first recall some results for the single fidelity planner and then extend them to
the case of the multi-fidelity planner.
    Consider a single-fidelity GP 𝑓 that is sampled with additive Gaussian noise with variance 𝑠2 .
Let 𝑋𝑛 be the set of first 𝑛 sampling points and let the vector of associated observations be 𝒚 𝑋𝑛 . It
is shown in [44, Lemma 5.3] that the mutual information between 𝒚 𝑋𝑛 and 𝑓 is
                                         1Õ     𝑛                          
                              𝐼 𝒚 𝑋𝑛 ; 𝑓 =           log 1 + 𝑠−2 𝜎𝑖−1 2
                                                                        (𝒙 𝑖 ) ,                    (6.7)
                                              2 𝑖=1
where 𝒇 𝑋𝑛 is the vector of 𝑓 (𝒙) calculated at points in 𝑋𝑛 . Let the maximal mutual information
gain with 𝑛 samples be
                                                                    
                                        𝛾𝑛 :=     max      𝐼 𝒚𝑍 ; 𝑓 .
                                               𝑍 ∈𝐷:|𝑍 |=𝑛
Let 𝐼greedy be the total mutual information gain using a greedy policy that maximizes the summand
                                                                                       
in (6.7) at each sampling step. It follows, due to submodularity [116] of 𝐼 𝒚 𝑋𝑛 ; 𝑓 , that
                                                                 
                                        1
                                    1−       𝛾𝑛 ≤ 𝐼greedy 𝒚 𝑋𝑛 ; 𝑓 ≤ 𝛾𝑛 ,
                                        𝑒
While giving an exact value of 𝛾𝑛 is difficult, an upper bound on 𝛾𝑛 for squared exponential kernel
derived in [44] is presented in the following Lemma 6.3.
Lemma 6.3 (Information gain for squared exp. kernel). Let a GP 𝑓 be defined on domain 𝐷 ⊂ R2 .
If 𝑓 has squared exponential kernel with length scale 𝑙, then the maximum mutual information
satisfies
                                         𝛾𝑛 (𝑙) ∈ 𝑂 (𝑙 −2 (log 𝑛) 3 ).
Proof. For a GP defined on 𝐷 ∈ [0, 1] 2 with squared exponential kernel function 𝑘 (𝒙, 𝒙 0) =
exp(−k𝒙 − 𝒙 0 k 2 /2), 𝛾𝑛 ∈ 𝑂 ((log 𝑛) 3 ) [44]. It is shown in [117] that 𝛾𝑛 scales with the area of 𝐷.
                                                             
Thus, if the diameter of 𝐷 is 𝑑, then 𝛾𝑛 ∈ 𝑂 𝑑 2 (log 𝑛) 3 . Note that having length scale 𝑙 in kernel
                                                                                    
function is equivalent to scale 𝐷 by 1/𝑙. Accordingly, 𝛾𝑛 ∈ 𝑂 𝑑 2 𝑙 −2 (log 𝑛) 3 . For fixed 𝐷, we
omit diameter 𝑑 from the order notation and write 𝛾𝑛 (𝑙) ∈ 𝑂 (𝑙 −2 (log 𝑛) 3 ).                        
                                                      90


    Lemma 6.3 provides a bound on the mutual information gain at the first fidelity level. For
higher fidelity levels, the Gaussian process is composed of the summation of independent GPs.
We now establish that the information gained by sampling the sum of GPs is smaller than the
information gained by sampling them independently. and then use this result to establish the bound
on information gain for multi-fidelity GPs.
Lemma 6.4 (Information gain for sum of GPs). Let ℎ1 ∼ 𝐺𝑃(𝜇1 (𝒙), 𝑘 1 (𝒙, 𝒙 0)) and ℎ2 ∼
𝐺𝑃(𝜇2 (𝒙), 𝑘 2 (𝒙, 𝒙 0)) be independent GPs. Consider a measurement 𝑦 = ℎ1 (𝒙) + ℎ2 (𝒙) + 𝜖 at
point 𝒙, where 𝜖 is additive measurement noise independent of ℎ1 and ℎ2 . Let 𝒚 𝑋 = 𝒉1,𝑋 + 𝒉2,𝑋 + 𝝐
be the vector of such measurements at sampling points in a set 𝑋, where 𝝐 is the vector of i.i.d.
measurement noise. Then,
                            𝐼 (𝒚 𝑋 ; ℎ1 + ℎ2 ) ≤ 𝐼 (𝒉1,𝑋 + 𝝐; ℎ1 ) + 𝐼 (𝒉2,𝑋 + 𝝐; ℎ2 ).
Proof. The data processing inequality [118, Theorem 2.8.1] indicates
                     𝐼 (𝒚 𝑋 ; ℎ1 + ℎ2 ) ≤ 𝐼 ( 𝒚 𝑋 ; ℎ1 , ℎ2 ) = 𝐼 ( 𝒚 𝑋 ; ℎ1 ) + 𝐼 ( 𝒚 𝑋 ; ℎ2 | ℎ1 ).
Applying the data processing inequality again, we get
                               𝐼 ( 𝒚 𝑋 ; ℎ1 ) ≤ 𝐼 (𝒉1,𝑋 + 𝝐, 𝒉2,𝑋 ; ℎ1 )
                                              = 𝐼 (𝒉2,𝑋 ; ℎ1 ) + 𝐼 (𝒉1,𝑋 + 𝝐; ℎ1 | 𝒉2,𝑋 )
                                              = 𝐼 (𝒉1,𝑋 + 𝝐; ℎ1 ),
where in the last step follows due to the independence of ℎ1 , ℎ2 and 𝝐. Similarly, it can be shown
                    𝐼 ( 𝒚 𝑋 ; ℎ2 | ℎ1 ) = 𝐼 (𝒉1,𝑋 + 𝒉2,𝑋 + 𝝐; ℎ2 | ℎ1 ) = 𝐼 (𝒉2,𝑋 + 𝝐; ℎ2 ).
This establishes the lemma.                                                                                     
    Let 𝛾𝑛𝑚 be the maximal mutual information gain at fidelity 𝑚. It follows from Lemma 6.4
                                                                   Í𝑚
and the multi-fidelity GP model in (6.2) that 𝛾𝑛𝑚 ≤ 𝑖=1                  𝛾𝑛 (𝑙𝑖 ). Combining this inequality with
Lemma 6.3, we obtain the following result.
                                                            91


Corollary 6.5 (Information gain for multi-fidelity GPs). The maximal mutual information gain at
fidelity 𝑚 satisfies
                                                   Õ  𝑚                  
                                          𝛾𝑛𝑚 ∈ 𝑂          𝑙𝑖−2 (log 𝑛) 3 .
                                                      𝑖=1
     This corollary gives us an insight on the size of 𝛾𝑛𝑚 at different fidelity levels. It follows that
𝛾𝑛(𝑚) grows faster at higher fidelity levels.
     We now derive a bound on the posterior variance for the multi-fidelity GP in terms of the
maximum mutual information gain.
Lemma 6.6 (Uncertainty reduction for multi-fidelity GPs). Let 𝑓 ∼ 𝐺𝑃 𝜇0 (𝒙), 𝑘 0 (𝒙, 𝒙 0) and
                                                                                                         
𝜎02 (𝒙) ≤ 𝜎 2 , for each 𝒙 ∈ 𝐷. An additive sampling noise 𝜖 ∼ 𝑁 (0, 𝑠2 ) is incurred every time
 𝑓 is accessed. Under the greedy sampling policy the posterior variance after 𝑛 sampling rounds
satisfies
                                                                 2𝜎 2        𝛾𝑛
                                      max 𝜎𝑛2 (𝒙) ≤                  −2
                                                                             .
                                      𝒙∈𝐷              log 1 + 𝑠 𝜎 𝑛      2
Proof. For any 𝒙 ∈ 𝐷, 𝜎𝑛2 (𝒙) is monotonically non-increasing in 𝑛. So we get
                          max 𝜎𝑛2 (𝒙) = 𝜎𝑛2 (𝒙 𝑛+1 ) ≤ 𝜎𝑛−1     2               2
                                                                    (𝒙 𝑛+1 ) ≤ 𝜎𝑛−1 (𝒙 𝑛 ),                (6.8)
                           𝒙∈𝐷
where the second inequality is due to the fact 𝒙 𝑛 = arg max 𝒙∈𝐷 𝜎𝑛−1               2 (𝒙). Again since 𝒙
                                                                                                         𝑛+1 =
arg max 𝒙∈𝐷 𝜎𝑛2 (𝒙), inequality (6.8) also indicates that 𝜎𝑛−1         2 (𝒙 ) is monotonically non-increasing.
                                                                             𝑛
                                                                                                         
Hence, from (6.7), log 1 + 𝑠−2 𝜎𝑛−1    2 (𝒙 ) ≤ 2𝐼
                                                                                             2            2 is
                                             𝑛          greedy 𝒚 𝑋 ; 𝑓 /𝑛 ≤ 2𝛾𝑛 /𝑛. Since 𝑠 /log 1 + 𝑠
an increasing function on [0, ∞),
                            2                     𝜎2                 
                                                                            −2 2
                                                                                          
                         𝜎𝑛−1   (𝒙 𝑛 ) ≤                       log 1 + 𝑠 𝜎𝑛−1 (𝒙 𝑛 ) .
                                          log 1 + 𝑠−2 𝜎 2
Substituting (6.8) into it, we conclude that
                                                                        2𝜎 2        𝛾𝑛
                              max 𝜎𝑛2 (𝒙)   ≤    2
                                               𝜎𝑛−1  (𝒙 𝑛 ) ≤                −2
                                                                                    .
                              𝒙∈𝐷                                 log 1 + 𝑠 𝜎 𝑛  2
                                                                                                              
     Lemma 6.6 indicates that the smaller and the more slowly growing 𝛾𝑛 is, the faster max𝒙∈𝐷 𝜎𝑛 (𝒙)
converges. This result explains our idea of using a multi-fidelity model.
                                                         92


6.4.3    Analysis of Expected Detection Time
We now derive an upper bound on the number of samples needed to classify a location using the
EMTS algorithm and then use this result to compute the total sampling and travel time required for
classification.
Lemma 6.7 (Sample complexity for uncertainty reduction). In the autoregressive multi-fidelity
model (6.3), if each ℎ (𝑚) has a squared exponential kernel, then
                                                                                      3!
                                                                            𝜎02
                                                                                
                                                                                   𝜎0
                            min{𝑛 ∈ N | max 𝜎𝑛 (𝒙) ≤ Δ} ∈ 𝑂 2 ln                          .
                                            𝒙∈𝐷                             Δ      Δ
Proof. It follows from Lemma 6.6 that
                                              𝑛             2𝜎02
                                                  ≤                       .
                                             𝛾𝑛      max𝒙∈𝐷 𝜎𝑛2 (𝒙)
Since 𝑣 𝑚 , 𝑠𝑚 and 𝑙 𝑚 for all fidelity levels are finite, it follows from Corollary 6.5 that 𝛾𝑛 ∈ 𝑂 ((ln 𝑛) 3 ).
Combining these results, the lemma follows by inspection.                                                      
Lemma 6.8 (Sample complexity for EMTS). For a given misclassification tolerance 𝛿, let 𝑛(𝒙, 𝛿)
be the number of samples required to classify 𝒙 ∈ 𝐷. Then, the expected number of samples
satisfies
                                                                                       3
                          E [𝑛(𝒙, 𝛿) | Δ(𝒙)] ∈ 𝑂 𝜑(Δ(𝒙), 𝛿) ln 𝜑(Δ(𝒙), 𝛿) ,
                                                       𝜎02
                                                                       
                                                                   3𝜎0
where Δ(𝒙) = 𝑓 (𝒙) − th and 𝜑(Δ(𝒙), 𝛿) = Δ2 (𝒙)             ln 𝛿Δ(𝒙)      .
                                                      𝑗+1
Proof. Since 𝛿 < 1/2, function 𝑐(𝛿/2 𝑗 ) 3/4               is monotonically decreasing for 𝑗 ≥ 2. We define
                                                        s                   
                                                 3𝜎  0              3𝜎0 ª
                                  𝐽 = log4/3 ­                                ® + 1.
                                                ©
                                                           2 ln
                                                 Δ(𝒙)              𝛿Δ(𝒙) 
                                               «                              ¬
It can be shown that the choice of 𝐽 ensures, for 𝑗 ≥ 𝐽,
                                                                   𝑗+1                        𝐽+1
               𝑈 (𝒙, 𝛿/2 𝑗 ) − 𝐿(𝒙, 𝛿/2 𝑗 ) ≤2𝑐(𝛿/2 𝑗 ) 3/4             𝜎0 ≤ 2𝑐(𝛿/2𝐽 ) 3/4          𝜎0
                                                       v
                                                       u
                                                       u                             
                                                       u
                                                       u                  q
                                                                    3𝜎
                                                       t 𝛼 ln 𝛿Δ(𝒙) 2 ln 𝛿Δ(𝒙)
                                                       u                          3𝜎
                                                Δ(𝒙)
                                             ≤                              3𝜎
                                                                                         < Δ(𝒙)           (6.9)
                                                   2                 ln 𝛿Δ(𝒙)
                                                          93


where 𝛼 = log4/3 2 and the second inequality is due to the fact ln(𝑥 ln(𝑥))/ln(𝑥) ≤ (1 + 𝑒)/𝑒. For
a point 𝒙 at which 𝑐∗ (𝒙) = 1 and Δ(𝒙) > 0, based on (6.9), the number of sampling rounds to
classify 𝒙 satisfies
                                         Õ∞
                       𝑛(𝒙, 𝛿) ≤ 𝑛 𝐽 +        1𝐿(𝒙, 𝛿/2 𝑗 ) < th ≤ 𝑈 (𝒙, 𝛿/2 𝑗 )
                                        𝑗=𝐽+1
                                         Õ∞
                               ≤ 𝑛𝐽 +         1𝐿(𝒙, 𝛿/2 𝑗 ) < th
                                        𝑗=𝐽+1
                                         Õ∞
                               ≤ 𝑛𝐽 +         1𝑈 (𝒙, 𝛿/2 𝑗 ) < 𝑓 (𝒙),
                                        𝑗=𝐽+1
where 𝑛 𝐽 is the number of samples collected in the first 𝐽 epochs. Then the expected sampling
rounds can be bounded as
                                                Õ∞                          
                             ¯ 𝛿) ≤ 𝑛 𝐽 +
                             𝑛(𝒙,                   P 𝐿 (𝒙, 𝛿/2 𝑗 ) ≥ th
                                              𝑗=𝐽+1
                                                Õ∞                          
                                                                   𝑗
                                      ≤ 𝑛𝐽 +        P 𝐿 (𝒙, 𝛿/2 ) ≥ th
                                              𝑗=𝐽+1
                                                ∞
                                              Õ    𝑛𝑗
                                      ≤ 𝑛𝐽 +          .
                                               𝑗=1
                                                   2𝑗
From Lemma 6.7, we has 𝑛𝑖 ∈ 𝑂˜ ((16/9) 𝑗 ). Therefore ∞
                                                           Í
                                                              𝑗=1 𝑛 𝑗 /2 is finite. So we conclude
                                                                         𝑗
                                                                          3
                            ¯ 𝛿) ∈ 𝑂 𝜑(Δ(𝒙), 𝛿) ln 𝜑(Δ(𝒙), 𝛿) .
                            𝑛(𝒙,
                                                                                                     
Remark 6.1 (Comparison with sample complexity of multiarmed bandits). Notice that
                                                                      
                                                                 1
                                  E [𝑛(𝒙, 𝛿) | Δ(𝒙)] ∈ 𝑂˜ 2
                                                               Δ (𝒙)
describes the complexity to of classification of 𝒙, i.e., for a point with 𝑓 (𝒙) close to th more time
is needed. This term is similar to the sampling complexity [73] in a pure-exploration multi-armed
bandit problem. This result is based on the assumption that GPs all have squared exponential
kernel. For kernels characterizing less correlations, e.g. Matérn kernels, more sampling rounds
are expected.                                                                                        
                                                   94


     We now derive an upper bound on detection time for EMTS.
Theorem 6.9 (Target search time for EMTS). For a given misclassification tolerance 𝛿 and detection
difficulty measure 1/Δ, the target search time satisfies
                                                                       3
                               𝑡 (Δ, 𝛿) ∈ 𝑂 𝑑 2 𝜑(Δ, 𝛿) ln 𝜑(Δ, 𝛿) .
Proof. Since we assume unit sampling time, the total sampling time is in the same order as 𝑛(𝒙, 𝛿).
Then we consider the traveling time spent in order to collected those samples. Since EMTS requires
the vehicle to search from low fidelity level to high fidelity level, the total number of altitude switches
is no greater than 𝑀 − 1. As presented in [119], for 𝑛 points in [0, 1] 2 , the length of the shortest TSP
               √                                                                      p          
Tour < 0.984 2𝑛 + 11. Therefore, the expected traveling time belongs to 𝑂 𝑑 𝑛(𝒙,         ¯ 𝛿) , where 𝑑
is the diameter of 𝐷. Thus, the expected traveling time belongs to 𝑜( 𝑛(𝒙,      ¯ 𝛿)). Considering both
                                                             
                                                                2
                                                                                    3
sampling and traveling time, we conclude 𝑡 (Δ, 𝛿) ∈ 𝑂 𝑑 𝜑(Δ, 𝛿) ln 𝜑(Δ, 𝛿) .                             
     Theorem 6.9 illustrates the efficiency of the EMTS algorithm, we conjecture it to be near-
optimal. This upper bound has a natural implication that the target search time increases with the
detection difficulty 1/Δ and the desired classification accuracy 1 − 𝛿.
6.5     Summary
We studied the autonomous robotic search of an unknown number of targets located at the 2D floor
in an unknown and uncertain 3D environment. The novelty of this work lies in using autoregressive
multi-fidelity GPs [12, 117] to model the likelihood of the presence of a target at a location, which
is computed by a computer vision algorithm using the sample collected at that location at a given
altitude. The multi-fidelity GPs sensing model captures the fact that a high altitude (low fidelity)
sample provides more global but less accurate information compared with a low altitude (high
fidelity) sample. We designed a multi-target search algorithm EMTS that leverages multi-fidelity
Gps to capture the fidelity-coverage trade-off, information-theoretic techniques to efficiently explore
the environment, and Bayesian techniques to accurately identify targets and construct an occupancy
                                                    95


map. With rigorous analysis, we establish formal guarantees of the target detection accuracy and
expected detection time.
6.6     Bibliographic Remarks
Autonomous multi-target search requires the autonomous system to quickly and accurately locate
multiple targets of interest in an unknown and uncertain environment. Examples include search
and rescue missions, mineral exploration, and tracking natural phenomena. To improve the target
search efficiency, the trajectory should be designed to balance the explore-exploit tension—the
robot should spend more time at target locations while learning target locations. There have
been some efforts to address such explore-exploit tension within the context of informative path
planning [9, 11, 36, 45–53].
    Gaussian processes (GPs) are the most widely used models for capturing spatiotemporal sensing
fields in robotics [42, 43]. Informative path planning using such models of the environment has
been studied [51, 53, 58, 120–122]. In a broader class of search problems, robot trajectories
are designed to maximize the information collected along the way-points while ensuring that the
distance traveled is within a prescribed budget. Such informative path planning problems are
studied in [54–58]. While GP-based approaches have been used extensively, most of them rely on
single-fidelity measurements. Besides, most of these works focus on maximizing the reduction in
uncertainty of the estimates instead of the efficiency of the target search.
    The multi-target search can also be viewed as a hot-spot identification problem in which, instead
of the global maximum of the field, all locations with values greater than a threshold need to be
identified. Such problems have been studied in the multiarmed bandit literature [123, 124]; however,
we are not aware of any such studies in the GP setting. Furthermore, all these works focus on single
fidelity measurements, while we focus on multiple fidelities of measurements. The multi-target
search policy in this chapter can be viewed as a combination of informative path planning and the
MAB methods in an environment with multi-fidelity sensing information.
                                                  96


                                               CHAPTER 7
                    ONLINE ESTIMATION AND COVERAGE CONTROL
An intuitive idea to extend the single robot search policy to 𝑁 robots is to partition the environment
into 𝑁 regions and allocate each robot to one partition. Ideally, the workload needs to be equitably
distributed across all regions to maximize time efficiency. The coverage control focuses on such
equitable partitioning problems for a team of robots to provide service to a large or continuous
environment. The workload is typically referred to as serving demand in coverage control literature.
In the coverage problem [13], a particular demand function 𝜙 is defined over an environment that
specifies the degree to which a robot is “needed”. The team of agents aims to partition an
environment and achieve a configuration that minimizes the coverage cost defined by the sum of the
𝜙-weighted distances from every point in the environment to the nearest agent. Intuitively, more
robots should concentrate at the regions with higher demands in order to reduce the coverage cost.
    This chapter is a slightly modified version of our published work on adaptive coverage control,
and it is reproduced here with the permission of the copyright holder.1 Unlike the classic coverage
control with assume demand function 𝜙 to be known, we model it as a realization of a Gaussian
process that can be learned by taking samples.
7.1     Online Estimation and Coverage Problem
We consider a team of 𝑁 agents tasked with providing coverage to a finite set of points in an
environment represented by an undirected graph. The team is required to navigate within the
graph to learn an unknown demand function while maintaining near-optimal configuration. In this
section, we present the preliminaries of the estimation and coverage problem.
    1 ©2021 IEEE. Reprinted with permission from [125].
                                                     97


7.1.1   Graph Representation of Environment
We consider a discrete environment modeled by an undirected graph 𝐺 = (𝑉, 𝐸), where the vertex
set 𝑉 contains the finite set of points to be covered and the edge set 𝐸 ⊆ 𝑉 × 𝑉 is the collection of
physically adjacent pairs of vertices that can be reached from each other without passing through
other vertices. Let the weight map 𝑤 : 𝐸 → R>0 indicate the distance between connected vertices.
We assume 𝐺 is connected. Following the standard definition of weighted undirected graph, a path
in 𝐺 is an ordered sequence of vertices where there exists an edge between consecutive vertices.
The distance between vertices 𝑣 𝑖 and 𝑣 𝑗 in 𝐺, denoted by 𝑑𝐺 (𝑣 𝑖 , 𝑣 𝑗 ), is defined by the minimum of
the sums of the weights in the paths between 𝑣 𝑖 and 𝑣 𝑗 .
    Suppose there exists an unknown demand function 𝜙 : 𝑉 → R>0 that assigns a nonnegative
weight to each vertex in 𝐺. Intuitively, 𝜙(𝑣 𝑖 ) could represent the intensity of signal of interest such
as brightness or column of sound. A robot at vertex 𝑣 𝑖 is capable of measuring 𝜙(𝑣 𝑖 ) by collecting
a sample 𝑦 = 𝜙(𝑣 𝑖 ) + 𝜖, where 𝜖 ∼ N (0, 𝜎 2 ) is an additive zero mean Gaussian noise.
7.1.2   Nonparametric Estimation
Let 𝝓 be a vector with the 𝑖-th entry 𝜙(𝑣 𝑖 ), 𝑖 ∈ {1, . . . , |𝑉 |}, where | · | denotes set cardinality.
We assume a multivariate Gaussian prior for 𝝓 such that 𝝓 ∼ N ( 𝝁0 , 𝚲−1          0 ), where 𝝁 0 is the mean
vector and 𝚲0 is the inverse covariance matrix. Let 𝑛𝑖 (𝑡) be the number of samples and 𝑠𝑖 (𝑡) be the
summation of sampling results from 𝑣 𝑖 until time 𝑡. Then, the posterior distribution of 𝝓 at time 𝑡
is N 𝝁(𝑡), 𝚲−1 (𝑡) [126, Chapter 10], where
                    
                                                |𝑉 |
                                               Õ     𝑛𝑖 (𝑡)
                                 𝚲(𝑡) = 𝚲0 +                𝒆𝑖 𝒆𝑖T
                                                𝑖=1
                                                      𝜎2
                                                                                                        (7.1)
                                                               |𝑉 |
                                                                              !
                                                              Õ        𝑠𝑖 (𝑡)
                                  𝝁(𝑡) = 𝚲−1 (𝑡) 𝚲0 𝝁0 +            𝒆𝑖          .
                                                               𝑖=1
                                                                        𝜎2
Here, 𝒆𝑖 is the standard unit vector with 𝑖-th entry to be 1.
                                                     98


7.1.3   Voronoi Partition and Coverage Problem
We define the 𝑁-partition of graph 𝐺 as a collection 𝑃 = {𝑃𝑖 }𝑖=1           𝑁 of 𝑁 nonempty subsets of 𝑉 such
that ∪𝑖=1
       𝑁 𝑃 = 𝑉 and 𝑃 ∩ 𝑃 = ∅ for any 𝑖 ≠ 𝑗. 𝑃 is said to be connected if the subgraph induced by
           𝑖            𝑖    𝑗
𝑃𝑖 denoted by 𝐺 [𝑃𝑖 ] is connected for each 𝑖 ∈ 𝑁. 𝐺 [𝑃𝑖 ] being induced subgraph means its vertex
set is 𝑃𝑖 and its edge set includes all edges in 𝐺 whose both end vertices are included in 𝑃𝑖 .
    The configuration of the robot team is a vector of 𝑁 vertices 𝜼 ∈ 𝑉 𝑁 occupied by the robot
team, where the 𝑖-th entry 𝜂𝑖 corresponds to the position of the 𝑖-th robot. The 𝑖-th robot is tasked
to cover vertices in 𝑃𝑖 . Then, the coverage cost corresponding to configuration 𝜼 and connected
𝑁-partition 𝑃 can be defined as
                                               Õ 𝑁 Õ
                                H (𝜼, 𝑃) =                 𝑑𝐺 [𝑃𝑖 ] (𝜂𝑖 , 𝑣0)𝜙(𝑣 0).                     (7.2)
                                               𝑖=1 𝑣 0 ∈𝑃𝑖
In a coverage problem, the objective is to minimize this coverage cost by selecting appropri-
ate configuration 𝜼 and connected 𝑁-partition 𝑃. However, how to efficiently find the optimal
configuration-partition pair in a large graph with arbitrary demand function 𝜙 remains an open
problem. There are two intermediate results about the optimal selection of configuration or parti-
tion when the other is fixed [127].
Optimal Partition with Fixed Configuration For a fixed configuration 𝜼 with distinct entries, a
optimal connected 𝑁-partition 𝑃 minimizing coverage cost is called Voronoi partition denoted by
V (𝜼). Formally, for each 𝑃𝑖 ∈ V (𝜼) and any 𝑣 0 ∈ 𝑃𝑖 ,
                              𝑑𝐺 (𝑣 0, 𝜂𝑖 ) ≤ 𝑑𝐺 (𝑣 0, 𝜂 𝑗 ),   ∀ 𝑗 ∈ {1, . . . , 𝑁 }.
Optimal Configuration with Fixed Partition For a fixed connected 𝑁-partition 𝑃, the centroid of
the 𝑗-th partition 𝑃 𝑗 ∈ 𝑃 is defined by
                                                    Õ
                                 𝑐𝑖 := arg min             𝑑𝐺 [𝑃𝑖 ] (𝑣, 𝑣0)𝜙(𝑣 0),
                                            𝑣∈𝑃𝑖   𝑣 0 ∈𝑃𝑖
and the optimal configuration is to place one robot at the centroid of each 𝑃𝑖 ∈ 𝑃. We denote the
vector of centroid of 𝑃 by 𝒄(𝑃) with 𝑐𝑖 as its 𝑖-th element.
                                                         99


    Building upon the above two properties, the classic Lloyd algorithm iteratively places the robot
to the centroid of the current Voronoi partition and computes the new Voronoi partition using the
updated configuration. It is known that the robot team eventually converges to a class of partition
called centroidal Voronoi partition defined below.
Definition 7.1 (Centroidal Voronoi Partition, [128]). An 𝑁-partition 𝑃 is a centroidal Voronoi
partion of 𝐺 if 𝑃 is a Voronoi partition generated by some configuration with distinct entries 𝜼, i.e.
                           
𝑃 = V (𝜼), and 𝒄 V (𝜼) = 𝜼.
    It needs to be noted that an optimal partition and configuration pair minimizing the coverage
cost H (𝜼, 𝑃) is of the form (𝜼∗ , V (𝜼∗ )), where 𝜼∗ has distinct entries and V (𝜼∗ ) is a centroidal
Voronoi partition. A configuration-partition pair (𝜼0, V (𝜼0)) is considered to be an efficient
solution to the coverage problem if V (𝜼0) is a centroidal Voronoi partition, even though it is
possibly suboptimal [128].
7.1.4   Coverage Performance Evaluation
To achieve efficient coverage, the agents need to balance the trade-off between sampling the
environment to learn 𝝓 (exploration) and achieving centroidal Voronoi configuration defined using
the estimated 𝜙 (exploitation). To characterize this trade-off, we introduce a notion of coverage
regret.
Definition 7.2 (Coverage Regret). At each time 𝑡, let the team configuration be 𝜼𝑡 and the connected
                                                                         Í
𝑁-partition be 𝑃𝑡 . The coverage regret until time 𝑇 is defined by 𝑇𝑡=1 𝑅𝑡 (𝜙), where 𝑅𝑡 (𝜙) is the
instantaneous coverage regret with respect to demand function 𝜙, and is defined by
                        𝑅𝑡 (𝜙) = 2H (𝜼𝑡 , 𝑃𝑡 ) − H (𝒄(𝑃𝑡 ), 𝑃𝑡 ) − H (𝜼𝑡 , V (𝜼𝑡 )),
which is the sum of two terms H (𝜼𝑡 , 𝑃𝑡 ) − H (𝒄(𝑃𝑡 ), 𝑃𝑡 ) and H (𝜼𝑡 , 𝑃𝑡 ) − H (𝜼𝑡 , V (𝜼𝑡 )). The
former (resp., latter) term is the regret induced by the deviation of the current configuration
(resp., partition) from the optimal configuration (resp., partition) for the current partition (resp.,
                                                  100


  Algorithm 8: Deterministic Sequencing of Learning and Coverage (DSLC)
     Input       : Environment graph 𝐺, 𝝁0 , 𝚲0 ;
     Set         : 𝛼 ∈ (0, 1) and 𝛽 > 1;
    for epoch 𝑗 = 1, 2, . . . do
         Exploration phase:
  1      The robot team sample at vertices in 𝑉 to make
                                                      max          𝜎𝑖2 (𝑡) ≤ 𝛼 𝑗 𝜎02 .
                                                 𝑖 ∈ {1,...,|𝑉 | }
         Information propagation phase:
  2      Each robot agent propagates its sampling result to the team.
  3      Each robot update estimated demand function 𝜙.             ˆ
         Coverage phase:  
  4      for 𝑡 𝑗 = 1, 2, . . . , 𝛽 𝑗 do
               Based on 𝜙,   ˆ run pairwise partitioning algorithm.
configuration). Accordingly, no regret is incurred at time 𝑡 if and only if 𝑃𝑡 is a centroidal Voronoi
𝑁-partition and 𝜼𝑡 = 𝒄(𝑃𝑡 ).
    There are two sources contributing to the coverage regret. First, the estimation error in the
demand function 𝜙. Second, the deviation from centroidal Voronoi partition while sampling
environment to learn 𝜙.
7.2    Deterministic Sequencing of Learning and Coverage Algorithm
In this section, we describe the Deterministic Sequencing of Learning and Coverage (DSLC)
algorithm (Algorithm 8). It operates with an epoch-wise structure and each epoch consists of an
exploration (learning) phase and an exploitation (coverage) phase. The exploration phase comprises
two sub-phases: estimation and information propagation.
7.2.1    Estimation Phase
Let 𝜎𝑖2 (𝑡) be the marginal posterior variance of 𝜙(𝑣 𝑖 ) at time 𝑡, i.e., the 𝑖-th diagonal entry of
𝚲−1 (𝑡). Suppose the marginal prior variance 𝜎𝑖2 (0) ≤ 𝜎02 , for each 𝑖. Within each epoch 𝑗, agents
                                                              101


first determine the points to be sampled in order to reduce max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑡) below a threshold
𝛼 𝑗 𝜎02 , where 𝛼 ∈ (0, 1) is a prespecified parameter.
     Note that the posterior covariance computed in (7.1) depends only on the number of samples
at each vertex, and does not require actual sampling results. Therefore, the sequence of sampling
locations can be computed before physically visiting the locations. Leveraging this deterministic
evolution of the covariance, we take a greedy sampling policy that repeatedly selects the point 𝑣 𝑖𝑡
with maximum marginal posterior variance, i.e.,
                                                   𝑖𝑡 = arg max 𝜎𝑖 (𝑡),                                    (7.3)
                                                        𝑖∈{1,...,|𝑉 |}
for 𝑡 ∈ {𝑡 𝑗 , . . . , 𝑡 𝑗 }, where 𝑡 𝑗 and 𝑡 𝑗 are the starting and ending time of estimation phase in the 𝑗-th
epoch. It has been shown that the greedy sampling policy is near-optimal in terms of maximizing
the mutual information of the sampling results and demand function 𝜙 [58].
                                                                                       𝑗
     Let the set of points to be sampled during epoch 𝑗 be 𝑋 𝑗 and let 𝑋𝑟 = 𝑋 𝑗 ∩ 𝑃𝑡 𝑗 ,𝑟 be the set
of sampling points that belong to 𝑃𝑡 𝑗 ,𝑟 , the partition assigned to agent 𝑟 at time 𝑡 𝑗 . Each agent 𝑟
                                                                     𝑗
computes a path through the sampling points in 𝑋𝑟 and collects noisy measurements from those
points. The traveling path can be optimized by solving a Traveling Salesperson Problem (TSP).
Remark 7.1. With 𝚲0 as the common knowledge, the set of sampling points 𝑋 𝑗 for each epoch 𝑗
can be computed independently by each robot following the greedy sampling policy. If the same
tie-breaking rule is followed, the computed 𝑋 𝑗 and the number of samples at each sampling point
are the same for all agents.
7.2.2     Information Propagation Phase
After the estimation phase, sampling results from each agent must be passed to all other agents.
There are several mechanisms to accomplish this in a finite number of steps. For example, agents
can communicate with their neighboring agents and use flooding algorithms [129] to relay their
sampling results to every agent. Alternatively, the agents may be able to send their sampling results
                                                             102


to a cloud and receive global estimates after a finite delay. Another possibility for the agents is to
use finite time consensus protocols [130] in the distributed inference algorithm discussed in [28].
    For any of the above mechanisms, the sampling results from the entire robot team can be
propagated to each robot agent in finite time. Then, each agent has an identical posterior distribution
N 𝝁(𝑡), 𝚲−1 (𝑡) of 𝝓, and 𝜙ˆ := 𝝁(𝑡) will be used as the estimate of the demand function.
                  
7.2.3    Coverage Phase
After the estimation and information propagation phases, agents have the same estimate of the
demand function 𝜙.   ˆ The coverage phase involves no environmental sampling and its length is
designed to grow exponentially with epochs, i.e., the number of time steps in the coverage phase of
                     
the 𝑗-th epoch is 𝛽 𝑗 for some 𝛽 > 1. We use a distributed coverage algorithm, proposed in [127],
called pairwise partitioning with the estimated demand function 𝜙.             ˆ
    In an connected 𝑁-partition 𝑃, 𝑃𝑖 and 𝑃 𝑗 is said to be adjacent if there exists a vertex pair 𝑣 ∈ 𝑃𝑖
and 𝑣 0 ∈ 𝑃 𝑗 such that there exist an edge in 𝐸 connecting 𝑣 and 𝑣 0. At each time, a random pair of
agents (𝑖, 𝑗), with 𝑃𝑖 and 𝑃 𝑗 adjacent, compute an optimal pair of vertices (𝑎 ∗ , 𝑏 ∗ ) within 𝑃𝑖 ∪ 𝑃 𝑗
that minimize
                             Õ                                                            
                                        0                         0                      0
                                     𝜙(𝑣 ) min 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝑎, 𝑣 ), 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝑏, 𝑣 ) .
                                     ˆ
                        𝑣 0 ∈𝑃𝑖 ∪𝑃 𝑗
Then, agents 𝑖 and 𝑗 move to 𝑎 ∗ and 𝑏 ∗ . Subsequently, 𝑃𝑖 and 𝑃 𝑗 are updated to
                      𝑃𝑖 ← {𝑣 ∈ 𝑃𝑖 ∪ 𝑃 𝑗 | 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝜂𝑖 , 𝑣) ≤ 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝜂 𝑗 , 𝑣)},
                      𝑃 𝑗 ← {𝑣 ∈ 𝑃𝑖 ∪ 𝑃 𝑗 | 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝜂𝑖 , 𝑣) > 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝜂 𝑗 , 𝑣)}.
7.3     Analysis of the DSLC Algorithm
In this section, we analyze DSLC to provide a performance guarantee about the expected cumulative
coverage regret. To this end, we leverage the information gain from the estimation phase to analyze
the convergence rate of uncertainty. Then, we recall the convergence properties of the pairwise
                                                      103


partitioning algorithm used in DSLC. Based on these results, we establish the main result of this
paper, i.e., an upper bound on the expected cumulative coverage regret.
7.3.1    Mutual Information and Uncertainty Reduction
Let 𝑋 𝑔 = (𝑣 𝑖1 , . . . , 𝑣 𝑖 𝑛 ) be a sequence of 𝑛 vertices selected by the greedy policy. With a slight
abuse of notation, we denote the marginal posterior variance of 𝜙(𝑣 𝑖 ) after sampling at 𝑣 𝑖1 . . . 𝑣 𝑖 𝑘 by
𝜎𝑖2 (𝑘). We now present a bound on the maximal posterior variance after sampling at vertices within
𝑋 𝑔 . The following Lemma is adapted from Lemma 6.6 to incorporate the discrete environment.
Since the proof steps are similar, we skip them for brevity.
Lemma 7.1 (Uncertainty reduction). Under the greedy sampling policy, the maximum posterior
variance after 𝑛 sampling rounds satisfies
                                                                     2𝜎02          𝛾𝑛
                                          max         𝜎𝑖2 (𝑛) ≤                     ,
                                       𝑖∈{1,...,|𝑉 |}                              𝑛
                                                                log 1 + 𝜎 −2 𝜎02
where 𝛾𝑛 is the maximal mutual information gain that can be achieved with 𝑛 samples.
     Typically, it is hard to characterize 𝛾𝑛 with a general 𝚺0 . Therefore, we assume a squared
exponetial kernel for 𝝓.
Assumption 7.1. Vertices in 𝑉 lie in a convex and compact set 𝐷 ∈ R2 and the covariance of any
pair 𝜙(𝑣 𝑖 ) and 𝜙(𝑣 𝑗 ) is determined by a squared exponential kernel function
                                                                                      !
                                                                        𝑑 2 (𝑣 , 𝑣 )
                                                                          eu 𝑖 𝑗
                                      𝑘 (𝜙(𝑣 𝑖 ), 𝜙(𝑣 𝑗 )) = 𝜎𝑣2 exp −                  ,
                                                                            2𝑙 2
where 𝑑eu (𝑣 𝑖 , 𝑣 𝑗 ) is the Euclidean distance between 𝑣 𝑖 and 𝑣 𝑗 , 𝑙 is the length scale, and 𝜎𝑣2 is the
variability parameter.
     We now recall an upper bound on 𝛾𝑛 from [44].
Lemma 7.2 (Information gain for squared exp. kernel). With Assumption 7.1, the maximum mutual
information satisfies 𝛾𝑛 ∈ 𝑂 ((log|𝑉 | 𝑛) 3 ).
                                                               104


Remark 7.2. If the correlation information is ignored, i.e., 𝜙(𝑖), 𝑖 ∈ {1, . . . ,|𝑉 |} are treated
to be independent, it can be seen that max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑛) ∈ 𝑂 (|𝑉 | /𝑛) with greedy sampling
policy. In contrast, if correlation information is considered, by substituting the result in Lemma 7.2
into Lemma 7.1, max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑛) ∈ 𝑂 ((log(|𝑉 | 𝑛)) 3 /𝑛), which shows great advantage about
reducing uncertainty when |𝑉 | is large (the environment is finely discretized).
7.3.2    Convergence within Coverage Phase
Before each coverage phase, since the sampling results of each agent are relayed to the entire team,
the agents have a consensus estimate of the demand function 𝜙.                ˆ It has been shown in [127] that
using the pairwise partitioning algorithm, the 𝑁-partition 𝑃 for the team converges almost surely
to a class of near-optimal partitions defined below.
Definition 7.3 (Pairwise-optimal Partition). A connected 𝑁-partition 𝑃 is pairwise-optimal if for
each pair of adjacent regions 𝑃𝑖 and 𝑃 𝑗 ,
                            Õ                                  Õ
                                   𝑑𝐺 (𝑐(𝑃𝑖 ), 𝑣 0)𝜙(𝑣 0) +            𝑑𝐺 (𝑐(𝑃 𝑗 ), 𝑣0)𝜙(𝑣 0)
                           𝑣 0 ∈𝑃𝑖                            𝑣 0 ∈𝑃 𝑗
                                            Õ
                                                    𝜙(𝑣 0) min 𝑑𝐺 (𝑎, 𝑣0), 𝑑 (𝑏, 𝑣0) .
                                                                                        
                         =      min
                           𝑎,𝑏∈𝑃𝑖 ∪𝑃 𝑗
                                       𝑣 0 ∈𝑃𝑖 ∪𝑃 𝑗
     It means that, within the induced subgraph generated by any pair of adjacent regions, the 2-
partition is optimal. It is proved in [127] that if a connected 𝑁-partition 𝑃 is pairwise-optimal
then it is also a centroidal Voronoi partition. The following result on the convergence time of the
pairwise partitioning algorithm is established in [127].
Lemma 7.3 (Expected Convergence Time). Using the pairwise partitioning algorithm, the expected
time to converge to a pairwise-optimal 𝑁-Partition is finite.
     For each coverage phase, Lemma 7.3 implies that the expected time for the instantaneous regret
𝑅𝑡 ( 𝜙)
      ˆ to converge to 0 is finite.
                                                          105


7.3.3    An Upper Bound on Expected Coverage Regret
We now present the main result for this paper.
Theorem 7.4. For DSLC and any time horizon 𝑇, if Assumption 7.1 holds and 𝛼 = 𝛽−2/3 , then the
expected cumulative coverage regret with respect to demand function 𝜙 satisfies
                                     " 𝑇         #
                                       Õ                                
                                  E        𝑅𝑡 (𝜙) ∈ 𝑂 𝑇 2/3 (log(𝑇)) 3 .
                                       𝑡=1
Proof. We establish the theorem using the following four steps.
Step 1 (Regret from estimation phases): Let the total number of sampling steps before the end
of the 𝑗-th epoch be 𝑠 𝑗 . By applying Lemma 7.1, we get
                                          𝑠 𝑗 ∈ 𝑂 ((log(𝑇)) 3 /𝛼 𝑗 ).
Thus, the coverage regret in the estimation phases until the end of the 𝑗-th epoch belongs to
𝑂 ((log(𝑇)) 3 /𝛼 𝑗 ).
Step 2 (Regret from information propagation phases): The sampling information by each robot
propagates to all the team members in finite time. Thus, before the end of the 𝑗-th epoch, the
coverage regret from information propagation phases can be bounded by 𝑐 1 𝑗 for some constant
𝑐 1 > 0.
Step 3 (Regret from coverage phases): According to Lemma 7.3, in each coverage phase, the
expected time before converging to a pairwise-optimal partition is finite. Thus, before the end of
the 𝑗-th epoch, the expected coverage regret from converging steps can be upper bounded by 𝑐 2 𝑗
for some constant 𝑐 2 > 0.
    Also note that the robot team converge to pair-wise optimal partition with respect estimated
demand function 𝜙ˆ which may deviate from the actual 𝜙. Then, the instantaneous coverage regret
𝑅𝑡 (𝜙) caused by estimation error can be expressed as
                         2H (𝜼𝑡 , 𝑃𝑡 ) − H (𝒄(𝑃𝑡 ), 𝑃𝑡 ) − H (𝜼𝑡 V (𝜼𝑡 )) := 𝐴𝑡T 𝝓,
                                                     106


for some 𝐴𝑡 ∈ R|𝑉 | . Moreover, the posterior distribution of 𝑅𝑡 (𝜙) is N ( 𝐴𝑡T 𝝁(𝑡), 𝐴𝑡T 𝚺(𝑡) 𝐴𝑡 ), where
𝚺(𝑡) = 𝚲−1 (𝑡) is the posterior covariance matrix. Since a pairwise-optimal partition 𝑃 is also
a centroidal Voronoi partition and 𝜙ˆ = 𝝁(𝑡), 𝑅𝑡 ( 𝜙)       ˆ = 0 indicates 𝐴𝑡T 𝝁(𝑡) = 0. Now, we get
𝑅𝑡 (𝜙) ∼ N (0, 𝐴𝑡T 𝚺(𝑡) 𝐴𝑡 ) and
                                                               r
                                                               2 T
                               E [𝑅𝑡 (𝜙)] ≤ E 𝑅𝑡 (𝜙)         =      𝐴 𝚺(𝑡) 𝐴𝑡 .
                                                                 𝜋 𝑡
Note that 𝐴𝑡T 𝚺(𝑡) 𝐴𝑡 is weighed summation of eigenvalues of 𝚺(𝑡). At any time 𝑡 in the coverage
phase of the 𝑘-th epoch, max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑡) ≤ 𝛼 𝑘 𝜎02 , and its follows that the summation of
eigenvalues of 𝚺(𝑡) equals trace(𝚺(𝑡)) ≤ |𝑉 | 𝛼 𝑘 𝜎02 . Thus, we get
                                        "               #
                                           Õ                      √
                                     E            𝑅𝑡 (𝜙) ≤ 𝑐 3 (𝛽 𝛼) 𝑘 ,
                                          𝑡∈T𝑘cov
for some constant 𝑐 3 > 0, where T𝑘cov are the time slots in the coverage phase of the 𝑘-th epoch
and we have used the fact that |T𝑘cov | = d𝛽 𝑘 e.
Step 4 (Summary): Summing up the expected coverage regret from the above steps, the expected
cumulative coverage regret at the end of the 𝑗-th epoch 𝑇 𝑗 satisfies
                               "  𝑇𝑗        #                     𝑗
                                 Õ                              Õ          √
                             E       𝑅𝑡 (𝜙) ≤ 𝐶1 𝑗 + 𝐶2 𝑠 𝑗 +        𝑐 3 (𝛽 𝛼) 𝑘 ,                    (7.4)
                                 𝑡=1                             𝑘=1
where 𝐶1 , 𝐶2 > 0 are some constants. The theorem statement follows by plugging in 𝛼 = 𝛽−2/3 ,
using 𝑗 ∈ 𝑂 (log 𝑇) and some simple calculations.                                                        
7.4    Simulation Results
To illustrate the empirical performance of the proposed algorithm, we simulate its execution on a
uniform grid graph superimposed on the unit square. We present numerical results which show
that DSLC achieves sublinear regret and compare our algorithm to those proposed in [13] and [68].
    Motivated by environmental applications, we construct the demand function 𝜙 over a discrete
21 × 21 point grid in [0, 1] 2 by performing kernel density estimation on a subset of the geospatial
distribution of Australian wildfires observed by NASA in 2019 [131]. Intuitively, 𝜙 represents the
                                                      107


probability that a wildfire may occur at a particular point of the unit square, and it is used to model
the demand for an autonomous sensing agent at that point. The ground truth 𝜙 is shown on the
right in Figure 7.1.
                             Figure 7.1: Distributed implementation of DSLC.
    In each simulation, nine agents are placed uniformly at random over the grid and execute 3
epochs of length 16, 46, and 128 to achieve adaptive coverage of the environment. Partitions are
initialized by iterating over the grid and assigning each point to the nearest agent. During the
exploration phase of each epoch, partitions are fixed; during the exploitation phase of each epoch,
partitions are updated according to the protocol established in [127], where pairwise gossip-based
repartitioning occurs between randomly selected neighbors. Coverage cost, regret and maximum
variance are computed throughout using (7.2), Definition 7.2, and the maximum diagonal entry of
𝚲−1 (𝑡) from (7.1), respectively. From left to right in turn, Figure 7.1 shows (i) agent positions 𝜼𝑡
and partitions 𝑃𝑡 (ii) TSP sampling tours (iii) posterior mean (iv) variance of 𝜙.ˆ Pairwise partition
updates between gossiping agents are denoted by magenta lines in the leftmost column of panels.
Points along TSP tours in the second-from-leftmost column of panels are plotted in magenta prior
                                                    108


                          Figure 7.2: Comparison of DSLC, Todescato and Cortes.
to sampling, and in black after sampling. A simulation video is available online.2
    The demand function 𝜙 is normalized in the range [0, 1] and sampled by agents with Gaussian
noise parameterized by mean and standard deviation 𝜇 = 0, 𝜎 = 0.1. A global Gaussian Process
model is assumed to simplify the estimation of 𝜙ˆ throughout the simulation, though in practice
estimation of 𝜙ˆ could be implemented in a fully distributed manner by assuming each agent
maintains its own model of 𝜙ˆ and employing an information propagation phase described in Section
7.2. Setting the parameter 𝛼 = 0.5 to reduce uncertainty by half within each epoch, 𝛽 = 𝛼−3/2 is
fully determined by Theorem 7.4. Figure 7.1 visualizes the simulation of DSLC.
    Figure 7.2 compares the evolution of the coverage regret, coverage cost, and uncertainty reduc-
tion in DSLC with that of algorithms proposed in [13] and [68], denoted Cortes and Todescato,
respectively. Agents in [13] are assumed to have perfect knowledge of 𝜙 and simply go to the cen-
troid of their cell at each iteration; in [68], agents follow a stochastic sampling approach with the
probability of exploration proportional to posterior variance in the estimate 𝜙ˆ at each iteration. All
    2 https://youtu.be/nalwrZC6GiI
                                                    109


results are averaged over 16 simulations of 190 iterations, aligned with the three-implementation
of DSLC with epoch lengths 16, 46, and 128. It can be noticed that DSLC empirically achieves
sublinear regret. Spikes in regret occur during the exploration phase of each epoch, before agents
converge to a pairwise-optimal coverage configuration with respect to 𝜙ˆ during the exploitation
phase.
     Though we do not include the algorithm in our simulations, it is worth noting that DSLC operates
in a manner similar to that proposed in [64] where agents spend a number of iterations sampling
𝜙 to reduce maximum posterior variance max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑛) below a prespecified threshold, then
transition to perform coverage for all remaining iterations. Indeed, this algorithm is essentially a
special case of DSLC with one epoch and can therefore be expected to perform similarly from an
empirical perspective.
7.5     Summary
We propose an adaptive coverage algorithm DSLC that balances the exploration versus exploitation
trade-off in learning 𝜙 and achieving environmental coverage. Our algorithm schedules learning and
coverage epochs such that its emphasis gradually shifts from exploration to exploitation while never
fully ceasing to learn. Most importantly, we introduce a novel coverage regret that characterizes
the deviation of agent configurations and partitions from a centroidal Voronoi partition and derive
analytic bounds on the expected cumulative regret for DSLC. In particular, we prove that DSLC
will achieve sublinear expected cumulative regret under minor assumptions. The efficacy of DSLC
is illustrated through extensive simulation and comparison with existing state-of-the-art approaches
to adaptive coverage.
7.6     Bibliographic Remarks
Classical approaches to coverage control [13, 61–63] assume a priori knowledge of 𝜙 and em-
ploy Lloyd’s algorithm [132] to drive agents to a local minimum of the coverage cost. In these
algorithms, each agent communicates with the agents in the neighboring partitions at each time
                                                 110


and updates its partition. Distributed gossip-based coverage algorithms [133] address potential
communication bottlenecks in classical approaches by updating partitions pairwise between the
agents in neighboring partitions.
    While much of the work in coverage considers continuous convex environments, the global
convergence property remains an open problem. The asymptotic convergence to a local minimum
is normally based on an unproven assumption that there exist finite local optimal points [13, 134].
A discrete graph representation of the environment is considered in [127], which not only enables
finite time convergence but also allows for non-convex environments. As has been mentioned
earlier in this chapter, the gossip-based coverage algorithm in the graph environment has been
proved to converge almost surely to pairwise-optimal partitions in finite time [127].
    Recent works have put more focus on the problem of adaptive coverage, in which agents are
not assumed to know 𝜙 a priori. Parametric estimation approaches to adaptive coverage [135, 136]
model 𝜙 as a linear combination of some basis functions and propose algorithms to learn the
weights of basis functions; while non-parametric approaches [64–69] model 𝜙 as the realization of
a Gaussian Process and make predictions by conditioning on observed values of 𝜙 sampled over
the operating environment. Alternative approaches to adaptive coverage [137, 138] have also been
considered.
    A non-parametric adaptive coverage algorithm with provable regret guarantees was presented
in this chapter. Similar adaptive coverage algorithms with formal performance guarantees are also
developed in [68, 69]. Todescato et al. [68] use a Bernoulli random variable for each robot to
decide between learning and coverage steps. The distribution of this random variable is designed
to ensure the convergence of the algorithm. In contrast, we leverage the so-called “doubling trick"
from the MAB literature to design a deterministic schedule of learning and coverage. This allows
us to derive formal regret bounds on our adaptive coverage algorithm.
    The most closely related work to the ideas presented in this chapter is by Benevento et al. [69],
which uses a Gaussian process optimization-based [44] approach to design an adaptive coverage
algorithm. They derived an upper bound on a notion of coverage regret different from Definition 7.2
                                                111


in this chapter. Their result is based on a strong assumption that the coverage control algorithm can
drive the system to the global minimum of the coverage cost. In contrast, our coverage regret is
defined with respect to the local minima which can be achieved by many state-of-the-art coverage
control policies including the classic Lloyd’s algorithm. By analyzing the coverage regret defined in
this chapter, the convergence of the adaptive coverage control can still be shown without requiring
the global optimal assumption.
                                                  112


                                            CHAPTER 8
                         CONCLUSIONS AND FUTURE DIRECTIONS
This dissertation has focused upon optimal decision-making in the face of uncertainty. In particular,
we address the exploration versus exploitation dilemma in the MAB setup as well as in robotic
problems including target search and adaptive coverage control. All proposed algorithms are
accompanied by rigorous analysis to indicate their convergence properties.
    Since the MAB problem provides a concise mathematical formulation of the exploration versus
exploitation dilemma, we have investigated a variety of MAB problem variations that capture
different properties of the stochastic environment in real-world problems. For the heavy-tailed
bandit, we proposed the Robust MOSS algorithm, which is the first to achieve order optimal worst-
case regret while maintaining a logarithm asymptotic regret. For the nonstationary bandits, we
studied both the piece-wise stationary bandit and the more general nonstationary bandits with a
variation budget. Exact order optimal or near order optimal algorithms for these problem setups are
proposed, analyzed, and compared extensively in simulations. As an extension of the single-player
MAB, we studied the distributed multi-player bandit in a piece-wise stationary environment. By
modifying the single-player policy, novel multi-player policies are designed and proved to maintain
group regrets matching with the standard single-player regret even without communication between
agents.
    For the robotic target search problem, we have considered a scenario in with a robot operates in
a 3D environment to search targets on a 2D floor. The target search task is modeled as a hot-spots
identification problem in which sensing information is compromised by measurement noise. Since
sensing at a location farther from the floor provides better coverage of the environment but less
accurate results, we have modeled the sensing field with a multi-fidelity Gaussian process that
captures the coverage-accuracy trade-off. Leveraging this novel sensing model, we established a
new informative path planning strategy that allows for jointly planning for sampling locations and
associated fidelity levels, and thus reduces target search time.
                                                 113


    For the adaptive coverage problem, the demand for robotic service within the environment is
modeled as a realization of the Gaussian process. With Bayesian techniques, we have devised a
policy that balances the tradeoff between learning the demand and covering the environment. To
provide analytical rigor, we have defined the coverage regret, and based on it, we have analyzed the
convergence property of the proposed online estimation and coverage algorithm.
    There are several possible avenues of future studies on problems addressed in this dissertation.
The distributed multi-player MAB problem in a general nonstationary environment is a challenging
problem and is expected to be applied in the opportunistic spectrum access wherein the stochastic
nature of the channel vacancy changes with time. We intend to design a multi-player policy
that actively detects the drift in stochastic reward processes such that the players require no prior
information about the nonstationary environment. It can be foreseen that the nonstationarity could
bring more difficulty in reward estimation so that achieving coordinate behavior to reduce collisions
is a trickier task. To deal with it, we can allow communication among agents. For example, the
agents can do cooperative reward estimation through a bi-directional communication network by
running consensus algorithms as in [27]. Since communicating sampling results may require
relatively large bandwidth, to reduce the communication requirement, it is also possible to only
require the player to share the ranking of arms as mentioned in [139]. Other issues for such a
problem include privacy and defense against adversarial attacks.
    We are also interested in extending the single-robot target search policy to cooperative multi-
robot search scenarios. As has been mentioned, coverage control is a potential method that can
balance the workload. Note that the workload distribution in the environment changes as the search
mission progresses, so the robots need to cover the dynamic demands of service. For such a
problem, providing analytical rigor would be of interest. Other workload-balancing ideas include
using multi-robot path planning methods that solve a vehicle routing problem [140] or orienteering
problem [141]. Also of interest would be the implementation of target search algorithms in
underwater multi-target search testbeds.
    It would be worthwhile to pursue the adaptive coverage problem from a variety of new directions.
                                                  114


The proposed online estimation and coverage algorithm requires an information propagation phase
to maintain uniform estimation of demands among agents, while we envision a fully distributed
policy that allows for small differences in demand estimates. Besides, the problem setup can be
generalized by considering heterogeneous agents that can provide multiple types of services. The
quality of service could depend on both the servicing agent and the service type. Considering both
inter-service dependencies and the spatial correlations, the multi-task Gaussian process might be a
good fit to model demands of different service types. Another interesting direction could be to use
the time-varying Gaussian process to model a dynamic environment in which the demands change
with time.
    Like adaptive coverage control, adaptive multi-robot patrolling is also an interesting problem
with the explore-exploit tradeoff. In the multi-robot patrolling problem [142], a team of robots
circling around a set of important locations (viewpoints) with different known priorities. The
objective is to minimize the weighted refresh time, which is the longest time interval between
any two visits of a viewpoint, weighted by the viewpoint’s priority. We envision addressing this
problem by considering the priorities of viewpoints to be unknown and time-varying so that they
need to be learned. Since the LM-DSEE algorithm for the piece-wise stationary MAB problem has
a block allocation structure that benefits path planning, its multi-player extension and associated
distributed control methods could be promising to solve the problem.
                                                115


BIBLIOGRAPHY
     116


                                       BIBLIOGRAPHY
[1]  W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view
     of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933.
[2]  R. Albert and A.-L. Barabási, “Statistical mechanics of complex networks,” Reviews of
     Modern Physics, vol. 74, no. 1, p. 47, 2002.
[3]  M. Vidyasagar, “Law of large numbers, heavy-tailed distributions, and the recent financial
     crisis,” in Perspectives in Mathematical System Theory, Control, and Signal Processing.
     Springer, 2010, pp. 285–295.
[4]  K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple players,”
     IEEE Transactions on Signal Processing, vol. 58, no. 11, pp. 5667–5681, 2010.
[5]  A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, “Distributed algorithms for learning
     and cognitive medium access with logarithmic regret,” IEEE Journal on Selected Areas in
     Communications, vol. 29, no. 4, pp. 731–745, 2011.
[6]  O. Avner and S. Mannor, “Concurrent bandits and cognitive radio networks,” in Joint Euro-
     pean Conference on Machine Learning and Knowledge Discovery in Databases. Springer,
     2014, pp. 66–81.
[7]  J. R. Krebs, A. Kacelnik, and P. Taylor, “Test of optimal sampling by foraging great tits,”
     Nature, vol. 275, no. 5675, pp. 27–31, 1978.
[8]  V. Srivastava, P. Reverdy, and N. E. Leonard, “On optimal foraging and multi-armed bandits,”
     in Proceedings of the 51st Annual Allerton Conference on Communication, Control, and
     Computing, Monticello, IL, USA, 2013, pp. 494–499.
[9]  V. Srivastava, P. Reverdy, and N. E. Leonard, “Surveillance in an abruptly changing world
     via multiarmed bandits,” in IEEE Conference on Decision and Control, 2014, pp. 692–697.
[10] M. Y. Cheung, J. Leighton, and F. S. Hover, “Autonomous mobile acoustic relay positioning
     as a multi-armed bandit with switching costs,” in IEEE/RSJ International Conference on
     Intelligent Robots and Systems, Tokyo, Japan, Nov. 2013, pp. 3368–3373.
[11] Y. Sung, D. Dixit, and P. Tokekar, “Environmental hotspot identification in limited time with
     a uav equipped with a downward-facing camera,” arXiv preprint arXiv:1909.08483, 2019.
[12] M. C. Kennedy and A. O’Hagan, “Predicting the output from a complex computer code
     when fast approximations are available,” Biometrika, vol. 87, no. 1, pp. 1–13, 2000.
                                              117


[13] J. Cortés, S. Martínez, T. Karataş, and F. Bullo, “Coverage control for mobile sensing
     networks,” IEEE Transactions on Robotics and Automation, vol. 20, no. 2, pp. 243–255,
     2004.
[14] H. Robbins, “Some aspects of the sequential design of experiments,” Bulletin of the American
     Mathematical Society, vol. 58, no. 5, pp. 527–535, 1952.
[15] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in
     Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985.
[16] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit
     problem,” Machine Learning, vol. 47, no. 2, pp. 235–256, 2002.
[17] A. Garivier and O. Cappé, “The KL-UCB algorithm for bounded stochastic bandits and
     beyond,” in Proceedings of the 24th Conference on Learning Theory, vol. 19, Budapest,
     Hungary, 2011, pp. 359–376.
[18] S. Bubeck, N. Cesa-Bianchi, and G. Lugosi, “Bandits with heavy tail,” IEEE Transactions
     on Information Theory, vol. 59, no. 11, pp. 7711–7717, 2013.
[19] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed
     bandit problem,” SIAM journal on computing, vol. 32, no. 1, pp. 48–77, 2002.
[20] A. Garivier and E. Moulines, “On upper-confidence bound policies for switching bandit
     problems,” in International Conference on Algorithmic Learning Theory. Springer, 2011,
     pp. 174–188.
[21] O. Besbes and Y. Gur, “Stochastic multi-armed-bandit problem with non-stationary rewards,”
     in Advances in neural information processing systems, 2014, pp. 199–207.
[22] V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically efficient allocation rules for the
     multiarmed bandit problem with multiple plays-part I: IID rewards,” IEEE Transactions on
     Automatic Control, vol. 32, no. 11, pp. 968–976, 1987.
[23] Y. Gai and B. Krishnamachari, “Distributed stochastic online learning policies for oppor-
     tunistic spectrum access,” IEEE Transactions on Signal Processing, vol. 62, no. 23, pp.
     6184–6193, 2014.
[24] D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed
     bandits,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2331–2345, 2014.
[25] N. Nayyar, D. Kalathil, and R. Jain, “On regret-optimal learning in decentralized multiplayer
     multiarmed bandits,” IEEE Transactions on Control of Network Systems, vol. 5, no. 1, pp.
     597–606, 2018.
                                              118


[26] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Multi-armed bandits in multi-agent net-
     works,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
     (ICASSP), 2017.
[27] P. Landgren, V. Srivastava, and N. E. Leonard, “On distributed cooperative decision-making
     in multiarmed bandits,” in 2016 European Control Conference, Aalborg, Denmark, 2016,
     pp. 243–248.
[28] P. Landgren, V. Srivastava, and N. E. Leonard, “Distributed cooperative decision-making in
     multiarmed bandits: Frequentist and Bayesian algorithms,” in IEEE Conference on Decision
     and Control, Las Vegas, NV, USA, Dec. 2016, pp. 167–172.
[29] A. B. H. Alaya-Feki, E. Moulines, and A. LeCornec, “Dynamic spectrum access with
     non-stationary multi-armed bandit,” in IEEE Workshop on Signal Processing Advances in
     Wireless Communications, 2008, pp. 416–420.
[30] Y. Li, Q. Hu, and N. Li, “A reliability-aware multi-armed bandit approach to learn and select
     users in demand response,” Automatica, vol. 119, p. 109015, 2020.
[31] D. Kalathil and R. Rajagopal, “Online learning for demand response,” in Annual Allerton
     Conference on Communication, Control, and Computing, 2015, pp. 218–222.
[32] D. Agarwal, B. C. Chen, P. Elango, N. Motgi, S. T. Park, R. Ramakrishnan, S. Roy, and
     J. Zachariah, “Online models for content optimization,” in Advances in Neural Information
     Processing Systems, vol. 21. Curran Associates, Inc., 2009, pp. 17–24.
[33] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized
     news article recommendation,” in International Conference on World Wide Web, 2010, pp.
     661–670.
[34] C. Baykal, G. Rosman, S. Claici, and D. Rus, “Persistent surveillance of events with unknown,
     time-varying statistics,” in Proceedings of the IEEE International Conference on Robotics
     and Automation, 2017, pp. 2682–2689.
[35] R. Dimitrova, I. Gavran, R. Majumdar, V. S. Prabhu, and S. E. Z. Soudjani, “The robot routing
     problem for collecting aggregate stochastic rewards,” arXiv preprint arXiv:1704.05303,
     2017.
[36] V. Srivastava, F. Pasqualetti, and F. Bullo, “Stochastic surveillance strategies for spatial
     quickest detection,” The International Journal of Robotics Research, vol. 32, no. 12, pp.
     1438–1458, 2013.
[37] R. Agrawal, M. V. Hedge, and D. Teneketzis, “Asymptotically efficient adaptive alloca-
     tion rules for the multi-armed bandit problem with switching cost,” IEEE Transactions on
     Automatic Control, vol. 33, no. 10, pp. 899–906, 1988.
                                               119


[38] P. Reverdy, V. Srivastava, and N. E. Leonard, “Modeling human decision making in general-
     ized Gaussian multiarmed bandits,” Proceedings of the IEEE, vol. 102, no. 4, pp. 544–571,
     2014.
[39] V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg, “Batched bandit problems,” The
     Annals of Statistics, vol. 44, no. 2, pp. 660–681, 2016.
[40] S. Vakili, K. Liu, and Q. Zhao, “Deterministic sequencing of exploration and exploitation
     for multi-armed bandit problems,” IEEE Journal of Selected Topics in Signal Processing,
     vol. 7, no. 5, pp. 759–767, 2013.
[41] H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restless multiarmed bandit
     with unknown dynamics,” IEEE Transactions on Information Theory, vol. 59, no. 3, pp.
     1902–1916, 2013.
[42] C. K. Williams and C. E. Rasmussen, Gaussian processes for Machine Learning.            MIT
     press Cambridge, MA, 2006, vol. 2, no. 3.
[43] S. Vasudevan, F. Ramos, E. Nettleton, and H. Durrant-Whyte, “Gaussian process modeling
     of large-scale terrain,” Journal of Field Robotics, vol. 26, no. 10, pp. 812–840, 2009.
[44] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Information-theoretic regret bounds
     for Gaussian process optimization in the bandit setting,” IEEE Transactions on Information
     Theory, vol. 58, no. 5, pp. 3250–3265, 2012.
[45] G. A. Hollinger, B. Englot, F. S. Hover, U. Mitra, and G. S. Sukhatme, “Active planning for
     underwater inspection and the benefit of adaptivity,” The International Journal of Robotics
     Research, vol. 32, no. 1, pp. 3–18, 2013.
[46] D. E. Soltero, M. Schwager, and D. Rus, “Generating informative paths for persistent
     sensing in unknown environments,” in IEEE/RSJ Int Conf on Intelligent Robots and Systems,
     Vilamoura, Algarve, Portugal, Oct. 2012, pp. 2172–2179.
[47] J. Yu, M. Schwager, and D. Rus, “Correlated orienteering problem and its application to
     informative path planning for persistent monitoring tasks,” in 2014 IEEE/RSJ International
     Conference on Intelligent Robots and Systems, 2014, pp. 342–349.
[48] G. A. Hollinger and G. S. Sukhatme, “Sampling-based robotic information gathering al-
     gorithms,” The International Journal of Robotics Research, vol. 33, no. 9, pp. 1271–1287,
     2014.
[49] G. Hitz, E. Galceran, M.-È. Garneau, F. Pomerleau, and R. Siegwart, “Adaptive continuous-
     space informative path planning for online environmental monitoring,” Journal of Field
     Robotics, vol. 34, no. 8, pp. 1427–1449, 2017.
                                                 120


[50] G. Hitz, A. Gotovos, M.-É. Garneau, C. Pradalier, A. Krause, R. Y. Siegwart et al., “Fully au-
     tonomous focused exploration for robotic environmental monitoring,” in IEEE International
     Conference on Robotics and Automation, 2014, pp. 2658–2664.
[51] N. Atanasov, J. Le Ny, K. Daniilidis, and G. J. Pappas, “Information acquisition with sensing
     robots: Algorithms and error bounds,” in IEEE International Conference on Robotics and
     Automation, 2014, pp. 6447–6454.
[52] A. A. Meera, M. Popović, A. Millane, and R. Siegwart, “Obstacle-aware adaptive informative
     path planning for uav-based target search,” in IEEE International Conference on Robotics
     and Automation, 2019, pp. 718–724.
[53] X. Lan and M. Schwager, “Planning periodic persistent monitoring trajectories for sensing
     robots in Gaussian random fields,” in IEEE International Conference on Robotics and
     Automation, Karlsruhe, Germany, May 2013, pp. 2415–2420.
[54] N. E. Leonard, D. A. Paley, F. Lekien, R. Sepulchre, D. M. Fratantoni, and R. E. Davis,
     “Collective motion, sensor networks, and ocean sampling,” Proceedings of the IEEE, vol. 95,
     no. 1, pp. 48–74, 2007.
[55] S. L. Smith, M. Schwager, and D. Rus, “Persistent robotic tasks: Monitoring and sweeping
     in changing environments,” IEEE Transactions on Robotics, vol. 28, no. 2, pp. 410–426,
     2012.
[56] C. G. Cassandras, X. Lin, and X. Ding, “An optimal control approach to the multi-agent
     persistent monitoring problem,” IEEE Transactions on Automatic Control, vol. 58, no. 4, pp.
     947–961, 2013.
[57] R. N. Smith, M. Schwager, S. L. Smith, B. H. Jones, D. Rus, and G. S. Sukhatme, “Persistent
     ocean monitoring with underwater gliders: Adapting sampling resolution,” Journal of Field
     Robotics, vol. 28, no. 5, pp. 714–741, 2011.
[58] A. Krause and C. E. Guestrin, “Near-optimal nonmyopic value of information in graphical
     models,” in Proceedings of the 21st Conference Conference on Uncertainty in Artificial
     Intelligence, Edinburgh, Scotland, July 2005, pp. 324–331.
[59] P. Auer and R. Ortner, “UCB revisited: Improved regret bounds for the stochastic multi-
     armed bandit problem,” Periodica Mathematica Hungarica, vol. 61, no. 1-2, pp. 55–65,
     2010.
[60] S. Kalyanakrishnan and P. Stone, “Efficient selection of multiple bandit arms: Theory and
     practice,” in ICML, 2010.
[61] J. Cortés and F. Bullo, “Coordination and Geometric Optimization via Distributed Dynamical
     Systems,” SIAM Journal on Control and Optimization, vol. 44, no. 5, pp. 1543–1574, 2005.
                                               121


[62] F. Lekien and N. E. Leonard, “Nonuniform coverage and cartograms,” IEEE Conference on
     Decision and Control, pp. 5518–5523, 2010.
[63] I. I. Hussein and D. M. Stipanovic, “Effective coverage control for mobile sensor networks
     with guaranteed collision avoidance,” IEEE Transactions on Control Systems Technology,
     vol. 15, no. 4, pp. 642–657, 2007.
[64] J. Choi, J. Lee, and S. Oh, “Swarm intelligence for achieving the global maximum using
     spatio-temporal Gaussian processes,” Proceedings of the American Control Conference, pp.
     135–140, 2008.
[65] Y. Xu and J. Choi, “Adaptive sampling for learning Gaussian processes using mobile sensor
     networks,” Sensors, vol. 11, no. 3, pp. 3051–3066, 2011.
[66] W. Luo and K. Sycara, “Adaptive Sampling and Online Learning in Multi-Robot Sensor
     Coverage with Mixture of Gaussian Processes,” in Proceedings of the IEEE International
     Conference on Robotics and Automation, 2018, pp. 6359–6364.
[67] W. Luo, C. Nam, G. Kantor, and K. Sycara, “Distributed environmental modeling and
     adaptive sampling for multi-robot sensor coverage,” in International Joint Conference on
     Autonomous Agents and Multiagent Systems, 2019, pp. 1488–1496.
[68] M. Todescato, A. Carron, R. Carli, G. Pillonetto, and L. Schenato, “Multi-robots Gaus-
     sian estimation and coverage control: From client–server to peer-to-peer architectures,”
     Automatica, vol. 80, pp. 284–294, 2017.
[69] A. Benevento, M. Santos, G. Notarstefano, K. Paynabar, M. Bloch, and M. Egerstedt,
     “Multi-robot coordination for estimation and coverage of unknown spatial fields,” in IEEE
     International Conference on Robotics and Automation, 2020, pp. 7740–7746.
[70] J. Audibert and S. Bubeck, “Minimax policies for adversarial and stochastic bandits,” in
     Proceedings of the 22nd conference on learning theory, 2009, pp. 217–226.
[71] A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies for sequential allocation
     problems,” Advances in Applied Mathematics, vol. 17, no. 2, pp. 122–142, 1996.
[72] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in a rigged casino:
     The adversarial multi-armed bandit problem,” in IEEE Annual Foundations of Computer
     Science, 1995, pp. 322–331.
[73] S. Mannor and J. N. Tsitsiklis, “The sample complexity of exploration in the multi-armed
     bandit problem,” Journal of Machine Learning Research, vol. 5, no. Jun, pp. 623–648, 2004.
[74] L. Wei and V. Srivastava, “Minimax policy for heavy-tailed bandits,” IEEE Control Systems
     Letters, vol. 5, no. 4, pp. 1423–1428, 2021.
                                               122


[75] X. Fan, I. Grama, and Q. Liu, “Hoeffding’s inequality for supermartingales,” Stochastic
     Processes and their Applications, vol. 122, no. 10, pp. 3545–3559, 2012.
[76] S. Bubeck, “Bandits games and clustering foundations,” Ph.D. dissertation, Université des
     Sciences et Technologie de Lille - Lille I, 2010.
[77] O. C. E. Kaufmann and A. Garivier, “On bayesian upper confidence bounds for bandit
     problems,” in Artificial Intelligence and Statistics, 2012, pp. 592–600.
[78] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit
     problem,” in Conference on Learning Theory, 2012, pp. 39–1.
[79] N. K. E. Kaufmann and R. Munos, “Thompson sampling: An asymptotically optimal finite-
     time analysis,” in International Conference on Algorithmic Learning Theory. Springer,
     2012, pp. 199–213.
[80] P. Ménard and A. Garivier, “A minimax and asymptotically optimal algorithm for stochastic
     bandits,” in Algorithmic Learning Theory, vol. 76, 2017.
[81] R. Degenne and V. Perchet, “Anytime optimal algorithms in stochastic multi-armed bandits,”
     in International Conference on Machine Learning, 2016, pp. 1587–1595.
[82] L. Wei and V. Srivastava, “On abruptly-changing and slowly-varying multiarmed bandit
     problems,” in American Control Conference, Milwaukee, WI, June 2018, pp. 6291–6296.
[83] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of
     the American Statistical Association, vol. 58, no. 301, pp. 13–30, 1963.
[84] J. C. Gittins, “Bandit processes and dynamic allocation indices,” Journal of the Royal
     Statistical Society: Series B (Methodological), vol. 41, no. 2, pp. 148–164, 1979.
[85] K. Liu and Q. Zhao, “Indexability of restless bandit problems and optimality of whittle
     index for dynamic multichannel access,” IEEE Transactions on Information Theory, vol. 56,
     no. 11, pp. 5547–5567, 2010.
[86] L. Kocsis and C. Szepesvári, “Discounted UCB,” in 2nd PASCAL Challenges Workshop,
     vol. 2, 2006.
[87] C. Hartland, N. Baskiotis, S. Gelly, M. Sebag, and O. Teytaud, “Change Point Detection and
     Meta-Bandits for Online Learning in Dynamic Environments,” in Conférence Francophone
     Sur L’Apprentissage Automatique, Grenoble, France, July 2007, pp. 237–250.
[88] F. Liu, J. Lee, and N. Shroff, “A change-detection based framework for piecewise-stationary
     multi-armed bandit problem,” in Thirty-Second AAAI Conference on Artificial Intelligence,
     2018.
                                               123


[89] L. Besson and E. Kaufmann, “The generalized likelihood ratio test meets klucb: an improved
      algorithm for piece-wise non-stationary bandits,” arXiv preprint arXiv:1902.01575, 2019.
[90] Y. Cao, Z. Wen, B. Kveton, and Y. Xie, “Nearly optimal adaptive procedure with change
      detection for piecewise-stationary bandit,” in International Conference on Artificial Intelli-
      gence and Statistics, 2019, pp. 418–427.
[91] R. Allesiardo and R. Féraud, “Exp3 with drift detection for the switching bandit problem,”
      in 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).
      IEEE, 2015, pp. 1–7.
[92] J. Mellor and J. Shapiro, “Thompson sampling in switching environments with bayesian
      online change detection,” in Artificial Intelligence and Statistics, 2013, pp. 442–450.
[93] P. Auer, P. Gajane, and R. Ortner, “Adaptively tracking the best bandit arm with an unknown
      number of distribution changes,” in Annual Conference on Learning Theory, 2019, pp.
      138–158.
[94] Y. Chen, C. W. Lee, H. Luo, and C. Y. Wei, “A new algorithm for non-stationary contextual
      bandits: Efficient, optimal and parameter-free,” in Proceedings of the 32nd Conference on
      Learning Theory, vol. 99, Phoenix, USA, 2019, pp. 696–726.
[95] L. Wei and V. Srivastava, “On distributed multi-player multiarmed bandit problems in
      abruptly changing environment,” in 2018 IEEE Conference on Decision and Control (CDC).
      IEEE, 2018, pp. 5783–5788.
[96] R. Burkard, M. Dell’Amico, and S. Martello, Assignment problems: revised reprint. SIAM,
      2012.
[97] D. P. Bertsekas, “The auction algorithm: A distributed relaxation method for the assignment
      problem,” Annals of operations research, vol. 14, no. 1, pp. 105–123, 1988.
[98] E. Boursier and V. Perchet, “SIC-MMAB: Synchronisation involves communication in multi-
      player multi-armed bandits,” in Advances in Neural Information Processing Systems, vol. 32,
      2019, pp. 2249–2257.
[99] C. Shi and C. Shen, “On no-sensing adversarial multi-player multi-armed bandits with
      collision communications,” IEEE Journal on Selected Areas in Information Theory, 2021.
[100] I. Bistritz and A. Leshem, “Distributed multi-player bandits-a game of thrones approach,” in
      Advances in Neural Information Processing Systems, vol. 31, 2018, pp. 7222–7232.
[101] J. R. Marden, H. P. Young, and L. Y. Pao, “Achieving pareto optimality through distributed
      learning,” SIAM Journal on Control and Optimization, vol. 52, no. 5, pp. 2753–2770, 2014.
                                                124


[102] O. Besbes, Y. Gur, and A. Zeevi, “Optimal exploration–exploitation in a multi-armed bandit
      problem with non-stationary rewards,” Stochastic Systems, vol. 9, no. 4, pp. 319–337, 2019.
[103] V. Raj and S. Kalyani, “Taming non-stationary bandits: A Bayesian approach,” arXiv preprint
      arXiv:1707.09727, 2017.
[104] W. C. Cheung, D. Simchi-Levi, and R. Zhu, “Learning to optimize under non-stationarity,”
      in Proceedings of Machine Learning Research, vol. 89, 16–18 Apr 2019, pp. 1079–1087.
[105] P. Zhao, L. Zhang, Y. Jiang, and Z.-H. Zhou, “A simple approach for non-stationary linear
      bandits,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020,
      pp. 746–755.
[106] Y. Russac, C. Vernade, and O. Cappé, “Weighted linear bandits for non-stationary en-
      vironments,” in Advances in Neural Information Processing Systems, vol. 32, 2019, pp.
      12 017–12 026.
[107] L. Wei, X. Tan, and V. Srivastava, “Expedited multi-target search with guaranteed perfor-
      mance via multi-fidelity Gaussian processes,” in IEEE/RSJ International Conference on
      Intelligent Robots and Systems, Las Vegas, NV (Virtual), Oct. 2020, pp. 7095–7100.
[108] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint
      arXiv:1804.02767, 2018.
[109] P. Perdikaris, “Gaussian processes a hands-on tutorial,” 2017. [Online]. Available:
      https://github.com/paraklas/GPTutorial
[110] S. Kemna, J. G. Rogers, C. Nieto-Granda, S. Young, and G. S. Sukhatme, “Multi-robot
      coordination through dynamic Voronoi partitioning for informative adaptive sampling in
      communication-constrained environments,” in IEEE International Conference on Robotics
      and Automation, 2017, pp. 2124–2130.
[111] D. Applegate, R. Bixby, V. Chvatal, and W. Cook, “Concorde TSP solver,” 2006.
[112] J. Y. Audibert and S. Bubeck, “Best arm identification in multi-armed bandits,” in Proceed-
      ings of the 23rd Conference on Learning Theory, 2010, pp. 41–53.
[113] E. Rolf, D. Fridovich-Keil, M. Simchowitz, B. Recht, and C. Tomlin, “A successive-
      elimination approach to adaptive robotic sensing,” ArXiv e-prints, 2018.
[114] M. M. M. Manhães, S. A. Scherer, M. Voss, L. R. Douat, and T. Rauschenbach, “UUV
      simulator: A Gazebo-based package for underwater intervention and multi-robot simulation,”
      in OCEANS 2016 MTS/IEEE Monterey. IEEE, 2016, pp. 1–8.
                                               125


[115] M. Abramowitz and I. A. Stegun, Eds., Handbook of Mathematical Functions: with Formu-
      las, Graphs, and Mathematical Tables. Dover Publications, 1964.
[116] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approximations for
      maximizing submodular set functions,” Mathematical programming, vol. 14, no. 1, pp.
      265–294, 1978.
[117] K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, and B. Póczos, “Gaussian process
      bandit optimisation with multi-fidelity evaluations,” in Advances in Neural Information
      Processing Systems, 2016, pp. 992–1000.
[118] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & Sons, 2012.
[119] H. J. Karloff, “How long can a euclidean traveling salesman tour be?” SIAM Journal on
      Discrete Mathematics, vol. 2, no. 1, pp. 91–99, 1989.
[120] A. Singh, A. Krause, C. Guestrin, and W. J. Kaiser, “Efficient informative sensing using
      multiple robots,” Journal of Artificial Intelligence Research, vol. 34, no. 2, p. 707, 2009.
[121] A. Krause, A. Singh, and C. Guestrin, “Near-optimal sensor placements in gaussian pro-
      cesses: Theory, efficient algorithms and empirical studies,” Journal of Machine Learning
      Research, vol. 9, no. Feb, pp. 235–284, 2008.
[122] J. L. Ny and G. J. Pappas, “On trajectory optimization for active sensing in Gaussian process
      models,” in IEEE Conf on Decision and Control and Chinese Control Conference, Shanghai,
      China, Dec. 2009, pp. 6286–6292.
[123] S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen, “Combinatorial pure exploration of
      multi-armed bandits,” in Advances in Neural Information Processing Systems, 2014, pp.
      379–387.
[124] P. Reverdy, V. Srivastava, and N. E. Leonard, “Satisficing in multi-armed bandit problems,”
      IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3788 – 3803, 2017.
[125] L. Wei, A. McDonald, and V. Srivastava, “Multi-robot Gaussian process estimation and
      coverage: a deterministic sequencing algorithm and regret analysis,” in Proceedings of the
      IEEE International Conference on Robotics and Automation, Xi’an, CN (Virtual), 202.
[126] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume I : Estimation Theory.
      Prentice Hall, 1993.
[127] J. W. Durham, R. Carli, P. Frasca, and F. Bullo, “Discrete partitioning and coverage control
      for gossiping robots,” IEEE Transactions on Robotics, vol. 28, no. 2, pp. 364–378, 2012.
                                                126


[128] F. Bullo, J. Cortés, and S. Martínez, Distributed Control of Robotic Networks, ser. Ap-
      plied Mathematics Series. Princeton University Press, 2009, electronically available at
      http://coordinationbook.info.
[129] H. Lim and C. Kim, “Flooding in wireless ad hoc networks,” Computer Communications,
      vol. 24, no. 3-4, pp. 353–363, 2001.
[130] L. Wang and F. Xiao, “Finite-time consensus problems for networks of dynamic agents,”
      IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 950–955, 2010.
[131] C. Paradis, “Fires from Space: Australia,” 2019. [Online]. Available:                 https:
      //www.kaggle.com/carlosparadis/fires-from-space-australia-and-new-zeland
[132] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory,
      vol. 28, no. 2, pp. 129–137, 1982.
[133] F. Bullo, R. Carli, and P. Frasca, “Gossip coverage control for robotic networks: Dynamical
      systems on the space of partitions,” SIAM Journal on Control and Optimization, vol. 50,
      no. 1, pp. 419–447, 2012.
[134] Q. Du, M. Emelianenko, and L. Ju, “Convergence of the lloyd algorithm for computing
      centroidal voronoi tessellations,” SIAM journal on numerical analysis, vol. 44, no. 1, pp.
      102–119, 2006.
[135] M. Schwager, D. Rus, and J. J. Slotine, “Decentralized, adaptive coverage control for net-
      worked robots,” International Journal of Robotics Research, vol. 28, no. 3, pp. 357–375,
      2009.
[136] M. Schwager, M. P. Vitus, S. Powers, D. Rus, and C. J. Tomlin, “Robust adaptive coverage
      control for robotic sensor networks,” IEEE Transactions on Control of Network Systems,
      vol. 4, no. 3, pp. 462–476, 2017.
[137] P. Davison, N. E. Leonard, A. Olshevsky, and M. Schwemmer, “Nonuniform Line Coverage
      from Noisy Scalar Measurements,” IEEE Transactions on Automatic Control, vol. 60, no. 7,
      pp. 1975–1980, 2015.
[138] J. Choi and R. Horowitz, “Learning coverage control of mobile sensing agents in one-
      dimensional stochastic environments,” IEEE Transactions on Automatic Control, vol. 55,
      no. 3, pp. 804–809, 2010.
[139] M. Agarwal, V. Aggarwal, and K. Azizzadenesheli, “Multi-agent multi-armed bandits with
      limited communication,” arXiv preprint arXiv:2102.08462, 2021.
[140] P. Toth and D. Vigo, The vehicle routing problem.    SIAM, 2002.
                                                127


[141] A. Gunawan, H. C. Lau, and P. Vansteenwegen, “Orienteering problem: A survey of recent
      variants, solution approaches and applications,” European Journal of Operational Research,
      vol. 255, no. 2, pp. 315–332, 2016.
[142] F. Pasqualetti, J. W. Durham, and F. Bullo, “Cooperative patrolling via weighted tours:
      Performance analysis and distributed algorithms,” IEEE Transactions on Robotics, vol. 28,
      no. 5, pp. 1181–1188, 2012.
                                               128