OPTIMAL LEARNING OF DEPLOYMENT AND SEARCH STRATEGIES FOR ROBOTIC TEAMS By Lai Wei A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical and Computer Engineering – Doctor of Philosophy 2021 ABSTRACT OPTIMAL LEARNING OF DEPLOYMENT AND SEARCH STRATEGIES FOR ROBOTIC TEAMS By Lai Wei In the problem of optimal learning, the dilemma of exploration and exploitation stems from the fact that gathering information and exploiting it are, in many cases, two mutually exclusive activities. The key to optimal learning is to strike a balance between exploration and exploitation. The Multi- Armed Bandit (MAB) problem is a prototypical example of such an explore-exploit tradeoff, in which a decision-maker sequentially allocates a single resource by repeatedly choosing one among a set of options that provide stochastic rewards. The MAB setup has been applied in many robotics problems such as foraging, surveillance, and target search, wherein the task of robots can be modeled as collecting stochastic rewards. The theoretical work of this dissertation is based on the MAB setup and three problem variations, namely heavy-tailed bandits, nonstationary bandits, and multi- player bandits, are studied. The first two variations capture two key features of stochastic feedback in complex and uncertain environments: heavy-tailed distributions and nonstationarity; while the last one addresses the problem of achieving coordination in uncertain environments. We design several algorithms that are robust to heavy-tailed distributions and nonstationary environments. Besides, two distributed policies that require no communication among agents are designed for the multi-player stochastic bandits in a piece-wise stationary environment. The MAB problems provide a natural framework to study robotic search problems. The above variations of the MAB problems directly map to robotic search tasks in which a robot team searches for a target from a fixed set of view-points (arms). We further focus on the class of search problems involving the search of an unknown number of targets in a large or continuous space. We view the multi-target search problem as a hot-spots identification problem in which, instead of the global maximum of the field, all locations with a value greater than a threshold need to be identified. We consider a robot moving in 3D space with a downward-facing camera sensor. We model the robot’s sensing output using a multi-fidelity Gaussian Process (GP) that systematically describes the sensing information available at different altitudes from the floor. Based on the sensing model, we design a novel algorithm that (i) addresses the coverage-accuracy tradeoff: sampling at a location farther from the floor provides a wider field of view but less accurate measurements, (ii) computes an occupancy map of the floor within a prescribed accuracy and quickly eliminates unoccupied regions from the search space, and (iii) travels efficiently to collect the required samples for target detection. We rigorously analyze the algorithm and establish formal guarantees on the target detection accuracy and the detection time. An approach to extend the single robot search policy to multiple robots is to partition the environment into multiple regions such that workload is equitably distributed among all regions and then assign a robot to each region. The coverage control focuses on such equitable partitioning and the workload is equivalent to the so-called service demands in the coverage control literature. In particular, we study the adaptive coverage control problem, in which the demands of robotic service within the environment are modeled as a GP. To optimize the coverage of service demands in the environment, the team of robots aims to partition the environment and achieve a configuration that minimizes the coverage cost, which is a measure of the average distance of a service demand from the nearest robot. The robots need to address the explore-exploit tradeoff: to minimize coverage cost, they need to gather information about demands within the environment, whereas information gathering deviates them from maintaining a good coverage configuration. We propose an algorithm that schedules learning and coverage epochs such that its emphasis gradually shifts from exploration to exploitation while never fully ceasing to learn. Using a novel definition of coverage regret, we analyze the algorithm and characterizes its coverage performance over a finite time horizon. Copyright by LAI WEI 2021 ACKNOWLEDGEMENTS Over the course of my PhD, I have received encouragement and support from many people. First and foremost, I would express my deep and most sincere gratitude to my advisor, Prof Vaibhav Srivastava. He guided me into the fascinating world of control and learning research and led my pathway to dissertation. I really enjoyed my research experience with him. He inspired me with his vision and expertise in every stage of my PhD study. I’ll always remember his great mentorship, constant encouragement, and impeccable support. I thank all my committee members, Prof Xiaobo Tan, Prof Ranjan Mukherjee, and Prof Shaunak Bopardikar for their great support and advice during my PhD study. Also, I thank them for teaching me adaptive control, nonlinear control and game theory. It had been a really nice experience to attend our robotics reading group and the National Robotics Initiative project meeting together. I thank all my teachers at Michigan State University for their detailed explanation and quick responses to my questions. I particularly thank Prof Hassan Khalil for teaching me my first PhD course linear systems and control and ignite my interest in control theory. I’m also grateful to him for advising me before Dr. Srivastava joined MSU. I thank every member of Prof Srivastava’s, Prof Bopardikar’s and Prof Tan’s lab for their friendship, for their research presentations and for all our brainstorming sessions. Thank you Andrew McDonald for working together with me on coverage control. Thank you Pearce Reickert and Eric Gaskell for helping me design the robotic underwater target search experiment. My stay at MSU is truly a fun experience. I’m blessed to make so many friends here. I thank Prof Ning Xi for taking me to MSU. Thank you Liangliang Chen for all the help when I first come to the U.S. and all the get-togethers at your home. Thank you Hongyang Shi for help me moving home. I’m so fortunate to meet with my girlfriend Kemeng Wang at MSU. Thank you so much for your constant love and encouragement. Thank you for being tolerant and supporting. Last, but not the least, I’d like to thank my mother and father for their unconditional love and endless support enabling me to pursue my dream. v TABLE OF CONTENTS LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background and Literature Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Contribution and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 CHAPTER 2 STATIONARY STOCHASTIC BANDITS . . . . . . . . . . . . . . . . . . 12 2.1 Lower Bounds on Regrets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Upper Confidence Bound Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Heavy-tailed Stochastic MAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 A Robust Minimax Policy: Robust MOSS . . . . . . . . . . . . . . . . . . 16 2.3.2 Analysis of Robust MOSS . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Numerical Illustration of Robust MOSS . . . . . . . . . . . . . . . . . . . 26 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 CHAPTER 3 PIECE-WISE STATIONARY STOCHASTIC BANDITS . . . . . . . . . . . 29 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 The LM-DSEE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Analysis of the LM-DSEE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 The SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Analysis of the SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 Numerical Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.8 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 CHAPTER 4 MULTI-PLAYER PIECEWISE STATIONARY STOCHASTIC BANDITS . 41 4.1 The RR-SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Analysis of the RR-SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . . . 43 4.3 The SW-DLP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Analysis of the SW-DLP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Numerical Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.7 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 CHAPTER 5 GENERAL NONSTATIONARY BANDITS WITH VARIATION BUDGET 54 5.1 Lower Bound on Minimax Regret in Nonstationary Environment . . . . . . . . . . 55 5.2 UCB Algorithms for Sub-Gaussian Nonstationary Stochastic Bandits . . . . . . . . 56 5.2.1 Resetting MOSS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.2 Sliding-Window MOSS Algorithm . . . . . . . . . . . . . . . . . . . . . 58 vi 5.2.3 Discounted UCB Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 UCB Policies for Heavy-tailed Nonstationary Stochastic MAB Problems . . . . . . 70 5.3.1 Resetting robust MOSS for the non-stationary heavy-tailed MAB problem . 70 5.3.2 SW-RMOSS for the non-stationary heavy-tailed MAB problem . . . . . . . 70 5.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.1 Bernoulli Nonstationay Stochastic MAB Experiment . . . . . . . . . . . . 75 5.4.2 Heavy-tailed Nonstationary Stochastic MAB Experiment . . . . . . . . . . 76 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 CHAPTER 6 MULTI-TARGET SEARCH VIA MULTI-FIDELITY GAUSSIAN PRO- CESSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.1 Multi-target Search Problem Description . . . . . . . . . . . . . . . . . . . . . . . 80 6.1.1 Multi-fidelity Sensing Model . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.1.2 Objective of the Multi-target Search Algorithm . . . . . . . . . . . . . . . 82 6.2 Expedited Multi-target Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.1 Inference Algorithm for Multi-fidelity GPs . . . . . . . . . . . . . . . . . . 83 6.2.2 Multi-fidelity Sampling & Path Planning . . . . . . . . . . . . . . . . . . . 84 6.2.3 Classification and Region Elimination . . . . . . . . . . . . . . . . . . . . 86 6.3 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.4 Analysis of the EMTS Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4.1 Analysis of the classification algorithm . . . . . . . . . . . . . . . . . . . . 89 6.4.2 Analysis of the Sampling and Fidelity Planner . . . . . . . . . . . . . . . . 90 6.4.3 Analysis of Expected Detection Time . . . . . . . . . . . . . . . . . . . . 93 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 CHAPTER 7 ONLINE ESTIMATION AND COVERAGE CONTROL . . . . . . . . . . 97 7.1 Online Estimation and Coverage Problem . . . . . . . . . . . . . . . . . . . . . . 97 7.1.1 Graph Representation of Environment . . . . . . . . . . . . . . . . . . . . 98 7.1.2 Nonparametric Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.1.3 Voronoi Partition and Coverage Problem . . . . . . . . . . . . . . . . . . . 99 7.1.4 Coverage Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . 100 7.2 Deterministic Sequencing of Learning and Coverage Algorithm . . . . . . . . . . . 101 7.2.1 Estimation Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.2.2 Information Propagation Phase . . . . . . . . . . . . . . . . . . . . . . . . 102 7.2.3 Coverage Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3 Analysis of the DSLC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1 Mutual Information and Uncertainty Reduction . . . . . . . . . . . . . . . 104 7.3.2 Convergence within Coverage Phase . . . . . . . . . . . . . . . . . . . . . 105 7.3.3 An Upper Bound on Expected Coverage Regret . . . . . . . . . . . . . . . 106 7.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vii CHAPTER 8 CONCLUSIONS AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . 113 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 viii LIST OF FIGURES Figure 2.1: Comparison of Robust MOSS with MOSS and other Robust UCB algorithms. . 27 Figure 3.1: Comparison of LM-DSEE and SW-UCB#. . . . . . . . . . . . . . . . . . . . 38 Figure 4.1: Simulation of RR-SW-UCB# and SW-DLP in a piecewise stationary environment. 51 Figure 5.1: Comparison of different policies. . . . . . . . . . . . . . . . . . . . . . . . . . 75 Figure 5.2: Performances with heavy-tailed rewards. . . . . . . . . . . . . . . . . . . . . . 76 Figure 6.1: Architecture of EMTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 6.2: Underwater victim search simulation setups. . . . . . . . . . . . . . . . . . . . 87 Figure 6.3: Simulation result of EMTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Figure 6.4: Uncertainty reduction results. . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Figure 7.1: Distributed implementation of DSLC. . . . . . . . . . . . . . . . . . . . . . . 108 Figure 7.2: Comparison of DSLC, Todescato and Cortes. . . . . . . . . . . . . . . . . 109 ix LIST OF ALGORITHMS 1 The LM-DSEE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2 The SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 The RR-SW-UCB# Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 The SW-DLP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 The R-MOSS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6 The SW-MOSS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7 The D-UCB Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8 Deterministic Sequencing of Learning and Coverage (DSLC) . . . . . . . . . . . . 101 x CHAPTER 1 INTRODUCTION Decision-making in the face of uncertainty is one of the most fundamental problems in robotics re- search which requires the system to actively learn the environment while completing assigned tasks. Exploration versus exploitation tradeoff is a key challenge in such problems. Here, exploration means learning the environment to reduce the uncertainty while exploitation means taking the best actions according to the existing information. In applications such as robotic deployment and search where efficiency is of major concern, it is imperative to strike a good balance between exploration and exploitation. This encourages an investigation into algorithms that enable the robotic system to address this tradeoff in uncertain environments and achieve good finite-time performance. The problems of interest in this dissertation include both theoretical investigations of the exploration versus exploitation tradeoff and its applications to robotic systems. The theoretical work is based on the Multi-Armed Bandit (MAB) setup [1] which is a classic mathematical formulation featuring the exploration versus exploitation tradeoff. With no prior information, a decision-maker sequentially chooses one among a set of stochastic arms (options) to achieve the maximum cumulative reward. An efficient adaptive arm selection rule keeps a good balance of learning the expected mean rewards and picking the empirically most profitable option. The high- level ideas embodied in the MAB problem extend naturally to many applications dealing with resource allocation in the face of uncertainty. Two prototypical examples in robotics research are target search and multi-robot deployment, and they are studied in this work. In a target search problem, to quickly and accurately locate the targets of interest, the autonomous vehicle needs to effectively learn the likelihood of target positions through sensing and spend more effort at locations that are more plausible to contain the target. The multi-robot deployment problem deals with the optimal allocation of a team of robots such that the robot team configuration matches with the demand of services within the environment. For example, more firefighting robots should be assigned to areas with a larger probability of wildfire breakouts to ensure a shorter response time. 1 In this dissertation, we study three variations of the MAB problem, namely heavy-tailed bandits, nonstationary bandits, and multi-player bandits. Though the classical stochastic MAB problem has a concise formulation and well-established theoretical guarantees, it fails to capture many key properties of stochastic processes in real-world problems. The heavy-tailed MAB relaxed the assumption that rewards from each arm are bounded or sub-Gaussian. This extension is motivated by applications such as social networks [2] and financial markets [3], wherein certain variables of interest exhibit heavy-tailed distributions. In the nonstationary MAB problem, in addition to being unknown, the reward distributions are assumed to be time-varying. This formulation characterizes the drift of physical processes in unknown dynamic working environments, e.g., the frequency of certain biological activities changes due to the different light conditioning in a day. The multi-player MAB problem involves maximizing the total group reward such that no two decision-makers select the same arm. This requires achieving a coordinated behavior of multiple decision-makers facing uncertainty. An important aspect of the multi-player MAB problems is the collision model, which means the reward is shared or no reward is received if multiple agents select the same option. This formulation arises naturally in cognitive radio [4–6], wherein a transceiver can intelligently detect and use vacant communication channels, and provides a rich modeling framework for other interesting domains such as animal and robotic foraging [7, 8], autonomous surveillance [9] and acoustic relay positioning for underwater communication [10]. A canonical example of the exploration-exploitation tradeoff in robotics applications is the target search problem. In many scenarios, the search task is to find the locations that emit large signals [11]. For example, the human body is relatively hot compared with its surroundings and emits more infrared radiations, which can be used to sense the existence of victims in a search and rescue scenario. The MAB problem provides a natural framework to study robotic search problems. In particular, the class of robotic search problems in which a robot team searches for a target from a set of view-points (arms), or monitors an environment from a set of viewpoints, directly map to the above variations of the MAB problems. We next focus on the class of search problems involving the search of an unknown number of targets in a large or continuous space that is not constrained 2 to be observed from a small set of viewpoints. We consider a robot moving in a 3D continuous environment to search for targets located on the 2D floor. For a given location of the robot, the sensors on the robot provide a score indicating the likelihood of the presence of a victim within the sensing footprint of the robot. We refer to these scores at all locations in the environment at a given altitude as the sensing field. The group of sensing fields at different altitudes from the floor is modeled a multi-fidelity Gaussian process (GP) [12] which is an auto-regressive model that captures the following fact: sensing at locations farther from the floor provides a wider field of view but less accurate measurements. The objective is to design a search policy that schedules the altitude and locations of the points at which the robot should collect samples such that the search time is minimized while ensuring a desired search accuracy. An approach to extend the single robot search policy to 𝑁 robots is to partition the environment into 𝑁 regions such that some measure of "search load" is equitably distributed across all regions. The coverage control focuses on such equitable partitioning and the search load is typically referred to as serving demand in coverage control literature. In particular, the coverage control [13] concerns the deployment of a multi-agent team in order to meet "servicing" demands in an environment. The team of agents aims to minimize the coverage cost which is a function of the demand distribution, agent locations in the environment, and the allocation of the set of points within the environment to robots. In a standard coverage problem [13], the demand function is assumed to be known, while in this dissertation, we assume it to be unknown and model it as a realization of a GP. Similar to the MAB problem, the coverage problem with unknown demands exhibits the exploration versus exploitation tradeoff: learning the demands requires inefficient deployment with respect to the true demands. 1.1 Background and Literature Synopsis The canonical stochastic MAB was historically proposed in [1] to model the clinical trials, where different treatments with unknown effects are viewed as stochastic arms. At each time step, the 3 agent selects one arm from a set of options and receives a reward associated with it. The reward sequence at each arm is assumed to be an unknown i.i.d random process. The difficulty of such a problem lies in the exploration versus exploitation dilemma since the agent has to play the poorer arms to learn about their mean rewards. In fact, there exists an intrinsic tradeoff between choosing the most informative and seemingly the most rewarding alternative. Robbins [14] formulated the objective of the stochastic MAB problem as minimizing the regret, that is, the loss in expected cumulative rewards caused by failing to select the best arm every time. In their seminal work, Lai and Robbins [15], established a logarithm problem-dependent asymptotic lower bound on the regret achieved by any policy. The lower bound being problem-dependent means it depends on the reward distribution at each arm. Using a simple heuristic called optimism in the face of uncertainty, a general method of constructing Upper Confidence Bound (UCB) rules for parametric families of reward distributions is also presented in [15], and the associated policy is shown to attain the logarithm lower bound. Several subsequent UCB-based algorithms [16, 17] with efficient finite-time performance have been proposed. The simple formulation of the stochastic MAB limits its applications in many real-world problems. Bubeck et al. [18] relaxed the sub-Gaussian assumption by only assuming the rewards to have finite moments of order 1 + 𝜖 for some 𝜖 > 0. Their work allows the MAB model to be used in applications such as social networks [2] and financial markets [3] wherein certain variables of interest are inherently heavy-tailed. The nonstationary MAB problem captures the dynamic aspect of the environment and has received some interest. In [19], the authors studied the bandit problem in which an adversary, rather than well-behaved stochastic arms, controls the payoffs. The performance of a policy is evaluated using the weak regret, which is the difference in the cumulated reward of a policy compared against the best single action policy. While being able to capture nonstationarity, the generality of the reward model in adversarial MAB makes the investigation of globally optimal policies very challenging. The nonstationary stochastic MAB can be viewed as a compromise between stationary stochastic MAB and adversarial MAB. It maintains the stochastic nature of the 4 reward sequence while allowing some degree of nonstationarity in reward distributions. Instead of the weak regret analyzed in adversarial MAB, a strong notion of regret defined with respect to the best arm at each time step is studied in these problems. A broadly studied nonstationary problem is piecewise stationary MAB [20], wherein the reward distributions are piecewise stationary. A more general nonstationary problem is studied in [21], wherein the cumulative maximum variation in mean rewards is subject to a variation budget. In some decision-making problems, multiple agents could get involved and the choice made by one agent could influence the selections of other agents. Most multi-player MAB studies deal with a stationary environment and the task is to maximize the total rewards collected by all the agents. As in the single-player case, the performance of the entire group of agents can be characterized using group regret, which is defined as the loss in expected total rewards caused by the agents failing to select the best set of arms every time. In [22], a lower bound on the group regret for a centralized policy is derived and algorithms that asymptotically achieve this lower bound are designed. Distributed multi-player MAB problem with no communication among players has been studied in [4, 5, 23–25]. In [26–28], distributed cooperative agents communicate through a communication network to improve their estimates of mean rewards and arm selections. The MAB problem has been applied in many scientific and technological areas. For example, it is used for opportunistic spectrum access in communication networks, wherein multiple secondary users actively detect and use vacant channels [5, 29]. The arm models the availability of a channel and the reward from an arm is eliminated if it is selected by multiple users. In the MAB formulation of online learning for demand response [30, 31], an aggregator calls upon a subset of users (arms) who have an unknown response to the request to reduce their loads. Besides, contextual bandits are widely used in recommender systems [32, 33], wherein the acceptance of a recommendation corresponds to the rewards from an arm. The MAB setup has also been used in robotic research, which is one of the major topics in this dissertation. In many robotic applications such as foraging, surveillance [7–9, 34] and acoustic relay po- sitioning for underwater communication [10], the task of robots can be modeled to be collecting 5 stochastic rewards [35, 36]. These rewards may correspond to, for example, the likelihood of an anomaly at a spatial location, the concentration of a certain type of algae in the ocean, the communication quality of a specific location, etc. Algorithms for the MAB problems have been extended to these problems. It needs to be noted that in robotic applications, motion is normally energy-consuming and time-demanding, while switch between arms comes with no cost in MAB models. By introducing block-allocation strategies, the exploration versus exploitation tradeoff can be balanced with a sufficiently small number of arm switches [37–39]. In [9], the robotic surveil- lance problem is studied in an environment that is abruptly changing due to the arrival of unknown spatial events. To solve the problem, a block-allocation strategy [20] is adapted to the piece-wise stationary MAB setting. Other algorithms that benefit path planning in robotic applications include the Deterministic Sequencing of Exploration and Exploitation (DSEE) algorithms [25, 40, 41] due to their deterministic and predictable structures. Unlike the MAB problems in which rewards from different arms are commonly assumed to be independent, the feedbacks in robotic applications such as sensing information from different locations are usually correlated. GPs are powerful tools to capture spatially correlated information and they are widely used to characterizing spatiotemporal sensing fields [42, 43]. Sung et al. [11] study the hot-spot identification problem in an environment within the framework of GP MAB [38, 44]. GPs have also been used in robotic inspection and search. Hollinger et al. [45] study an inspection problem in which the robot needs to classify the underwater surface. They use a combination of GP-implicit surface modeling and sequential hypothesis testing to classify surfaces. In autonomous robotic target search, the vehicle is required to quickly and accurately locate the targets of interest in an unknown and uncertain environment. Examples include victim search and rescue, mineral exploration, and tracking natural phenomena. There have been some efforts to address target search within the context of informative path planning [9, 11, 36, 45–58], which deals with path planning for the robots to maximize the utility of data collection. For example, Meera et al. [52] model target occupancy map as a GP and design a heuristic algorithm for target detection that handles tradeoffs among information gain, field coverage, sensor performance, and 6 collision avoidance. In this dissertation, our multi-target search solution is inspired by successive elimination ideas from MAB research in [59, 60]. The robot sequentially collected new sensing information and remove regions unlikely to contain the target from the search task. The proposed algorithm combines informative path planning with Bayesian confidence interval estimation and enables the robot to efficiently collect information and concentrate measurements in promising areas. The coverage control [13] is another interesting topic in robotics. It arises naturally in multi- robot systems when a team of agents is assigned to deploy themselves over an environment according to a particular demand function, which specifies the degree to which a robot is needed at each location. The objective is to minimize the coverage cost which is determined by the demand distribution, robot locations, and task allocations. Example applications of coverage control range from autonomous wildfire fighting, to smart agriculture, to ecological surveying, to environmental cleaning. Classical coverage control problem [13, 61–63] assumes the demand function is known, while recent works have focused on the adaptive coverage problem, in which agents are not assumed to have knowledge of demand function a priori. In [64–69], the demand function defined on the working environment is modeled as a realization of a GP which can be learned by taking samples at different locations. The exploration versus exploitation dilemma in adaptive coverage control is due to the conflict between collecting samples to refine demand estimation and maintaining a good configuration to reduce the coverage cost. As with the MAB setup, adaptive coverage also solves a stochastic optimization problem, though its task is more complicated than selecting the most rewarding arm. In the same spirit as the DSEE algorithm for MAB problems, the adaptive coverage policy proposed in this dissertation deterministically shifts its emphasis from exploration to exploitation. 1.2 Contribution and Organization In this section, the organization of the chapters in this dissertation is outlined and the contributions in each chapter are discussed in detail. 7 Chapter 2. We first review the stationary stochastic MAB problem and corresponding concepts such as regret, worst-case regret, and lower bounds on regret. We then study the heavy-tailed Bandits in which the sub-Gaussian assumption of reward distributions is relaxed. Instead, the reward distributions admit moments of order 1 + 𝜖 for some 𝜖 > 0, similarly as in [18]. We modify the MOSS [70] algorithm for the sub-Gaussian reward distribution by using a saturated empirical mean to design a new algorithm called Robust MOSS. By analyzing Robust MOSS, we show that it achieves worst-case regret matching with the lower bound while maintaining a distribution- dependent logarithm regret. To the best of our knowledge, Robust MOSS is the first algorithm to achieve order-optimal worst-case regret for heavy-tailed bandits. This is the major contribution of this chapter. Numerical illustrations are provided to verify the robustness of the proposed algorithm against heavy-tailed rewards. We close the chapter with comments and bibliographic remarks on the classic MAB problem. Chapter 3. In this chapter, we study a special nonstationary stochastic MAB problem called piece-wise stationary bandits. We assume the mean rewards from arms switch to unknown values at unknown time instants and the reward distribution remains stationary between consecutive switches. The main contribution of this chapter is the design of two generic algorithms, namely, the Limited Memory DSEE (LM-DSEE) and the Sliding-Window UCB# (SW-UCB#). LM-DSEE inherits the structure of the DSEE algorithm [25, 40] for the stationary bandit and comprises interleaving blocks of exploration and exploitation. In the exploitation epochs, an arm is selected based on the most recent exploration. This avoids a large bias in reward estimation in a nonstationary environment. And, SW-UCB# is a modification of the SW-UCB algorithm from [20] that relaxes the assumption of knowing horizon length. We rigorously show these algorithms incur sublinear regret, i.e., the time average of the regret asymptotically converges to zero. A comparison of both algorithms is made and discussed based on the simulation results. Chapter 4. We study the multi-player stochastic bandits in a piece-wise stationary environment. We consider a collision model in which a player receives a reward at an arm if it is the only player 8 to select the arm. This problem features achieving coordination in an uncertain and nonstationary environment. The contribution of this chapter is the design of two novel distributed algorithms that require no communication between agents, namely Round-Robin SW-UCB# (RR-SW-UCB#) and the Sliding- Window Distributed Learning with Prioritization (SW-DLP). For both algorithms, it is shown that the group regret is upper bounded by a sublinear function of time even with the collision model, i.e., both algorithms achieve coordination while efficiently learning a nonstationary environment. Chapter 5. In this chapter, we study a more general nonstationary stochastic MAB problem proposed in [21] in which the cumulative maximum variation in mean rewards is restricted to a variation budget. There is no restriction on how do reward distributions change, for example, they may change abruptly like the piece-wise stationary bandits, or they may drift slowly between subsequent abrupt changes. The performance of a policy is measured by comparing its cumulative expected rewards with that of an oracle that selects the arm with the maximum mean reward at each time and it is characterized using the worst-case regret, which is the regret for the worst choice of reward distribution sequences that satisfies the variation budget. We extend UCB-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they achieve order-optimal worst-case performance. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees. The major contributions of this work are threefold. First, we extend MOSS [70] to design Resetting MOSS (R-MOSS) and Sliding-Window MOSS (SW-MOSS). Also, we show Discounted UCB (D-UCB) [20] can be tuned to solve the problem. Secondly, with rigorous analysis, we show that R-MOSS and SW-MOSS achieve the exact order-optimal worst-case performance and D-UCB is near-optimal. Finally, we relax the bounded or sub-Gaussian assumption on the rewards required by these algorithms and design policies robust to heavy-tailed rewards. We show the theoretical guarantees on the worst-case regret can be maintained by the robust policies. 9 Chapter 6. We consider a scenario in which an autonomous vehicle equipped with a downward- facing camera operates in a 3D environment and is tasked with searching for an unknown number of stationary targets on the 2D floor of the environment. The key challenge is to design a search policy that minimizes the search time while ensuring a high target detection accuracy. We model the sensing field using a multi-fidelity GP that systematically describes the sensing information available at different altitudes from the floor. Based on the sensing model, we design a novel algorithm called Expedited Multi-Target Search (EMTS) that (i) addresses the coverage-accuracy tradeoff: sampling at a location farther from the floor provides a wider field of view but less accurate measurements, (ii) computes an occupancy map of the floor within a prescribed accuracy and quickly eliminates unoccupied regions from the search space, and (iii) travels efficiently to collect the required samples for target detection. We rigorously analyze the algorithm and establish formal guarantees on the target detection accuracy and the expected detection time. We illustrate the algorithm using a simulated multi-target search scenario. The primary contribution of this chapter is the extension of the classical informative path planning approach for the single-fidelity GP to the multi-fidelity GP setting. This novel extension allowed for jointly planning for sampling locations and associated fidelity levels, and thus, ad- dresses the fidelity-coverage tradeoff for expedited target search. The EMTS algorithm is proposed and illustrated in an underwater victim search scenario using the Unmanned Underwater Vehicle Simulator. The algorithm is analyzed in terms of its accuracy and expected detection time. Chapter 7. We study the problem of distributed multi-robot coverage over an unknown, nonuniform sensory field, which is a deployment problem with uncertain demands. Modeling the sensory field as a realization of a GP and using Bayesian techniques, we devise a policy that aims to balance the tradeoff between learning the sensory function and covering the environment. We propose an adaptive coverage algorithm called Deterministic Sequencing of Learning and Coverage (DSLC) that schedules learning and coverage epochs such that its emphasis gradually shifts from exploration to exploitation while never fully ceasing to learn. Using a novel definition of coverage regret which characterizes the overall coverage performance of a multi-robot team over a time horizon 𝑇, we 10 analyze DSLC to provide an upper bound on expected cumulative coverage regret. Finally, we illustrate the empirical performance of the algorithm through simulations of the coverage task over an unknown distribution of wildfires. The most important contribution of this chapter is the definition of coverage regret which enables the finite-time analysis of the online estimation and coverage algorithm. Existing works evaluate the algorithm performance by assuming the coverage algorithm attains global optimality and comparing the performance with a globally optimal result. Since the coverage problem itself is NP-Hard, this assumption is too strong, especially with distributed deployment being considered. The coverage regret is defined with respect to locally optimal solutions which relaxes the assumption of achieving a globally optimal coverage and perfectly characterizes the convergence property of a policy. Chapter 8. In this chapter, we conclude the dissertation with a summary of contributions and future directions. For the MAB problems addressed in this work, we discuss their potential applications in robotic patrolling. For the robotic target search and adaptive coverage control, the problem generalizations and possible solutions are illustrated in detail. 11 CHAPTER 2 STATIONARY STOCHASTIC BANDITS In a stationary stochastic MAB problem, an agent chooses an arm 𝜑𝑡 from the set of 𝐾 arms 𝜑 {1, . . . , 𝐾 } at each time 𝑡 ∈ {1, . . . , 𝑇 } and receives the associated random reward 𝑋𝑡 𝑡 . The reward at each arm 𝑘 is drawn from an unknown probability distribution 𝑓 𝑘 with unknown mean 𝜇 𝑘 . The problem being stationary mean the reward distribution for each arm does not change with time. Normally, it is assumed that the reward distributions are sub-Gaussian. Definition 2.1 (Sub-Gaussian reward). For any arm 𝑘 ∈ {1, . . . , 𝐾 }, the probability distribution 𝑓 𝑘 is 1/2 sub-Gaussian, i.e., if 𝑋 ∼ 𝑓 𝑘 , !   𝜆2 ∀𝜆 ∈ R : E exp(𝜆(𝑋 − 𝜇)) ≤ exp . 8 The main example for sub-Gaussia rewards are random rewards with bounded support [0, 1] which is commonly used in MAB literature. The objective for the stochastic MAB problem is to maximize the expected value of the cumu- Í 𝜑 𝜑 lative reward 𝑇𝑡=1 𝑋𝑡 𝑡 . We assume that 𝜑𝑡 is selected based upon past observations {𝑋𝑠 𝑠 , 𝜑 𝑠 }𝑡−1 𝑠=1 following some policy 𝜌. Specifically, 𝜌 determines the conditional distribution   𝜑 P 𝜌 𝜑𝑡 = 𝑘 | {𝑋𝑠 𝑠 , 𝜑 𝑠 }𝑡−1 𝑠=1 at each time 𝑡 ∈ {1, . . . , 𝑇 − 1}. If P 𝜌 (·) takes binary values, we call 𝜌 deterministic; otherwise, it is called stochastic. Let the maximum mean reward among all arms be 𝜇∗ . We use Δ 𝑘 = 𝜇∗ − 𝜇 𝑘 to measure the suboptimality of arm 𝑘. For a policy 𝜌, to maximize the expected cumulative reward E [𝑆𝑇 ] is equivalent to minimize the regret defined by " 𝑇 # 𝐾 Õ  Õ 𝜌 𝜌 ∗ 𝜑𝑡 𝑅𝑇 := E 𝜇 − 𝑋𝑡 = Δ 𝑘 E 𝜌 [𝑛 𝑘 (𝑇)], 𝑡=1 𝑘=1 where 𝑛 𝑘 (𝑇) is the total number of times the arm 𝑘 has been chosen until time 𝑇, and the second 𝜌 expectation is taken over different realization of arm selections. It needs to be noted that 𝑅𝑇 can 12 be viewed as the difference between the expected cumulative reward obtained by selecting the arm with the maximum mean reward 𝜇∗ and selecting arms 𝜑1 , . . . , 𝜑𝑇 . 2.1 Lower Bounds on Regrets The objective of regret minimization was originally formulated by Robbins [14]. It was established later a logarithm problem-dependent asymptotic lower bound on the number of times a suboptimal arm is selected by a uniformly good policy in [15, 71]. Here, a policy 𝜌 being uniformly good means for any possible set of reward distributions { 𝑓1 , . . . , 𝑓𝐾 }, 𝜌 E [𝑅𝑇 ] = 𝑜(𝑇 𝑎 ) for every 𝑎 > 0, 𝜌 which means lim𝑇→∞ E [𝑅𝑇 ]/𝑇 𝑎 = 0 for any 𝑎 > 0. Lemma 2.1 (Lai and Robbins’ Lower bound [15, 71]). Suppose there is a unique best arm with reward distribution 𝑓 ∗ and an uniformly good policy 𝜌 is applied. For any suboptimal arm 𝑘 and every 𝜖 > 0,   1−𝜖 lim P 𝑛 𝑘 (𝑇) ≥ = 1, 𝑇→∞ 𝐷 KL ( 𝑓 𝑘 || 𝑓 ∗ ) where 𝐷 KL denote the Kullback-Leibler divergence of two distributions. Hence, E [𝑛 𝑘 (𝑇)] 1 lim inf ≥ . 𝑇→∞ log 𝑇 𝐷 KL ( 𝑓 𝑘 || 𝑓 ∗ ) The above result indicates a suboptimal arm needs to be selected at least logarithm number of times, resulting in the logarithm lower bound for a stochastic MAB problem. It can also be seen 𝜌 that regret 𝑅𝑇 is implicitly determined by reward distributions in { 𝑓1 , . . . , 𝑓𝐾 } as well as policy 𝜌 𝜌. So, 𝑅𝑇 is also called distribution-dependent regret. In contrast, the distribution-independent regret, also known as worst-case regret, is defined by taking the maximum over at all possible reward distribution combinations. Definition 2.2 (Worst-case Regret). The worst-case regret is the regret for the worst possible choice of reward distributions and it can be expressed as worst 𝜌 𝜌 𝑅𝑇 = sup 𝑅𝑇 . { 𝑓1 ,..., 𝑓𝐾 } 13 The regret associated with the policy that minimizes the above worst-case regret is called √ minimax regret. According to [72], the minimax regret also has a lower bound 1/20 𝐾𝑇. This result is about finite-time performance and can be derived by selecting a set of reward distributions that present challenges to the allocation policy. Consider a scenario in which there is a unique best arm and all other arms have identical mean rewards such that the gap between optimal and suboptimal mean rewards is Δ. For such a problem, it has been shown in [73] that for any policy 𝜌, 𝜌 𝐾  𝑇Δ2  𝐾 𝑅𝑇 ≥ 𝐶1 ln + 𝐶2 , (2.1) Δ 𝐾 Δ p where 𝐶1 and 𝐶2 are some positive constants. It needs to be noted that for Δ = 𝐾/𝑇, the above √ √ lower bound becomes 𝐶2 𝐾𝑇, which matches with the lower bound 1/20 𝐾𝑇. 2.2 Upper Confidence Bound Strategies The family of Upper Confidence Bound (UCB) strategies uses the principle called optimism in the face of uncertainty. At each time slot, a UCB index which is a statistical index composed of both mean reward estimate and the associated uncertainty measure is computed at each arm, and the arm with the maximum UCB is picked. Within the family of UCB algorithms, two state-of-the-art algorithms for the stationary stochastic MAB problems are UCB1 [16] and MOSS [70]. With arm 𝑘 being sampled 𝑛 𝑘 (𝑡) of times before 𝑡, 𝜇ˆ 𝑘,𝑛 𝑘 (𝑡) is the associated empirical mean. Then, UCB1 computes the UCB index for each arm 𝑘 at time 𝑡 as s 2 ln 𝑡 𝑔 UCB1 = 𝜇ˆ 𝑘,𝑛 𝑘 (𝑡) + . 𝑘,𝑡 𝑛 𝑘 (𝑡) The finite-time performance guarantee of UCB1, which is stronger than asymptotic property, has been proved in [16]. Lemma 2.2 (Regret Upper Bound for UCB1). For the stationary stochastic MAB problem, if the reward distributions have bounded support [0, 1], the regret of UCB1 after any 𝑇 satisfies ! 𝐾 Õ ln 𝑇 𝜋 2 Õ 𝑅𝑇UCB1 ≤ 8 + 1+ Δ𝑘 . 𝑘:Δ >0 Δ 𝑘 3 𝑘=1 𝑘 14 Notice that the above upper bound matches with the order of Lai and Robbin’s logarithm lower bound in Lemma 2.1. As is shown in [70], the worst-case regret of UCB1 can be derived by selecting values for Δ 𝑘 to maximize the upper bound, resulting in p worst UCB1 𝑅𝑇 ≤ 10 (𝐾 − 1)𝑇 (ln 𝑇). √ Comparing this result with the lower bound on the minimax regret 1/20 𝐾𝑇 there exists an extra √ factor ln 𝑇. This issue has been resolved by the algorithm called Minimax Optimal Strategy in the Stochastic case (MOSS) [70], which is the first algorithm that enjoys both logarithm distribution- dependent and order-optimal distribution-independent bound. With prior knowledge of horizon length 𝑇, and the UCB index for MOSS is expressed as v u t     max ln 𝐾𝑛𝑇𝑘 (𝑡) , 0 𝑔 MOSS = 𝜇ˆ 𝑘,𝑛 𝑘 (𝑡) + . 𝑘,𝑡 𝑛 𝑘 (𝑡) We now recall the worst-case regret upper bound for MOSS. Lemma 2.3 (Worst-case regret upper bound for MOSS [70]). For the stationary stochastic MAB problem, the worst-case regret of the MOSS algorithm satisfies worst MOSS √ 𝑅𝑇 ≤ 49 𝐾𝑇 . 2.3 Heavy-tailed Stochastic MAB This section is a slightly modified version of our published work on heavy-tailed bandits, and it is reproduced here with the permission of the copyright holder1. The rewards being bounded or sub-Gaussian is a common assumption that gives the sample mean an exponential convergence and simplifies the MAB problem. However, in many applications, such as social networks [2] and financial markets [3], the rewards are heavy-tailed. Bubeck et al. [18] relax the sub-Gaussian assumption by only assuming the rewards to have finite moments of order 1 + 𝜖 for some 𝜖 ∈ (0, 1]. They present the robust UCB algorithm and show that it attains a 1 ©2018 IEEE. Reprinted with permission from [74]. 15 logarithmic distribution-depend regret upper bound on the regret that is within a constant factor of the lower bound in the heavy-tailed setting. However, the solutions provided in [18] are not able to provably achieve an order optimal worst-case regret. Specifically, the factor of optimality is a poly-logarithmic function of time-horizon. The heavy-tailed stochastic MAB problem studied in this section is the stochastic MAB problem with the following assumptions. Assumption 2.1. Let 𝑋 be a random reward drawn from any arm 𝑘 ∈ {1, . . . , 𝐾 }. There exists a   constant 𝑢 ∈ R>0 such that E |𝑋 | 1+𝜖 ≤ 𝑢 1+𝜖 for some 𝜖 ∈ (0, 1]. Assumption 2.2. Parameters 𝑇, 𝐾, 𝑢 and 𝜖 are known. We now recall the lower bound on the minimax regret for the heavy tailed bandit problem derived in [18]. Theorem 2.4 ([18, Th. 2]). For any fixed time horizon 𝑇 and the stochastic MAB problem under Assumptions 2.1 and 2.2 with 𝑢 = 1, the worst-case regret for a uniformly good policy 𝜌 satisfies 𝜌 𝜖 1 worst 𝑅𝑇 ≥ 0.01𝐾 1+𝜖 𝑇 1+𝜖 . 𝜖 1  Remark 2.1. Since 𝑅𝑇 scales with 𝑢, the lower bound for heavy tail bandit is Ω 𝑢𝐾 1+𝜖 𝑇 1+𝜖 . This lower bound also indicates that within a finite horizon 𝑇, it is almost impossible to differentiate the 𝜖  optimal arm from arm 𝑘, if Δ 𝑘 ∈ 𝑂 𝑢(𝐾/𝑇) 1+𝜖 . As a special case, rewards with bounded support √ [0, 1] correspond to 𝜖 = 1 and 𝑢 = 1. Then, the lower bound Ω( 𝐾𝑇) matching with the regret upper bound is achieved by MOSS. 2.3.1 A Robust Minimax Policy: Robust MOSS In Robust MOSS, to deal with the heavy-tailed reward distribution, we replace the empirical mean with a saturated empirical mean. Although the saturated empirical mean is a biased estimator, it has better convergence properties. The formal definition is given later in this section. We construct a novel UCB index to evaluate the arms, and at each time slot, the arm with the maximum UCB 16 index is picked. Let 𝑛 𝑘 (𝑡) be the number of times that arm 𝑘 has been selected until time 𝑡 − 1. At time 𝑡, let 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) be the saturated empirical mean reward computed from the 𝑛 𝑘 (𝑡) samples at arm 𝑘. Robust MOSS initializes by selecting each arm once and subsequently, at each time 𝑡, selects the arm that maximizes the following UCB 𝑔𝑛𝑘𝑘 (𝑡) = 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) + (1 + 𝜂)𝑐 𝑛 𝑘 (𝑡) ,   𝜖 where 𝜂 > 0 is an appropriate constant, 𝑐 𝑛 𝑘 (𝑡) = 𝑢 × 𝜙(𝑛 𝑘 (𝑡)) 1+𝜖 and 𝑇  ln+ 𝐾𝑛 𝜙(𝑛) = , 𝑛 where ln+ (𝑥) := max(ln 𝑥, 1). Note that both 𝜙(𝑛) and 𝑐 𝑛 are monotonically decreasing in 𝑛. The robust saturated empirical mean is similar to the truncated empirical mean used in [18], which is employed to extend UCB1 to achieve logarithm distribution-dependent regret for the heavy-tailed MAB problem. Let {𝑋𝑖 }𝑖∈{1,...,𝑚} be a sequence of i.i.d. random variables with mean 𝜇 and E |𝑋𝑖 | 1+𝜖 ≤ 𝑢 1+𝜖 , where 𝑢 > 0. Pick 𝑎 > 1 and let ℎ(𝑚) = 𝑎 blog𝑎 (𝑚) c +1 such that   ℎ(𝑚) ≥ 𝑚. Define the saturation point 𝐵𝑚 by  − 1 𝐵𝑚 := 𝑢 × 𝜙 ℎ(𝑚) 1+𝜖 . Then, the saturated empirical mean estimator is defined by 𝑚 1 Õ 𝜇ˆ 𝑚 := sat(𝑋𝑖 , 𝐵𝑚 ), (2.2) 𝑚 𝑖=1  where sat(𝑋𝑖 , 𝐵𝑚 ) := sign(𝑋𝑖 ) min |𝑋𝑖 | , 𝐵𝑚 . Define 𝑑𝑖 := sat(𝑋𝑖 , 𝐵𝑚 ) − E [sat(𝑋𝑖 , 𝐵𝑚 )] which has the following properties. Lemma 2.5. For any 𝑖 ∈ {1, . . . , 𝑚}, 𝑑𝑖 satisfies (i) |𝑑𝑖 | ≤ 2𝐵𝑚 (ii) E [𝑑𝑖2 ] ≤ 𝑢 1+𝜖 𝐵1−𝜖 𝑚 . Proof. Property (i) follows immediately from the definition of 𝑑𝑖 , and property (ii) follows from     E [𝑑𝑖2 ] ≤ E sat2 (𝑋𝑖 , 𝐵𝑚 ) ≤ E |𝑋𝑖 | 1+𝜖 𝐵1−𝜖 𝑚 .  17 The following lemma examines the estimator bias and provides an upper bound on the error of the saturated empirical mean. Lemma 2.6 (Bias of saturated empirical mean). For an i.i.d. sequence of random variables   {𝑋𝑖 }𝑖∈{1,...,𝑚} such that E [𝑋𝑖 ] = 𝜇 and E 𝑋𝑖1+𝜖 ≤ 𝑢 1+𝜖 , the saturated empirical mean (2.2) satisfies 𝑚 1 Õ 𝑢 1+𝜖 𝜇ˆ 𝑚 − 𝜇 − 𝑑𝑖 ≤ 𝜖 . 𝑚 𝑖=1 𝐵𝑚 h i Proof. Since 𝜇 = E 𝑋𝑖 1{|𝑋𝑖 |≤𝐵𝑚 } + 1{|𝑋𝑖 |>𝐵𝑚 } , the error of estimator 𝜇ˆ 𝑚 satisfies 𝑚 𝑚 𝑚 1 Õ  1 Õ 1 Õ  𝜇ˆ 𝑚 − 𝜇 = sat(𝑋𝑖 , 𝐵𝑚 ) − 𝜇 = 𝑑𝑖 + E [sat(𝑋𝑖 , 𝐵𝑚 )] − 𝜇 , 𝑚 𝑖=1 𝑚 𝑖=1 𝑚 𝑖=1 where the second term is the bias of 𝜇ˆ 𝑚 . We now compute an upper bound on the bias. " # 1+𝜖 h i |𝑋𝑖 | 𝑢 1+𝜖 E [sat(𝑋𝑖 , 𝐵𝑚 )] − 𝜇 ≤ E |𝑋𝑖 | 1{|𝑋𝑖 |>𝐵𝑚 } ≤ E ≤ , (𝐵𝑚 ) 𝜖 (𝐵𝑚 ) 𝜖 which concludes the proof.  2.3.2 Analysis of Robust MOSS In this section, we analyze Robust MOSS to provide both distribution-free and distribution- dependent regret bounds.To derive the concentration property of saturated empirical mean, we use a maximal Bennett type inequality as shown in Lemma 2.7. Lemma 2.7 (Maximal Bennett’s inequality [75]). Let {𝑋𝑖 }𝑖∈{1,...,𝑛} be a sequence of bounded random variables with support [−𝐵, 𝐵], where 𝐵 ≥ 0. Suppose that E [𝑋𝑖 |𝑋1 , . . . , 𝑋𝑖−1 ] = 𝜇𝑖 and Í𝑚 Var[𝑋𝑖 |𝑋1 , . . . , 𝑋𝑖−1 ] ≤ 𝑣. Let 𝑆 𝑚 = 𝑖=1 (𝑋𝑖 − 𝜇𝑖 ) for any 𝑚 ∈ {1, . . . , 𝑛}. Then, for any 𝛿 ≥ 0  !  𝛿 𝐵𝛿 P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑆 𝑚 ≥ 𝛿 ≤ exp − 𝜓 , 𝐵 𝑛𝑣  !  𝛿 𝐵𝛿 P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑆 𝑚 ≤ −𝛿 ≤ exp − 𝜓 , 𝐵 𝑛𝑣 where 𝜓(𝑥) = (1 + 1/𝑥) ln(1 + 𝑥) − 1. 18 Remark 2.2. For 𝑥 ∈ (0, ∞), function 𝜓(𝑥) is monotonically increasing in 𝑥. Now, we establish an upper bound on the probability that the UCB underestimates the mean at arm 𝑘 by an amount 𝑥. Lemma 2.8. For any arm 𝑘 ∈ {1, . . . , 𝐾 } and any 𝑡 ∈ {𝐾 + 1, . . . , 𝑇 } and 𝑥 > 0, if 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎,  the probability of event 𝑔𝑛𝑘𝑘 (𝑡) ≤ 𝜇 𝑘 − 𝑥 is no greater than    ! − 1+𝜖 𝜖 𝐾 𝑎 1 𝜓 2𝜂/𝑎 𝑥 Γ +2 . 𝑇 ln(𝑎) 𝜖 2𝑎 𝑢 Proof. It follows from Lemma 2.6 that     P 𝑔𝑛𝑘𝑘 (𝑡) ≤ 𝜇 𝑘 − 𝑥 ≤P ∃𝑚 ∈ {1, . . . , 𝑇 } : 𝜇ˆ 𝑚𝑘 + (1 + 𝜂)𝑐 𝑚 ≤ 𝜇 𝑘 − 𝑥 𝑚 𝑑𝑖𝑘   Õ 𝑢 1+𝜖 ≤P ∃𝑚 ∈ {1, . . . , 𝑇 } : ≤ 𝜖 − (1 + 𝜂)𝑐 𝑚 − 𝑥 𝑖=1 𝑚 𝐵𝑚  𝑚  1 Õ 𝑘 ≤P ∃𝑚 ∈ {1, . . . , 𝑇 } : 𝑑 ≤ −𝑥 − 𝜂𝑐 𝑚 , 𝑚 𝑖=1 𝑖 where 𝑑𝑖𝑘 is defined similarly to 𝑑𝑖 for i.i.d. reward sequence at arm 𝑘 and the last inequality is due to 𝑢 1+𝜖   𝜖   𝜖 𝜖 = 𝑢 𝜙 ℎ(𝑚) 1+𝜖 ≤ 𝑢 𝜙(𝑚) 1+𝜖 = 𝑐 𝑚 . (2.3) 𝐵𝑚 Recall 𝑎 > 1. We apply a peeling argument [76, Sec 2.2] with geometric grid 𝑎 𝑠 ≤ 𝑚 < 𝑎 𝑠+1 over time interval {1, . . . , 𝑇 }. Since 𝑐 𝑚 is monotonically decreasing with 𝑚,  𝑚  1 Õ 𝑘 P ∃𝑚 ∈ {1, . . . , 𝑇 } : 𝑑 ≤ −𝑥 − 𝜂𝑐 𝑚 𝑚 𝑖=1 𝑖 Õ  Õ 𝑚  𝑠 𝑠+1 𝑘 𝑠  ≤ P ∃𝑚 ∈ [𝑎 , 𝑎 ) : 𝑑𝑖 ≤ −𝑎 𝑥 + 𝜂𝑐 𝑎 𝑠+1 . 𝑠≥0 𝑖=1 Also notice that 𝐵𝑚 = 𝐵𝑎 𝑠 for all 𝑚 ∈ [𝑎 𝑠 , 𝑎 𝑠+1 ). Then with properties in Lemma 2.5, we apply 19 Lemma 2.7 to get Õ  Õ 𝑚  𝑠 𝑠+1 𝑘 𝑠  P ∃𝑚 ∈ [𝑎 , 𝑎 ) : 𝑑𝑖 ≤ −𝑎 𝑥 + 𝜂𝑐 𝑎 𝑠+1 𝑠≥0 𝑖=1  ! © 𝑎 𝑥 + 𝜂𝑐 𝑎 𝑠+1 2𝐵𝑎 𝑠 𝑥 + 𝜂𝑐 𝑎 𝑠+1 Õ 𝑠 ≤ exp ­− ª 𝜓 ® 𝑠≥0 2𝐵𝑎 𝑠 𝑎𝑢 1+𝜖 𝐵1−𝜖𝑎𝑠 « ¬  𝜓(𝑥) is monotonically increasing   ! Õ 𝑎 𝑠 𝑥 + 𝜂𝑐 𝑎 𝑠+1 2𝜂𝐵𝜖𝑎 𝑠 𝑐 𝑎 𝑠+1 ≤ exp − 𝜓 𝑠≥0 2𝐵 𝑎 𝑠 𝑎𝑢 1+𝜖 plug in 𝑐 𝑎 𝑠+1 , 𝐵𝑎 𝑠 and use ℎ(𝑎 𝑠 ) = 𝑎 𝑠+1 )   ! Õ 𝑥 𝜓 2𝜂/𝑎 = exp −𝑎 𝑠 + 𝜂𝜙(𝑎 𝑠 ) . 𝑠≥1 𝐵 𝑎 𝑠−1 2𝑎 Plugging in 𝜙(𝑎 𝑠 ), with 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎 and ln+ (𝑦) ≥ ln(𝑦), we have   ! Õ ! Õ 𝑥 𝜓 2𝜂/𝑎 𝑥 𝜓 2𝜂/𝑎 𝐾 𝑠 exp −𝑎 𝑠 + 𝜂𝜙(𝑎 𝑠 ) ≤ exp −𝑎 𝑠 𝑎 . 𝑠≥1 𝐵 𝑎 𝑠−1 2𝑎 𝑠≥1 𝐵 𝑎 𝑠−1 2𝑎 𝑇  𝑠 Let 𝑏 = 𝑥𝜓 2𝜂/𝑎 /(2𝑎𝑢). Since 𝐵𝑎 𝑠−1 ≤ 𝑢𝑎 1+𝜖 , we have ! Õ 𝑠 𝑥 𝜓 2𝜂/𝑎 𝐾 𝑠 𝐾Õ 𝑠  𝜖𝑠  exp −𝑎 𝑎 ≤ 𝑎 exp −𝑏𝑎 1+𝜖 𝑠≥1 𝐵𝑎 𝑠−1 2𝑎 𝑇 𝑇 𝑠≥1 ∫ +∞ 𝐾 ( 𝑦−1) 𝜖  ≤ 𝑎 𝑦 exp − 𝑏𝑎 1+𝜖 𝑑𝑦 𝑇 1 ∫ +∞ 𝐾 𝑦𝜖  = 𝑎 𝑎 𝑦 exp − 𝑏𝑎 1+𝜖 𝑑𝑦 𝑇 0  𝑦𝜖  where we set 𝑧 = 𝑏𝑎 1+𝜖 ∫ +∞ 𝐾 𝑎 1 + 𝜖 − 1+𝜖 1+𝜖 𝑧 𝜖 −1 exp − 𝑧 𝑑𝑧  = 𝑏 𝜖 𝑇 ln(𝑎) 𝜖   𝑏 𝐾 𝑎 1 1+𝜖 ≤ Γ + 2 𝑏− 𝜖 , 𝑇 ln(𝑎) 𝜖 which concludes the proof.  The following is a straightforward corollary of Lemma 2.8. Corollary 2.9. For any arm 𝑘 ∈ {1, . . . , 𝐾 } and any 𝑡 ∈ {𝐾 + 1, . . . , 𝑇 } and 𝑥 > 0, if 𝜂𝜓(2𝜂/𝑎) ≥  2𝑎, the probability of event 𝑔𝑛𝑘𝑘 (𝑡) − 2(1 + 𝜂)𝑐 𝑛 𝑘 (𝑡) ≥ 𝜇 𝑘 + 𝑥} shares the same bound in Lemma 2.8. 20 The distribution-free upper bound for Robust MOSS, which is the main result for the paper, is presented in this section. We show that the algorithm achieves order optimal worst-case regret. Theorem 2.10. For the heavy-tailed stochastic MAB problem with 𝐾 arms and time horizon 𝑇, if 𝜂 and 𝑎 are selected such that 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎, then Robust MOSS satisfies 𝜖 1 worst Robust MOSS 𝑅𝑇 ≤ 𝐶𝑢𝐾 1+𝜖 (𝑇/𝑒) 1+𝜖 + 2𝑢𝐾, where   1    1+𝜖 𝐶 =Γ 1/𝜖 + 2 𝑎/ 6 + 3𝜂 𝜖 3/𝜓 6 + 3𝜂 𝜖  − 1   1+𝜖  −𝜖  + 𝜖Γ 1/𝜖 + 2 6 + 3𝜂 𝜖 6𝑎/𝜓(2𝜂/𝑎) 𝜖 𝑎/ln(𝑎) + 6 + 3𝜂 𝑒 + (1 + 𝜖)𝑒 1+𝜖 . Remark 2.3. Parameter 𝑎 and 𝜂 as inputs to Robust MOSS can be selected by minimizing the leading constant 𝐶 in the upper bound on the regret in Theorem 2.10. We have found that selecting 𝑎 slightly larger than 1 and selecting smallest 𝜂 that satisfies 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎 yields good performance. Proof. Since both the UCB and the regret scales with 𝑢 defined in Assumption 2.1, to simplify the expressions, we assume 𝑢 = 1. Also notice that Assumption 2.1 indicates 𝜇 𝑘 ≤ 𝑢, so Δ 𝑘 ≤ 2 for any 𝑘 ∈ {1, . . . , 𝐾 }. In the following, any terms with superscript or subscript “∗" and “𝑘" are with respect to the best and the 𝑘-th arm, respectively. The proof is divided into 4 steps. Step 1: We follow a decoupling technique inspired by the proof of regret upper bound in MOSS [70]. Take the set of 𝛿-bad arms as B𝛿 as B𝛿 := {𝑘 ∈ {1, . . . , 𝐾 } | Δ 𝑘 > 𝛿}, (2.4)   𝜖 where we assign 𝛿 = 6 + 3𝜂 𝑒𝐾/𝑇 1+𝜖 . Thus, 𝐾 " 𝑇 # Õ Õ  𝑅𝑇 ≤ 𝑇 𝛿 + Δ𝑘 + E 1{𝜑𝑡 ∈ B𝛿 } Δ𝜑𝑡 − 𝛿 𝑡=1 𝑡=𝐾+1 " 𝑇 # Õ  ≤ 𝑇 𝛿 + 2𝐾 + E 1{𝜑𝑡 ∈ B𝛿 } Δ𝜑𝑡 − 𝛿 . (2.5) 𝑡=𝐾+1 21 Furthermore, we make the following decomposition 𝑇 𝑇   Õ Õ Δ𝜑 B𝛿 , 𝑔𝑛∗∗ (𝑡) ∗   1{𝜑𝑡 ∈ B𝛿 } Δ𝜑𝑡 − 𝛿 = 1 𝜑𝑡 ∈ ≤𝜇 − 𝑡 Δ𝜑𝑡 − 𝛿 (2.6) 𝑡=𝐾+1 𝑡=𝐾+1 3 𝑇   Õ Δ𝜑 B𝛿 , 𝑔𝑛∗∗ (𝑡) ∗  + 1 𝜑𝑡 ∈ >𝜇 − 𝑡 Δ𝜑𝑡 − 𝛿 . 𝑡=𝐾+1 3 Notice that the first summand (2.6) describes regret from underestimating optimal arm ∗. For the ≥ 𝑔𝑛∗∗ (𝑡) and 𝜇∗ = 𝜇 𝜑𝑡 + Δ𝜑𝑡 , 𝜑 second summand, since 𝑔𝑛 𝑡 (𝑡) 𝜑𝑡 𝑇   Õ Δ𝜑 B𝛿 , 𝑔𝑛∗∗ (𝑡) ∗  1 𝜑𝑡 ∈ >𝜇 − 𝑡 Δ𝜑𝑡 − 𝛿 𝑡=𝐾+1 3 𝑇   Õ 𝜑 2Δ𝜑𝑡 ≤ 1 𝜑𝑡 ∈ B𝛿 , 𝑔𝑛 𝑡 (𝑡) > 𝜇 𝜑𝑡 + Δ𝜑𝑡 𝑡=𝐾+1 𝜑𝑡 3 𝑇   Õ Õ 𝑘 2Δ 𝑘 = 1 𝜑𝑡 = 𝑘, 𝑔𝑛 𝑘 (𝑡) > 𝜇 𝑘 + Δ𝑘 , (2.7) 𝑘∈B 𝑡=𝐾+1 3 𝛿 which characterizes the regret caused by overestimating 𝛿-bad arms. n o Step 2: In this step, we bound the expectation of (2.6). When event 𝜑𝑡 ∈ B𝛿 , 𝑔𝑛∗ (𝑡) ≤ 𝜇 − Δ𝜑𝑡 /3 ∗ ∗ happens, we know 𝛿 Δ𝜑 ≤ 3𝜇∗ − 3𝑔𝑛∗∗ (𝑡) and 𝑔𝑛∗∗ (𝑡) < 𝜇∗ − . 3 Thus, we get     Δ𝜑𝑡 ∗ 𝛿 B𝛿 , 𝑔𝑛∗∗ (𝑡) ∗ ∗ × 3𝜇∗ − 3𝑔𝑛∗∗ (𝑡) − 𝛿 := 𝑌𝑡  1 𝜑𝑡 ∈ ≤𝜇 − (Δ𝜑𝑡 − 𝛿) ≤ 1 𝑔𝑛∗ (𝑡) < 𝜇 − 3 3 Since 𝑌𝑡 is a positive random variable, its expected value can be computed involving only its cumulative density function: ∫ +∞ ∫ +∞   E [𝑌𝑡 ] = P (𝑌𝑡 > 𝑥) 𝑑𝑥 ≤ P 3𝜇∗ − 3𝑔𝑛∗∗ (𝑡) − 𝛿 > 𝑥 𝑑𝑥 ∫0 +∞  0  𝑥 = P 𝜇∗ − 𝑔𝑛∗∗ (𝑡) > 𝑑𝑥. 𝛿 3 Then we apply Lemma 2.8 at optimal arm ∗ to get ∫ +∞ 𝐾𝐶1 1 − 1+𝜖 𝐾𝐶1 E [𝑌𝑡 ] ≤ 𝑥 𝜖 𝑑𝑥 = 1 𝑇 𝛿 𝜖 𝑇𝛿 𝜖 22   1+𝜖 where 𝐶1 = 𝜖Γ 1/𝜖 + 2 6𝑎/𝜓(2𝜂/𝑎) 𝜖 𝑎/ln(𝑎). We conclude this step by 𝑇   𝑇 h Õ Δ𝜑 i Õ 1 E 1 𝜑𝑡 ∈ B𝛿 , 𝑔𝑛∗∗ (𝑡) ≤𝜇 − 𝑡∗ Δ𝜑𝑡 − 𝛿 ≤ 𝑌𝑡 ≤ 𝐶1 𝐾𝛿− 𝜖 . 𝑡=𝐾+1 3 𝑡=𝐾+1 Step 3: In this step, we bound the expectation of (2.7). For each arm 𝑘 ∈ B𝛿 , 𝑇   Õ 2Δ 𝑘 1 𝜑𝑡 = 𝑘, 𝑔𝑛𝑘𝑘 (𝑡) ≥ 𝜇𝑘 + 𝑡=𝐾+1 3 𝑇 Õ 𝑡−𝐾   Õ  2Δ 𝑘 = 1 𝜑𝑡 = 𝑘, 𝑛 𝑘 (𝑡) = 𝑚 1 𝑔𝑚𝑘 ≥ 𝜇𝑘 + 𝑡=𝐾+1 𝑚=1 3 𝑇−𝐾 Õ   𝑇 𝑘 2Δ 𝑘 Õ  = 1 𝑔𝑚 ≥ 𝜇 𝑘 + 1 𝜑𝑡 = 𝑘, 𝑛 𝑘 (𝑡) =𝑚 𝑚=1 3 𝑡=𝑚+𝐾 𝑇   Õ 𝑘 2Δ 𝑘 ≤ 1 𝑔𝑚 ≥ 𝜇 𝑘 + 𝑚=1 3 𝑇  Õ 𝑚  Õ 1 𝑘 2Δ 𝑘 ≤ 1 𝑑𝑖 ≥ − (2 + 𝜂)𝑐 𝑚 , (2.8) 𝑚=1 𝑚 𝑖=1 3 where in the last inequality we apply Lemma 2.6 and use the fact that 𝑢 1+𝜖 /𝐵𝑚 𝜖 ≤ 𝑐 in (2.3). We 𝑚 set !   1+𝜖   1+𝜖  6 + 3𝜂 𝜖 𝑇 Δ 𝑘 𝜖  𝑙 𝑘 =  ln .  Δ𝑘 𝐾 6 + 3𝜂     With Δ 𝑘 ≥ 𝛿, we get 𝑙 𝑘 is no less than   1+𝜖    1+𝜖    1+𝜖 6 + 3𝜂 𝜖 𝑇 𝛿 𝜖 6 + 3𝜂 𝜖 ln = . Δ𝑘 𝐾 6 + 3𝜂 Δ𝑘 Furthermore, since 𝑐 𝑚 is monotonically decreasing with 𝑚, for 𝑚 ≥ 𝑙 𝑘 , " 𝑇 Δ𝑘  1+𝜖 𝜖 𝜖  # 1+𝜖 ln+ 𝐾 6+3𝜂 Δ𝑘 𝑐𝑚 ≤ 𝑐𝑙𝑘 ≤ ≤ . (2.9) 𝑙𝑘 6 + 3𝜂 With this result and 𝑙 𝑘 ≥ 1, we continue from (2.8) to get Õ𝑇  Õ 𝑚  𝑇  Õ 𝑚  1 𝑘 2Δ 𝑘 Õ 1 𝑘 2Δ 𝑘 E 1 𝑑 ≥ − (2 + 𝜂)𝑐 𝑚 ≤𝑙 𝑘 − 1 + P 𝑑 ≥ − (2 + 𝜂)𝑐 𝑚 𝑚=1 𝑚 𝑖=1 𝑖 3 𝑚=𝑙 𝑘 𝑚 𝑖=1 𝑖 3 𝑇  Õ 𝑚  Õ 1 𝑘 Δ𝑘 ≤𝑙 𝑘 − 1 + P 𝑑𝑖 ≥ (2.10) 𝑚=𝑙 𝑚 𝑖=1 3 𝑘 23 Therefore by using Lemma 2.7 together with statement (ii) from Lemma 2.5, we get 𝑇 ( 𝑚 ) 𝑇   Õ 𝑇   Õ 1 Õ 𝑘 Δ𝑘 Õ 𝑚Δ 𝑘 𝜖  𝑚Δ 𝑘  P 𝑑 ≥ ≤ exp − 𝜓 𝐵𝑚 Δ 𝑘 ≤ exp − 𝜓 6 + 3𝜂 , 𝑚=𝑙 𝑚 𝑖=1 𝑖 3 𝑚=𝑙 3𝐵𝑚 𝑚=𝑙 3𝐵𝑚 𝑘 𝑘 𝑘 where the last step is due to that 𝜓(𝑥) is monotonically increasing and 𝐵𝑚 𝜖 Δ ≥ (6+3𝜂)𝐵 𝜖 𝑐 ≥ 6+3𝜂 𝑘 𝑚 𝑚 1  − 1+𝜖 1 1 from (2.9) and (2.3). Since 𝐵𝑚 = 𝜙 ℎ(𝑚) ≤ 𝜙(𝑎𝑚) − 1+𝜖 ≤ (𝑎𝑚) 1+𝜖 , we have 𝑇   Õ 𝑇   Õ 𝑚Δ 𝑘  𝜖 1 − 1+𝜖  Δ𝑘 exp − 𝜓 6 + 3𝜂 ≤ exp −𝑚 𝑎 1+𝜖 𝜓 6 + 3𝜂 . 𝑚=𝑙 𝑘 3𝐵𝑚 𝑚=1 3 ∫ +∞   𝜖 ≤ exp −𝛽𝑦 1+𝜖 𝑑𝑦 0 1 𝜖 where we set 𝛽 = 𝑎 − 1+𝜖 𝜓 6 + 3𝜂 Δ 𝑘 /3. Taking 𝑧 = 𝛽𝑦 1+𝜖 , we obtain  ∫ +∞ ∫ +∞    𝜖  1 + 𝜖 − 1+𝜖 1+𝜖 1 1+𝜖 exp −𝛽𝑦 1+𝜖 𝑑𝑦 = 𝛽 𝜖 𝑧 𝜖 −1 exp (−𝑧) 𝑑𝑦 = Γ + 2 𝛽− 𝜖 . 0 𝜖 0 𝜖 Plugging it into (2.10), Õ 𝑇  Õ 𝑚   𝑇 1+𝜖  1 𝑘 2Δ 𝑘 − 1+𝜖 − 1+𝜖 E 1 𝑑𝑖 ≥ − (2 + 𝜂)𝑐 𝑚 ≤ 𝐶2 Δ 𝑘 𝜖 + 𝐶3 Δ 𝑘 𝜖 ln Δ𝑘 𝜖 𝑚=1 𝑚 𝑖=1 3 𝐾𝐶3  1    1+𝜖  1+𝜖 where 𝐶2 = Γ 1/𝜖 + 2 𝑎 𝜖 3/𝜓 6 + 3𝜂 𝜖 and 𝐶3 = 6 + 3𝜂 𝜖 . Putting it together with Δ 𝑘 ≥ 𝛿 for all 𝑘 ∈ B𝛿 , the expectation of (2.7) is no greater than   −1 −1 𝑇 1+𝜖 −𝜖 Õ 1 1 𝐶2 Δ 𝑘 𝜖 + 𝐶3 Δ 𝑘 𝜖 ln Δ𝑘 𝜖 ≤ 𝐶2 𝐾𝛿− 𝜖 + (1 + 𝜖)𝑒 1+𝜖 𝐶3 𝐾𝛿− 𝜖 , 𝑘∈B 𝛿 𝐾𝐶3 1 1+𝜖 where we use the fact that 𝑥 − 𝜖 ln 𝑇𝑥  𝜖 /(𝐾𝐶3 ) takes its maximum at 𝑥 = 𝛿 exp(𝜖 2 /(1 + 𝜖)). Step 4: Plugging the results in step 2 and step 3 into (2.5), h −𝜖 i 1 worst Robust MOSS 𝑅𝑇 ≤ 𝑇 𝛿 + 𝐶1 + 𝐶2 + (1 + 𝜖)𝑒 1+𝜖 𝐶3 𝐾𝛿− 𝜖 + 2𝐾. Straightforward calculation concludes the proof.  We now show that robust MOSS also preserves a logarithmic upper bound on the distribution- dependent regret. 24 Theorem 2.11. For the heavy-tailed stochastic MAB problem with 𝐾 arms and time horizon 𝑇, if 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎, the regret 𝑅𝑇 for Robust MOSS is no greater than "   # Õ  𝑢 1+𝜖  𝜖1 𝑇  Δ 𝑘  1+𝜖𝜖 𝐶1 ln + 𝐶2 𝐾 + Δ 𝑘 , 𝑘:Δ >0 Δ 𝑘 𝐾𝐶 1 𝑢 𝑘  1+𝜖   1+𝜖  where 𝐶1 = 4 + 4𝜂 𝜖 and 𝐶2 = max 𝑒𝐶1 , 2Γ(1/𝜖 + 2) 8𝑎/𝜓(2𝜂/𝑎) 𝜖 𝑎/ln(𝑎) .   𝜖 Proof. Let 𝛿 = 4 + 4𝜂 𝑒𝐾/𝑇 1+𝜖 and define B𝛿 the same as (2.4). Since Δ 𝑘 ≤ 𝛿 for all 𝑘 ∉ B𝛿 , the regret satisfies Õ Õ 𝑇 𝑅𝑇Robust MOSS ≤ 𝑇Δ 𝑘 + 1{𝜑𝑡 ∈ B𝛿 }Δ𝜑𝑡 𝑘∉B 𝛿 𝑡=1   1+𝜖 𝑇 Õ 4 + 4𝜂 𝜖 Õ Õ ≤ 𝑒𝐾 Δ𝑘 + 1{𝜑𝑡 = 𝑘 }Δ 𝑘 . (2.11) 𝑘∉B 𝛿 Δ𝑘 𝑘∈B 𝛿 𝑡=1 Pick arbitrary 𝑙 𝑘 ∈ Z+ , thus Õ 𝑇 Õ 𝑇  1{𝜑𝑡 = 𝑘 } ≤ 𝑙 𝑘 + 1 𝜑𝑡 = 𝑘, 𝑛 𝑘 (𝑡) ≥ 𝑙 𝑘 𝑡=1 𝑡=𝐾+1 Õ 𝑇 n o ≤ 𝑙𝑘 + 1 𝑔𝑛𝑘𝑘 (𝑡) ≥ 𝑔𝑛∗∗ (𝑡) , 𝑛 𝑘 (𝑡) ≥ 𝑙𝑘 . 𝑡=𝐾+1 Observe that 𝑔𝑛𝑘𝑘 (𝑡) ≥ 𝑔𝑛∗∗ (𝑡) implies at least one of the following is true 𝑔𝑛∗∗ (𝑡) ≤ 𝜇∗ − Δ 𝑘 /4, (2.12) 𝑔𝑛𝑘𝑘 (𝑡) ≥ 𝜇 𝑘 + Δ 𝑘 /4 + 2(1 + 𝜂)𝑐 𝑛 𝑘 (𝑡) , (2.13) (1 + 𝜂)𝑐 𝑛 𝑘 (𝑡) > Δ 𝑘 /4. (2.14) We select !   1+𝜖   1+𝜖  4 + 4𝜂 𝜖 𝑇 Δ𝑘 𝜖  𝑙 𝑘 =  ln .  Δ𝑘 𝐾 4 + 4𝜂     Similarly as (2.9), 𝑛 𝑘 (𝑡) ≥ 𝑙 𝑘 indicates 𝑐 𝑛 𝑘 (𝑡) ≤ Δ 𝑘 /(4 + 4𝜂), so (2.14) is false. Then we apply Lemma 2.8 and Corollary 2.9, n o 𝐶20 𝐾 − 1+𝜖 P 𝑔𝑛𝑘𝑘 (𝑡) ≥ 𝑔𝑛∗∗ (𝑡) , 𝑛 𝑘 (𝑡) ≥ 𝑙𝑘 ≤ P ((2.12) or (2.13) is true ) ≤ Δ 𝜖 , 𝑇 𝑘 25  1+𝜖 where 𝐶20 = 2Γ 1/𝜖 + 2 8𝑎/𝜓(2𝜂/𝑎) 𝜖 𝑎/ln(𝑎). Substituting it into (2.11),    Õ 𝑒𝐶1 𝐾 Õ  𝐶1  𝑇 1+𝜖  𝐶 0𝐾  2 𝑅𝑇Robust MOSS ≤ + Δ𝑘 + 1 + Δ𝑘  . 𝜖  1  1 ln 𝐾𝐶1 Δ 𝑘𝜖 𝑘∈B 𝛿  Δ 𝜖 Δ 𝑘𝜖   𝑘∉B 𝛿   𝑘  Considering the scaling factor 𝑢, the proof can be concluded with easy computation.  2.3.3 Numerical Illustration of Robust MOSS In this section, we compare Robust MOSS with MOSS and Robust UCB (with truncated empirical mean or Catoni’s estimator) [18] in a 3-armed heavy-tailed bandit setting. The mean rewards are 𝜇1 = −0.3, 𝜇2 = 0 and 𝜇3 = 0.3 and sampling at each arm 𝑘 returns a random reward equals to 𝜇 𝑘 added by sampling noise 𝜈, where |𝜈| is a generalized Pareto random variable and the sign of 𝜈 has equal probability to be positive and negative. The PDF of reward at arm 𝑘 is ! − 𝜉1 −1 1 𝜉 𝑥 − 𝜇𝑘 𝑓 𝑘 (𝑥) = 1+ for 𝑥 ∈ (−∞, +∞), 2𝜎 𝜎 where we select 𝜉 = 0.33 and 𝜎 = 0.32. Thus, for a random reward 𝑋 from any arm, we know E [𝑋 2 ] ≤ 1, which means 𝜖 = 1 and 𝑢 = 1. We select parameters 𝑎 = 1.1 and 𝜂 = 2.2 for Robust MOSS so that condition 𝜂𝜓(2𝜂/𝑎) ≥ 2𝑎 is met. Figure.2.1 shows the mean regret together with quantiles of regret distribution as a function of time, which are computed using 200 simulations of each policy. On each graph, the bold curve is the emprical mean regret while light shaded and dark shaded regions correspond respectively to upper 5% and lower 95% quantile cumulative regrets. The simulation result shows that there is a chance MOSS loses stability in heavy-tailed MAB and suffers linear regret while other algorithms work consistently and maintain sub-linear regrets. Robust MOSS slightly outperforms Robust UCB in this specific problem. 2.4 Summary We reviewed the stationary stochastic MAB problem and concepts including regret and regret lower bound. Specially, we studied the heavy-tailed bandit problem and proposed the Robust 26 Figure 2.1: Comparison of Robust MOSS with MOSS and other Robust UCB algorithms. MOSS algorithm. We evaluate it by deriving upper bounds on the associated distribution-free and distribution-dependent regrets. Our analysis shows that Robust MOSS achieves order optimal performance in both scenarios. It can be noticed that the saturated mean estimator centers at zero so that the algorithm is not translation invariant. Exploration of translation invariant robust mean estimator in this context remains an open problem. 2.5 Bibliographic Remarks Since the seminal work by Lai and Robbins [15], several subsequent works design simpler al- gorithms by assuming that the rewards are bounded or more generally, sub-Gaussian. By using Kullback-Leibler(KL) divergence-based uncertainty estimates, Garivier and Cappé [17] designed KL-UCB and proved that it strictly dominates UCB1 [16], which uses Hoeffding inequality-based uncertainty estimates. Aside from the nonBayesian policies mentioned above, Bayesian strategies have also been proved to be effective for MAB problems. The Bayes-UCB algorithm by Kaufmann et al. [77] is the first Bayesian algorithm proved to be asymptotic optimal. Thompson sampling [1], 27 proposed in 1933, has long been shown to perform very well in practice. Very recently, asymp- totic and finite-time performance guarantees that are very close to the optimal have been proved for Thompson sampling [78, 79]. Other effective policies include 𝜖-greedy [16]and deterministic sequencing of exploration and exploitation [25, 40, 41]. However, both these classes of algorithms require the knowledge of a lower bound on the minimum gap in mean rewards. In the context of minimizing the worst-case regret, Ménard and Garivier [80] adapted the MOSS algorithm with KL divergence-based uncertainty estimates and proposed a minimax algorithm kl- UCB++ that improves the algorithm performance. It also needs to be noticed that MOSS requires knowing horizon length, so it is not an anytime algorithm. Degenne and Perchet [81] extend MOSS to an any-time version called MOSS-anytime that can adapt to different horizon lengths. 28 CHAPTER 3 PIECE-WISE STATIONARY STOCHASTIC BANDITS In nonstationary stochastic MAB problem, the reward sequence {𝑋𝑡𝑘 }𝑇𝑡=1 at each arm 𝑘 ∈ {1, . . . , 𝐾 } is composed of independent samples from time-varying reward distributions { 𝑓𝑡𝑘 }𝑇𝑡=1 . Piece-wise stationary MAB is a special type of the nonstationary bandit problem in which 𝑓𝑡𝑘 switches at unknown time instants referred as breakpoints. Between consecutive breakpoints, 𝑓𝑡𝑘 remains the same for any 𝑘 ∈ {1, . . . , 𝐾 }. In this chapter, we assume each 𝑓𝑡𝑘 has bounded support [0, 1], and the total number of breakpoints until time 𝑇 is Υ𝑇 ∈ 𝑂 (𝑇 𝜈 ), where 𝜈 ∈ [0, 1) and is known a priori. Similarly as the stationary Stochastic MAB problem, the decision-maker’s objective is to select  Í𝑇 𝜑𝑡  arms 𝜑𝑡 , . . . , 𝜑𝑇 that maximizes the expected of cumulative reward E 𝑡=1 𝑡 . Let the time- 𝑋 varying mean reward associated with arm 𝑘 be 𝜇𝑡𝑘 at time 𝑡 ∈ {1, . . . , 𝑇 }. Then, for the nonstationary MAB problem, the regret for a policy 𝜌 can be defined by 𝑇 " 𝑇 # " 𝑇 # Õ Õ Õ 𝜇𝑡∗ − E 𝜇𝑡∗ − 𝜇𝑡 𝑡 , 𝜌 𝜑 𝜑 𝑅𝑇 := 𝑋𝑡 𝑡 = E 𝜌 (3.1) 𝑡=1 𝑡=1 𝑡=1 where 𝜇𝑡∗ = max 𝑘∈{1,...,𝐾 } 𝜇𝑡𝑘 and the expectation is with respect to different realization of {𝜑𝑡 }𝑇𝑡=1 that depends on obtained rewards through policy 𝜌. This chapter is a slightly modified version of our published work on piece-wise stationary stochastic bandits, and it is reproduced here with the permission of the copyright holder1. In the following sections, two generic algorithms, namely Limited-Memory DSEE (LM-DSEE) algo- rithm and the Sliding-Window UCB# (SW-UCB#) algorithm, are presented and analyzed. These algorithms require parameters to be tuned based on environment characteristics. 3.1 Preliminaries We first recall two MAB policies since algorithms proposed in this chapter are developed based upon them. The first algorithm is Deterministic Sequencing of Exploration and Exploitation (DSEE) [40]. 1 ©2018 IEEE. Reprinted with permission from [82]. 29 It divides the set of natural numbers N into interleaving blocks of exploration and exploitation. In the exploration block all arms are played in a round-robin fashion, while in the exploitation block, the arm with the maximum statistical mean reward is played. For an appropriately defined 𝑤 ∈ R>0 , the DSEE algorithm at time 𝑡 exploits if the number of exploration steps until time 𝑡 − 1 are greater than or equal to 𝐾 d𝑤 log 𝑡e, otherwise it starts a new exploration block. Vakili et al. showed that the DSEE algorithm achieves efficient performance. It should be noted that tuning 𝑤 requires knowledge of a lower bound on the gap between the mean reward from the best arm and the second-best arm. This requirement can be relaxed at the cost of degraded performance. The second policy is Sliding-Window UCB (SW-UCB) [20]. It is a variation of UCB1 [16] that intends to solve the piecewise-stationary bandits. A sliding observation window is used to erase the outdated sampling history, and the UCB index is computed within it. Since the size of the sliding observation window in SW-UCB depends on the horizon length, it requires knowledge of the horizon length of the problem. The SW-UCB# proposed in this chapter intends to relax this assumption and enable the policy to adapt to different horizon lengths. 3.2 The LM-DSEE Algorithm The LM-DSEE algorithm comprises interleaving blocks of exploration and exploitation. In the 𝑛-th exploration epoch, each arm is sampled 𝐿 (𝑛) = d𝛾 ln(𝑛 𝜚 𝑙𝑏)e number of times. In the 𝑛-th exploitation epoch, the arm with the highest sample mean in the 𝑛-th exploration epoch is sampled d𝑎𝑛 𝜚 𝑙e − 𝐾 𝐿 (𝑛) times. Here, the parameters 𝜚, 𝛾, 𝑎, 𝑏, and 𝑙 are tuned based on the environment characteristics (see Algorithm 1 for details). In the following, we will set 𝑎 and 𝑏 to unity for the purposes of analysis. The LM-DSEE algorithm is similar in spirit to the DSEE algorithms [25, 40], wherein the length of the exploitation epoch increases exponentially with the epoch number, and all the data collected in the previous exploration epochs are used to estimate mean rewards. However, in a non-stationary environment using all the rewards from the previous exploration epochs may lead to a heavily biased estimate of mean rewards. Furthermore, an exponentially increasing exploitation 30 Algorithm 1: The LM-DSEE Algorithm Input : 𝜈 ∈ [0, 1),Δmin ∈ (0, 1), 𝑇 ∈ N,𝑎 ∈ R>0 , 𝑏 ∈ (0, 1]; Set : 𝛾 ≥ Δ22 , 𝑎𝑙 ∈ {𝐾 d𝛾 ln 𝑙𝑏e, . . . , +∞}, and 𝜚 = 1−𝜈 1+𝜈 ; min Output : sequence of arm selection; % Initialization: 1 Set batch index 𝑛 ← 1 and 𝑡 ← 1; 2 while 𝑡 ≤ 𝑇 do % Exploration 3 for 𝑘 ∈ {1, . . . , 𝐾 } do Pick arm 𝑘, 𝐿(𝑛) ← d𝛾 ln(𝑛 𝜚 𝑙𝑏)e times ; collect rewards {𝑋𝑖𝑘 (𝑛)}𝑖 ∈ {1,...,𝐿 (𝑛) } ; epch 1 Í 𝐿 (𝑛) compute sample mean 𝜇¯ 𝑘 (𝑛) ← 𝐿 (𝑛) 𝑖=1 𝑋𝑖𝑘 (𝑛); % Exploitation epch epch 4 Select the best arm 𝜑 𝑛 = arg max 𝑘 ∈ {1,...,𝐾 } 𝜇¯ 𝑘 (𝑛) ; epch 5 Pick arm 𝜑 𝑛 , d𝑎𝑛 𝜚 𝑙e − 𝐾 𝐿(𝑛) times ; 6 Update 𝑡 ← 𝑡 + d𝑎𝑛 𝜚 𝑙e and batch index 𝑛 ← 𝑛 + 1; epoch length may lead to excessive exploitation based on an outdated estimate of the mean rewards. To address these issues, we modify the DSEE algorithm by using only the rewards from the current exploration epoch to estimate the mean rewards, and we increase the length of the exploitation epoch using a power law instead of an exponential function. 3.3 Analysis of the LM-DSEE Algorithm Before we analyze the LM-DSEE algorithm, we introduce the following notation for the piece-wise stationary environment. Let Δ 𝑘 = max{𝜇𝑡∗ − 𝜇𝑡𝑘 | 𝑡 ∈ {1, . . . , 𝑇 }}, Δmax = max{Δ 𝑘 | 𝑘 ∈ {1, . . . , 𝐾 }}, and Δmin = min{𝜇𝑡∗ − 𝜇𝑡𝑘 | 𝑡 ∈ {1, . . . , 𝑇 }, 𝑘 ∈ {1, . . . , 𝐾 }, 𝜇𝑡∗ − 𝜇𝑡𝑘 > 0}. Theorem 3.1 (Regret Upper Bound for LM-DSEE). For piece-wise stationary environment with 31 number of breakpoints Υ𝑇 ∈ 𝑂 (𝑇 𝜈 ) and 𝜈 ∈ [0, 1), the regret for the LM-DSEE algorithm satisfies 1+𝜈 𝑅𝑇LM-DSEE ∈ 𝑂 (𝑇 2 ln 𝑇). Proof. Let 𝑁 be the index of the epoch containing the time-instant 𝑇, then the length of each epoch is at most d𝑁 𝜚 𝑙e. Since breakpoints are located in at most Υ𝑇 epochs, we can upper bound the regret from epochs containing breakpoints by 𝑅b ≤ Υ𝑇 d𝑁 𝜚 𝑙eΔmax . In the epochs containing no breakpoint, let 𝑅e and 𝑅i denote, respectively, the regret from exploration and exploitation epochs. Note that in such epochs, the mean reward from each arm does not change. In the 𝑛-th epoch with no breakpoint, we denote the maximum mean reward by ∗ 𝜇no-break (𝑛) and the set of arms with maximum mean reward by Kno-break ∗ (𝑛). Then, the regret in exploration epochs 𝑅e satisfies, Õ 𝑁 Õ 𝐾 Õ 𝐾 𝜚 𝜚 𝑅e ≤ d𝛾 ln(𝑛 𝑙)eΔ 𝑘 ≤ 𝑁 d𝛾 ln(𝑁 𝑙)e Δ𝑘 . 𝑛=1 𝑘=1 𝑘=1 In exploitation epochs, regret is incurred if a sub-optimal arm is selected, and consequently, the regret in exploitation epochs 𝑅i satisfies Õ𝑁 Õ 𝐾 ∗   epch 𝑅i ≤ d𝑛 𝜚 𝑙e − 𝐾 𝐿(𝑛) P(𝜑𝑛 = 𝑘 ∉ Kno-break (𝑛))Δ 𝑘 , (3.2) 𝑛=1 𝑘=1 epch where 𝜑𝑛 is the arm selected in the 𝑛-th exploitation epoch. It follows from the Chernoff-Hoeffding inequality [83, Theorem 1] that epch epch  epch epch  P 𝜇¯ 𝑘 (𝑛) ≥ 𝜇 𝑘 (𝑛) + 𝛿 = P 𝜇¯ 𝑘 (𝑛) ≤ 𝜇 𝑘 (𝑛) − 𝛿 = exp(−2𝛿2 𝐿 (𝑛)), epch where 𝜇 𝑘 (𝑛) is the mean reward of arm 𝑘 in the 𝑛-th epoch and 𝐿 (𝑛) is the number of times an arm is selected in the 𝑛-th exploration epoch. Thus, we take 𝑗 ∗ ∈ Kno-break ∗ (𝑛) and get epch ∗  P 𝜑𝑛 = 𝑘 ∉ Kno-break (𝑛) epch epch Δmin  epch ∗ Δmin  ≤P 𝜇¯ 𝑘 (𝑛) ≥ 𝜇 𝑘 (𝑛) + + P 𝜇¯ 𝑗 ∗ (𝑛) ≤ 𝜇no-break (𝑘) − 2 2  Δ2min  ≤2 exp − 𝛾 ln(𝑛 𝜚 𝑙) . 2 32 epch ∗ (𝑛) ≤ 2(𝑛 𝜚 𝑙) −1 . Substituting it into (3.2), we have  Since 𝛾 ≥ Δ22 , P 𝜑𝑛 = 𝑘 ∉ Kno-break min Í 𝑅i ≤ 2𝑁 𝐾𝑘=1 Δ 𝑘 since d𝑛 𝜚 𝑙e − 𝐾 𝐿(𝑛) < 𝑛 𝜚 𝑙. Furthermore, it can be seen that 𝑙 𝑙 (𝑁 − 1) 1+𝜚 − 𝑁 ≤ 𝑇 ≤ (𝑁 + 1) 1+𝜚 + 𝑁, 1+ 𝜚 1+ 𝜚 1 and consequently 𝑁 ∈ 𝑂 (𝑇 1+ 𝜚 ). Therefore, it follows that Õ𝐾 LM-DSEE 𝜚 𝜚 𝑅 (𝑇) = 𝑅b + 𝑅e + 𝑅i ≤ Υ𝑇 𝑁 𝑙Δmax + 𝑁 (d𝛾 ln(𝑁 𝑙)e + 2) Δ𝑘 . 𝑘=1 1+𝜈 Thus, the regret 𝑅 LM-DSEE (𝑇) ∈ 𝑂 (𝑇 2 ln 𝑇), and this establishes the theorem.  3.4 The SW-UCB# Algorithm The SW-UCB# algorithm is an adaptation of the SW-UCB algorithm proposed and studied in [20]. At time 𝑡, SW-UCB# maintains an estimate of the mean reward 𝜇¯ 𝑘 (𝑡, 𝛼) at each arm 𝑘, using only the rewards collected within a sliding-window of observations. Let the width of the sliding-window at time 𝑡 ∈ {1, . . . , 𝑇 } be 𝜏(𝑡, 𝛼) = min{d𝜆𝑡 𝛼 e, 𝑡}, where parameters 𝛼 ∈ (0, 1], 𝜉 ∈ (1, 2] and 𝜆 ∈ R≥0 ∪{+∞} are tuned based on environment characteristics. Let Õ𝑡 𝑛 𝑘 (𝑡, 𝛼) = 1{𝜑 𝑠 = 𝑘 } 𝑠=𝑡−𝜏(𝑡,𝛼)+1 be the number of times arm 𝑘 has been selected in the time-window at time 𝑡, then we have 𝑡 1 Õ 𝜑 𝜇¯ 𝑘 (𝑡, 𝛼) = 𝑋𝑠 𝑠 1{𝜑 𝑠 = 𝑘 }. 𝑛 𝑘 (𝑡, 𝛼) 𝑠=𝑡−𝜏(𝑡,𝛼)+1 Based on the above estimate, the SW-UCB# algorithm at each time selects the arm 𝜑𝑡 = arg max 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼), (3.3) 𝑘 ∈{1,...,𝐾 } p where 𝑐 𝑘 (𝑡, 𝛼) = 𝜉 ln(𝑡)/𝑛 𝑘 (𝑡, 𝛼). The details of the algorithm are presented in Algorithm 2. In contrast to the SW-UCB algorithm [20], the SW-UCB# algorithm employs a time-varying width of the sliding-window. The tuning of the fixed window width in [20] requires a priori knowledge of the time horizon 𝑇 which is no longer needed for the SW-UCB# algorithm. 33 Algorithm 2: The SW-UCB# Algorithm Input : 𝜈 ∈ [0, 1), Δmin ∈ (0, 1), 𝜆 ∈ R>0 & 𝑇 ∈ N; Set : 𝛼 = 1−𝜈 2 Output : sequence of arm selection; % Initialization: 1 while 𝑡 ≤ 𝑇 do 2 if 𝑡 ∈ {1, . . . , 𝑁 } then Pick arm 𝜑𝑡 = 𝑡; 3 else Pick arm 𝜑𝑡 defined in (3.3) ; 3.5 Analysis of the SW-UCB# Algorithm We analyze the performance of the SW-UCB# algorithm (Algorithm 2) to get the following result. Theorem 3.2 (Regret Upper Boudn for SW-UCB#). For the piece-wise stationary environment with number of breakpoints Υ𝑇 = 𝑂 (𝑇 𝜈 ) and 𝜈 ∈ [0, 1), the regret for the SW-UCB# algorithm satisfies 1+𝜈 𝑅𝑇SW-UCB# ∈ 𝑂 (𝑇 2 ln 𝑇). Proof. We define set T̂ such that for all 𝑡 ∈ T̂ , 𝑡 is either a breakpoint or there exists a break point in its sliding-window of observations {𝑡 − 𝜏(𝑡 − 1, 𝛼), . . . , 𝑡 − 1}. For 𝑡 ∈ T̂ , the statistical means are corrupted. Since the maximum sliding-window width is d𝜆(𝑇 − 1) 𝛼 e, it can be shown that |T | ≤ Υ𝑇 d𝜆(𝑇 − 1) 𝛼 e. Then, the regret can be upper bounded as follows. Õ𝐾 𝑅𝑇SW-UCB# 𝛼 ≤ Υ𝑇 d𝜆(𝑇 − 1) eΔmax + E[ 𝑁˜ 𝑘 (𝑇)]Δ 𝑘 , (3.4) 𝑘=1 where 𝑁˜ 𝑘 (𝑇) := 𝑇𝑡=1 1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑡 ∉ T̂ }, and K𝑡∗ is the set of arms with maximum mean Í 34 reward at 𝑡. It can be seen that Õ 𝑇 𝑁˜ 𝑘 (𝑇) ≤ 1 + 1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1)} 𝑡=𝐾+1 𝑇 (3.5) Õ + 1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑡 ∉ T̂ , 𝑛 𝑘 (𝑡 − 1, 𝛼) ≥ 𝐴(𝑡 − 1)}, 𝑡=𝐾+1 where 𝐴(𝑡) = 4𝜉 ln 𝑡/Δ2min . We first bound the second term on the right side of inequality (3.5). Let 𝐺 ∈ N be such that 1 1 [𝜆(1 − 𝛼)(𝐺 − 1)] 1−𝛼 < 𝑇 ≤ [𝜆(1 − 𝛼)𝐺] 1−𝛼 . (3.6) Then, consider the following partition of time indices n  1   1  o 1 + [𝜆(1 − 𝛼)(𝑔 − 1)] 1−𝛼 , . . . , [𝜆(1 − 𝛼)𝑔] 1−𝛼 . (3.7) 𝑔∈{1,...,𝐺} In the 𝑔-th epoch in the partition, either Õ 1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1)} = 0, 𝑡∈𝑔-th epoch or there exist at least one time instant 𝑡 that 𝜑𝑡 = 𝑘 ∉ K𝑡∗ and 𝑛 𝑗 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1). Let the last time instant satisfying these conditions in the 𝑔-th epoch be 𝑡 𝑘 (𝑔) = max 𝑡 ∈ 𝑔-th epoch | 𝜑𝑡 = 𝑘 ∉ K𝑡∗ and 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡) .  We will now show that there exists at most one time index in the 𝑔-th epoch until 𝑡 𝑘 (𝑔) − 1 that is not covered by the time-window at 𝑡 𝑘 (𝑔). Towards this end, consider the increasing convex function 1 𝑓 (𝑥) = 𝑥 1−𝛼 with 𝛼 ∈ (0, 1). It follows that 𝑓 (𝑥 2 ) − 𝑓 (𝑥1 ) ≤ 𝑓 0 (𝑥2 )(𝑥 2 − 𝑥 1 ) if 𝑥2 ≥ 𝑥 1 . Let 𝑡˜ be 𝑡˜1−𝛼 a time index in the 𝑔-th epoch, and set 𝑥 1 = 𝑔 − 1 and 𝑥 2 = 𝜆(1−𝛼) . Then, substituting 𝑥1 and 𝑥 2 in the above inequality and simplifying, we get 1  𝑡˜1−𝛼  𝑡˜ − (𝜆(1 − 𝛼)(𝑔 − 1)) 1−𝛼 ≤ 𝜆𝑡˜𝛼 −𝑔+1 . (3.8) 𝜆(1 − 𝛼) 𝑡˜1−𝛼 Since by definition of the 𝑔-th epoch, 𝜆(1−𝛼) ≤ 𝑔, we have 1 𝑡˜ − b(𝜆(1 − 𝛼)(𝑔 − 1)) 1−𝛼 c ≤ min{𝑡˜ + 1, 𝜆d𝑡˜𝛼 e + 1} = 𝜏( 𝑡˜, 𝛼) + 1. (3.9) 35 Setting 𝑡˜ = 𝑡 𝑘 (𝑔) − 1 in (3.9), we obtain  1 𝑡 𝑘 (𝑔) − 𝜏 𝑡 𝑘 (𝑔) − 1, 𝛼 ≤ 2 + b𝜆(1 − 𝛼)(𝑔 − 1) 1−𝛼 c, i.e., the first time instant in the sliding-window at 𝑡 𝑗 (𝑔) is located at or to the left of the second time instant of the 𝑔-th epoch in the partition (3.7). Therefore, it follows that 1 b𝜆(1−𝛼)𝑔 Õ 1−𝛼 c 1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1)} 1 𝑡=1+b𝜆(1−𝛼) (𝑔−1) 1−𝛼 c ≤ 𝑛 𝑘 (𝑡 − 1, 𝛼) + 2 ≤ 𝐴(𝑡 𝑘 (𝑔) − 1) + 2. Now we have Õ 𝑇 1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑛 𝑘 (𝑡 − 1, 𝛼) < 𝐴(𝑡 − 1)} 𝑡=𝐾+1 𝐺  Õ 4𝜉 ln 𝑇  ≤2𝐺 + 𝐴(𝑡 𝑘 (𝑔) − 1) ≤ 𝐺 2 + 2 . (3.10) 𝑔=1 Δmin Next, we upper-bound the expectation of the last term on the right-hand side of inequality (3.5). Taking 𝑗𝑡∗ ∈ K𝑡∗ , it can be shown that 1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑡 ∉ T̂ , 𝑛 𝑘 (𝑡 − 1, 𝜏) ≥ 𝐴(𝑡 − 1)} Õ e d𝜆(𝑡−1) d𝜆(𝑡−1) Õ e 𝛼 𝛼 ≤ 1{𝑛 𝑘 (𝑡 − 1, 𝛼) = 𝑠, 𝑛 𝑗𝑡∗ (𝑡 − 1, 𝛼) = 𝑠∗ } (3.11) 𝑠∗ =1 𝑠=𝐴(𝑡−1) × 1{ 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) > 𝜇¯ 𝑗𝑡∗ (𝑡 − 1, 𝛼) + 𝑐 𝑗𝑡∗ (𝑡 − 1, 𝛼), 𝑡 ∉ T̂ }. When 𝑡 ∉ T̂ , for each arm 𝑘 ∈ {1, . . . , 𝐾 }, 𝜇 𝑘 (𝑠) is a constant for all 𝑠 ∈ {𝑡 − 𝜏(𝑡 − 1, 𝛼), . . . , 𝑡}. Note that if 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) > 𝜇¯ 𝑗𝑡∗ (𝑡 − 1, 𝛼) + 𝑐 𝑗𝑡∗ (𝑡 − 1, 𝛼) is true, at least one of the following inequalities holds. 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) ≥ 𝜇𝑡𝑘 + 𝑐 𝑘 (𝑡 − 1, 𝛼), (3.12) 𝑗∗ 𝜇¯ 𝑗𝑡∗ (𝑡 − 1, 𝛼) ≤ 𝜇𝑡 𝑡 − 𝑐 𝑗𝑡∗ (𝑡 − 1, 𝛼), (3.13) 𝜇𝑡∗ − 𝜇𝑡𝑘 < 2𝑐 𝑘 (𝑡 − 1, 𝛼). (3.14) 36 Since 𝑛 𝑘 (𝑡 − 1, 𝛼) ≥ 𝐴(𝑡 − 1), (3.14) does not hold. Applying Chernoff-Hoeffding inequality [83, Theorem 1] to bound the probability of events (3.12) and (3.13), we obtain P( 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) ≥ 𝜇 𝑘 (𝑡) + 𝑐 𝑘 (𝑡 − 1, 𝛼)) ≤ (𝑡 − 1) −2𝜉 , P( 𝜇¯ 𝑗𝑡∗ (𝑡 − 1, 𝛼) ≤ 𝜇 𝑗𝑡∗ (𝑡) − 𝑐 𝑗𝑡∗ (𝑡 − 1, 𝛼)) ≤ (𝑡 − 1) −2𝜉 , where 𝜉 = 1 + 𝛼. Applying both probability inequalities in conjuncture with (3.11), we get h Õ 𝑇 i E 1{𝜑𝑡 = 𝑘 ∉ K𝑡∗ , 𝑡 ∉ T̂ , 𝑛 𝑘 (𝑡 − 1, 𝛼) ≥ 𝐴(𝑡 − 1)} 𝑡=𝐾+1 Õ𝑇 ≤ 2(𝑡 − 1) −2𝜉 [𝜆(𝑡 − 1) 𝛼 + 1] 2 𝑡=𝐾+1 ∞ Õ (𝜆 + 1) 2 𝜋 2 ≤ 2(𝜆 + 1) 2 𝑡 −2 = . (3.15) 𝑡=1 3 Therefore, it follows from (3.4), (3.5), (3.10), and (3.15) that 𝐾    SW-UCB# 𝛼 Õ 4𝜉 ln 𝑇  (𝜆 + 1) 2 𝜋 2 𝑅𝑇 ≤ Υ𝑇 d𝜆(𝑇 − 1) eΔmax + 𝐺 2+ 2 +1+ Δ𝑘 . 𝑘=1 Δmin 3 1+𝜈 From (3.6), we have 𝐺 = 𝑂 (𝑇 1−𝛼 ), and this yields 𝑅𝑇SW-UCB# ∈ 𝑂 (𝑇 2 ln 𝑇).  3.6 Numerical Illustration In this section, we present simulation results for the SW-UCB# and LM-DSEE algorithms. For each simulation, we consider a 10-armed bandit in which the reward at each arm is generated using Beta distribution. The breakpoints are introduced at time instants where the next element of the sequence {b𝑡 𝜈 c}𝑡∈{1,...,𝑇 } is different from the current element. At each breakpoint, the mean rewards at each arm were randomly selected from the set {0.05, 0.12, 0.19, 0.26, 0.33, 0.39, 0.46, 0.53, 0.6, 0.9}. We select the parameters (𝑎, 𝑏) equal to (1, 0.25) for LM-DSEE in Algorithm 1. For SW-UCB# in Algorithm 2, we select 𝜆 = 12.3. The parameters 𝜈 that describe characteristics of nonstationarity are varied to evaluate the performance of algorithms. Figure. 3.1 shows that both SW-UCB# and LM-DSEE are effective in the piece-wise stationary environment. 37 Figure 3.1: Comparison of LM-DSEE and SW-UCB#. It can be seen in Figs. 3.1 that for both algorithms, as expected, the ratio of the empirical regret to the order of the regret established in Sections 3.3 and 3.5 is upper bounded by a constant. The regret for the SW-UCB# is relatively smoother than the regret for the LM-DSEE algorithm. The saw-tooth behavior of the regret for LM-DSEE is attributed to the fixed exploration-exploitation structure, wherein the regret is certainly incurred during the exploration epochs. 3.7 Summary We studied the stochastic MAB problem in the piece-wise stationary environment and designed two novel algorithms, the LM-DSEE and the SW-UCB# for these problems. We analyzed these algorithms to show that these algorithms incur sublinear regret, i.e., the time average of the regret asymptotically converges to zero. The theoretical results are verified with numerical illustrations. While both the algorithms incur the same order of regret, compared with LM-DSEE, SW- UCB# has a better leading constant. This illustrates the cost of constraining the algorithm to have a deterministic structure. On the other hand, this deterministic structure can be very useful, for example, in the context of planning trajectories for a mobile robot performing surveillance or search using a MAB framework. Though both algorithms can balance the explore-exploit tradeoff, they are reactive in the sense that they select only one arm at a time, i.e., they only provide information about the next location to be visited by the robot. Certain motion constraints on the robots such as non- holonomicity may make such movements energetically demanding. Therefore, the deterministic and predictable structure of LM-DSEE can be leveraged to design a tour for the robot which can 38 be efficiently traversed even under motion constraints. There are several possible extensions of this work. In the next chapter, we’ll study the multiple decision-maker version of the problem in this chapter. Besides, extensions of the methodology developed in this paper to other classes on MAB problems such as the Markovian MAB problem [84] and the restless bandits [85] are also of interest. 3.8 Bibliographic Remarks In a non-stationary environment, achieving logarithmic expected cumulative regret may not be feasible and the focus is on the design of algorithms that achieve sub-linear regret. Indeed, the √ lower bound for the piece-wise stationary stochastic bandit has been shown to be Ω( 𝐾Υ𝑇 𝑇) in [20]. Thus, both of the proposed algorithms in this chapter are near-optimal. The approaches to handle nonstationary environments can be classified into active approach and passive approach. The former actively detects the breakpoints to and accordingly removes the old sampling results, while the latter follows a predetermined rule and disregards the information about breakpoints. One of the passive approaches is proposed by Kocsis and Szepesvári [86] which uses a discount- ing factor to compute the UCB index. In subsequent work, Garivier and Moulines [20] provide a formal analysis of Discounted UCB (D-UCB) and propose SW-UCB, which is also a passive approach. They pointed out that if the number of change points Υ𝑇 is available, both algorithms √ can be tuned to achieve a regret close to the Ω( 𝐾Υ𝑇 𝑇) regret lower bound. The active approach handles the change of reward distributions in an adaptive manner. Harland et al. [87] actively detect the change point with the Page-Hinkley test and design two restarting strate- gies to prevent false alarm, namely 𝛾-Restart and Meta-Bandit. The 𝛾-Restart triggers discounting the sampling history when a breakpoint is detected, while the meta-Bandit models preserving or discarding old information as a new 2-armed bandit. Other change point detection techniques such as cumulative sum (CUMSUM) and Generalized Likelihood Ratio (GLR) tests are used in subsequent work to design CUMSUM-UCB [88], GLR-klUCB [89], and M-UCB [90]. Change point detection has also been implemented together with non-UCB methods to design EXP3.R [91] 39 and Change-Point Thompson sampling [92]. These policies either need the knowledge of Υ𝑇 to tune the parameter or have regret upper bounds with a strong dependence on 𝑁𝑇 . Very recently, two parameter-free policies AdSwitch [93] and Ada-ILTCB+ [94] for contextual MAB are proved √ to have 𝑂˜ ( 𝐾 𝑁𝑇 𝑇) regret. 40 CHAPTER 4 MULTI-PLAYER PIECEWISE STATIONARY STOCHASTIC BANDITS In a variety of applications including robotic swarming, opportunistic spectrum access, and the Internet of Things, achieving coordinated behavior of multiple decision-makers in unknown, un- certain, and non-stationary environments without any explicit communication among them is of immense interest. This multi-player decision-making in the face of uncertainty is embodied by the multi-player MAB problem, in which several decision-makers simultaneously play the bandit game in a decentralized fashion. For such a problem, we follow a common routine to assume a collision model: the reward from an arm is eliminated or shared when it is selected by multiple agents. The work in this chapter is slightly modified from our published paper on multi-player piecewise stationary stochastic bandits, and it is reproduced here with the permission of the copyright holder1. To formally formulate the problem, we consider a multi-player MAB problem with 𝐾 arms and the total number of players is 𝑀 ∈ {1, . . . , 𝐾 }. Similarly as the single-player case in last chapter, at each time 𝑡, there is a random reward 𝑋𝑡𝑘 ∈ [0, 1] associated with each arm 𝑘 ∈ {1, . . . , 𝐾 }, and 𝜑𝑡 ( 𝑗) every agent 𝑗 ∈ {1, . . . , 𝑀 } picks a particular arm 𝜑𝑡 ( 𝑗) ∈ {1, . . . , 𝐾 } and observes 𝑋𝑡 . We assume no communication between agents, so that 𝜑𝑡 ( 𝑗) is selected based only on agent 𝑗’s own  𝜑 ( 𝑗) 𝑡−1 observation and decision-making history 𝑋𝑠 𝑠 , 𝜑 𝑠 ( 𝑗) 𝑠=1 . 𝜑𝑡 ( 𝑗) With collision model M that eliminate rewards, agent 𝑗 receives the reward 𝑋𝑡 from arm 𝜑𝑡 ( 𝑗) if it is the only player to select arm 𝜑𝑡 ( 𝑗) at time 𝑡. Then, the group reward till time 𝑇 is Õ 𝑇 Õ 𝐾 𝑆𝑇 = 𝑋𝑡𝑘 O𝑡𝑘 , 𝑡=1 𝑘=1 where O𝑡𝑘 = 1 if arm 𝑘 is selected by only one player at time 𝑡 and is zero otherwise. If the collision model allows the reward to be shared, O𝑡𝑘 = 1 if arm 𝑘 is selected by a player. Since the algorithm design and analysis are similar in both cases, the discussion will only be made based on the collision model M in this chapter. 1 ©2018 IEEE. Reprinted with permission from [95]. 41 We assume the minimum difference in mean rewards between any pair of arms at any time is lower bounded by Δmin > 0. Let 𝜎𝑡 be a permutation of {1, . . . , 𝐾 } at time 𝑡 such that the mean rewards satisfy 𝜇𝑡𝜎𝑡 (1) > . . . > 𝜇𝑡𝜎𝑡 (𝐾) . Then, the group regret for a policy 𝜌 till time 𝑇 is defined by Õ𝑇 Õ 𝑀 Õ 𝑇 Õ 𝑀 Õ𝑇 Õ 𝐾  𝜇𝑡𝜎𝑡 (𝑘) 𝜇𝑡𝜎𝑡 (𝑘) 𝜌   𝑅𝑇 (M) = −E 𝜌 𝑆𝑇 = −E 𝜌 𝜇𝑡𝑘 O𝑡𝑘 , 𝑡=1 𝑘=1 𝑡=1 𝑘=1 𝑡=1 𝑘=1 where the second expectation is computed over different realizations of O𝑡𝑘 under policy 𝜌. Our 𝜌 main purpose here is to design a multi-player policy 𝜌 that minimizes 𝑅𝑇 (M). Like the last chapter, we study the above MAB problem in a piecewise stationary environment with the number of breakpoints until time 𝑇 to be Υ𝑇 ∈ 𝑂 (𝑇 𝜈 ), where 𝜈 ∈ [0, 1) is known a priori. 4.1 The RR-SW-UCB# Algorithm The Round Robin SW-UCB# (RR-SW-UCB#) algorithm is designed based upon SW-UCB# pre- sented in the last chapter. In the RR-SW-UCB# algorithm, at each time 𝑡, every agent 𝑗 maintains 𝑗 an estimate of the mean reward 𝜇¯ 𝑘 (𝑡, 𝛼) at each arm 𝑘, using only the rewards collected within a sliding-window of width 𝜏(𝑡, 𝛼) = min{d𝜆𝑡 𝛼 e, 𝑡}, where parameter 𝛼 ∈ (0, 1]. The number of times arm 𝑘 has been selected within the time-window at time 𝑡 is Õ 𝑡 𝑗 𝑛 𝑘 (𝑡, 𝛼) = 1{𝜑 𝑠 ( 𝑗) = 𝑘 }. 𝑠=𝑡−𝜏(𝑡,𝛼)+1 𝑗 Then, 𝜇¯ 𝑘 (𝑡, 𝛼) can be computed by 𝑡 𝑗 1 Õ 𝜑 ( 𝑗) 𝜇¯ 𝑘 (𝑡, 𝛼) = 𝑋𝑠 𝑠 1{𝜑 𝑠 ( 𝑗) = 𝑘 }. 𝑛 𝑘 (𝑡, 𝛼) 𝑠=𝑡−𝜏(𝑡,𝛼)+1 Using its own observations, each agent 𝑗 computes upper confidence bounds on the mean rewards 𝑗 𝑗 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼), ∀𝑘 ∈ {1, . . . , 𝐾 }, q 𝑗 𝑗 where 𝑐 𝑘 (𝑡 − 1, 𝛼) = (1 + 𝛼) ln 𝑡/𝑛 𝑘 (𝑡 − 1, 𝛼). For initial 𝐾 iterations, i.e., 𝑡 ∈ {1, . . . , 𝐾 }, the player 𝑗 selects each arm once. Then, at time instants {𝐾 + 𝜂𝑀 + 1}𝜂∈Z ≥0 , it computes the set Ω 𝑗 42 Algorithm 3: The RR-SW-UCB# Algorithm Input : 𝜈 ∈ [0, 1), Δmin ∈ (0, 1), 𝜆 ∈ R>0 , 𝑇 ∈ N and player number 𝑗; Set : 𝛼 = 1−𝜈 2 ; Output : sequence of arm selections for each player 𝑗; % Initialization: 1 Set Ω 𝑗 ← ∅, ordered set G 𝑗 ← (), and 𝑡 ← 1; 2 while 𝑡 ≤ 𝑇 do % round-robin selection of each arm starting at arm 𝑗 3 if 𝑡 ∈ {1, . . . , 𝐾 } then Pick arm 𝜑𝑡 ( 𝑗) = mod(𝑡 + 𝑗 − 2, 𝐾) + 1; 4 else Compute Ω 𝑗 containing 𝑀 arms with 𝑀 largest values in  𝑗 𝑗 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) | 𝑘 ∈ {1, . . . , 𝐾 } ; Ascending sort the arm indices in Ω 𝑗 , G 𝑗 ← sort↑ (Ω 𝑗 ); % round-robin selection of arms in G 𝑗 starting at G 𝑗 ( 𝑗) for round ∈ {1, . . . , 𝑀 } do Pick arm 𝜑𝑡 ( 𝑗) = G 𝑗 (mod(𝑡 − 𝐾 + 𝑗 − 2, 𝑀) + 1); 𝑡 ← 𝑡 + 1; containing 𝑀 arms with 𝑀 largest values in the set  𝑗 𝑗 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) | 𝑘 ∈ {1, . . . , 𝐾 } . Let G 𝑗 be the ordered set that contains arms in Ω 𝑗 sorted in ascending value of their indices (not using the upper confidence bounds), and let G 𝑗 (𝑖) denote the 𝑖-th element in G 𝑗 . The player 𝑗 selects 𝑗 arms in G 𝑗 in a round-robin fashion starting with the arm G𝑡 ( 𝑗). It will be shown in the following section that the estimated set of 𝑀 best arms, denoted by Ω 𝑗 , will be the same for each player with high probability. Details of RR-SW-UCB# are shown in Algorithm 3. The free parameter 𝜆 in the algorithm can be used to refine the finite-time performance of the algorithm. 4.2 Analysis of the RR-SW-UCB# Algorithm Before the analysis, we introduce the following notation. Let Ω∗𝑀 (𝑡) denote the set of 𝑀 arms with the 𝑀 largest mean rewards at time 𝑡. Then, the total number of times Ω 𝑗 (𝑡) ≠ Ω∗𝑀 (𝑡) until time 𝑇 43 can be defined as Õ 𝑇 N 𝑗 (𝑇) := 1{Ω 𝑗 (𝑡) ≠ Ω∗𝑀 (𝑡)}. 𝑡=1 We now upper bound N 𝑗 (𝑇) in the following lemma. Lemma 4.1. For the RR-SW-UCB# algorithm and the multi-player MAB problem with 𝐾 arms and 𝑀 players in the piecewise stationary environment with the number of break points Υ𝑇 = 𝑂 (𝑇 𝜈 ), 𝜈 ∈ [0, 1), the total number of times Ω 𝑗 (𝑡) ≠ Ω∗𝑀 (𝑡) until time 𝑇 for any player 𝑘 satisfies h 𝑇 1−𝛼  4𝑀 (1 + 𝛼) ln 𝑇  𝜋 2  𝜆 + 𝑀 + 1  2 i N 𝑗 (𝑇) ≤(𝐾 − 𝑀) +1 1+ + 𝜆(1 − 𝛼) Δ2min 3 𝑀  + Υ𝑇 d𝜆(𝑇 − 1) 𝛼 e + 𝑀 − 1 + 𝐾. Proof. We begin by separately analyzing windows with and without breakpoints. For the ease of 𝑗 𝑗 𝑗 notation, in the following, superscript 𝑗 is omitted in 𝜇¯ 𝑘 (𝑡, 𝛼), 𝑛 𝑘 (𝑡, 𝛼) and 𝑐 𝑘 (𝑡, 𝛼). Step 1: Let set T̂ such that for all 𝑡 ∈ T̂ , 𝑡 is either a breakpoint or there exists a break point in its sliding-window of observations {𝑡 − 𝜏(𝑡 − 1, 𝛼), . . . , 𝑡 − 1}. For 𝑡 ∈ T̂ , the statistical means are biased. It follows that | T̂ | ≤ Υ𝑇 d𝜆(𝑇 − 1) 𝛼 e. Consequently, N 𝑗 (𝑇) can be upper-bounded as   N 𝑗 (𝑇) ≤ Υ𝑇 d𝜆(𝑇 − 1) 𝛼 e + 𝑀 − 1 + Ñ 𝑗 (𝑇), (4.1) Í𝑇 where Ñ 𝑗 (𝑇) := 𝑡=1 1{Ω 𝑗 (𝑡) ≠ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ }. The term 𝑀 − 1 in (4.1) is due to the fact that Ω 𝑗 is computed every 𝑀 steps. In the following steps, we will bound Ñ 𝑗 (𝑇). Step 2: If Ω 𝑗 (𝑡) ≠ Ω∗𝑀 (𝑡), there exists at least one arm 𝑖 such that 𝑖 ∈ Ω 𝑗 (𝑡) and 𝑖 ∉ Ω∗𝑀 (𝑡). Then, it follows that Õ𝑇 Õ 𝐾 Ñ 𝑗 (𝑇) ≤𝐾 + 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} 𝑡=𝐾+1 𝑖=1 𝑇 𝐾 (4.2) Õ Õ + 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼)}, 𝑡=𝐾+1 𝑖=1 44 where we choose 𝑙 (𝑡, 𝛼) = 4(1 + 𝛼) ln 𝑡/Δ2min . We begin with bounding the second term on the right-hand side of inequality (4.2). First, we partition time instants into 𝐺 epochs. Let 𝐺 ∈ N be such that 1 1 [𝜆(1 − 𝛼)(𝐺 − 1)] 1−𝛼 < 𝑇 ≤ [𝜆(1 − 𝛼)𝐺] 1−𝛼 . (4.3) Then, we have the following epochs {1 + 𝜙(𝑔 − 1), . . . , 𝜙(𝑔)}𝑔∈{1,...,𝐺} , (4.4)  1  where 𝜙(𝑔) = [𝜆(1 − 𝛼)𝑔] 1−𝛼 . Let 𝑡˜ be any time instant other than the first instant in the 𝑔-th epoch. We will now show that all but one of the time instants in the 𝑔-th epoch until 𝑡˜ must be contained in the time-window at 𝑡˜. Towards this end, consider the increasing convex function 1 𝑓 (𝑥) = 𝑥 1−𝛼 with 𝛼 ∈ (0, 1). It follows that 𝑓 (𝑥 2 ) − 𝑓 (𝑥 1 ) ≤ 𝑓 0 (𝑥 2 )(𝑥 2 − 𝑥 1 ) if 𝑥2 ≥ 𝑥 1 . Then, 𝑡˜1−𝛼 substituting 𝑥 1 = 𝑔 − 1 and 𝑥 2 = 𝜆(1−𝛼) in the above inequality and simplifying, we get 1  𝑡˜1−𝛼  𝑡˜ − (𝜆(1 − 𝛼)(𝑔 − 1)) 1−𝛼 ≤ 𝜆𝑡˜𝛼 −𝑔+1 . 𝜆(1 − 𝛼) 𝑡˜1−𝛼 Since by definition of the 𝑔-th epoch, 𝜆(1−𝛼) ≤ 𝑔, we have 1 𝑡˜ − b(𝜆(1 − 𝛼)(𝑔 − 1)) 1−𝛼 c ≤ min{𝑡˜ + 1, 𝜆d𝑡˜𝛼 e + 1} = 𝜏( 𝑡˜, 𝛼) + 1. The only time instant in the 𝑔-th epoch that is possibly not contained in the time window at 𝑡˜ is 1 + 𝜙(𝑔 − 1). Then for any arm 𝑖 ∈ {1, . . . , 𝐾 }, Õ 𝑡˜ 1{𝑖 ∈ Ω 𝑗 (𝑡)} ≤ 𝑀𝑛𝑖 ( 𝑡˜, 𝛼). (4.5) 2+𝜙(𝑔−1) Furthermore, in the 𝑔-th epoch in the partition, either Õ Õ 𝐾 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} = 0, 𝑡∈𝑔-th epoch 𝑖=1 or there exist at least one time-instant 𝑡 in the 𝑔-th epoch such that Õ𝐾 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} > 0. 𝑖=1 45 Let the last time instant satisfying this condition in the 𝑔-th epoch be  Õ𝐾  𝑡 (𝑔) = max 𝑡 ∈ 𝑔-th epoch 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} > 0 . 𝑖=1 Note that 𝑡 (𝑔) ∉ T̂ indicates, for each 𝑖 ∈ {1, . . . , 𝐾 }, 𝜇𝑖 (𝑠) is a constant for all 𝑠 ∈ {𝑡 (𝑔) − 𝜏(𝑡 (𝑔) − 1, 𝛼), . . . , 𝑡 (𝑔)}. Then, it follows from (4.5) that Õ Õ𝐾 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} 𝑡∈𝑔-th epoch 𝑖=1 Õ𝑡 (𝑔) Õ𝐾 ≤𝐾−𝑀+ 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} 𝑡=𝜙(𝑔−1)+2 𝑖=1 Õ ≤𝐾−𝑀+ 𝑀𝑙 (𝑡𝑖 (𝑔), 𝛼) 𝑖∉Ω∗𝑀 (𝑡 (𝑔))  4𝑀 (1 + 𝛼) ln 𝑇  ≤ (𝐾 − 𝑀) 1 + , (4.6) Δ2min where 𝑡𝑖 (𝑔) = max{𝑡 ∈ 𝑔-th epoch | 𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} and 𝑡𝑖 (𝑔) ≤ 𝑡 (𝑔) for all 𝑖 ∈ {1, . . . , 𝐾 }. Therefore, from (4.4) and (4.6), we have Õ𝑇 Õ 𝐾 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗ (𝑡), 𝑛𝑖 (𝑡 − 1, 𝛼) < 𝑙 (𝑡, 𝛼)} 𝑡=𝑁+1 𝑖=1  4𝑀 (1 + 𝛼) ln 𝑇  ≤𝐺 (𝐾 − 𝑀) 1 + . (4.7) Δ2min Step 3: In this step, we bound the expectation of the last term in (4.2). It can be shown that Õ 𝐾 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼)} 𝑖=1 Õ Õ Õℎ(𝑡) Õ ℎ(𝑡) ≤ 1{𝑛 𝜁 (𝑡 − 1, 𝛼) = 𝑠 𝜁 , 𝑛𝑖 (𝑡 − 1, 𝛼) = 𝑠𝑖 , 𝑡 ∉ T̂ } (4.8) 𝑖∉Ω∗𝑀 (𝑡) 𝜁 ∈Ω∗𝑀 (𝑡) 𝑠 𝜁 =1 𝑠𝑖 =𝑙 (𝑡,𝛼) × 1{ 𝜇¯ 𝜁 (𝑡 − 1, 𝛼) + 𝑐 𝜁 (𝑡 − 1, 𝛼) ≤ 𝜇¯𝑖 (𝑡 − 1, 𝛼) + 𝑐𝑖 (𝑡 − 1, 𝛼), 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼)},   where ℎ(𝑡) := d𝜆(𝑡 − 1) 𝛼 e/𝑀 is the maximum number of times an arm can be selected within the time window at 𝑡 − 1. Note that 𝜇¯ 𝜁 (𝑡 − 1, 𝛼) + 𝑐 𝜁 (𝑡 − 1, 𝛼) ≤ 𝜇¯𝑖 (𝑡 − 1, 𝛼) + 𝑐𝑖 (𝑡 − 1, 𝛼) means 46 at least one of the following holds. 𝜇¯𝑖 (𝑡 − 1, 𝛼) ≥ 𝜇𝑖 (𝑡) + 𝑐𝑖 (𝑡 − 1, 𝛼), (4.9) 𝜇¯ 𝜁 (𝑡 − 1, 𝛼) ≤ 𝜇 𝜁 (𝑡) − 𝑐 𝜁 (𝑡 − 1, 𝛼), (4.10) 𝜇 𝜁 (𝑡) − 𝜇𝑖 (𝑡) < 2𝑐𝑖 (𝑡 − 1, 𝛼). (4.11) Since 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼), (4.11) does not hold. Applying Chernoff-Hoeffding inequality [83, Theorem 1] to bound the probability of events (4.9) and (4.10), we obtain P( 𝜇¯𝑖 (𝑡 − 1, 𝛼) ≥ 𝜇𝑖 (𝑡) + 𝑐𝑖 (𝑡 − 1, 𝛼)) ≤ 𝑡 −2(1+𝛼) , (4.12) P( 𝜇¯ 𝜁 (𝑡 − 1, 𝛼) ≤ 𝜇 𝜁 (𝑡) − 𝑐 𝜁 (𝑡 − 1, 𝛼)) ≤ 𝑡 −2(1+𝛼) . (4.13) Since Ω 𝑗 is only computed at time instants {𝐾 + 𝜂𝑀 + 1}𝜂∈Z ≥0 , it follows from (4.8), (4.12) and (4.13) that h Õ𝑇 Õ 𝐾 i E 1{𝑖 ∈ Ω 𝑗 (𝑡), 𝑖 ∉ Ω∗𝑀 (𝑡), 𝑡 ∉ T̂ , 𝑛𝑖 (𝑡 − 1, 𝛼) ≥ 𝑙 (𝑡, 𝛼)} 𝑡=𝐾+1 𝑖=1 𝑇 −𝑁 ℎ(Õ 𝑓 (𝜂)) ℎ(Õ𝑓 (𝜂)) d Õ 𝑀 e ≤ (𝐾 − 𝑀)𝑀 2𝑀 𝑓 (𝜂) −2(1+𝛼) 𝑠 𝜁 =1 𝑠𝑖 =𝑙 (𝑡,𝛼) 𝜂=0 −𝑁 d 𝑇Õ 𝑀 e ≤ (𝐾 − 𝑀)𝑀 2 2 𝑓 (𝜂) −2(1+𝛼) ℎ( 𝑓 (𝜂)) 2 𝜂=0  𝜆 + 𝑀 + 1 2 Õ ∞ ≤ (𝐾 − 𝑀) 2𝜂−2 𝑀 𝜂=1 𝜋2  𝜆 + 𝑀 + 1 2 = (𝐾 − 𝑀) , (4.14) 3 𝑀 where 𝑓 (𝜂) := 𝐾 + 𝜂𝑀 + 1. Therefore, it follows from (4.1), (4.2), (4.7), and (4.14) that h  4𝑀 (1 + 𝛼) ln 𝑇  𝜋 2  𝜆 + 𝑀 + 1  2 i 𝛼  N 𝑗 (𝑇) ≤ 𝐾 + (𝐾 − 𝑀) 𝐺 1 + + + Υ𝑇 d𝜆(𝑇 − 1) e + 𝑀 − 1 . Δ2min 3 𝑀 From (4.3), we have 𝐺 ≤ 𝑇 1−𝛼 /(𝜆 − 𝜆𝛼) + 1, and this yields the desired result.  Based on Lemma 4.1, we now establish the order of expected cumulative group regret of RR-SW-UCB# in the abruptly changing environment. 47 Theorem 4.2. For the RR-SW-UCB# algorithm and the multi-player MAB problem with 𝑁 arms and 𝑀 players in the piecewise stationary environment with the number of break points Υ𝑇 = 𝑂 (𝑇 𝜈 ), 𝜈 ∈ [0, 1), under collision model M, the expected cumulative group regret satisfies 1+𝜈  𝑅𝑇RR-SW-UCB# (M) ∈ 𝑂 𝑇 2 ln 𝑇 . Proof. If all player identify Ω∗𝑀 (𝑡) correctly at time 𝑡, no expected regret is accrued. It follows 1+𝜈 from Lemma 4.1 that N 𝑗 (𝑇) ∈ 𝑂 (𝑇 2 ln 𝑇) for all 𝑗 ∈ {1, . . . , 𝑀 }. The total number of times Í that any player misidentifies Ω∗𝑀 (𝑡) until time 𝑇 can be upper bounded by 𝑀 𝑗=1 N 𝑗 (𝑇). Thus, we conclude the proof.  4.3 The SW-DLP Algorithm Distributed Learning with Prioritization (DLP) [23] is designed for the multi-player stochastic MAB problem in a stationary environment. The idea of DLP is to assign player 𝑗 to collect rewards from 𝑗-th best arm for most of circumstances. In a piecewise stationary environment, we extend the DLP algorithm to design SW-DLP using a sliding observation window. The upper confidence bounds on the mean rewards in SW-DLP are computed the same as SW-UCB#. SW-DLP employes an identical allocation rule as DLP, i.e., at each time instant 𝑡, player 𝑗 computes a set 𝐴 𝑗 (𝑡) containing 𝑗 arms with 𝑗 largest values in the set  𝑗 𝑗 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) | 𝑘 ∈ {1, . . . , 𝐾 } , and selects arm 𝑗 𝑗 𝜑𝑡 ( 𝑗) = arg min { 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) − 𝑐 𝑘 (𝑡 − 1, 𝛼)}. 𝑘∈𝐴 𝑗 (𝑡) Details of the SW-DLP is shown in Algorithm 4. The parameters in the SW-DLP algorithm are the 𝑗 𝑗 same as in the RR-SW-UCB# algorithm. In the following, we will refer to 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) − 𝑐 𝑘 (𝑡 − 1, 𝛼) as the lower confidence bound on the estimate reward from arm 𝑘. 4.4 Analysis of the SW-DLP Algorithm We analyze the performance of the SW-DLP algorithm (Algorithm 4) to get the following result. 48 Algorithm 4: The SW-DLP Algorithm Input, output, and parameters are the same as RR-SW-UCB# 1 while 𝑡 ≤ 𝑇 do 2 if 𝑡 ∈ {1, . . . , 𝐾 } then Pick arm 𝜑𝑡 ( 𝑗) = mod(𝑡 + 𝑗 − 2, 𝑁) + 1; 3 else Compute 𝐴 𝑗 (𝑡) containing 𝑗 arms with 𝑗 largest values in  𝑗 𝑗 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) + 𝑐 𝑘 (𝑡 − 1, 𝛼) | 𝑘 ∈ {1, . . . , 𝐾 } ; Pick arm 𝑗 𝑗 𝜑𝑡 ( 𝑗) = arg min { 𝜇¯ 𝑘 (𝑡 − 1, 𝛼) − 𝑐 𝑘 (𝑡 − 1, 𝛼)}; 𝑘 ∈𝐴𝑗 Theorem 4.3. For the SW-DLP algorithm and the multi-player MAB problem with 𝐾 arms and 𝑀 players in the piecewise stationary environment with the number of break points Υ𝑇 = 𝑂 (𝑇 𝜈 ), 𝜈 ∈ [0, 1), under collision model M, the expected cumulative group regret satisfies 1+𝜈  𝑅𝑇SW-DLP (M) ∈ 𝑂 𝑇 2 ln 𝑇 . Proof. The proof is similar to the proof of Theorem 4.2 and we only present a sketch. Let 𝜃 𝑡 ( 𝑗) be the 𝑗-th best arm at time 𝑡. The total number of time instants that 𝜃 𝑡 ( 𝑗) is not selected by player 𝑗 with SW-DLP satisfies Õ𝑇 𝑇  𝛼  Õ N̂ 𝑗 = 1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗)} ≤ Υ𝑇 𝜆(𝑇 − 1) + 1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗), 𝑡 ∉ T̂ }. (4.15) 𝑡=1 𝑡=1 We partition the time horizon as in (4.4). Then, similarly to (4.5) in the proof of Lemma 4.1, it can be shown that Õ 𝑡˜ 𝑗 1{𝜑𝑡 ( 𝑗) = 𝑖} ≤ 𝑛𝑖 ( 𝑡˜, 𝛼), (4.16) 𝑡=2+𝜙(𝑔−1) for any arm 𝑖 ∈ {1, . . . , 𝐾 } and 𝑡˜ ∈ 𝑔-th epoch. We study the event that player 𝑗 does not select arm 𝜃 𝑡 ( 𝑗) at time 𝑡 under two scenarios: (i) 𝑗 𝐴 𝑗 (𝑡) ≠ Ω∗ (𝑡), and (ii) 𝐴 𝑗 (𝑡) = Ω∗𝑘 (𝑡), where Ω∗𝑘 (𝑡) is the set with 𝑘 best arms at time 𝑡. Then, we 49 have Õ𝑇 Õ𝑇 𝑗 1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗), 𝑡 ∉ T̂ } ≤ 1{𝐴 𝑗 (𝑡) ≠ Ω∗ (𝑡), 𝑡 ∉ T̂ } 𝑡=1 𝑡=1 Õ𝑇 + 1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗), 𝐴 𝑗 (𝑡) = Ω∗𝑘 (𝑡), 𝑡 ∉ T̂ }. (4.17) 𝑡=1 Note that unlike RR-SW-UCB#, in SW-DLP, after initialization, 𝐴 𝑘 (𝑡) is computed every time instead of only at time instants {𝑁 + 𝜂𝑀 + 1}𝜂∈Z ≥0 . However, this difference does not change the order of the total number of times that Ω∗𝑘 (𝑡) is misidentified. Therefore, using (4.16), it follows similarly to the proof of Lemma 4.1 that Õ 𝑇 𝑗 1+𝜈  1{𝐴 𝑗 ≠ Ω∗ (𝑡)} ∈ 𝑂 𝑇 2 ln 𝑇 . (4.18) 𝑡=1 The Chernoff-Hoefding inequality is symmetric about the estimated mean and the upper tail bound is identical to the lower tail bound. Hence, the second term on the right-hand side of inequality (4.17) that involves selecting 𝜑𝑡 ( 𝑗) using lower confidence bounds can be bounded similarly to the first term. Thus, we have Õ 𝑇 𝑗 1+𝜈  1{𝜑𝑡 ( 𝑗) ≠ 𝜃 𝑡 ( 𝑗), 𝐴 𝑗 = Ω∗ (𝑡), 𝑡 ≠ T̂ } ∈ 𝑂 𝑇 2 ln 𝑇 . (4.19) 𝑡=1 Substituting (4.18) and (4.19) into (4.17), and substituting (4.17) into (4.15), we conclude that 1+𝜈  N̂𝑘 ∈ 𝑂 𝑇 2 ln 𝑇 . The number of times the group does not receive a reward from arm 𝜃 𝑡 ( 𝑗) is upper bounded by the number of times player 𝑗 does not receive a reward from arm 𝜃 𝑡 ( 𝑗). Player 𝑗 does not receive a reward from arm 𝜃 𝑡 ( 𝑗) if one of the following conditions is true (i) arm 𝜃 𝑡 ( 𝑗) is not selected by player 𝑗, and (ii) arm 𝜃 𝑡 ( 𝑗) is selected by another player 𝑗 0 ≠ 𝑗. The total number of times either one of these events occurs at any arm 𝜃 𝑡 ( 𝑗), for all 𝑗 ∈ {1, . . . , 𝑀 }, can be upper bounded by Í𝑀 1+𝜈  𝑘=1 2N̂ 𝑗 . Since N̂ 𝑗 ∈ 𝑂 𝑇 2 ln 𝑇 for all 𝑗 ∈ {1, . . . , 𝑀 }, we conclude the proof.  Remark 4.1 (Comparison of RR-SW-UCB# and SW-DLP). In multi-player MAB algorithms, the assignment of a player to a targeted arm is crucial to avoid collisions. In RR-SW-UCB#, 50 2 15 = 0.15 = 0.30 = 0.15 = 0.45 = 0.30 1.5 = 0.45 10 1 5 0.5 0 0 0 2 4 6 8 10 0 2 4 6 8 10 105 105 (a) RR-SW-UCB# (b) SW-DLP and RR-SW-UCB# Figure 4.1: Simulation of RR-SW-UCB# and SW-DLP in a piecewise stationary environment. the indices of arms and the indices of players are employed for this assignment. A round-robin policy ensures all players select 𝑀-best arms persistently and accurately estimate the associated mean rewards. While in SW-DLP, such accurate estimation by all players is driven by the lower confidence bound-based assignment of players to the arms.  4.5 Numerical Illustration In this section, we present simulation results for RR-SW-UCB# and SW-DLP in abruptly changing environments. In the simulations, we consider a multi-player MAB problem with 6 arms and 3 players. We consider three different values {0.15, 0.3, 0.45} of parameter 𝜈 that describes the number of breakpoints to show the performance the both algorithms. The breakpoints are introduce at time instants where the next element of sequence {b𝑡 𝜈 c}𝑡∈{1,...,𝑇 } is different from current element. We pick them at these time instants to make number of breakpoints Υ𝑡 ∈ 𝑂 (𝑡 𝜈 ) uniformly for all 𝑡 ∈ {1, . . . , 𝑇 }. At each break point, the mean rewards at each arm is randomly selected from {0.05, 0.22, 0.39, 0.56, 0.73, 0.90}. In both algorithms, we select 𝜆 = 12.3. As shown in Figure. 4.1, with either algorithm, the ratio of the empirical cumulative group 1+𝜈 regret to the order of 𝑡 2 ln 𝑡 is upper bounded by a constant. The dashed lines in Figure. 4.1 (b) are taken directly from (a). The comparison shows that the cumulative regret of RR-SW-UCB# is much lower than SW-DLP. However, if the cost of switching between arms is considered, then 51 the round-robin structure of RR-SW-UCB# would incur significant cost, and in such a scenario SW-DLP might be preferred. 4.6 Summary We studied the multi-player stochastic MAB problem in abruptly changing environments under a collision model in which a player receives a reward by selecting an arm if it is the only player to select that arm. We designed two novel algorithms, RR-SW-UCB# and SW-DPL to solve this problem. We analyzed these algorithms and characterized their performance in terms of group regret. In particular, we showed that these algorithms incur sublinear expected cumulative regret, i.e., the time average of the regret asymptotically converges to zero. It would be of interest to extend this work to a more general nonstationary environment in which the reward distributions can change at each time step. Another avenue of future research is the extension of these algorithms to the multi-player Markovian MAB problem. 4.7 Bibliographic Remarks Most of the studies on the multi-player MAB problem deal with a stationary environment. In [22], a lower bound on the expected cumulative group regret for a centralized policy is derived and algorithms that asymptotically achieve this lower bound are designed. Some works assume no communication among players in [4, 5, 23–25], whereas other works allow agents to communicate to improve their arm selection in [26–28]. One of the major generalizations in the multi-player MAB problem is to consider player- dependent rewards, i.e., an arm has different mean rewards for different players [24]. The optimal allocation of the players to arms can be computed using approaches for a famous combinatorial optimization problem known as the assignment problem [96]. To achieve a sublinear regret in a distributed manner, a distributed solution to the assignment problem is required [97]. Assuming collision results in no reward, implicit communication can be generated through collision. In [24], the distributed MAB problem is solved using distributed auction [97] and collision-based implicit 52 communication. The idea of implicit communication is used broadly in different distributed protocols for multi-player MAB problems [98, 99]. More recently, game-theoretic techniques have been used to design fully distributed multi-player MAB algorithms without implicit communication [100]. Specifically, using the payoff dynamics introduced in [101], the authors in [100] design an algorithm that plays, for a sufficiently large portion of time, a strategy profile that optimizes the sum of player-specific mean rewards. 53 CHAPTER 5 GENERAL NONSTATIONARY BANDITS WITH VARIATION BUDGET In this chapter, we study a more general non-stationary stochastic MAB problem proposed in [21]. The reward distributions are allowed to either change abruptly like the piece-wise stationary bandits or drift slowly. The nonstationarity of the environment is characterized by the cumulative maximum variation in mean rewards, which subjects to a variation budget. In order to minimize clutter, we denote the set of arms as K := {1, . . . , 𝐾 } and the sequence of time slots as T := {1, . . . , 𝑇 }. The reward sequence {𝑋𝑡𝑘 }𝑡∈T for each arm 𝑘 ∈ K is composed of independent samples from potentially time-varying probability distribution function sequence 𝑓T𝑘 := { 𝑓𝑡𝑘 (𝑥)}𝑡∈T . We refer to the set F𝑇K = { 𝑓T𝑘 | 𝑘 ∈ K} containing reward distribution sequences at all arms as the environment. Then, the total variation of mean rewards in F𝑇K is defined by 𝑇−1 Õ F𝑇K  𝑘 𝑣 := max 𝜇𝑡+1 − 𝜇𝑡𝑘 , (5.1) 𝑘∈K 𝑡=1 which captures the non-stationarity of the environment. We focus on the class of non-stationary environments that have the total variation within a variation budget 𝑉𝑇 ≥ 0 which is defined by E (𝑉𝑇 , 𝑇, 𝐾) := F𝑇K | 𝑣 F𝑇K ≤ 𝑉𝑇 .   The objective is still to design a policy 𝜌 to minimize the regret in a nonstationary environment 𝑅𝑇 defined in (3.1). Note that the performance of a policy 𝜌 differs with different F𝑇K ∈ E (𝑉𝑇 , 𝑇, 𝐾). 𝜌 For a fixed variation budget 𝑉𝑇 and a policy 𝜌, the worst-case regret is the regret with respect to the worst possible choice of environment, i.e., 𝜌 𝜌 𝑅worst (𝑉𝑇 , 𝑇, 𝐾) = sup 𝑅𝑇 . F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾) In this work, we aim at designing policies to minimize the worst-case regret. The optimal worst-case regret achieved by any policy is called the minimax regret, and is defined by 𝜌 inf sup 𝑅𝑇 . 𝜌 F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾) 54 We study the nonstationary MAB problem under the following two classes of reward distributions: Assumption 5.1 (Sub-Gaussian reward). For any 𝑘 ∈ K and any 𝑡 ∈ T , distribution 𝑓𝑡𝑘 (𝑥) is 1/2 sub-Gaussian, i.e., ! h i 𝜆 2 ∀𝜆 ∈ R : E exp(𝜆(𝑋𝑡𝑘 − 𝜇)) ≤ exp . 8   Moreover, for any arm 𝑘 ∈ K and any time 𝑡 ∈ T , E 𝑋𝑡𝑘 ∈ [𝑎, 𝑎 + 𝑏], where 𝑎 ∈ R and 𝑏 > 0.   Assumption 5.2 (Heavy-tailed reward). For any arm 𝑘 ∈ K and any time 𝑡 ∈ T , E (𝑋𝑡𝑘 ) 2 ≤ 1. 5.1 Lower Bound on Minimax Regret in Nonstationary Environment In this section, we review existing minimax regret lower bounds and minimax policies from literature. These results apply to both sub-Gaussian and heavy-tailed rewards. When 𝑉𝑇 = 0, the minimax regret lower bound is the same as the one for stochastic stationary bandit (2.1). We show how the minimax regret lower bound for 𝑉𝑇 = 0 can be extended to establish the minimax regret lower bound for 𝑉𝑇 > 0. In the later sections, we design a variety of policies that match with the minimax regret lower bound for 𝑉𝑇 > 0. In the setting of 𝑉𝑇 > 0, we recall here the minimax regret lower bound for nonstationary stochastic MAB problems. Lemma 5.1 (Minimax Lower Bound: 𝑉𝑇 > 0 [21]). For the non-stationary MAB problem with 𝐾 arms, time horizon 𝑇 and variation budget 𝑉𝑇 ∈ [1/𝐾, 𝑇/𝐾], 𝜌 1 2 inf sup 𝑅𝑇 ≥ 𝐶 (𝐾𝑉𝑇 ) 3 𝑇 3 , 𝜌 F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾) where 𝐶 ∈ R>0 is some constant. To understand this lower bound, consider the following non-stationary environment. The  1 2 horizon T is partitioned into epochs of length 𝜏 = 𝐾 3 (𝑇/𝑉𝑇 ) 3 . In each epoch, the reward distribution sequences are stationary and all the arms have identical mean rewards except for the p unique best arm. Let the gap in the mean be Δ = 𝐾/𝜏. The index of the best arm switches at 55 the end of each epoch following some unknown rule. So, the total variation is no greater than Δ𝑇/𝜏, which satisfies the variation budget 𝑉𝑇 . Besides, for any policy 𝜌, we know from (2.1) that √ worst-case regret in each epoch is no less than 𝐶2 𝐾𝜏. Summing up the regret over all the epochs, √ minimax regret is lower bounded by 𝑇/𝜏 × 𝐶2 𝐾𝜏, which is consistent with Lemma 5.1. 5.2 UCB Algorithms for Sub-Gaussian Nonstationary Stochastic Bandits In this section, we extend UCB1 and MOSS to design nonstationary UCB policies for scenarios with 𝑉𝑇 > 0. Three different techniques are employed, namely periodic resetting, sliding observation window and discount factor, to deal with the remembering-forgetting tradeoff. The proposed algorithms are analyzed to provide guarantees on the worst-case regret. We show their performances match closely with the lower bound in Lemma 5.1.   The following notations are used in later discussions. Let 𝑁 = 𝑇/𝜏 , for some 𝜏 ∈ {1, . . . , 𝑇 }, and let {T1 , . . . , T𝑁 } be a partition of time slots T , where each epoch T𝑖 has length 𝜏 except possibly T𝑁 . In particular, n o T𝑖 = 1 + (𝑖 − 1)𝜏 , . . . , min (𝑖𝜏, 𝑇) , 𝑖 ∈ {1, . . . , 𝑁 }. Let the maximum mean reward within T𝑖 be achieved at time 𝜏𝑖 ∈ T𝑖 and arm 𝜅𝑖 , i.e., 𝜇𝜏𝜅𝑖𝑖 = max𝑡∈T𝑖 𝜇𝑡∗ . We define the variation within T𝑖 as Õ 𝑘 𝑣 𝑖 := max 𝜇𝑡+1 − 𝜇𝑡𝑘 , 𝑘∈K 𝑡∈T𝑖 where we trivially assign 𝜇𝑇+1 𝑘 = 𝜇𝑇𝑘 for all 𝑘 ∈ K. Let 1 {·} denote the indicator function and |·| denote the cardinality of the set, if its argument is a set, and the absolute value if its argument is a real number. 5.2.1 Resetting MOSS Algorithm Periodic resetting is an effective technique to preserve the freshness and authenticity of the informa- tion history. It has been employed in [21] to modify Exp3 to design Rexp3 policy for nonstationary 56 Algorithm 5: The R-MOSS Algorithm Input : 𝑉𝑇 ∈l R ≥0 and 𝑇 ∈m N 1 2 Set : 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 Output : sequence of arm selection 1 while 𝑡 ≤ 𝑇 do 2 if mod (𝑡, 𝜏) = 0 then 3 Restart the MOSS policy; stochastic MAB problems. We extend this approach to MOSS and propose a nonstationary policy Resetting MOSS (R-MOSS). In R-MOSS, after every 𝜏 time slots, the sampling history is erased and MOSS is restarted. The pseudo-code is provided in Algorithm 5 and the performance in terms of the worst-case regret is established below. Theorem 5.2. For the sub-Gaussian nonstationary MAB problem with 𝐾 arms, time horizon 𝑇, l 1 2m variation budget 𝑉𝑇 > 0, and 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 , the worst case regret of R-MOSS satisfies 1 2 sup 𝑅𝑇R-MOSS ∈ O ((𝐾𝑉𝑇 ) 3 𝑇 3 ). F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾) Sketch of the proof. Note that one run of MOSS takes place in each epoch. For epoch T𝑖 , define the set of bad arms for R-MOSS by B𝑖R := {𝑘 ∈ K | 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑘𝑖 ≥ 2𝑣 𝑖 }. (5.2) Notice that for any 𝑡1 , 𝑡2 ∈ T𝑖 , 𝜇𝑡𝑘1 − 𝜇𝑡𝑘2 ≤ 𝑣 𝑖 , ∀𝑘 ∈ K. (5.3) Therefore, for any 𝑡 ∈ T𝑖 , we have 𝜇𝑡∗ − 𝜇𝑡 𝑡 ≤ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝑡 𝑡 ≤ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑖𝑡 + 𝑣 𝑖 . 𝜑 𝜑 𝜑 Then, the regret from T𝑖 can be bounded as the following, Õ  Õ  𝜇𝑡∗ 𝜑 𝜑 E − 𝜇𝑡 𝑡 ≤ |T𝑖 | 𝑣 𝑖 + E 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑖𝑡 ≤ 3|T𝑖 | 𝑣 𝑖 + 𝑆𝑖 , (5.4) 𝑡∈T𝑖 𝑡∈T𝑖 57 Õ Õ    𝜑 where 𝑆𝑖 = E 1 𝜑𝑡 = 𝑘 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑖𝑡 − 2𝑣 𝑖 . 𝑡∈T𝑖 𝑘∈B R 𝑖 Now, we have decoupled the problem, enabling us to generalize the analysis of MOSS in the stationary environment [70] to bound 𝑆𝑖 . We will only specify the generalization steps and skip the details for brevity. First notice inequality (5.3) indicates that for any 𝑘 ∈ B𝑖R and any 𝑡 ∈ T𝑖 , 𝜇𝑡𝜅𝑖 ≥ 𝜇𝜏𝜅𝑖𝑖 − 𝑣 𝑖 and 𝜇𝑡𝑘 ≤ 𝜇𝜏𝑘𝑖 + 𝑣 𝑖 . So, at any 𝑡 ∈ T𝑖 , 𝜇ˆ 𝜅𝑖 ,𝑛 𝜅𝑖 (𝑡) concentrate around a value no smaller than 𝜇𝜏𝜅𝑖𝑖 −𝑣 𝑖 , and 𝜇ˆ 𝑘,𝑛 𝑘 (𝑡) concentrate around a value no greater than 𝜇𝜏𝑘𝑖 + 𝑣 𝑖 for any 𝑘 ∈ 𝐵𝑖R . Also 𝜇𝜏𝜅𝑖𝑖 − 𝑣 𝑖 ≥ 𝜇𝜏𝑘𝑖 + 𝑣 𝑖 due to the definition in (5.2). In the analysis of MOSS in stationary environment [70], the UCB of each suboptimal arm is compared with the best arm and each selection of suboptimal arm 𝑘 contribute Δ 𝑘 in regret. Here, we can apply a similar analysis by comparing the UCB of each arm 𝑘 ∈ 𝐵𝑖R with 𝜅𝑖 and each selection of arm 𝑘 ∈ 𝐵𝑖R contributes (𝜇𝜏𝜅𝑖𝑖 − 𝑣 𝑖 ) − (𝜇𝜏𝑘𝑖 + 𝑣 𝑖 ) in 𝑆𝑖 . Accordingly, we borrow the upper p bound in Lemma 2.3 to get 𝑆𝑖 ≤ 49 𝐾 |T𝑖 |. Substituting the upper bound on 𝑆𝑖 into (5.4) and summarizing over all the epochs, we conclude that Õ 𝑁 √ sup 𝑅𝑇R-MOSS ≤ 3𝜏𝑉𝑇 + 49 𝐾𝜏, F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾) 𝑖=1 which implies the theorem.  The upper bound in Theorem 5.2 is in the same order as the lower bound in Lemma 5.1. So, the worst-case regret for R-MOSS is order optimal. 5.2.2 Sliding-Window MOSS Algorithm We have shown that periodic resetting coarsely adapts the stationary policy to a nonstationary setting. However, it is inefficient to entirely remove the sampling history at the restarting points and the regret accumulates quickly close to these points. In [20], a sliding observation window 58 Algorithm 6: The SW-MOSS Algorithm Input : 𝑉𝑇 ∈l R>0 , 𝑇 ∈ N mand 𝜂 > 1/2 1 2 Set : 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 Output : sequence of arm selection 1 Pick each arm once. 2 while 𝑡 ≤ 𝑇 do  Compute statistics within W𝑡 = min(1, 𝑡 − 𝜏), . . . , 𝑡 − 1 : 1 Õ Õ 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) = 𝑋𝑠 1{𝜑 𝑠 = 𝑘 }, 𝑛 𝑘 (𝑡) = 1{𝜑 𝑠 = 𝑘 } 𝑛 𝑘 (𝑡) 𝑠 ∈W𝑡 𝑠 ∈W𝑡 v u t     max ln 𝐾 𝑛𝜏𝑘 (𝑡) , 0 Pick arm 𝜑𝑡 = arg max 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) + 𝜂 ; 𝑘 ∈K 𝑛 𝑘 (𝑡) is used to erase the outdated information smoothly and more efficiently utilize the information history. The authors proposed the SW-UCB algorithm that intends to solve the MAB problem with piece-wise stationary mean rewards. We show that a similar approach can also deal with the general nonstationary environment with a variation budget. In contrast to SW-UCB, we integrate the sliding window technique with MOSS instead of UCB1 and achieve the order optimal worst-case regret.  Let the sliding observation window at time 𝑡 be W𝑡 := min(1, 𝑡 − 𝜏), . . . , 𝑡 − 1 . Then, the associated mean estimator is given by 1 Õ Õ 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) = 𝑋𝑠 1{𝜑 𝑠 = 𝑘 }, 𝑛 𝑘 (𝑡) = 1{𝜑 𝑠 = 𝑘 }. 𝑛 𝑘 (𝑡) 𝑠∈W𝑡 𝑠∈W𝑡 For each arm 𝑘 ∈ K, define the UCB index for SW-MOSS by v u t     max ln 𝐾𝑛𝜏𝑘 (𝑡) , 0 𝑔𝑡𝑘 = 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) + 𝑐 𝑛 𝑘 (𝑘) , 𝑐 𝑛 𝑘 (𝑡) = 𝜂 , 𝑛 𝑘 (𝑡) where 𝜂 > 1/2 is a tunable parameter. With these notations, SW-MOSS is defined in Algorithm 6. To analyze it, we will use the following concentration bound for sub-Gaussian random variables. Fact 5.1 (Maximal Hoeffding inequality[83]). Let 𝑋1 , . . . , 𝑋𝑛 be a sequence of independent 1/2 59 sub-Gaussian random variables. Define 𝑑𝑖 := 𝑋𝑖 − 𝜇𝑖 , then for any 𝛿 > 0,  Õ𝑚    P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑑𝑖 ≥ 𝛿 ≤ exp −2𝛿2 /𝑛 , 𝑖=1  Õ 𝑚    and P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑑𝑖 ≤ −𝛿 ≤ exp −2𝛿2 /𝑛 . 𝑖=1 At time 𝑡, for each arm 𝑘 ∈ K define 1 Õ 𝑘 𝑀𝑡𝑘 := 𝜇 𝑠 1{𝜑𝑠 =𝑘 } . 𝑛 𝑘 (𝑡) 𝑠∈W𝑡 Now, we are ready to present concentration bounds for the sliding window empirical mean 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) . Lemma 5.3. For any arm 𝑘 ∈ K and any time 𝑡 ∈ T , if 𝜂 > 1/2, for any 𝑥 > 0 and 𝑙 ≥ 1, the  probability of event 𝐴 := 𝜇ˆ 𝑛𝑘 (𝑡) + 𝑐 𝑛 𝑘 (𝑡) ≤ 𝑀𝑡𝑘 − 𝑥, 𝑛 𝑘 (𝑡) ≥ 𝑙 is no greater than 𝑘 3 (2𝜂) 2 𝐾  2  exp −𝑥 𝑙/𝜂 . (5.5) ln(2𝜂) 𝜏𝑥 2  The probability of event 𝐵 := 𝜇ˆ 𝑛𝑘 𝑘 (𝑡) − 𝑐 𝑛 𝑘 (𝑡) ≥ 𝑀𝑡𝑘 + 𝑥, 𝑛 𝑘 (𝑡) ≥ 𝑙 is also upper bounded by (5.5). Proof. For any 𝑡 ∈ T , let 𝑢𝑖𝑘𝑡 be the 𝑖-th time slot when arm 𝑘 is selected within W𝑡 and let 𝑑𝑖𝑘𝑡 = 𝑋 𝑘𝑘𝑡 − 𝜇 𝑘 𝑘𝑡 . Note that 𝑢𝑖 𝑢𝑖  𝑚  1 Õ 𝑘𝑡 P ( 𝐴) ≤ P ∃𝑚 ∈ {𝑙, . . . , 𝜏} : 𝑑 ≤ −𝑥 − 𝑐 𝑚 , 𝑚 𝑖=1 𝑖 p Let 𝑎 = 2𝜂 such that 𝑎 > 1. We now apply a peeling argument [76, Sec 2.2] with geometric grid 𝑎 𝑠 𝑙 < 𝑚 ≤ 𝑎 𝑠+1 𝑙 over {𝑙, . . . , 𝜏}. Since 𝑐 𝑚 is monotonically decreasing in 𝑚,  𝑚  1 Õ 𝑘𝑡 P ∃𝑚 ∈ {𝑙, . . . , 𝜏} : 𝑑 ≤ −𝑥 − 𝑐 𝑚 𝑚 𝑖=1 𝑖 Õ  Õ𝑚  𝑠 𝑠+1 𝑘𝑡 𝑠  ≤ P ∃𝑚 ∈ [𝑎 𝑙, 𝑎 𝑙) : 𝑑𝑖 ≤ −𝑎 𝑙 𝑥 + 𝑐 𝑎 𝑠+1 𝑙 . 𝑠≥0 𝑖=1 60 According to Fact 5.1, the above summand is no greater than Õ  Õ𝑚  𝑠+1 𝑘𝑡 𝑠  P ∃𝑚 ∈ [1, 𝑎 𝑙) : 𝑑𝑖 ≤ −𝑎 𝑙 𝑥 + 𝑐 𝑎 𝑠+1 𝑙 𝑠≥0 𝑖=1 ! Õ 𝑎 2𝑠 𝑙 2   ≤ exp −2   𝑥 2 + 𝑐2𝑎 𝑠+1 𝑙 𝑠≥0 𝑎 𝑠+1 𝑙  ! Õ 𝑠−1 2 2𝜂 𝜏 ≤ exp −2𝑎 𝑙𝑥 − 2 ln 𝑠≥0 𝑎 𝐾𝑎 𝑠+1 𝑙 Õ 𝐾𝑙𝑎 𝑠   𝑠−2 2 = exp −2𝑎 𝑙𝑥 . 𝑠≥1 𝜏 Let 𝑏 = 2𝑥 2 𝑙/𝑎 2 . It follows that Õ 𝐾𝑙𝑎 𝑠 ∫ +∞ 𝑠 𝐾𝑙  exp −𝑏𝑎 ≤ 𝑎 𝑦+1 exp − 𝑏𝑎 𝑦 𝑑𝑦 𝑠≥1 𝜏 𝜏 0 ∫ +∞ 𝐾𝑙𝑎  = exp(−𝑏𝑧)𝑑𝑧 where we set 𝑧 = 𝑎 𝑦 𝜏 ln(𝑎) 1 𝐾𝑙𝑎𝑒 −𝑏 = , 𝜏𝑏 ln(𝑎) which concludes the bound for the probability of event 𝐴. By using upper tail bound, similar result exists for event 𝐵.  We now leverage Lemma 5.3 to get an upper bound on the worst-case regret for SW-MOSS. Theorem 5.4. For the nonstationary MAB problem with 𝐾 arms, time horizon 𝑇, variation budget l 1 2m 𝑉𝑇 > 0 and 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 , the worst-case regret of SW-MOSS satisfies 1 2 sup 𝑅𝑇SW-MOSS ∈ O ((𝐾𝑉𝑇 ) 3 𝑇 3 ). F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾) Proof. The proof consists of the following five steps. Step 1: Recall that 𝑣 𝑖 is the variation within T𝑖 . Here, we trivially assign T0 = ∅ and 𝑣 0 = 0. Then, for each 𝑖 ∈ {1, . . . , 𝑁 }, let Δ𝑖𝑘 := 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑘𝑖 − 2𝑣 𝑖−1 − 2𝑣 𝑖 , ∀𝑘 ∈ K. Define the set of bad arms for SW-MOSS in T𝑖 as B𝑖SW := {𝑘 ∈ K | Δ𝑖𝑘 ≥ 𝜖 }, 61 p where we assign 𝜖 = 4 𝑒𝜂𝐾/𝜏. Step 2: We decouple the regret in this step. For any 𝑡 ∈ T𝑖 , since 𝜇𝑡𝑘 − 𝜇𝜏𝑘𝑖 ≤ 𝑣 𝑖 for any 𝑘 ∈ K, it satisfies that 𝜇𝑡∗ − 𝜇𝑡 𝑡 ≤ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝑡 𝑡 ≤ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑖𝑡 + 𝑣 𝑖 𝜑 𝜑 𝜑 n o SW 𝜑 ≤ 1 𝜑𝑡 ∈ B𝑖 (Δ𝑖 𝑡 − 𝜖) + 2𝑣 𝑖−1 + 3𝑣 𝑖 + 𝜖 . Then we get the following inequalities, Õ Õ 𝑁 Õ n o 𝜇𝑡∗ 𝜑 𝜑 − 𝜇𝑡 𝑡 ≤ 1 𝜑𝑡 ∈ B𝑖SW (Δ𝑖 𝑡 − 𝜖) + 2𝑣 𝑖−1 + 3𝑣 𝑖 + 𝜖 𝑡∈T 𝑖=1 𝑡∈T𝑖 Õ𝑁 Õ n o 𝜑 ≤5𝜏𝑉𝑇 + 𝑇𝜖 + 1 𝜑𝑡 ∈ B𝑖SW (Δ𝑖 𝑡 − 𝜖). (5.6) 𝑖=1 𝑡∈T𝑖 To continue, we take a decomposition inspired by the analysis of MOSS in [70] below, Õ n o  SW 𝜑𝑡 1 𝜑𝑡 ∈ B𝑖 Δ𝑖 − 𝜖 𝑡∈T𝑖 𝜑  Õ  SW 𝜅 𝑖 𝜅𝑖 Δ𝑖 𝑡 𝜑𝑡 ≤ 1 𝜑𝑡 ∈ B𝑖 , 𝑔𝑡 > 𝑀𝑡 − Δ𝑖 (5.7) 𝑡∈T𝑖 4 𝜑  Õ  SW 𝜅 𝑖 𝜅𝑖 Δ𝑖 𝑡  𝜑𝑡  + 1 𝜑𝑡 ∈ B𝑖 , 𝑔𝑡 ≤ 𝑀𝑡 − Δ𝑖 − 𝜖 , (5.8) 𝑡∈T 4 𝑖 where summands (5.7) describes the regret when arm 𝜅𝑖 is fairly estimated and summand (5.8) quantifies the regret incurred by underestimating arm 𝜅𝑖 . 𝜑 Step 3: In this step, we bound the expectation of (5.7). Since 𝑔𝑡 𝑡 ≥ 𝑔𝑡𝜅𝑖 , 𝜑  𝜑  Δ𝑖 𝑡 𝜑𝑡 Õ Δ𝑖 𝑡 𝜑𝑡 Õ   SW 𝜅 𝑖 𝜅𝑖 SW 𝜑𝑡 𝜅𝑖 1 𝜑𝑡 ∈ B𝑖 , 𝑔𝑡 > 𝑀𝑡 − Δ𝑖 ≤ 1 𝜑𝑡 ∈ B𝑖 , 𝑔𝑡 > 𝑀𝑡 − Δ𝑖 𝑡∈T𝑖 4 𝑡∈T𝑖 4 Õ Õ  Δ𝑖𝑘 𝑘  𝑘 𝜅𝑖 = 1 𝜑𝑡 = 𝑘, 𝑔𝑡 > 𝑀𝑡 − Δ𝑖 . (5.9) SW 𝑡∈T 4 𝑘 ∈B𝑖 𝑖 Notice that for any 𝑡 ∈ T𝑖−1 ∪ T𝑖 , 𝜇𝑡𝑘 − 𝜇𝜏𝑘𝑖 ≤ 𝑣 𝑖−1 + 𝑣 𝑖 , ∀𝑘 ∈ K. 62 It indicates that an arm 𝑘 ∈ B𝑖SW is at least Δ𝑖𝑘 worse in mean reward than arm 𝜅𝑖 at any time slot 𝑡 ∈ T𝑖−1 ∪ T𝑖 . Since W𝑡 ⊂ T𝑖−1 ∪ T𝑖 , for any 𝑡 ∈ T𝑖 𝑀𝑡𝜅𝑖 − 𝑀𝑡𝑘 ≥ Δ𝑖𝑘 ≥ 𝜖, ∀𝑘 ∈ B𝑖SW . It follows from (5.9) that Δ𝑖𝑘 𝑘 3Δ𝑖𝑘 𝑘 Õ Õ   Õ Õ   𝑘 𝜅𝑖 𝑘 𝑘 1 𝜑𝑡 = 𝑘, 𝑔𝑡 > 𝑀𝑡 − Δ𝑖 ≤ 1 𝜑𝑡 = 𝑘, 𝑔𝑡 > 𝑀𝑡 + Δ𝑖 . (5.10) SW 𝑡∈T 4 SW 𝑡∈T 4 𝑘∈B𝑖 𝑖 𝑘∈B𝑖 𝑖 SW 𝑠 be the 𝑠-th time slot when arm 𝑘 is selected within T𝑖 . Then, for any 𝑘 ∈ B𝑖 , Let 𝑡 𝑖𝑘 Õ  3Δ𝑖𝑘  𝑘 𝑘 1 𝜑𝑡 = 𝑘, 𝑔𝑡 > 𝑀𝑡 + 𝑡∈T𝑖 4 Õ  3Δ𝑖𝑘  𝑘 𝑘 = 1 𝑔𝑡 𝑖𝑘 > 𝑀𝑡 𝑖𝑘 + 𝑠≥1 𝑠 𝑠 4 3Δ𝑖𝑘 Õ   𝑘 𝑘 𝑘 ≤𝑙𝑖 + 1 𝑔𝑡 𝑖𝑘 > 𝑀𝑡 𝑖𝑘 + , (5.11) 𝑠 𝑠 4 𝑠≥𝑙 𝑖 +1 𝑘        2 𝑘 2 4 𝜏 Δ𝑖 where we set 𝑙𝑖𝑘 = 𝜂 Δ𝑘 ln 𝜂𝐾 4 . Since Δ𝑖𝑘 ≥ 𝜖, for 𝑘 ∈ B𝑖SW , we have 𝑖 2  𝜏  2 l  2 m  𝑙𝑖𝑘 ≥ 𝜂 4/Δ𝑖 ln 𝑘 𝜖/4 ≥ 𝜂 4/Δ𝑖𝑘 , 𝜂𝐾 p where the second inequality follows by substituting 𝜖 = 4 𝑒𝜂𝐾/𝜏. Additionally, since 𝑡1𝑖𝑘 , . . . , 𝑡 𝑖𝑘 𝑠−1 ∈ W𝑡 𝑠𝑖𝑘 , we get 𝑛 𝑘 (𝑡 𝑖𝑘 𝑠 ) ≥ 𝑠 − 1. Furthermore, since 𝑐 𝑚 is monotonically decreasing with 𝑚, v u t  2! 𝜂 𝜏 Δ𝑖𝑘 Δ𝑖𝑘 𝑐 𝑛 𝑘 (𝑡 𝑠𝑘 ) ≤ 𝑐 𝑙 𝑘 ≤ ln ≤ , 𝑖 𝑙𝑖𝑘 𝜂𝐾 4 4 for 𝑠 ≥ 𝑙𝑖𝑘 + 1. Therefore, we continue from (5.11) to get 3Δ 𝑘 Δ𝑖𝑘 Õ   Õ   𝑙𝑖𝑘 + 1 𝑔𝑡𝑘𝑖𝑘 > 𝑀𝑡𝑘𝑖𝑘 + 𝑖 ≤ 𝑙𝑖𝑘 + 1 𝑔𝑡𝑘𝑖𝑘 − 2𝑐 𝑛 𝑘 (𝑡 𝑠𝑖𝑘 ) > 𝑀𝑡𝑘𝑖𝑘 + . 𝑠 𝑠 4 𝑠 𝑠 4 𝑠≥𝑙𝑖𝑘 +1 𝑠≥𝑙𝑖𝑘 +1 63 By applying Lemma 5.3, considering 𝑛 𝑘 (𝑡 𝑖𝑘 𝑠 ) ≥ 𝑠 − 1, Δ𝑘 Õ   P 𝑔𝑡𝑘𝑖𝑘 − 2𝑐 𝑛 𝑘 (𝑡 𝑠𝑖𝑘 ) > 𝑀𝑡𝑘𝑖𝑘 + 𝑖 𝑠 𝑠 4 𝑠≥𝑙𝑖𝑘 +1 ! Õ (2𝜂) 23 𝐾  4  2  2 𝑠 Δ𝑖𝑘 ≤ exp − 𝑘 ln(2𝜂) 𝜏 Δ𝑖𝑘 𝜂 4 𝑠≥𝑙𝑖 +∞ 3  2  2! 𝑦 Δ𝑖𝑘 ∫ (2𝜂) 2 𝐾 4 ≤ 𝑘 exp − 𝑑𝑦 𝑙𝑖𝑘 −1 ln(2𝜂) 𝜏 Δ𝑖 𝜂 4 3  4 (2𝜂) 2 𝜂𝐾 4 ≤ . (5.12) ln(2𝜂) 𝜏 Δ𝑖𝑘   p Let ℎ(𝑥) = 16𝜂/𝑥 ln 𝜏𝑥 2 /16𝜂𝐾 which achieves maximum at 4𝑒 𝜂𝐾/𝜏. Combining (5.12), (5.11), (5.10), and (5.9), we obtain Õ (2𝜂) 23 𝜂𝐾 256 𝑘 𝑘 E [(5.7)] ≤ ln(2𝜂) 𝜏   3 + 𝑙𝑖 Δ𝑖 𝑘 ∈B𝑖 Δ𝑖𝑘 Õ (2𝜂) 32 𝜂𝐾 256 𝑘 𝑘 ≤   3 + ℎ(Δ𝑖 ) + Δ𝑖 ln(2𝜂) 𝜏 𝑘 ∈B𝑖 Δ𝑖𝑘 Õ (2𝜂) 32 𝜂𝐾 256  p  ≤ + ℎ 4𝑒 𝜂𝐾/𝜏 + 𝑏 𝑘 ∈B𝑖 ln(2𝜂) 𝜏 𝜖 3 √ √   2.6𝜂 ≤ + 3 𝜂 𝐾𝜏 + 𝐾𝑏. ln(2𝜂)  𝜑 Step 4: In this step, we bound expectation of (5.8). When event 𝜑𝑡 ∈ B𝑖SW , 𝑔𝑡𝜅𝑖 ≤ 𝑀𝑡𝜅𝑖 − Δ𝑖 𝑡 /4 happens, we know 𝜑 𝜖 Δ𝑖 𝑡 ≤ 4𝑀𝑡𝜅𝑖 − 4𝑔𝑡𝜅𝑖 and 𝑔𝑡𝜅𝑖 ≤ 𝑀𝑡𝜅𝑖 − . 4 Thus, we have 𝜑  Δ𝑖 𝑡  𝜑𝑡   1 𝜑𝑡 ∈ B𝑖SW , 𝑔𝑡𝜅𝑖 ≤ 𝑀𝑡𝜅𝑖 − Δ𝑖 − 𝜖 4   𝜖  ≤1 𝑔𝑡𝜅𝑖 ≤ 𝑀𝑡𝜅𝑖 − × 4𝑀𝑡𝜅𝑖 − 4𝑔𝑡𝜅𝑖 − 𝜖 := 𝑌 . 4 64 Since 𝑌 is a nonnegative random variable, its expectation can be computed involving only its cumulative density function: ∫ +∞ E [𝑌 ] = P (𝑌 > 𝑥) 𝑑𝑥 0 ∫ +∞   ≤ P 4𝑀𝑡𝜅𝑖 − 4𝑔𝑡𝜅𝑖 − 𝜖 ≥ 𝑥 𝑑𝑥 ∫ 0 +∞   = P 4𝑀𝑡𝜅𝑖 − 4𝑔𝑡𝜅𝑖 > 𝑥 𝑑𝑥 𝜖 ∫ +∞ 3 3 16(2𝜂) 2 𝐾 16(2𝜂) 2 𝐾 ≤ 2 𝑑𝑥 = . 𝜖 ln(2𝜂) 𝜏𝑥 ln(2𝜂) 𝜏𝜖 3  Hence, E [(5.8)] ≤ 16(2𝜂) 2 𝐾 |T𝑖 | / ln(2𝜂)𝜏𝜖 . Step 5: With bounds on E [(5.7)] and E [(5.8)] from previous steps, 3 √ √   2.6𝜂 16(2𝜂) 2 𝐾𝑇 1 2 E [(5.6)] ≤5𝜏𝑉𝑇 + 𝑇𝜖 + 𝑁 + 3 𝜂 𝐾𝜏 + 𝑁𝐾𝑏 + ≤ 𝐶 (𝐾𝑉𝑇 ) 3 𝑇 3 , ln(2𝜂) ln(2𝜂) 𝜏𝜖 for some constant 𝐶, which concludes the proof.  We have shown that SW-MOSS also enjoys order optimal worst-case regret. One drawback of the sliding window method is that all sampling history within the observation window needs to be  1 2 stored. Since window size is selected to be 𝜏 = 𝐾 3 (𝑇/𝑉𝑇 ) 3 , large memory is needed for large horizon length 𝑇. The next policy resolves this problem. 5.2.3 Discounted UCB Algorithm The discount factor is widely used in estimators to forget old information and put more attention on recent information. In [20], such an estimation is used together with UCB1 to solve the piecewise stationary MAB problem, and the policy designed is called Discounted UCB (D-UCB). Here, we tune D-UCB to work in the nonstationary environment with variation budget 𝑉𝑇 . Specifically, the mean estimator used is discounted empirical average given by 𝑡−1 𝑡−1 𝑘 1 Õ 𝑡−𝑠 𝑘 Õ 𝜇ˆ 𝛾,𝑡 = 𝑘 𝛾 1{𝜑 𝑠 = 𝑘 }𝑋𝑠 , 𝑛𝛾,𝑡 = 𝛾 𝑡−𝑠 1{𝜑 𝑠 = 𝑘 }, 𝑛𝛾,𝑡 𝑠=1 𝑠=1 65 Algorithm 7: The D-UCB Algorithm 1 Input : 𝑉𝑇 ∈ R>0 , 𝑇 ∈ N and 𝜉 > 2 1 2 Set : 𝛾 = 1 − 𝐾 − 3 (𝑇/𝑉𝑇 ) − 3 Output : sequence of arm selection 1 for 𝑡 ∈ {1, . . . , 𝐾 } do Pick arm 𝜑𝑡 = 𝑡 and set 𝑛𝑡 ← 𝛾 𝐾 −𝑡 and 𝜇ˆ 𝑡 ← 𝑋𝑡𝑡 ; 2 while 𝑡 ≤ 𝑇 do r 𝑘 𝜉 ln(𝜏) Pick arm 𝜑𝑡 = arg max 𝜇ˆ + 2 ; 𝑘 ∈K 𝑛𝑘 For each arm 𝑘 ∈ K, set 𝑛 𝑘 ← 𝛾𝑛 𝑘 ; 1 𝜑 Set 𝑛 𝜑𝑡 ← 𝑛 𝜑𝑡 + 1 & 𝜇ˆ 𝜑𝑡 ← 𝜇ˆ 𝜑𝑡 + 𝑛 𝜑𝑡 (𝑋𝑡 𝑡 − 𝑋¯ 𝜑𝑡 ); 1 2 where 𝛾 = 1 − 𝐾 − 3 (𝑇/𝑉𝑇 ) − 3 is the discount factor. Besides, the UCB is designed as 𝑔𝑡𝑘 = 𝜇ˆ 𝑡𝑘 + 2𝑐 𝑡𝑘 , q 𝑘 = where 𝑐 𝛾,𝑡 𝜉 ln(𝜏)/𝑛𝛾,𝑡𝑘 for some constant 𝜉 > 1/2. The pseudo code for D-UCB is reproduced in Algorithm 7. It can be noticed that the memory size is only related to the number of arms, so D-UCB requires small memory. To proceed the analysis, we review the concentration inequality for discounted empirical average, which is an extension of Chernoff-Hoeffding bound. Let 𝑡−1 1 Õ 𝑡−𝑠 𝑘 𝑀𝛾,𝑡 := 𝑘 𝛾 1{𝜑 𝑠 = 𝑘 }𝜇 𝑠𝑘 . 𝑛𝛾,𝑡 𝑠=1 Then, the following fact is a corollary of [20, Theorem 18]. Fact 5.2 (A Hoeffding-type inequality for discounted empirical average with a random number of  q  summands). For any 𝑡 ∈ T and for any 𝑘 ∈ K, the probability of event 𝐴 = 𝜇ˆ 𝛾,𝑡 − 𝑀𝛾,𝑡 ≥ 𝛿/ 𝑛𝛾,𝑡 𝑘 𝑘 𝑘 is no greater than     2 2 log1+𝜆 (𝜏) exp −2𝛿 1 − 𝜆 /16 (5.13)  q  for any 𝛿 > 0 and 𝜆 > 0. The probability of event 𝐵 = 𝜇ˆ 𝛾,𝑡 − 𝑀𝛾,𝑡 ≤ −𝛿/ 𝑛𝛾,𝑡 𝑘 𝑘 𝑘 is also upper bounded by (5.13). 66 Theorem 5.5. For the nonstationary MAB problem with 𝐾 arms, time horizon 𝑇, variation budget 1 2 𝑉𝑇 > 0, and 𝛾 = 1 − 𝐾 − 3 (𝑇/𝑉𝑇 ) − 3 , if 𝜉 > 1/2, the worst case regret of D-UCB satisfies 1 2 sup 𝑅𝑇D-UCB ≤ 𝐶 ln(𝑇)(𝐾𝑉𝑇 ) 3 𝑇 3 . F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾) Proof. We establish the theorem in four steps. 𝑘 − 𝑀 𝑘 at some time slot 𝑡 ∈ T . Let 𝜏0 = log (1 − 𝛾)𝜉 ln(𝜏)/𝑏 2  Step 1: In this step, we analyze 𝜇 𝛾,𝑡 𝛾,𝑡 𝑖 𝛾 and take 𝑡 − 𝜏0 as a dividing point, then we obtain 𝑡−1 1 Õ 𝑡−𝑠 𝜇𝜏𝑘𝑖 − 𝑀𝛾,𝑡 𝑘 ≤ 𝑘 𝛾 1{𝜑 𝑠 = 𝑘 } 𝜇𝜏𝑘𝑖 − 𝜇 𝑠𝑘 𝑛𝛾,𝑡 𝑠=1 1 Õ 𝑡−𝑠 ≤ 𝑘 𝛾 1{𝜑 𝑠 = 𝑘 } 𝜇𝜏𝑘𝑖 − 𝜇 𝑠𝑘 (5.14) 𝑛𝛾,𝑡 𝑠≤𝑡−𝜏 0 𝑡−1 1 Õ 𝑡−𝑠 + 𝑘 𝛾 1{𝜑 𝑠 = 𝑘 } 𝜇𝜏𝑘𝑖 − 𝜇 𝑠𝑘 . (5.15) 𝑛𝛾,𝑡 𝑠≥𝑡−𝜏 0 Since 𝜇𝑡𝑘 ∈ [𝑎, 𝑎 + 𝑏] for all 𝑡 ∈ T , we have (5.14) ≤ 𝑏. Also, 0 1 Õ 𝑡−𝑠 𝑏𝛾 𝜏 𝜉 ln(𝜏) (5.14) ≤ 𝑘 𝑏𝛾 ≤ = . 𝑛𝛾,𝑡 𝑠≤𝑡−𝜏 0 (1 − 𝛾)𝑛𝛾,𝑡 𝑘 𝑘 𝑏𝑛𝛾,𝑡 Accordingly, we get ! s 𝜉 ln(𝜏) 𝜉 ln(𝜏) (5.14) ≤ min 𝑏, 𝑘 ≤ 𝑘 . 𝑏𝑛𝛾,𝑡 𝑛𝛾,𝑡 Furthermore, for any 𝑡 ∈ T𝑖 , Õ 𝑖 (5.15) ≤ max 𝜇𝜏𝑘𝑖 − 𝜇 𝑠𝑘 ≤ 𝑣𝑗, 𝑠∈[𝑡−𝜏 0 ,𝑡−1] 𝑗=𝑖−𝑛 0 where 𝑛0 = d𝜏0/𝜏e and 𝑣 𝑗 is the variation within T𝑗 . So we conclude that for any 𝑡 ∈ T𝑖 , Õ 𝑖 𝑘 𝜇 𝜅𝑘𝑖 − 𝑘 𝑀𝛾,𝑡 ≤ 𝑐 𝛾,𝑡 + 𝑣𝑗, ∀𝑘 ∈ K. (5.16) 𝑗=𝑖−𝑛 0 Step 2: Within partition T𝑖 , let Õ 𝑖 Δ̂𝑖𝑘 = 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑘𝑖 −2 𝑣𝑗, 𝑗=𝑖−𝑛 0 67 and define a subset of bad arms as   0 B𝑖D = 𝑘∈K| Δ̂𝑖𝑘 ≥𝜖 , p where we select 𝜖 0 = 4 𝜉𝛾 1−𝜏 𝐾 ln(𝜏)/𝜏. Since 𝜇𝑡𝑘 − 𝜇𝜏𝑘𝑖 ≤ 𝑣 𝑖 for any 𝑡 ∈ T𝑖 and for any 𝑘 ∈ K Õ Õ𝑁 Õ 𝜇𝑡∗ 𝜑 𝜑 − 𝜇𝑡 𝑡 ≤ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑖𝑡 + 𝑣 𝑖 𝑡∈T 𝑖=1 𝑡∈T𝑖 Õ 𝑁 Õ n o Õ𝑖  D 𝜑𝑡 0 ≤𝜏𝑉𝑇 + 1 𝜑𝑡 ∈ B𝑖 Δ̂𝑖 + 2 𝑣𝑗 + 𝜖 𝑖=1 𝑡∈T𝑖 𝑗=𝑖−𝑛 0 Õ 𝑁 Õ Õ  0 0 ≤(2𝑛 + 3)𝜏𝑉𝑇 + 𝑁𝜖 𝜏+ Δ̂𝑖𝑘 1 𝜑𝑡 = 𝑘 . (5.17) 𝑖=1 𝑘∈B𝑖D 𝑡∈T𝑖  Í   Step 3: In this step, we bound E Δ̂𝑖𝑘 𝑡∈T𝑖 1 𝜑𝑡 = 𝑘 for an arm 𝑘 ∈ B𝑖D . Let 𝑡𝑖𝑘 (𝑙) be the 𝑙-th 𝜑 time slot arm 𝑘 is selected within T𝑖 . From arm selection policy, we get 𝑔𝑡 𝑡 ≥ 𝑔𝑡𝜅𝑖 , which result in Õ  Õ n o 1 𝜑𝑡 = 𝑘 ≤ 𝑙𝑖𝑘 + 1 𝑔𝑡𝑘 ≥ 𝑔𝑡𝜅𝑖 , 𝑡 > 𝑡𝑖𝑘 (𝑙𝑖𝑘 ) , (5.18) 𝑡∈T𝑖 𝑡∈T𝑖 l  2m where we pick 𝑙𝑖 = 16𝜉𝛾 ln(𝜏)/ Δ̂𝑖 . Note that 𝑔𝑡𝑘 ≥ 𝑔𝑡𝜅𝑖 is true means at least one of the 𝑘 1−𝜏 𝑘 followings holds, 𝑘 𝑘 𝑘 𝜇ˆ 𝛾,𝑡 ≥ 𝑀𝛾,𝑡 + 𝑐 𝛾,𝑡 , (5.19) 𝜅𝑖 𝜅𝑖 𝜅𝑖 𝜇ˆ 𝛾,𝑡 ≤ 𝑀𝛾,𝑡 − 𝑐 𝛾,𝑡 , (5.20) 𝜅𝑖 𝜅𝑖 𝑘 𝑘 𝑀𝛾,𝑡 + 𝑐 𝛾,𝑡 < 𝑀𝛾,𝑡 + 3𝑐 𝛾,𝑡 . (5.21) For any 𝑡 ∈ T𝑖 , since every sample before 𝑡 within T𝑖 has a weight greater than 𝛾 𝜏−1 , if 𝑡 > 𝑡𝑖𝑘 (𝑙𝑖𝑘 ), s s 𝑘 𝜉 ln(𝜏) 𝜉 ln(𝜏) Δ̂𝑖𝑘 𝑐 𝛾,𝑡 = 𝑘 ≤ ≤ . 𝑛𝛾,𝑡 𝛾 𝜏−1 𝑙𝑖𝑘 4 Combining it with (5.16) yields Õ 𝑖 𝜅𝑖 𝑘 𝜅𝑖 𝑀𝛾,𝑡 − 𝑀𝛾,𝑡 ≥ 𝜇𝜏𝜅𝑖𝑖 − 𝜇𝜏𝑘𝑖 − 𝑐 𝛾,𝑡 − 𝑐 𝛾,𝑡𝑘 −2 𝑣𝑗 𝑗=𝑖−𝑛 0 𝜅𝑖 𝜅𝑖 ≥ Δ̂𝑖𝑘 − 𝑐 𝛾,𝑡 𝑘 − 𝑐 𝛾,𝑡 ≥ 3𝑐 𝛾,𝑡 𝑘 − 𝑐 𝛾,𝑡 , 68 p which indicates (5.21) is false. As 𝜉 > 1/2, we select 𝜆 = 4 1 − 1/(2𝜉) and apply Fact 5.2 to get     −2𝜉 (1−𝜆2 /16) log1+𝜆 (𝜏) P((5.19) is true) ≤ log1+𝜆 (𝜏) 𝜏 ≤ . 𝜏 The probability of (5.20) to be true shares the same bound. Then, it follows from (5.18) that  Í   E Δ̂𝑖𝑘 𝑡∈T𝑖 1 𝜑𝑡 = 𝑘 is upper bounded by Õ Δ̂𝑖𝑘 𝑙𝑖𝑘 + Δ̂𝑖𝑘 P ((5.19) or (5.20) is true) 𝑡∈T𝑖 16𝜉𝛾 1−𝜏 ln(𝜏)   ≤ + Δ̂𝑖𝑘 + 2Δ̂𝑖𝑘 log1+𝜆 (𝜏) Δ̂𝑖𝑘 16𝜉𝛾 1−𝜏 ln(𝜏)   ≤ 0 + 𝑏 + 2𝑏 log1+𝜆 (𝜏) , (5.22) 𝜖 where we use 𝜖 0 ≤ Δ̂𝑖𝑘 ≤ 𝑏 in the last step. Step 4: From (5.17) and (5.22), and plugging in the value of 𝜖 0, an easy computation results in q 0 𝑅𝑇D-UCB ≤(2𝑛 + 3)𝜏𝑉𝑇 + 8𝑁 𝜉𝛾 1−𝜏 𝐾𝜏 ln(𝜏) + 2𝑁 𝑏 + 2𝑁 𝑏 log1+𝜆 (𝜏) , where the dominating term is (2𝑛0 + 3)𝜏𝑉𝑇 . Considering 2  2  ln (1 − 𝛾)𝜉 ln(𝜏)/𝑏 − ln (1 − 𝛾)𝜉 ln(𝜏)/𝑏 𝜏0 = ≤ , ln 𝛾 1−𝛾 we get 𝑛0 ≤ 𝐶 0 ln(𝑇) for some constant 𝐶 0. Hence there exists some absolute constant 𝐶 such that 1 2 𝑅𝑇D-UCB ≤ 𝐶 ln(𝑇)(𝐾𝑉𝑇 ) 3 𝑇 3 .  Although the discount factor method requires less memory, there exists an extra factor ln(𝑇) in the upper bound on the worst-case regret for D-UCB comparing with the minimax regret. This is due to the fact that the discount factor method does not entirely cut off outdated sampling history like periodic resetting or sliding window techniques. 69 5.3 UCB Policies for Heavy-tailed Nonstationary Stochastic MAB Problems In this section, we propose and analyze UCB algorithms for the non-stationary stochastic MAB problem with heavy-tailed rewards defined in Assumption 5.2. For the stationary heavy-tailed MAB problem, we have shown Robust MOSS in chapter 2 achieve order optimal worst-case regret. We extend it to the nonstationary setting and design resetting robust MOSS algorithm and sliding-window robust MOSS algorithm. 5.3.1 Resetting robust MOSS for the non-stationary heavy-tailed MAB problem Like R-MOSS, Resetting Robust MOSS (R-RMOSS) restarts Robust MOSS after every 𝜏 time slots. For a stationary heavy-tailed MAB problem, it has been shown in theorem 2.10 that the √ worst-case regret of Robust MOSS belongs to O ( 𝐾𝑇). This result along with an analysis similar to the analysis for R-MOSS in Theorem 5.2 yield the following theorem for R-RMOSS. For brevity, we skip the proof. Theorem 5.6. For the nonstationary heavy-tailed MAB problem with 𝐾 arms, horizon 𝑇, variation l 1 2m budget 𝑉𝑇 > 0 and 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 , if 𝜓(2𝜁/𝑎) ≥ 2𝑎/𝜁, the worst-case regret of R-RMOSS satisfies 1 2 sup 𝑅𝑇R-RMOSS ∈ O ((𝐾𝑉𝑇 ) 3 𝑇 3 ). F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾) 5.3.2 SW-RMOSS for the non-stationary heavy-tailed MAB problem In Sliding-Window Robust MOSS (SW-RMOSS), 𝑛 𝑘 (𝑡) and 𝜇¯ 𝑛 𝑘 (𝑡) are computed from the sampling q  history within W𝑡 , and 𝑐 𝑛 𝑘 (𝑡) = ln+ 𝐾𝑛𝜏𝑘 (𝑡) /𝑛 𝑘 (𝑡). To analyze SW-RMOSS, we want to establish a similar property as Lemma 5.3 to bound the probability about an arm being under or over estimated. Toward this end, we need the following properties for truncated random variable. 70 Lemma 5.7. Let 𝑋 be a random variable with expected value 𝜇 and E [𝑋 2 ] ≤ 1. Let 𝑑 := sat(𝑋, 𝐵) − E [sat(𝑋, 𝐵)]. Then for any 𝐵 > 0, it satisfies (i) |𝑑| ≤ 2𝐵 (ii) E [𝑑 2 ] ≤ 1 (iii) E [sat(𝑋, 𝐵)] − 𝜇 ≤ 1/𝐵. Proof. Property (i) follows immediately from definition of 𝑑 and property (ii) follows from     E [𝑑 2 ] ≤ E sat2 (𝑋, 𝐵) ≤ E 𝑋 2 . To see property (iii), since     𝜇 = E 𝑋 1 |𝑋 | ≤ 𝐵 + 1 |𝑋 | > 𝐵 , one have " # h   i h  i 𝑋2 E [sat(𝑋, 𝐵)] − 𝜇 ≤ E |𝑋 | − 𝐵 1 |𝑋 | > 𝐵 ≤ E |𝑋 | 1 |𝑋 | > 𝐵 ≤ E . 𝐵  Moreover, we will also use a maximal Bennett type inequality as shown in the following. Lemma 5.8 (Maximal Bennett’s inequality [75]). Let {𝑋𝑖 }𝑖∈{1,...,𝑛} be a sequence of bounded random variables with support [−𝐵, 𝐵], where 𝐵 ≥ 0. Suppose that E [𝑋𝑖 |𝑋1 , . . . , 𝑋𝑖−1 ] = 𝜇𝑖 and Í𝑚 Var[𝑋𝑖 |𝑋1 , . . . , 𝑋𝑖−1 ] ≤ 𝑣. Let 𝑆 𝑚 = 𝑖=1 (𝑋𝑖 − 𝜇𝑖 ) for any 𝑚 ∈ {1, . . . , 𝑛}. Then, for any 𝛿 ≥ 0  !  𝛿 𝐵𝛿 P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑆 𝑚 ≥ 𝛿 ≤ exp − 𝜓 , 𝐵 𝑛𝑣  !  𝛿 𝐵𝛿 P ∃𝑚 ∈ {1, . . . , 𝑛} : 𝑆 𝑚 ≤ −𝛿 ≤ exp − 𝜓 . 𝐵 𝑛𝑣 Now, we are ready to establish a concentration property for saturated sliding window empirical mean. Lemma 5.9. For any arm 𝑘 ∈ {1, . . . , 𝐾 } and any 𝑡 ∈ {𝐾 + 1, . . . , 𝑇 }, if 𝜓(2𝜁/𝑎) ≥ 2𝑎/𝜁,   the probability of either event 𝐴 = 𝑔𝑡𝑘 ≤ 𝑀𝑡𝑘 − 𝑥, 𝑛 𝑘 (𝑡) ≥ 𝑙 or event 𝐵 = 𝑔𝑡𝑘 − 2𝑐 𝑛 𝑘 (𝑡) ≥ 𝑀𝑡𝑘 + 𝑥, 𝑛 𝑘 (𝑡) ≥ 𝑙 , for any 𝑥 > 0 and any 𝑙 ≥ 1, is no greater than 2𝑎 𝐾 p  p  (𝛽𝑥 ℎ(𝑙)/𝑎 + 1) exp −𝛽𝑥 ℎ(𝑙)/𝑎 , 𝛽2 ln(𝑎) 𝜏𝑥 2  where 𝛽 = 𝜓 2𝜁/𝑎 /(2𝑎). 71 Proof. Recall that 𝑢𝑖𝑘𝑡 is the 𝑖-th time slot when arm 𝑘 is selected within W𝑡 . Since 𝑐 𝑚 is a monotonically decreasing in 𝑚, 1/𝐵𝑚 = 𝑐 ℎ(𝑚) ≤ 𝑐 𝑚 due to ℎ(𝑚) ≥ 𝑚. Then, it follows from property (iii) in Lemma 5.7 that  Õ 𝑚 𝜇 𝑘 𝑘𝑡  𝑢 P( 𝐴) ≤ P ∃𝑚 ∈ {𝑙, . . . , 𝜏} : 𝜇¯ 𝑚𝑘 ≤ 𝑖 − (1 + 𝜁)𝑐 𝑚 − 𝑥 𝑖=1 𝑚  𝑚 ¯𝑘𝑡  Õ 𝑑𝑖𝑚 1 ≤ P ∃𝑚 ∈ {𝑙, . . . , 𝜏} : ≤ − (1 + 𝜁)𝑐 𝑚 − 𝑥 𝑖=1 𝑚 𝐵 𝑚  𝑚  1 Õ ¯𝑘𝑡 ≤ P ∃𝑚 ∈ {𝑙, . . . , 𝜏} : 𝑑 ≤ −𝑥 − 𝜁 𝑐 𝑚 , (5.23) 𝑚 𝑖=1 𝑖𝑚    where 𝑑¯𝑖𝑚 𝑘𝑡 = sat 𝑋 𝑘 , 𝐵 𝑘𝑡 𝑚 − E sat 𝑋 𝑘𝑘𝑡 , 𝐵𝑚 . Recall we select 𝑎 > 1. Again, we apply a peeling 𝑢𝑖 𝑢𝑖   argument with geometric grid 𝑎 𝑠 ≤ 𝑚 < 𝑎 𝑠+1 over time interval {𝑙, . . . , 𝜏}. Let 𝑠0 = log𝑎 (𝑙) . Since 𝑐 𝑚 is monotonically decreasing with 𝑚, we continue from (5.23) to get Õ Õ 𝑚  𝑠 𝑠+1 𝑘𝑡 𝑠  P( 𝐴) ≤ P ∃𝑚 ∈ [𝑎 , 𝑎 ): 𝑑¯𝑖𝑚 ≤ −𝑎 𝑥 + 𝜁 𝑐 𝑎 𝑠+1 . (5.24) 𝑠≥𝑠0 𝑖=1   For all 𝑚 ∈ [𝑎 𝑠 , 𝑎 𝑠+1 ), since 𝐵𝑚 = 𝐵𝑎 𝑠 , from Lemma 5.7 we know 𝑑¯𝑖𝑚 𝑘𝑡 ≤ 2𝐵 𝑠 and Var 𝑑¯𝑘𝑡 ≤ 1. 𝑎 𝑖𝑚 Continuing from (5.24), we apply Maximal Bennett’s inequality in Lemma 2.7 to get   ! Õ 𝑎 𝑠 𝑥 + 𝜁 𝑐 𝑎 𝑠+1 2𝐵𝑎 𝑠  P( 𝐴) ≤ exp − 𝜓 𝑥 + 𝜁 𝑐 𝑎 𝑠+1 𝑠≥𝑠 2𝐵𝑎 𝑠 𝑎 0  since 𝜓(𝑥) is monotonically increasing   ! Õ 𝑎 𝑠 𝑥 + 𝜁 𝑐 𝑎 𝑠+1 2𝜁 ≤ exp − 𝜓 𝐵𝑎 𝑠 𝑐 𝑎 𝑠+1 𝑠≥𝑠 2𝐵 𝑎 𝑠 𝑎 0 (substituting 𝑐 𝑎 𝑠+1 , 𝐵𝑎 𝑠 and using ℎ(𝑎 𝑠 ) = 𝑎 𝑠+1 )   ! Õ 𝑥 𝜓 2𝜁/𝑎 = exp −𝑎 𝑠 + 𝜁 𝑐2𝑎 𝑠 𝑠≥𝑠 +1 𝐵𝑎 𝑠−1 2𝑎 0  since 𝜁𝜓(2𝜁/𝑎) ≥ 2𝑎 ! 𝐾 Õ 𝑠 𝑥 𝜓 2𝜁/𝑎 ≤ 𝑎 exp −𝑎 𝑠 . 𝜏 𝑠≥𝑠 +1 𝐵𝑎 𝑠−1 2𝑎 0 72  Let 𝑏 = 𝑥𝜓 2𝜁/𝑎 /(2𝑎). Since ln+ (𝑥) ≥ 1 for all 𝑥 > 0, ! 𝐾 Õ 𝑠 𝑥 𝜓 2𝜁/𝑎 𝑎 exp −𝑎 𝑠 𝜏 𝑠≥𝑠 +1 𝐵𝑎 𝑠−1 2𝑎 0 𝐾 Õ 𝑠  √  ≤ 𝑎 exp −𝑏 𝑎 𝑠 𝜏 𝑠≥𝑠 +1 0 ∫ +∞ 𝐾 p  ≤ 𝑎 𝑦 exp − 𝑏 𝑎 𝑦−1 𝑑𝑦 𝜏 𝑠0 +1 ∫ +∞ 𝐾 √  = 𝑎 𝑎 𝑦 exp − 𝑏 𝑎 𝑦 𝑑𝑦 𝜏 𝑠0 ∫ +∞ 𝐾 2𝑎  √ = √ 𝑧 exp − 𝑧 𝑑𝑧 (where 𝑧 = 𝑏 𝑎𝑦) 𝜏 ln(𝑎)𝑏 2 𝑏 𝑎 𝑠0 𝐾 2𝑎 √ √ ≤ (𝑏 𝑎 𝑠0 + 1) exp(−𝑏 𝑎 𝑠0 ), 𝜏 ln(𝑎)𝑏 2 which concludes the proof.  With Lemma 5.9, the upper bound on the worst-case regret for SW-RMOSS in the nonstationary heavy-tailed MAB problem can be analyzed similarly as Theorem 5.4. Theorem 5.10. For the nonstationary heavy-tailed MAB problem with 𝐾 arms, time horizon 𝑇, l 1 2m variation budget 𝑉𝑇 > 0 and 𝜏 = 𝐾 3 𝑇/𝑉𝑇 3 , if 𝜓(2𝜁/𝑎) ≥ 2𝑎/𝜁, the worst-case regret of SW-RMOSS satisfies 1 2 sup 𝑅𝑇SW-RMOSS ≤ 𝐶 (𝐾𝑉𝑇 ) 3 𝑇 3 . F𝑇K ∈E (𝑉𝑇 ,𝑇,𝐾) Sketch of the proof. The procedure is similar as the proof of Theorem 5.4. The key difference is due to the nuance between the concentration properties on mean estimator. Neglecting the leading constants, the probability upper bound in Lemma 5.3 has a factor exp(−𝑥 2 𝑙/𝜂) comparing with p  p  (𝛽𝑥 ℎ(𝑙)/𝑎 + 1) exp −𝛽𝑥 ℎ(𝑙)/𝑎 in Lemma 5.9. Since both factors are no greater than 1, by simply replacing 𝜂 with (1+ 𝜁) 2 and taking similar calculation in every step except inequality (5.12), comparable bounds that only differs in leading constants can be obtained. Applying Lemma 5.9, 73 we revise the computation of (5.12) as the following, Δ𝑘 Õ   P 𝑔𝑡𝑘𝑠 − 2𝑐 𝑛 𝑘 (𝑡 𝑠 ) > 𝑀𝑡𝑘𝑠 + 𝑖 4 𝑠≥𝑙𝑖𝑘 +1 r ! r ! Õ 0 𝛽Δ𝑖𝑘 ℎ(𝑙) 𝛽Δ𝑖𝑘 ℎ(𝑙) ≤ 𝐶 + 1 exp − 𝑘 4 𝑎 4 𝑎 𝑠≥𝑙𝑖 ∫ +∞ ! ! 𝛽Δ 𝑘r 𝛽Δ 𝑘r 𝑦 𝑦 ≤ 𝐶0 𝑖 + 1 exp − 𝑖 𝑑𝑦 𝑙𝑖 −1 𝑘 4 𝑎 4 𝑎  4 6𝑎 2𝑎 𝐾 4 ≤ 2 2 . (5.25) 𝛽 𝛽 ln(𝑎) 𝜏 Δ𝑖𝑘 2 where 𝐶 0 = 2𝑎𝐾 4/Δ𝑖𝑘 / 𝛽2 ln(𝑎)𝜏 .The second inequality is due to the fact that (𝑥 + 1) exp(−𝑥)  is monotonically decreasing in 𝑥 for 𝑥 ∈ [0, ∞) and ℎ(𝑙) > 𝑙. In the last inequality, we change the lower limits of the integration from 𝑙𝑖𝑘 − 1 to 0 since 𝑙𝑖𝑘 ≥ 1 and plug in the value of 𝐶 0. Comparing with (5.12), this upper bound only varies in constant multiplier. So is the worst-regret upper bound.  Remark 5.1. The benefit of the discount factor method is that it is memory-friendly. This advantage is lost if the truncated empirical mean is used. As 𝑛 𝑘 (𝑡) could both increase and decrease with time, the truncated point could both grow and decline, so all sampling history needs to be recorded. It remains an open problem how to effectively use the discount factor in a nonstationary heavy-tailed MAB problem. 5.4 Numerical Experiments We complement the theoretical results in the previous section with two Monte-Carlo experiments. For the light-tailed setting, we compare R-MOSS, SW-MOSS, and D-UCB with other state-of-art policies. For the heavy-tailed setting, we test the robustness of R-RMOSS and SW-RMOSS against both heavy-tailed rewards and nonstationarity. Each result in this section is derived by running designated policies 500 times. And parameter selections for compared policies are strictly coherent with referred literature. 74 5.4.1 Bernoulli Nonstationay Stochastic MAB Experiment To evaluate the performance of different policies, we consider two nonstationary environments as shown in Figs. 5.1a and 5.1b, which both have 3 arms with nonstationary Bernoulli rewards. The success probability sequence at each arm is a Brownian motion in environment 1 and a sinusoidal function of time 𝑡 in environment 2. And the variation budget 𝑉𝑇 is 8.09 and 3 respectively. (a) Environment 1 (b) Environment 2 (c) Regrets for environment 1 (d) Regrets for environment 2 Figure 5.1: Comparison of different policies. The growths of regret in Figs. 5.1c and 5.1d show that UCB based policies (R-MOSS, SW- MOSS, and D-UCB) maintain their superior performance against adversarial bandit-based policies (Rexp3 and Exp3.S) for stochastic bandits even in nonstationary settings, especially for R-MOSS and SW-MOSS. Besides, DTS outperforms other policies when the best arm does not switch. While each switch of the best arm seems to incur larger regret accumulation for DTS, which results in 75 larger regret compared with SW-MOSS and R-MOSS. 5.4.2 Heavy-tailed Nonstationary Stochastic MAB Experiment Again we consider the 3-armed bandit problem with sinusoidal mean rewards. In particular, for each arm 𝑘 ∈ {1, 2, 3},  𝜇𝑡𝑘 = 0.3 sin 0.001𝜋𝑡 + 2𝑘 𝜋/3 , 𝑡 ∈ {1, . . . , 5000}. Thus, the variation budget is 3. Besides, mean reward is contaminated by additive sampling noise 𝜈, where |𝜈| is a generalized Pareto random variable and the sign of 𝜈 has equal probability to be “+" and “−". So the probability distribution for 𝑋𝑡𝑘 is − 𝜉1 −1 1 © 𝜉𝑥− 𝜇𝑡𝑘 ª 𝑓𝑡𝑘 (𝑥) = ­1 + ® for 𝑥 ∈ (−∞, +∞). 2𝜎 ­ 𝜎 ® « ¬ We select 𝜉 = 0.4 and 𝜎 = 0.23 such that Assumption 5.2 is satisfied. We select 𝑎 = 1.1 and 𝜁 = 2.2 for both R-RMOSS and SW-RMOSS such that condition 𝜓(2𝜁/𝑎) ≥ 2𝑎/𝜁 is met. (a) Regret (b) Histogram of 𝑅𝑇 Figure 5.2: Performances with heavy-tailed rewards. Figure. 5.2a show RMOSS based polices and slightly outperform MOSS-based polices in heavy-tailed settings. While by comparing the estimated histogram of 𝑅𝑇 for different policies in Figure. 5.2b, R-RMOSS and SW-RMOSS have a better consistency and a smaller possibility of a particular realization of the regret deviating significantly from the mean value. 76 5.5 Summary We studied the general nonstationary stochastic MAB problem with variation budget and provided three UCB based policies for the problem. Our analysis showed that the proposed policies enjoy the worst-case regret that is within a constant factor of the minimax regret lower bound. Besides, the sub-Gaussian assumption on reward distributions is relaxed to define the nonstationary heavy-tailed MAB problem. We show the order optimal worst-case regret can be maintained by extending the previous policies to robust versions. There are several possible avenues for future research. In this work, we relied on passive methods to balance the remembering-versus-forgetting tradeoff. The general idea is to keep taking in new information and removing outdated information. Parameter-free active approaches that adaptively detect and react to environment changes are promising alternatives and may result in better experimental performance. Also, extensions from the single decision-maker to distributed multiple decision-makers are of interest. Another possible direction is the nonstationary version of rested and restless bandits. 5.6 Bibliographic Remarks The adversarial MAB [19] is a paradigmatic nonstationary problem. In this model, the bounded reward sequence at each arm is arbitrary. The performance of a policy is evaluated using the weak regret, which is the difference in the cumulated reward of a policy compared with the best √ single action policy. A Ω( 𝐾𝑇) lower bound on the weak regret and a near-optimal policy Exp3 is also presented in [19]. While being able to capture the nonstationarity, the generality of the reward model in the adversarial MAB makes the investigation of globally optimal policies very challenging. The nonstationary stochastic MAB can be viewed as a compromise between the stationary stochastic MAB and the adversarial MAB. It maintains the stochastic nature of the reward sequence while allowing some degree of nonstationarity in reward distributions. Instead of the weak regret analyzed in adversarial MAB, a strong notion of regret defined with respect to the best arm at 77 each time step is studied in these problems. As a result, this problem can be studied from two perspectives by extending ideas from adversarial bandits or stochastic bandits. After formulating the nonstationary stochastic MAB problem in [21], the authors tune the Exp3.S policy for adversarial bandits [19] to achieve a near-optimal worst-case regret in their subsequent work [102]. Discounted Thomson Sampling (DTS) [103] has also been shown to have a good experimental performance within this general framework. However, we are not aware of any analytic regret bounds for the DTS algorithm. The variation budget idea has already been extended to more general problem settings such as nonstationary linear contextual bandits, and the ideas of using periodic resetting, discounting factor, and sliding observation windows have been shown to be applicable therein [104–106]. Nevertheless, to achieve exact order optimal worst-case regret remains unsolved for those generalized problem setups 78 CHAPTER 6 MULTI-TARGET SEARCH VIA MULTI-FIDELITY GAUSSIAN PROCESSES The robotic target search problems have a natural connection with MAB problems discussed in the previous section. In particular, the class of robotic search problems in which a robot team searches for a target from a set of view-points (arms), or monitors an environment from a set of viewpoints, maps directly to the MAB problems. In this chapter, we focus on a class of search problems involving the search of an unknown number of targets in a large or continuous space instead of a small number of viewpoints. We consider a scenario in which an autonomous vehicle equipped with a downward-facing camera operates in a 3D environment, and the task is to search for an unknown number of sta- tionary targets on the 2D floor of the environment. For such a problem, there exists an intrinsic fidelity-vs-coverage trade-off: sensing at a higher altitude provides more global but less accurate information compared with sensing at a lower altitude. To capture this phenomenon, we model the sensing information available at different altitudes from the floor using a multi-fidelity Gaussian process [12]. The key idea to address the fidelity-vs-coverage trade-off is to use the low fidelity information to remove regions unlikely to contain targets. This enables the robot to quickly transit its focus to areas likely to contain targets, thus expedite the search process. This chapter is a slightly modified version of our published work on multi-target search with multi-fidelity Gaussian process sensing model, and it is reproduced here with the permission of the copyright holder1. The proposed multi-target search strategy leverages information-theoretic techniques to efficiently explore the environment, and employ Bayesian techniques to accurately identify targets and construct an occupancy map. The target search accuracy and efficiency are proved with theoretical analysis, and also verified by simulation results. 1 ©2020 IEEE. Reprinted with permission from [107]. 79 6.1 Multi-target Search Problem Description We consider an autonomous vehicle that moves in a 3D environment, e.g., an aerial or an underwater vehicle. We assume that the vehicle either moves with unit speed or hovers at a location. The vehicle is tasked with searching for multiple targets on the 2D floor of the environment. Let 𝐷 ⊂ R2 be the area of the floor in which the targets may be present. The vehicle is equipped with a fixed camera that points towards the floor. The vehicle travels across the environment and collects images/videos of the floor (samples) from different sampling points. These sampling points may be located at different altitudes relative to the floor of the environment. We assume that no sample is collected during the movement between sampling points to avoid misleading low-quality sensing information. The collected samples are processed with a computer vision algorithm that outputs a score, which corresponds to the likelihood of a target being present, for each frame. An example of such a computer vision algorithm is the state of art deep neural network YOLOv3 [108]. The score will be used to update the estimate of the sensing output, i.e., the estimated score function 𝑓 : 𝐷 → [0, 1] which will be used to determine the location of the targets. The stochastic model for 𝑓 is introduced below. 6.1.1 Multi-fidelity Sensing Model GPs are widely used models for spatially distributed sensing outputs. In [52], a GP is used to model the target detection output of a computer vision algorithm. While target presence is a binary event, the computer vision algorithms such as YOLOv3 yield a score which is a function of the saliency and location of the target in the image. GPs are appropriate models for such score functions. So far in the literature, GPs have been used in the context of single-fidelity measurements. To characterize the inherent fidelity-coverage trade-off in sensing the floor scene by an autonomous vehicle operating in 3D space, we employ a novel multi-fidelity GP model. The two key physical sensing characteristics the model seeks to capture are: (i) there is some information that can only be accessed at lower altitudes, (ii) the sensing outputs are more spatially correlated at higher altitudes, 80 since the fields of view at neighboring locations have higher overlaps in their field of views. We assume that the vehicle can collect samples of the floor from 𝑀 possible heights from the floor 𝑧 1 > 𝑧 2 > · · · > 𝑧 𝑀 . We refer to these heights as the fidelity level of the measurement, with 𝑀 (resp. 1) corresponding to the highest (resp. lowest) level of fidelity. Let the score function 𝑔𝑚 : 𝐷 → [0, 1] be defined by the output of the computer vision algorithm for an ideal noise-free image collected at fidelity level 𝑚 ∈ {1, . . . , 𝑀 } with the field of view of the camera centered at 𝒙 ∈ 𝐷. We assume that the score functions for a location 𝒙 obtained from different altitudes (fidelity levels) are related to each other in an autoregressive manner as follows 𝑔 𝑚 (𝒙) = 𝑎 𝑚−1 𝑔 𝑚−1 (𝒙) + 𝑏 𝑚 (𝒙), (6.1) where 𝑎 𝑚−1 is a scale parameter and 𝑏 𝑚 is the bias term that captures the information that can Î  𝑀−1 be only be accessed at fidelities levels greater than 𝑚. Let 𝑓 𝑚 (𝒙) = 𝑖=𝑚 𝑖 𝑔 (𝒙) and 𝑎 𝑚 Î  𝑀−1 ℎ𝑚 (𝒙) = 𝑖=𝑚 𝑖 𝑏 (𝒙). Then, equation (6.1) reduces to 𝑎 𝑚 𝑓 𝑚 (𝒙) = 𝑓 𝑚−1 (𝒙) + ℎ𝑚 (𝒙), (6.2) where 𝑓 0 (𝒙) = 0 and 𝑓 (𝒙) := 𝑓 𝑀 (𝒙) is the score function at the highest fidelity level which we treat as ground truth. We model the influence of systemic errors in sample collection and environmental uncertainty on the output of the computer vision algorithm for an input at fidelity level 𝑚 through an additive zero mean Gaussian random variable 𝜖 𝑚 with variance 𝑠2𝑚 , i.e., 𝜖 𝑚 ∼ 𝑁 (0, 𝑠2𝑚 ). Consequently, the (scaled) score obtained by collecting a sample at location 𝒙 is a random variable 𝑦 = 𝑓𝑚 (𝒙) + 𝜖 𝑚 . We assume that each ℎ𝑚 is a realization of a Gaussian process with a constant mean 𝜇𝑚 and a squared exponential kernel function 𝑘 𝑚 (𝒙, 𝒙 0) expressed as ! k𝒙 − 𝒙 0 k2 𝑘 𝑚 (𝒙, 𝒙 0) = 𝑣 2𝑚 exp − 2 , (6.3) 2𝑙 𝑚 where 𝑙 𝑚 is the length scale parameter, and 𝑣 𝑚 is the variance parameter that satisfies 𝑣 1 > 𝑣 2 > · · · > 𝑣 𝑀 . This kernel function describes the spatial correlation of score function at neighboring 81 locations at each fidelity level. Since the fields of view are more overlapped at lower fidelity levels, it results in 𝑙 1 > 𝑙2 > · · · > 𝑙 𝑀 . We make the following assumptions about the highest-fidelity sample. If the target is not in the field of view at (𝒙, 𝑧 𝑀 ), the mean score of the computer vision algorithm 𝑓 (𝒙) is smaller than a threshold th. If a target is at the center of image collected at (𝒙, 𝑧 𝑀 ), 𝑓 (𝒙) ≥ th + Δ, for some constant Δ > 0. Here, 1/Δ can be viewed as a measure of detection difficulty that depends both on the quality of the computer vision algorithm and the environment complexity. 6.1.2 Objective of the Multi-target Search Algorithm Our objective is to design an algorithm for sequentially determining sampling points that lead to expedited detection and localization of targets within desired misclassification rate 𝛿 ∈ (0, 1/2). In particular, the algorithm should estimate the region containing targets 𝐷 𝑡 ⊆ 𝐷 such that (i)   ∀𝒙 ∈ 𝐷 𝑡 : P 𝑓 (𝒙) < th ≤ 𝛿 and (ii) ∀𝒙 ∈ 𝐷 \ 𝐷 𝑡 : P 𝑓 (𝒙) ≥ th + Δ ≤ 𝛿. The requirements about both false alarm and mis-detection rate are set by above two conditions. Let 𝑡 (Δ, 𝛿) be the total (traveling and sampling) time to finish the search task with misclassi- fication rate smaller than 𝛿. Then, the objective of the algorithm is to determine the sequence of sampling points that minimize 𝑡 (Δ, 𝛿). 6.2 Expedited Multi-target Search Algorithm The proposed Expedited Multi-target Search (EMTS) algorithm is illustrated in Figure. 6.1. It operates using an epoch-based structure. In each epoch, the sampling and fidelity planner computes a set of sampling points and the path planner optimizes a TSP tour going through those points. The vehicle follows the TSP tour to collect measurements at sampling points and the inference algorithm uses these measurements to update the estimate of the score function 𝑓 . Then, the Bayesian classification uses these estimates to compute an occupancy map of the floor and the region elimination module removes regions with no target with sufficiently high probability from the search space. In the following, we describe each of these modules in detail. 82 Figure 6.1: Architecture of EMTS. 6.2.1 Inference Algorithm for Multi-fidelity GPs The Bayesian inference method for multi-fidelity GPs discussed in this section is an extension of the inference procedure in [12] for the case of no sampling noise. Let the set of sampling location-  score-fidelity tuples after 𝑛 observations be P𝑛 = { 𝒙 𝑖 , 𝑦𝑖 , 𝑚𝑖 | 𝑖 ∈ {1, . . . , 𝑛}}. For each fidelity 𝑚, define a subset of P𝑛 ,  𝑃𝑛𝑚 = { 𝒙 𝑖 , 𝑦𝑖 , 𝑚𝑖 ∈ P𝑛 | 𝑚𝑖 = 𝑚}, and |𝑃𝑛𝑚 | denote the cardinality of 𝑃𝑛𝑚 . Recall that 𝑘 𝑖 (𝒙, 𝒙 0) is the kernel function for the GP ℎ𝑖 at 𝑖-th 0 0 0 fidelity level. Let 𝑲 𝑖0 𝑃𝑛𝑚 , 𝑃𝑛𝑚 be a |𝑃𝑛𝑚 | × |𝑃𝑛𝑚 | matrix with entries 𝑘 𝑖 (𝒙, 𝒙 0), 𝒙 ∈ 𝑃𝑛𝑚 , 𝒙 0 ∈ 𝑃𝑛𝑚 and 𝑲 𝑖0 (𝑃𝑛𝑚 , 𝒙) be a |𝑃𝑛𝑚 | dimensional vector with entries 𝑘 0𝑖 (𝒙 0, 𝒙), 𝒙 0 ∈ 𝑃𝑛𝑚 . Let 𝑲 be a 𝑀 × 𝑀 block matrix with (𝑚, 𝑚0) block submatrix 0 Õ ) min(𝑚,𝑚 0  𝑲 𝑚,𝑚 0 = 𝑲 𝑖 𝑃𝑛(𝑚) , 𝑃𝑛(𝑚 ) . 𝑖=1 Let 𝒌 (𝒙) be a |P𝑛 | dimensional vector constructed by concatenating 𝑀 sub-vectors 𝒌 (𝒙) =  𝒌 1 (𝒙), . . . , 𝒌 𝑀 (𝒙) , where Õ 𝑚 𝑚 𝒌 (𝒙) = 𝑲 𝑖 (𝑃𝑛𝑚 , 𝒙), ∀𝑚 ∈ {1, . . . , 𝑀 }. (6.4) 𝑖=1 Denoted by 𝚯 is the 𝑀 × 𝑀 diagonal matrix with the variance of sampling noise at diagonal entries n o 𝚯 = diag 𝑠2𝑚 𝑰 |𝑃𝑛𝑚 | . 𝑚={1,...,𝑀 } 83  Let 𝝂 𝑛 = [𝜈1 , . . . , 𝜈𝑛 ] be the a priori mean of the sample 𝒚 𝑛 = 𝑦 1 , . . . , 𝑦 𝑛 . In particular, if Í𝑚 𝑦 𝑗 is a sample at fidelity 𝑚, then 𝜈 𝑗 = 𝑖=1 𝜇𝑖 . The a priori covariance of 𝒚 𝑛 is 𝑲 + 𝚯. In the training process with training dataset P𝑛 , the hyperparameters {𝜇𝑚 , 𝑣 𝑚 , 𝑙 𝑚 , 𝑠𝑚 } 𝑚=1 𝑀 and {𝑎 } 𝑀−1 𝑚 𝑚=1 in the multi-fidelity GP can be learned by maximizing a log marginal likelihood function 1   1 𝒚 − 𝝂 𝑛 (𝑲 + 𝚯) −1 𝒚 − 𝝂 𝑛 . 𝑇  − log det 2𝜋 (𝑲 + 𝚯) − 2 2 Such training can be performed using the GP toolbox [109]. Due to the multi-fidelity structure described in (6.1) and (6.2), the prior mean and covariance of 𝑓 are Õ𝑀 Õ𝑀 𝜇0 (𝒙) = 𝜇𝑚 , 𝑘 0 (𝒙, 𝒙 0) = 𝑘 𝑚 (𝒙, 𝒙 0). 𝑚=1 𝑚=1 When running EMTS with learned hyperparameters, it can be shown that the posterior mean and covariance functions of 𝑓 after 𝑛 measurements are 𝜇𝑛 (𝒙) = 𝜇0 (𝒙) + 𝒌 𝑇 (𝒙) (𝑲 + 𝚯) −1 𝒚 − 𝝂 𝑛  (6.5) 𝑘 𝑛 𝒙, 𝒙 0 = 𝑘 0 𝒙, 𝒙 0 − 𝒌 𝑇 (𝒙) (𝑲 + 𝚯) −1 𝒌 (𝒙 0).   Note that the posterior variance 𝜎𝑛2 (𝒙) = 𝑘 𝑛 (𝒙, 𝒙) is a measure of uncertainty that will be utilized to classify 𝒙. It should be noted that the measurements collected at different fidelity levels are appropriately scaled in inference (6.5). 6.2.2 Multi-fidelity Sampling & Path Planning For each epoch 𝑗, we seek to design an efficient sampling tour through sampling locations {(𝒙 𝑛 𝑗 +1 , 𝑧 𝑛 𝑗 +1 ), . . . , (𝒙 𝑛 𝑗+1 , 𝑧 𝑛 𝑗+1 )} to ensure  max 𝜎𝑛 𝑗+1 (𝒙) max 𝜎𝑛 𝑗 (𝒙) ≤ 𝛼, 𝒙∈𝐷 𝒙∈𝐷 where 𝑛 𝑗 is the number of samples collected before the beginning of the 𝑗-th epoch and the selection of uncertainty reduction threshold 𝛼 is discussed in Section 6.2.3. Notice that the posterior variance update in (6.5) depends only on the location of the observations 𝒚 𝑛 , but not on the realized value of 𝒚 𝑛 . Therefore, the sequence of sampling location-fidelity 84 tuples can be computed before physically visiting the locations. Such deterministic evolution of the variance has also been leveraged within the context of single-fidelity GP planning to design efficient sampling tours [110]. Sampling Point Selection. The vehicle follows a greedy sampling policy at each fidelity level, i.e., at each sampling round the vehicle selects the most uncertain point as the next sampling point 𝒙 𝑛 = arg max 𝜎𝑛−1 (𝒙). (6.6) 𝒙∈𝐷 In the information theoretic view [58], the greedy policy is near-optimal in terms of maximizing an appropriate measure of uncertainty reduction. Fidelity Selection. For each sampling point 𝒙 𝑛 , a fidelity level (or sampling altitude) needs to be assigned. We let the vehicle start at fidelity level 1 and successively visit all fidelity levels from the lowest to the highest. Since sampling 𝑓 𝑚 is not able to reduce the uncertainty about 𝑓 introduced by the subsequent bias terms ℎ𝑚+1 , . . . , ℎ 𝑀 , we define the inaccessible uncertainty at fidelity level Í𝑀 𝑚 as 𝜉𝑚 = 𝑖=𝑚+1 𝑣 𝑖2 . Accordingly, we define the accessible uncertainty about 𝑓 at fidelity level 𝑚 by 𝑟 𝑛𝑚 = max𝒙∈𝐷 𝜎𝑛2 (𝒙) − 𝜉𝑚 . The assigned fidelity level to sample point 𝒙 𝑛 is designed to change from fidelity 𝑚 to 𝑚 + 1 when 𝑟 𝑛𝑚 ≤ 𝑣 2𝑚+1 𝑙 𝑚+1 2 2 /𝑙 𝑚 . Notice that before the vehicle begins to sample at fidelity level 𝑚, 𝑟 𝑛𝑚 ≥ 𝑣 2𝑚 ≥ 𝑣 2𝑚+1 𝑙 𝑚+1 2 /𝑙 2 , where 𝑚 the second inequality is due to the assumption that 𝑣 𝑚 > 𝑣 𝑚+1 and 𝑙 𝑚 > 𝑙 𝑚+1 . This ensures that all fidelity levels are visited from the lowest to the highest successively. Path Planning. Since the order of sampling locations does not influence the eventual posterior mean and variance, the path going through the sampling location can be optimized by computing an approximate TSP tour using packages, such as Concorde [111]. Such a tour-based sampling policy allows for energy and time efficient operation of the vehicle. If all measurements within epoch 𝑗 are collected at the same fidelity level, the vehicle traverses the TSP tour TSP(𝒙 𝑛 𝑗 +1 , . . . , 𝒙 𝑛 𝑗+1 ) to collect measurements from sampling points and update posterior distribution of 𝑓 . Otherwise, a TSP tour is designed at each fidelity level. 85 6.2.3 Classification and Region Elimination The classification and elimination of regions follow a confidence-bound-based rule, which has been widely used in pure exploration multi-armed bandit algorithms [112] and robotic source seeking [113]. We extend these ideas to the case of multi-fidelity GP setting. Conditioned on P𝑛 , the distribution of 𝑓 (𝒙) is Gaussian with mean function 𝜇𝑛 (𝒙) and vari- ance 𝜎𝑛2 (𝒙). Let (𝐿 𝑛 (𝒙, 𝜀), 𝑈𝑛 (𝒙, 𝜀)) be the Bayesian confidence interval containing 𝑓 (𝒙) with probability greater than (1 − 2𝜀). Here, the lower confidence bound 𝐿 𝑛 and upper confidence bound 𝑈𝑛 are defined by 𝐿 𝑛 (𝒙, 𝜀) = 𝜇𝑛 (𝒙) − 𝑐(𝜀)𝜎𝑛 (𝒙) , 𝑈𝑛 (𝒙, 𝜀) = 𝜇𝑛 (𝒙) + 𝑐(𝜀)𝜎𝑛 (𝒙) , with q  𝑐(𝜀) = 2 ln 1/(2𝜀) . Given the desired maximum misclassification rate 𝛿, at the end of epoch 𝑗, a location 𝒙 is   classified as target, if 𝐿 𝑛 𝑗 𝒙, 𝛿/2 𝑗 ≥ th, and is added to 𝐷 𝑡 ; while it is classified as empty, if   𝑈𝑛 𝑗 𝒙, 𝛿/2 𝑗 < th, and is added to the set 𝐷 𝑒 . Note that the confidence parameter 𝜀 = 𝛿/2 𝑗 defining the lower and upper bounds is decreased exponentially with epochs, and we will show that it ensures a misclassification rate smaller than 𝛿. The locations in the set 𝐷 𝑒 are removed from sampling space 𝐷 at the end of each epoch. EMTS is terminated if max𝒙∈𝐷 2𝜎𝑛 𝑗 (𝒙) ≤ Δ/𝑐(𝛿/2 𝑗 ). The selection of 𝛼 depends on the balance between the efficiency of the TSP path planer and region elimination. TSP path planer is more effective with smaller 𝛼 since each exploration tour includes more sample points. While region elimination favors bigger 𝛼 so that regions not likely to contain targets are removed more frequently. 6.3 An Illustrative Example In this section, we illustrate EMTS using the Unmanned Underwater Vehicle Simulator [114], which is a ROS package designed for Gazebo robot simulation environment. We integrate it with YOLOv3 [108] for image classification and Concorde solver [111] to compute TSP tours. We use 2 fidelity levels situated at 11m and 5m from the water floor, respectively. In Figure. 6.2, the left figure shows our simulation setup, where an underwater vehicle is equipped with a downward camera and a flashlight to facilitate the searching task in a dark underwater environment. The middle figure 86 Figure 6.2: Underwater victim search simulation setups. and right figure in Figure 6.2 are the detection results with YOLOv3 at a high fidelity level and a low fidelity level, respectively. There are 3 victims located at different unknown locations on a 40m × 40m water floor. At each sampling point, the vehicle takes 20 images and YOLOv3 returns an average score about the confidence level of the existence of victims in the view. The first three subplots of Figure. 6.3 show the classification of regions before each epoch, the sampling points selected by the greedy policy and the planned path. Classifications of the environment are represented by 3 colors: red means target exist, blue means no target, and green means uncertain. The dark green points and lines are the planned sampling locations and paths at the low fidelity level and red points and lines are sampling locations and paths at the high fidelity level. At the beginning of epoch 1, all regions are classified as uncertain. After each epoch, the region of targets is narrowed down. The search task is terminated after three epochs. Notice that the vehicle switches to the high fidelity level at epoch 2. The tours at low and high fidelity levels are plotted using two different colors. The vehicles do not sample in blue regions since they have been classified as empty. In the final result, the regions with target are successfully found. A video of the simulation is available online2. Figure. 6.4a shows the heat map of posterior variance for the whole region at the end of simulations. It reflects the nature of uncertainty reduction with EMTS, i.e., the posterior variance is low only at areas that likely contain a target. The regions classified as empty have larger posterior variance since they have been eliminated from sampling space in the early phase. This shows that 2 https://mediaspace.msu.edu/media/EMTS/1_phbul7ui 87 (a) Epoch 1 (b) Epoch 2 (c) Epoch 3 (d) Final result Figure 6.3: Simulation result of EMTS. EMTS is able to put more focus on areas likely to contain victims. The uncertainty reduction, i.e. the decrease in maximum posterior variance, for multi-fidelity greedy sampling and single- fidelity greedy sampling, are compared in Figure. 6.4b. It shows that greedy multi-fidelity sampling can reduce uncertainty much faster at the beginning stage, which will enable EMTS to eliminate unoccupied regions quickly, and hence, accelerate target search. (a) Final posterior variance (b) Convergence of 𝜎𝑛2 Figure 6.4: Uncertainty reduction results. 88 6.4 Analysis of the EMTS Algorithms In this section, we analyze the modules of the EMTS algorithm and use these analyses to derive an upper bound on the expected detection time for the overall algorithm. 6.4.1 Analysis of the classification algorithm We first characterize the Bayesian confidence interval for 𝑓 (𝒙), and then use this result to establish that the EMTS algorithm ensures the desired classification accuracy.   Lemma 6.1 (Bayesian confidence interval). For 𝑓 (𝒙) | P𝑛 ∼ 𝑁 𝜇𝑛 (𝒙), 𝜎𝑛2 (𝒙) and 𝜀 ∈ (0, 1/2),   P 𝑓 (𝒙) ≤ 𝐿 𝑛 (𝒙, 𝜀) = P 𝑓 (𝒙) ≥ 𝑈𝑛 (𝒙, 𝜀) ≤ 𝜀.  q  Proof. To normalize 𝑓 (𝒙), let 𝑟 = 𝑓 (𝒙) − 𝜇(𝒙) /𝜎(𝒙) and 𝑐(𝜀) = 2 ln 1/(2𝜀) . Now 𝑟 ∼ 𝑁 (0, 1), and from tail-inequality for standard normal distribution [115] ! 1 𝑐2 P (𝑟 ≥ 𝑐) ≤ exp − = 𝜀, 2 2  which prove the P 𝑓 (𝒙) ≥ 𝑈𝑛 (𝒙, 𝜀) ≤ 𝜀. Similar result holds for lower confidence bound.  Theorem 6.2 (Misclassification Rate). For the classification strategy in the EMTS algorithm, a location 𝒙 ∈ 𝐷 is misclassified with probability at most equal to 𝛿. Proof. Consider a location 𝒙 such that 𝑓 (𝒙) ≤ th, i.e., the true classification of 𝒙 is empty. Since at the end of epoch 𝑗, the lower and upper confidence bounds used for classification employ 𝜀 = 𝛿/2 𝑗 , we apply a union bound to show the probability of classifying 𝒙 as a target satisfies Õ∞   Õ ∞   𝑗 P 𝐿 𝑛 𝑗 (𝒙, 𝛿/2 ) > th ≤ P 𝐿 𝑛 𝑗 (𝒙, 𝛿/2 𝑗 ) > 𝑓 (𝒙) . 𝑗=1 𝑗=1 Í∞ Then, it follows from Lemma 6.1 that the misclassification probability is no greater than 𝑗=1 𝛿/2 𝑗 = 𝛿. The case of location 𝒙 being occupied by a target follows similarly.  89 6.4.2 Analysis of the Sampling and Fidelity Planner We now analyze the information gain and uncertainty reduction properties for our sampling and fidelity planner. We first recall some results for the single fidelity planner and then extend them to the case of the multi-fidelity planner. Consider a single-fidelity GP 𝑓 that is sampled with additive Gaussian noise with variance 𝑠2 . Let 𝑋𝑛 be the set of first 𝑛 sampling points and let the vector of associated observations be 𝒚 𝑋𝑛 . It is shown in [44, Lemma 5.3] that the mutual information between 𝒚 𝑋𝑛 and 𝑓 is   1Õ 𝑛   𝐼 𝒚 𝑋𝑛 ; 𝑓 = log 1 + 𝑠−2 𝜎𝑖−1 2 (𝒙 𝑖 ) , (6.7) 2 𝑖=1 where 𝒇 𝑋𝑛 is the vector of 𝑓 (𝒙) calculated at points in 𝑋𝑛 . Let the maximal mutual information gain with 𝑛 samples be  𝛾𝑛 := max 𝐼 𝒚𝑍 ; 𝑓 . 𝑍 ∈𝐷:|𝑍 |=𝑛 Let 𝐼greedy be the total mutual information gain using a greedy policy that maximizes the summand   in (6.7) at each sampling step. It follows, due to submodularity [116] of 𝐼 𝒚 𝑋𝑛 ; 𝑓 , that     1 1− 𝛾𝑛 ≤ 𝐼greedy 𝒚 𝑋𝑛 ; 𝑓 ≤ 𝛾𝑛 , 𝑒 While giving an exact value of 𝛾𝑛 is difficult, an upper bound on 𝛾𝑛 for squared exponential kernel derived in [44] is presented in the following Lemma 6.3. Lemma 6.3 (Information gain for squared exp. kernel). Let a GP 𝑓 be defined on domain 𝐷 ⊂ R2 . If 𝑓 has squared exponential kernel with length scale 𝑙, then the maximum mutual information satisfies 𝛾𝑛 (𝑙) ∈ 𝑂 (𝑙 −2 (log 𝑛) 3 ). Proof. For a GP defined on 𝐷 ∈ [0, 1] 2 with squared exponential kernel function 𝑘 (𝒙, 𝒙 0) = exp(−k𝒙 − 𝒙 0 k 2 /2), 𝛾𝑛 ∈ 𝑂 ((log 𝑛) 3 ) [44]. It is shown in [117] that 𝛾𝑛 scales with the area of 𝐷.   Thus, if the diameter of 𝐷 is 𝑑, then 𝛾𝑛 ∈ 𝑂 𝑑 2 (log 𝑛) 3 . Note that having length scale 𝑙 in kernel   function is equivalent to scale 𝐷 by 1/𝑙. Accordingly, 𝛾𝑛 ∈ 𝑂 𝑑 2 𝑙 −2 (log 𝑛) 3 . For fixed 𝐷, we omit diameter 𝑑 from the order notation and write 𝛾𝑛 (𝑙) ∈ 𝑂 (𝑙 −2 (log 𝑛) 3 ).  90 Lemma 6.3 provides a bound on the mutual information gain at the first fidelity level. For higher fidelity levels, the Gaussian process is composed of the summation of independent GPs. We now establish that the information gained by sampling the sum of GPs is smaller than the information gained by sampling them independently. and then use this result to establish the bound on information gain for multi-fidelity GPs. Lemma 6.4 (Information gain for sum of GPs). Let ℎ1 ∼ 𝐺𝑃(𝜇1 (𝒙), 𝑘 1 (𝒙, 𝒙 0)) and ℎ2 ∼ 𝐺𝑃(𝜇2 (𝒙), 𝑘 2 (𝒙, 𝒙 0)) be independent GPs. Consider a measurement 𝑦 = ℎ1 (𝒙) + ℎ2 (𝒙) + 𝜖 at point 𝒙, where 𝜖 is additive measurement noise independent of ℎ1 and ℎ2 . Let 𝒚 𝑋 = 𝒉1,𝑋 + 𝒉2,𝑋 + 𝝐 be the vector of such measurements at sampling points in a set 𝑋, where 𝝐 is the vector of i.i.d. measurement noise. Then, 𝐼 (𝒚 𝑋 ; ℎ1 + ℎ2 ) ≤ 𝐼 (𝒉1,𝑋 + 𝝐; ℎ1 ) + 𝐼 (𝒉2,𝑋 + 𝝐; ℎ2 ). Proof. The data processing inequality [118, Theorem 2.8.1] indicates 𝐼 (𝒚 𝑋 ; ℎ1 + ℎ2 ) ≤ 𝐼 ( 𝒚 𝑋 ; ℎ1 , ℎ2 ) = 𝐼 ( 𝒚 𝑋 ; ℎ1 ) + 𝐼 ( 𝒚 𝑋 ; ℎ2 | ℎ1 ). Applying the data processing inequality again, we get 𝐼 ( 𝒚 𝑋 ; ℎ1 ) ≤ 𝐼 (𝒉1,𝑋 + 𝝐, 𝒉2,𝑋 ; ℎ1 ) = 𝐼 (𝒉2,𝑋 ; ℎ1 ) + 𝐼 (𝒉1,𝑋 + 𝝐; ℎ1 | 𝒉2,𝑋 ) = 𝐼 (𝒉1,𝑋 + 𝝐; ℎ1 ), where in the last step follows due to the independence of ℎ1 , ℎ2 and 𝝐. Similarly, it can be shown 𝐼 ( 𝒚 𝑋 ; ℎ2 | ℎ1 ) = 𝐼 (𝒉1,𝑋 + 𝒉2,𝑋 + 𝝐; ℎ2 | ℎ1 ) = 𝐼 (𝒉2,𝑋 + 𝝐; ℎ2 ). This establishes the lemma.  Let 𝛾𝑛𝑚 be the maximal mutual information gain at fidelity 𝑚. It follows from Lemma 6.4 Í𝑚 and the multi-fidelity GP model in (6.2) that 𝛾𝑛𝑚 ≤ 𝑖=1 𝛾𝑛 (𝑙𝑖 ). Combining this inequality with Lemma 6.3, we obtain the following result. 91 Corollary 6.5 (Information gain for multi-fidelity GPs). The maximal mutual information gain at fidelity 𝑚 satisfies Õ 𝑚  𝛾𝑛𝑚 ∈ 𝑂 𝑙𝑖−2 (log 𝑛) 3 . 𝑖=1 This corollary gives us an insight on the size of 𝛾𝑛𝑚 at different fidelity levels. It follows that 𝛾𝑛(𝑚) grows faster at higher fidelity levels. We now derive a bound on the posterior variance for the multi-fidelity GP in terms of the maximum mutual information gain. Lemma 6.6 (Uncertainty reduction for multi-fidelity GPs). Let 𝑓 ∼ 𝐺𝑃 𝜇0 (𝒙), 𝑘 0 (𝒙, 𝒙 0) and  𝜎02 (𝒙) ≤ 𝜎 2 , for each 𝒙 ∈ 𝐷. An additive sampling noise 𝜖 ∼ 𝑁 (0, 𝑠2 ) is incurred every time 𝑓 is accessed. Under the greedy sampling policy the posterior variance after 𝑛 sampling rounds satisfies 2𝜎 2 𝛾𝑛 max 𝜎𝑛2 (𝒙) ≤ −2  . 𝒙∈𝐷 log 1 + 𝑠 𝜎 𝑛 2 Proof. For any 𝒙 ∈ 𝐷, 𝜎𝑛2 (𝒙) is monotonically non-increasing in 𝑛. So we get max 𝜎𝑛2 (𝒙) = 𝜎𝑛2 (𝒙 𝑛+1 ) ≤ 𝜎𝑛−1 2 2 (𝒙 𝑛+1 ) ≤ 𝜎𝑛−1 (𝒙 𝑛 ), (6.8) 𝒙∈𝐷 where the second inequality is due to the fact 𝒙 𝑛 = arg max 𝒙∈𝐷 𝜎𝑛−1 2 (𝒙). Again since 𝒙 𝑛+1 = arg max 𝒙∈𝐷 𝜎𝑛2 (𝒙), inequality (6.8) also indicates that 𝜎𝑛−1 2 (𝒙 ) is monotonically non-increasing. 𝑛     Hence, from (6.7), log 1 + 𝑠−2 𝜎𝑛−1 2 (𝒙 ) ≤ 2𝐼  2 2 is 𝑛 greedy 𝒚 𝑋 ; 𝑓 /𝑛 ≤ 2𝛾𝑛 /𝑛. Since 𝑠 /log 1 + 𝑠 an increasing function on [0, ∞), 2 𝜎2  −2 2  𝜎𝑛−1 (𝒙 𝑛 ) ≤  log 1 + 𝑠 𝜎𝑛−1 (𝒙 𝑛 ) . log 1 + 𝑠−2 𝜎 2 Substituting (6.8) into it, we conclude that 2𝜎 2 𝛾𝑛 max 𝜎𝑛2 (𝒙) ≤ 2 𝜎𝑛−1 (𝒙 𝑛 ) ≤ −2  . 𝒙∈𝐷 log 1 + 𝑠 𝜎 𝑛 2  Lemma 6.6 indicates that the smaller and the more slowly growing 𝛾𝑛 is, the faster max𝒙∈𝐷 𝜎𝑛 (𝒙) converges. This result explains our idea of using a multi-fidelity model. 92 6.4.3 Analysis of Expected Detection Time We now derive an upper bound on the number of samples needed to classify a location using the EMTS algorithm and then use this result to compute the total sampling and travel time required for classification. Lemma 6.7 (Sample complexity for uncertainty reduction). In the autoregressive multi-fidelity model (6.3), if each ℎ (𝑚) has a squared exponential kernel, then 3! 𝜎02  𝜎0 min{𝑛 ∈ N | max 𝜎𝑛 (𝒙) ≤ Δ} ∈ 𝑂 2 ln . 𝒙∈𝐷 Δ Δ Proof. It follows from Lemma 6.6 that 𝑛 2𝜎02 ≤ . 𝛾𝑛 max𝒙∈𝐷 𝜎𝑛2 (𝒙) Since 𝑣 𝑚 , 𝑠𝑚 and 𝑙 𝑚 for all fidelity levels are finite, it follows from Corollary 6.5 that 𝛾𝑛 ∈ 𝑂 ((ln 𝑛) 3 ). Combining these results, the lemma follows by inspection.  Lemma 6.8 (Sample complexity for EMTS). For a given misclassification tolerance 𝛿, let 𝑛(𝒙, 𝛿) be the number of samples required to classify 𝒙 ∈ 𝐷. Then, the expected number of samples satisfies  3 E [𝑛(𝒙, 𝛿) | Δ(𝒙)] ∈ 𝑂 𝜑(Δ(𝒙), 𝛿) ln 𝜑(Δ(𝒙), 𝛿) , 𝜎02   3𝜎0 where Δ(𝒙) = 𝑓 (𝒙) − th and 𝜑(Δ(𝒙), 𝛿) = Δ2 (𝒙) ln 𝛿Δ(𝒙) .  𝑗+1 Proof. Since 𝛿 < 1/2, function 𝑐(𝛿/2 𝑗 ) 3/4 is monotonically decreasing for 𝑗 ≥ 2. We define  s     3𝜎 0 3𝜎0 ª 𝐽 = log4/3 ­ ® + 1. ©  2 ln  Δ(𝒙) 𝛿Δ(𝒙)   « ¬ It can be shown that the choice of 𝐽 ensures, for 𝑗 ≥ 𝐽,  𝑗+1  𝐽+1 𝑈 (𝒙, 𝛿/2 𝑗 ) − 𝐿(𝒙, 𝛿/2 𝑗 ) ≤2𝑐(𝛿/2 𝑗 ) 3/4 𝜎0 ≤ 2𝑐(𝛿/2𝐽 ) 3/4 𝜎0 v u u   u u q 3𝜎 t 𝛼 ln 𝛿Δ(𝒙) 2 ln 𝛿Δ(𝒙) u 3𝜎 Δ(𝒙) ≤ 3𝜎 < Δ(𝒙) (6.9) 2 ln 𝛿Δ(𝒙) 93 where 𝛼 = log4/3 2 and the second inequality is due to the fact ln(𝑥 ln(𝑥))/ln(𝑥) ≤ (1 + 𝑒)/𝑒. For a point 𝒙 at which 𝑐∗ (𝒙) = 1 and Δ(𝒙) > 0, based on (6.9), the number of sampling rounds to classify 𝒙 satisfies Õ∞ 𝑛(𝒙, 𝛿) ≤ 𝑛 𝐽 + 1𝐿(𝒙, 𝛿/2 𝑗 ) < th ≤ 𝑈 (𝒙, 𝛿/2 𝑗 ) 𝑗=𝐽+1 Õ∞ ≤ 𝑛𝐽 + 1𝐿(𝒙, 𝛿/2 𝑗 ) < th 𝑗=𝐽+1 Õ∞ ≤ 𝑛𝐽 + 1𝑈 (𝒙, 𝛿/2 𝑗 ) < 𝑓 (𝒙), 𝑗=𝐽+1 where 𝑛 𝐽 is the number of samples collected in the first 𝐽 epochs. Then the expected sampling rounds can be bounded as Õ∞   ¯ 𝛿) ≤ 𝑛 𝐽 + 𝑛(𝒙, P 𝐿 (𝒙, 𝛿/2 𝑗 ) ≥ th 𝑗=𝐽+1 Õ∞   𝑗 ≤ 𝑛𝐽 + P 𝐿 (𝒙, 𝛿/2 ) ≥ th 𝑗=𝐽+1 ∞ Õ 𝑛𝑗 ≤ 𝑛𝐽 + . 𝑗=1 2𝑗 From Lemma 6.7, we has 𝑛𝑖 ∈ 𝑂˜ ((16/9) 𝑗 ). Therefore ∞ Í 𝑗=1 𝑛 𝑗 /2 is finite. So we conclude 𝑗  3 ¯ 𝛿) ∈ 𝑂 𝜑(Δ(𝒙), 𝛿) ln 𝜑(Δ(𝒙), 𝛿) . 𝑛(𝒙,  Remark 6.1 (Comparison with sample complexity of multiarmed bandits). Notice that   1 E [𝑛(𝒙, 𝛿) | Δ(𝒙)] ∈ 𝑂˜ 2 Δ (𝒙) describes the complexity to of classification of 𝒙, i.e., for a point with 𝑓 (𝒙) close to th more time is needed. This term is similar to the sampling complexity [73] in a pure-exploration multi-armed bandit problem. This result is based on the assumption that GPs all have squared exponential kernel. For kernels characterizing less correlations, e.g. Matérn kernels, more sampling rounds are expected.  94 We now derive an upper bound on detection time for EMTS. Theorem 6.9 (Target search time for EMTS). For a given misclassification tolerance 𝛿 and detection difficulty measure 1/Δ, the target search time satisfies  3 𝑡 (Δ, 𝛿) ∈ 𝑂 𝑑 2 𝜑(Δ, 𝛿) ln 𝜑(Δ, 𝛿) . Proof. Since we assume unit sampling time, the total sampling time is in the same order as 𝑛(𝒙, 𝛿). Then we consider the traveling time spent in order to collected those samples. Since EMTS requires the vehicle to search from low fidelity level to high fidelity level, the total number of altitude switches is no greater than 𝑀 − 1. As presented in [119], for 𝑛 points in [0, 1] 2 , the length of the shortest TSP √ p  Tour < 0.984 2𝑛 + 11. Therefore, the expected traveling time belongs to 𝑂 𝑑 𝑛(𝒙, ¯ 𝛿) , where 𝑑 is the diameter of 𝐷. Thus, the expected traveling time belongs to 𝑜( 𝑛(𝒙, ¯ 𝛿)). Considering both  2 3 sampling and traveling time, we conclude 𝑡 (Δ, 𝛿) ∈ 𝑂 𝑑 𝜑(Δ, 𝛿) ln 𝜑(Δ, 𝛿) .  Theorem 6.9 illustrates the efficiency of the EMTS algorithm, we conjecture it to be near- optimal. This upper bound has a natural implication that the target search time increases with the detection difficulty 1/Δ and the desired classification accuracy 1 − 𝛿. 6.5 Summary We studied the autonomous robotic search of an unknown number of targets located at the 2D floor in an unknown and uncertain 3D environment. The novelty of this work lies in using autoregressive multi-fidelity GPs [12, 117] to model the likelihood of the presence of a target at a location, which is computed by a computer vision algorithm using the sample collected at that location at a given altitude. The multi-fidelity GPs sensing model captures the fact that a high altitude (low fidelity) sample provides more global but less accurate information compared with a low altitude (high fidelity) sample. We designed a multi-target search algorithm EMTS that leverages multi-fidelity Gps to capture the fidelity-coverage trade-off, information-theoretic techniques to efficiently explore the environment, and Bayesian techniques to accurately identify targets and construct an occupancy 95 map. With rigorous analysis, we establish formal guarantees of the target detection accuracy and expected detection time. 6.6 Bibliographic Remarks Autonomous multi-target search requires the autonomous system to quickly and accurately locate multiple targets of interest in an unknown and uncertain environment. Examples include search and rescue missions, mineral exploration, and tracking natural phenomena. To improve the target search efficiency, the trajectory should be designed to balance the explore-exploit tension—the robot should spend more time at target locations while learning target locations. There have been some efforts to address such explore-exploit tension within the context of informative path planning [9, 11, 36, 45–53]. Gaussian processes (GPs) are the most widely used models for capturing spatiotemporal sensing fields in robotics [42, 43]. Informative path planning using such models of the environment has been studied [51, 53, 58, 120–122]. In a broader class of search problems, robot trajectories are designed to maximize the information collected along the way-points while ensuring that the distance traveled is within a prescribed budget. Such informative path planning problems are studied in [54–58]. While GP-based approaches have been used extensively, most of them rely on single-fidelity measurements. Besides, most of these works focus on maximizing the reduction in uncertainty of the estimates instead of the efficiency of the target search. The multi-target search can also be viewed as a hot-spot identification problem in which, instead of the global maximum of the field, all locations with values greater than a threshold need to be identified. Such problems have been studied in the multiarmed bandit literature [123, 124]; however, we are not aware of any such studies in the GP setting. Furthermore, all these works focus on single fidelity measurements, while we focus on multiple fidelities of measurements. The multi-target search policy in this chapter can be viewed as a combination of informative path planning and the MAB methods in an environment with multi-fidelity sensing information. 96 CHAPTER 7 ONLINE ESTIMATION AND COVERAGE CONTROL An intuitive idea to extend the single robot search policy to 𝑁 robots is to partition the environment into 𝑁 regions and allocate each robot to one partition. Ideally, the workload needs to be equitably distributed across all regions to maximize time efficiency. The coverage control focuses on such equitable partitioning problems for a team of robots to provide service to a large or continuous environment. The workload is typically referred to as serving demand in coverage control literature. In the coverage problem [13], a particular demand function 𝜙 is defined over an environment that specifies the degree to which a robot is “needed”. The team of agents aims to partition an environment and achieve a configuration that minimizes the coverage cost defined by the sum of the 𝜙-weighted distances from every point in the environment to the nearest agent. Intuitively, more robots should concentrate at the regions with higher demands in order to reduce the coverage cost. This chapter is a slightly modified version of our published work on adaptive coverage control, and it is reproduced here with the permission of the copyright holder.1 Unlike the classic coverage control with assume demand function 𝜙 to be known, we model it as a realization of a Gaussian process that can be learned by taking samples. 7.1 Online Estimation and Coverage Problem We consider a team of 𝑁 agents tasked with providing coverage to a finite set of points in an environment represented by an undirected graph. The team is required to navigate within the graph to learn an unknown demand function while maintaining near-optimal configuration. In this section, we present the preliminaries of the estimation and coverage problem. 1 ©2021 IEEE. Reprinted with permission from [125]. 97 7.1.1 Graph Representation of Environment We consider a discrete environment modeled by an undirected graph 𝐺 = (𝑉, 𝐸), where the vertex set 𝑉 contains the finite set of points to be covered and the edge set 𝐸 ⊆ 𝑉 × 𝑉 is the collection of physically adjacent pairs of vertices that can be reached from each other without passing through other vertices. Let the weight map 𝑤 : 𝐸 → R>0 indicate the distance between connected vertices. We assume 𝐺 is connected. Following the standard definition of weighted undirected graph, a path in 𝐺 is an ordered sequence of vertices where there exists an edge between consecutive vertices. The distance between vertices 𝑣 𝑖 and 𝑣 𝑗 in 𝐺, denoted by 𝑑𝐺 (𝑣 𝑖 , 𝑣 𝑗 ), is defined by the minimum of the sums of the weights in the paths between 𝑣 𝑖 and 𝑣 𝑗 . Suppose there exists an unknown demand function 𝜙 : 𝑉 → R>0 that assigns a nonnegative weight to each vertex in 𝐺. Intuitively, 𝜙(𝑣 𝑖 ) could represent the intensity of signal of interest such as brightness or column of sound. A robot at vertex 𝑣 𝑖 is capable of measuring 𝜙(𝑣 𝑖 ) by collecting a sample 𝑦 = 𝜙(𝑣 𝑖 ) + 𝜖, where 𝜖 ∼ N (0, 𝜎 2 ) is an additive zero mean Gaussian noise. 7.1.2 Nonparametric Estimation Let 𝝓 be a vector with the 𝑖-th entry 𝜙(𝑣 𝑖 ), 𝑖 ∈ {1, . . . , |𝑉 |}, where | · | denotes set cardinality. We assume a multivariate Gaussian prior for 𝝓 such that 𝝓 ∼ N ( 𝝁0 , 𝚲−1 0 ), where 𝝁 0 is the mean vector and 𝚲0 is the inverse covariance matrix. Let 𝑛𝑖 (𝑡) be the number of samples and 𝑠𝑖 (𝑡) be the summation of sampling results from 𝑣 𝑖 until time 𝑡. Then, the posterior distribution of 𝝓 at time 𝑡 is N 𝝁(𝑡), 𝚲−1 (𝑡) [126, Chapter 10], where  |𝑉 | Õ 𝑛𝑖 (𝑡) 𝚲(𝑡) = 𝚲0 + 𝒆𝑖 𝒆𝑖T 𝑖=1 𝜎2 (7.1) |𝑉 | ! Õ 𝑠𝑖 (𝑡) 𝝁(𝑡) = 𝚲−1 (𝑡) 𝚲0 𝝁0 + 𝒆𝑖 . 𝑖=1 𝜎2 Here, 𝒆𝑖 is the standard unit vector with 𝑖-th entry to be 1. 98 7.1.3 Voronoi Partition and Coverage Problem We define the 𝑁-partition of graph 𝐺 as a collection 𝑃 = {𝑃𝑖 }𝑖=1 𝑁 of 𝑁 nonempty subsets of 𝑉 such that ∪𝑖=1 𝑁 𝑃 = 𝑉 and 𝑃 ∩ 𝑃 = ∅ for any 𝑖 ≠ 𝑗. 𝑃 is said to be connected if the subgraph induced by 𝑖 𝑖 𝑗 𝑃𝑖 denoted by 𝐺 [𝑃𝑖 ] is connected for each 𝑖 ∈ 𝑁. 𝐺 [𝑃𝑖 ] being induced subgraph means its vertex set is 𝑃𝑖 and its edge set includes all edges in 𝐺 whose both end vertices are included in 𝑃𝑖 . The configuration of the robot team is a vector of 𝑁 vertices 𝜼 ∈ 𝑉 𝑁 occupied by the robot team, where the 𝑖-th entry 𝜂𝑖 corresponds to the position of the 𝑖-th robot. The 𝑖-th robot is tasked to cover vertices in 𝑃𝑖 . Then, the coverage cost corresponding to configuration 𝜼 and connected 𝑁-partition 𝑃 can be defined as Õ 𝑁 Õ H (𝜼, 𝑃) = 𝑑𝐺 [𝑃𝑖 ] (𝜂𝑖 , 𝑣0)𝜙(𝑣 0). (7.2) 𝑖=1 𝑣 0 ∈𝑃𝑖 In a coverage problem, the objective is to minimize this coverage cost by selecting appropri- ate configuration 𝜼 and connected 𝑁-partition 𝑃. However, how to efficiently find the optimal configuration-partition pair in a large graph with arbitrary demand function 𝜙 remains an open problem. There are two intermediate results about the optimal selection of configuration or parti- tion when the other is fixed [127]. Optimal Partition with Fixed Configuration For a fixed configuration 𝜼 with distinct entries, a optimal connected 𝑁-partition 𝑃 minimizing coverage cost is called Voronoi partition denoted by V (𝜼). Formally, for each 𝑃𝑖 ∈ V (𝜼) and any 𝑣 0 ∈ 𝑃𝑖 , 𝑑𝐺 (𝑣 0, 𝜂𝑖 ) ≤ 𝑑𝐺 (𝑣 0, 𝜂 𝑗 ), ∀ 𝑗 ∈ {1, . . . , 𝑁 }. Optimal Configuration with Fixed Partition For a fixed connected 𝑁-partition 𝑃, the centroid of the 𝑗-th partition 𝑃 𝑗 ∈ 𝑃 is defined by Õ 𝑐𝑖 := arg min 𝑑𝐺 [𝑃𝑖 ] (𝑣, 𝑣0)𝜙(𝑣 0), 𝑣∈𝑃𝑖 𝑣 0 ∈𝑃𝑖 and the optimal configuration is to place one robot at the centroid of each 𝑃𝑖 ∈ 𝑃. We denote the vector of centroid of 𝑃 by 𝒄(𝑃) with 𝑐𝑖 as its 𝑖-th element. 99 Building upon the above two properties, the classic Lloyd algorithm iteratively places the robot to the centroid of the current Voronoi partition and computes the new Voronoi partition using the updated configuration. It is known that the robot team eventually converges to a class of partition called centroidal Voronoi partition defined below. Definition 7.1 (Centroidal Voronoi Partition, [128]). An 𝑁-partition 𝑃 is a centroidal Voronoi partion of 𝐺 if 𝑃 is a Voronoi partition generated by some configuration with distinct entries 𝜼, i.e.  𝑃 = V (𝜼), and 𝒄 V (𝜼) = 𝜼. It needs to be noted that an optimal partition and configuration pair minimizing the coverage cost H (𝜼, 𝑃) is of the form (𝜼∗ , V (𝜼∗ )), where 𝜼∗ has distinct entries and V (𝜼∗ ) is a centroidal Voronoi partition. A configuration-partition pair (𝜼0, V (𝜼0)) is considered to be an efficient solution to the coverage problem if V (𝜼0) is a centroidal Voronoi partition, even though it is possibly suboptimal [128]. 7.1.4 Coverage Performance Evaluation To achieve efficient coverage, the agents need to balance the trade-off between sampling the environment to learn 𝝓 (exploration) and achieving centroidal Voronoi configuration defined using the estimated 𝜙 (exploitation). To characterize this trade-off, we introduce a notion of coverage regret. Definition 7.2 (Coverage Regret). At each time 𝑡, let the team configuration be 𝜼𝑡 and the connected Í 𝑁-partition be 𝑃𝑡 . The coverage regret until time 𝑇 is defined by 𝑇𝑡=1 𝑅𝑡 (𝜙), where 𝑅𝑡 (𝜙) is the instantaneous coverage regret with respect to demand function 𝜙, and is defined by 𝑅𝑡 (𝜙) = 2H (𝜼𝑡 , 𝑃𝑡 ) − H (𝒄(𝑃𝑡 ), 𝑃𝑡 ) − H (𝜼𝑡 , V (𝜼𝑡 )), which is the sum of two terms H (𝜼𝑡 , 𝑃𝑡 ) − H (𝒄(𝑃𝑡 ), 𝑃𝑡 ) and H (𝜼𝑡 , 𝑃𝑡 ) − H (𝜼𝑡 , V (𝜼𝑡 )). The former (resp., latter) term is the regret induced by the deviation of the current configuration (resp., partition) from the optimal configuration (resp., partition) for the current partition (resp., 100 Algorithm 8: Deterministic Sequencing of Learning and Coverage (DSLC) Input : Environment graph 𝐺, 𝝁0 , 𝚲0 ; Set : 𝛼 ∈ (0, 1) and 𝛽 > 1; for epoch 𝑗 = 1, 2, . . . do Exploration phase: 1 The robot team sample at vertices in 𝑉 to make max 𝜎𝑖2 (𝑡) ≤ 𝛼 𝑗 𝜎02 . 𝑖 ∈ {1,...,|𝑉 | } Information propagation phase: 2 Each robot agent propagates its sampling result to the team. 3 Each robot update estimated demand function 𝜙. ˆ Coverage phase:   4 for 𝑡 𝑗 = 1, 2, . . . , 𝛽 𝑗 do Based on 𝜙, ˆ run pairwise partitioning algorithm. configuration). Accordingly, no regret is incurred at time 𝑡 if and only if 𝑃𝑡 is a centroidal Voronoi 𝑁-partition and 𝜼𝑡 = 𝒄(𝑃𝑡 ). There are two sources contributing to the coverage regret. First, the estimation error in the demand function 𝜙. Second, the deviation from centroidal Voronoi partition while sampling environment to learn 𝜙. 7.2 Deterministic Sequencing of Learning and Coverage Algorithm In this section, we describe the Deterministic Sequencing of Learning and Coverage (DSLC) algorithm (Algorithm 8). It operates with an epoch-wise structure and each epoch consists of an exploration (learning) phase and an exploitation (coverage) phase. The exploration phase comprises two sub-phases: estimation and information propagation. 7.2.1 Estimation Phase Let 𝜎𝑖2 (𝑡) be the marginal posterior variance of 𝜙(𝑣 𝑖 ) at time 𝑡, i.e., the 𝑖-th diagonal entry of 𝚲−1 (𝑡). Suppose the marginal prior variance 𝜎𝑖2 (0) ≤ 𝜎02 , for each 𝑖. Within each epoch 𝑗, agents 101 first determine the points to be sampled in order to reduce max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑡) below a threshold 𝛼 𝑗 𝜎02 , where 𝛼 ∈ (0, 1) is a prespecified parameter. Note that the posterior covariance computed in (7.1) depends only on the number of samples at each vertex, and does not require actual sampling results. Therefore, the sequence of sampling locations can be computed before physically visiting the locations. Leveraging this deterministic evolution of the covariance, we take a greedy sampling policy that repeatedly selects the point 𝑣 𝑖𝑡 with maximum marginal posterior variance, i.e., 𝑖𝑡 = arg max 𝜎𝑖 (𝑡), (7.3) 𝑖∈{1,...,|𝑉 |} for 𝑡 ∈ {𝑡 𝑗 , . . . , 𝑡 𝑗 }, where 𝑡 𝑗 and 𝑡 𝑗 are the starting and ending time of estimation phase in the 𝑗-th epoch. It has been shown that the greedy sampling policy is near-optimal in terms of maximizing the mutual information of the sampling results and demand function 𝜙 [58]. 𝑗 Let the set of points to be sampled during epoch 𝑗 be 𝑋 𝑗 and let 𝑋𝑟 = 𝑋 𝑗 ∩ 𝑃𝑡 𝑗 ,𝑟 be the set of sampling points that belong to 𝑃𝑡 𝑗 ,𝑟 , the partition assigned to agent 𝑟 at time 𝑡 𝑗 . Each agent 𝑟 𝑗 computes a path through the sampling points in 𝑋𝑟 and collects noisy measurements from those points. The traveling path can be optimized by solving a Traveling Salesperson Problem (TSP). Remark 7.1. With 𝚲0 as the common knowledge, the set of sampling points 𝑋 𝑗 for each epoch 𝑗 can be computed independently by each robot following the greedy sampling policy. If the same tie-breaking rule is followed, the computed 𝑋 𝑗 and the number of samples at each sampling point are the same for all agents. 7.2.2 Information Propagation Phase After the estimation phase, sampling results from each agent must be passed to all other agents. There are several mechanisms to accomplish this in a finite number of steps. For example, agents can communicate with their neighboring agents and use flooding algorithms [129] to relay their sampling results to every agent. Alternatively, the agents may be able to send their sampling results 102 to a cloud and receive global estimates after a finite delay. Another possibility for the agents is to use finite time consensus protocols [130] in the distributed inference algorithm discussed in [28]. For any of the above mechanisms, the sampling results from the entire robot team can be propagated to each robot agent in finite time. Then, each agent has an identical posterior distribution N 𝝁(𝑡), 𝚲−1 (𝑡) of 𝝓, and 𝜙ˆ := 𝝁(𝑡) will be used as the estimate of the demand function.  7.2.3 Coverage Phase After the estimation and information propagation phases, agents have the same estimate of the demand function 𝜙. ˆ The coverage phase involves no environmental sampling and its length is designed to grow exponentially with epochs, i.e., the number of time steps in the coverage phase of   the 𝑗-th epoch is 𝛽 𝑗 for some 𝛽 > 1. We use a distributed coverage algorithm, proposed in [127], called pairwise partitioning with the estimated demand function 𝜙. ˆ In an connected 𝑁-partition 𝑃, 𝑃𝑖 and 𝑃 𝑗 is said to be adjacent if there exists a vertex pair 𝑣 ∈ 𝑃𝑖 and 𝑣 0 ∈ 𝑃 𝑗 such that there exist an edge in 𝐸 connecting 𝑣 and 𝑣 0. At each time, a random pair of agents (𝑖, 𝑗), with 𝑃𝑖 and 𝑃 𝑗 adjacent, compute an optimal pair of vertices (𝑎 ∗ , 𝑏 ∗ ) within 𝑃𝑖 ∪ 𝑃 𝑗 that minimize Õ   0 0 0 𝜙(𝑣 ) min 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝑎, 𝑣 ), 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝑏, 𝑣 ) . ˆ 𝑣 0 ∈𝑃𝑖 ∪𝑃 𝑗 Then, agents 𝑖 and 𝑗 move to 𝑎 ∗ and 𝑏 ∗ . Subsequently, 𝑃𝑖 and 𝑃 𝑗 are updated to 𝑃𝑖 ← {𝑣 ∈ 𝑃𝑖 ∪ 𝑃 𝑗 | 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝜂𝑖 , 𝑣) ≤ 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝜂 𝑗 , 𝑣)}, 𝑃 𝑗 ← {𝑣 ∈ 𝑃𝑖 ∪ 𝑃 𝑗 | 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝜂𝑖 , 𝑣) > 𝑑𝐺 [𝑃𝑖 ∪𝑃 𝑗 ] (𝜂 𝑗 , 𝑣)}. 7.3 Analysis of the DSLC Algorithm In this section, we analyze DSLC to provide a performance guarantee about the expected cumulative coverage regret. To this end, we leverage the information gain from the estimation phase to analyze the convergence rate of uncertainty. Then, we recall the convergence properties of the pairwise 103 partitioning algorithm used in DSLC. Based on these results, we establish the main result of this paper, i.e., an upper bound on the expected cumulative coverage regret. 7.3.1 Mutual Information and Uncertainty Reduction Let 𝑋 𝑔 = (𝑣 𝑖1 , . . . , 𝑣 𝑖 𝑛 ) be a sequence of 𝑛 vertices selected by the greedy policy. With a slight abuse of notation, we denote the marginal posterior variance of 𝜙(𝑣 𝑖 ) after sampling at 𝑣 𝑖1 . . . 𝑣 𝑖 𝑘 by 𝜎𝑖2 (𝑘). We now present a bound on the maximal posterior variance after sampling at vertices within 𝑋 𝑔 . The following Lemma is adapted from Lemma 6.6 to incorporate the discrete environment. Since the proof steps are similar, we skip them for brevity. Lemma 7.1 (Uncertainty reduction). Under the greedy sampling policy, the maximum posterior variance after 𝑛 sampling rounds satisfies 2𝜎02 𝛾𝑛 max 𝜎𝑖2 (𝑛) ≤   , 𝑖∈{1,...,|𝑉 |} 𝑛 log 1 + 𝜎 −2 𝜎02 where 𝛾𝑛 is the maximal mutual information gain that can be achieved with 𝑛 samples. Typically, it is hard to characterize 𝛾𝑛 with a general 𝚺0 . Therefore, we assume a squared exponetial kernel for 𝝓. Assumption 7.1. Vertices in 𝑉 lie in a convex and compact set 𝐷 ∈ R2 and the covariance of any pair 𝜙(𝑣 𝑖 ) and 𝜙(𝑣 𝑗 ) is determined by a squared exponential kernel function ! 𝑑 2 (𝑣 , 𝑣 ) eu 𝑖 𝑗 𝑘 (𝜙(𝑣 𝑖 ), 𝜙(𝑣 𝑗 )) = 𝜎𝑣2 exp − , 2𝑙 2 where 𝑑eu (𝑣 𝑖 , 𝑣 𝑗 ) is the Euclidean distance between 𝑣 𝑖 and 𝑣 𝑗 , 𝑙 is the length scale, and 𝜎𝑣2 is the variability parameter. We now recall an upper bound on 𝛾𝑛 from [44]. Lemma 7.2 (Information gain for squared exp. kernel). With Assumption 7.1, the maximum mutual information satisfies 𝛾𝑛 ∈ 𝑂 ((log|𝑉 | 𝑛) 3 ). 104 Remark 7.2. If the correlation information is ignored, i.e., 𝜙(𝑖), 𝑖 ∈ {1, . . . ,|𝑉 |} are treated to be independent, it can be seen that max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑛) ∈ 𝑂 (|𝑉 | /𝑛) with greedy sampling policy. In contrast, if correlation information is considered, by substituting the result in Lemma 7.2 into Lemma 7.1, max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑛) ∈ 𝑂 ((log(|𝑉 | 𝑛)) 3 /𝑛), which shows great advantage about reducing uncertainty when |𝑉 | is large (the environment is finely discretized). 7.3.2 Convergence within Coverage Phase Before each coverage phase, since the sampling results of each agent are relayed to the entire team, the agents have a consensus estimate of the demand function 𝜙. ˆ It has been shown in [127] that using the pairwise partitioning algorithm, the 𝑁-partition 𝑃 for the team converges almost surely to a class of near-optimal partitions defined below. Definition 7.3 (Pairwise-optimal Partition). A connected 𝑁-partition 𝑃 is pairwise-optimal if for each pair of adjacent regions 𝑃𝑖 and 𝑃 𝑗 , Õ Õ 𝑑𝐺 (𝑐(𝑃𝑖 ), 𝑣 0)𝜙(𝑣 0) + 𝑑𝐺 (𝑐(𝑃 𝑗 ), 𝑣0)𝜙(𝑣 0) 𝑣 0 ∈𝑃𝑖 𝑣 0 ∈𝑃 𝑗 Õ 𝜙(𝑣 0) min 𝑑𝐺 (𝑎, 𝑣0), 𝑑 (𝑏, 𝑣0) .  = min 𝑎,𝑏∈𝑃𝑖 ∪𝑃 𝑗 𝑣 0 ∈𝑃𝑖 ∪𝑃 𝑗 It means that, within the induced subgraph generated by any pair of adjacent regions, the 2- partition is optimal. It is proved in [127] that if a connected 𝑁-partition 𝑃 is pairwise-optimal then it is also a centroidal Voronoi partition. The following result on the convergence time of the pairwise partitioning algorithm is established in [127]. Lemma 7.3 (Expected Convergence Time). Using the pairwise partitioning algorithm, the expected time to converge to a pairwise-optimal 𝑁-Partition is finite. For each coverage phase, Lemma 7.3 implies that the expected time for the instantaneous regret 𝑅𝑡 ( 𝜙) ˆ to converge to 0 is finite. 105 7.3.3 An Upper Bound on Expected Coverage Regret We now present the main result for this paper. Theorem 7.4. For DSLC and any time horizon 𝑇, if Assumption 7.1 holds and 𝛼 = 𝛽−2/3 , then the expected cumulative coverage regret with respect to demand function 𝜙 satisfies " 𝑇 # Õ  E 𝑅𝑡 (𝜙) ∈ 𝑂 𝑇 2/3 (log(𝑇)) 3 . 𝑡=1 Proof. We establish the theorem using the following four steps. Step 1 (Regret from estimation phases): Let the total number of sampling steps before the end of the 𝑗-th epoch be 𝑠 𝑗 . By applying Lemma 7.1, we get 𝑠 𝑗 ∈ 𝑂 ((log(𝑇)) 3 /𝛼 𝑗 ). Thus, the coverage regret in the estimation phases until the end of the 𝑗-th epoch belongs to 𝑂 ((log(𝑇)) 3 /𝛼 𝑗 ). Step 2 (Regret from information propagation phases): The sampling information by each robot propagates to all the team members in finite time. Thus, before the end of the 𝑗-th epoch, the coverage regret from information propagation phases can be bounded by 𝑐 1 𝑗 for some constant 𝑐 1 > 0. Step 3 (Regret from coverage phases): According to Lemma 7.3, in each coverage phase, the expected time before converging to a pairwise-optimal partition is finite. Thus, before the end of the 𝑗-th epoch, the expected coverage regret from converging steps can be upper bounded by 𝑐 2 𝑗 for some constant 𝑐 2 > 0. Also note that the robot team converge to pair-wise optimal partition with respect estimated demand function 𝜙ˆ which may deviate from the actual 𝜙. Then, the instantaneous coverage regret 𝑅𝑡 (𝜙) caused by estimation error can be expressed as 2H (𝜼𝑡 , 𝑃𝑡 ) − H (𝒄(𝑃𝑡 ), 𝑃𝑡 ) − H (𝜼𝑡 V (𝜼𝑡 )) := 𝐴𝑡T 𝝓, 106 for some 𝐴𝑡 ∈ R|𝑉 | . Moreover, the posterior distribution of 𝑅𝑡 (𝜙) is N ( 𝐴𝑡T 𝝁(𝑡), 𝐴𝑡T 𝚺(𝑡) 𝐴𝑡 ), where 𝚺(𝑡) = 𝚲−1 (𝑡) is the posterior covariance matrix. Since a pairwise-optimal partition 𝑃 is also a centroidal Voronoi partition and 𝜙ˆ = 𝝁(𝑡), 𝑅𝑡 ( 𝜙) ˆ = 0 indicates 𝐴𝑡T 𝝁(𝑡) = 0. Now, we get 𝑅𝑡 (𝜙) ∼ N (0, 𝐴𝑡T 𝚺(𝑡) 𝐴𝑡 ) and r   2 T E [𝑅𝑡 (𝜙)] ≤ E 𝑅𝑡 (𝜙) = 𝐴 𝚺(𝑡) 𝐴𝑡 . 𝜋 𝑡 Note that 𝐴𝑡T 𝚺(𝑡) 𝐴𝑡 is weighed summation of eigenvalues of 𝚺(𝑡). At any time 𝑡 in the coverage phase of the 𝑘-th epoch, max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑡) ≤ 𝛼 𝑘 𝜎02 , and its follows that the summation of eigenvalues of 𝚺(𝑡) equals trace(𝚺(𝑡)) ≤ |𝑉 | 𝛼 𝑘 𝜎02 . Thus, we get " # Õ √ E 𝑅𝑡 (𝜙) ≤ 𝑐 3 (𝛽 𝛼) 𝑘 , 𝑡∈T𝑘cov for some constant 𝑐 3 > 0, where T𝑘cov are the time slots in the coverage phase of the 𝑘-th epoch and we have used the fact that |T𝑘cov | = d𝛽 𝑘 e. Step 4 (Summary): Summing up the expected coverage regret from the above steps, the expected cumulative coverage regret at the end of the 𝑗-th epoch 𝑇 𝑗 satisfies " 𝑇𝑗 # 𝑗 Õ Õ √ E 𝑅𝑡 (𝜙) ≤ 𝐶1 𝑗 + 𝐶2 𝑠 𝑗 + 𝑐 3 (𝛽 𝛼) 𝑘 , (7.4) 𝑡=1 𝑘=1 where 𝐶1 , 𝐶2 > 0 are some constants. The theorem statement follows by plugging in 𝛼 = 𝛽−2/3 , using 𝑗 ∈ 𝑂 (log 𝑇) and some simple calculations.  7.4 Simulation Results To illustrate the empirical performance of the proposed algorithm, we simulate its execution on a uniform grid graph superimposed on the unit square. We present numerical results which show that DSLC achieves sublinear regret and compare our algorithm to those proposed in [13] and [68]. Motivated by environmental applications, we construct the demand function 𝜙 over a discrete 21 × 21 point grid in [0, 1] 2 by performing kernel density estimation on a subset of the geospatial distribution of Australian wildfires observed by NASA in 2019 [131]. Intuitively, 𝜙 represents the 107 probability that a wildfire may occur at a particular point of the unit square, and it is used to model the demand for an autonomous sensing agent at that point. The ground truth 𝜙 is shown on the right in Figure 7.1. Figure 7.1: Distributed implementation of DSLC. In each simulation, nine agents are placed uniformly at random over the grid and execute 3 epochs of length 16, 46, and 128 to achieve adaptive coverage of the environment. Partitions are initialized by iterating over the grid and assigning each point to the nearest agent. During the exploration phase of each epoch, partitions are fixed; during the exploitation phase of each epoch, partitions are updated according to the protocol established in [127], where pairwise gossip-based repartitioning occurs between randomly selected neighbors. Coverage cost, regret and maximum variance are computed throughout using (7.2), Definition 7.2, and the maximum diagonal entry of 𝚲−1 (𝑡) from (7.1), respectively. From left to right in turn, Figure 7.1 shows (i) agent positions 𝜼𝑡 and partitions 𝑃𝑡 (ii) TSP sampling tours (iii) posterior mean (iv) variance of 𝜙.ˆ Pairwise partition updates between gossiping agents are denoted by magenta lines in the leftmost column of panels. Points along TSP tours in the second-from-leftmost column of panels are plotted in magenta prior 108 Figure 7.2: Comparison of DSLC, Todescato and Cortes. to sampling, and in black after sampling. A simulation video is available online.2 The demand function 𝜙 is normalized in the range [0, 1] and sampled by agents with Gaussian noise parameterized by mean and standard deviation 𝜇 = 0, 𝜎 = 0.1. A global Gaussian Process model is assumed to simplify the estimation of 𝜙ˆ throughout the simulation, though in practice estimation of 𝜙ˆ could be implemented in a fully distributed manner by assuming each agent maintains its own model of 𝜙ˆ and employing an information propagation phase described in Section 7.2. Setting the parameter 𝛼 = 0.5 to reduce uncertainty by half within each epoch, 𝛽 = 𝛼−3/2 is fully determined by Theorem 7.4. Figure 7.1 visualizes the simulation of DSLC. Figure 7.2 compares the evolution of the coverage regret, coverage cost, and uncertainty reduc- tion in DSLC with that of algorithms proposed in [13] and [68], denoted Cortes and Todescato, respectively. Agents in [13] are assumed to have perfect knowledge of 𝜙 and simply go to the cen- troid of their cell at each iteration; in [68], agents follow a stochastic sampling approach with the probability of exploration proportional to posterior variance in the estimate 𝜙ˆ at each iteration. All 2 https://youtu.be/nalwrZC6GiI 109 results are averaged over 16 simulations of 190 iterations, aligned with the three-implementation of DSLC with epoch lengths 16, 46, and 128. It can be noticed that DSLC empirically achieves sublinear regret. Spikes in regret occur during the exploration phase of each epoch, before agents converge to a pairwise-optimal coverage configuration with respect to 𝜙ˆ during the exploitation phase. Though we do not include the algorithm in our simulations, it is worth noting that DSLC operates in a manner similar to that proposed in [64] where agents spend a number of iterations sampling 𝜙 to reduce maximum posterior variance max𝑖∈{1,...,|𝑉 |} 𝜎𝑖2 (𝑛) below a prespecified threshold, then transition to perform coverage for all remaining iterations. Indeed, this algorithm is essentially a special case of DSLC with one epoch and can therefore be expected to perform similarly from an empirical perspective. 7.5 Summary We propose an adaptive coverage algorithm DSLC that balances the exploration versus exploitation trade-off in learning 𝜙 and achieving environmental coverage. Our algorithm schedules learning and coverage epochs such that its emphasis gradually shifts from exploration to exploitation while never fully ceasing to learn. Most importantly, we introduce a novel coverage regret that characterizes the deviation of agent configurations and partitions from a centroidal Voronoi partition and derive analytic bounds on the expected cumulative regret for DSLC. In particular, we prove that DSLC will achieve sublinear expected cumulative regret under minor assumptions. The efficacy of DSLC is illustrated through extensive simulation and comparison with existing state-of-the-art approaches to adaptive coverage. 7.6 Bibliographic Remarks Classical approaches to coverage control [13, 61–63] assume a priori knowledge of 𝜙 and em- ploy Lloyd’s algorithm [132] to drive agents to a local minimum of the coverage cost. In these algorithms, each agent communicates with the agents in the neighboring partitions at each time 110 and updates its partition. Distributed gossip-based coverage algorithms [133] address potential communication bottlenecks in classical approaches by updating partitions pairwise between the agents in neighboring partitions. While much of the work in coverage considers continuous convex environments, the global convergence property remains an open problem. The asymptotic convergence to a local minimum is normally based on an unproven assumption that there exist finite local optimal points [13, 134]. A discrete graph representation of the environment is considered in [127], which not only enables finite time convergence but also allows for non-convex environments. As has been mentioned earlier in this chapter, the gossip-based coverage algorithm in the graph environment has been proved to converge almost surely to pairwise-optimal partitions in finite time [127]. Recent works have put more focus on the problem of adaptive coverage, in which agents are not assumed to know 𝜙 a priori. Parametric estimation approaches to adaptive coverage [135, 136] model 𝜙 as a linear combination of some basis functions and propose algorithms to learn the weights of basis functions; while non-parametric approaches [64–69] model 𝜙 as the realization of a Gaussian Process and make predictions by conditioning on observed values of 𝜙 sampled over the operating environment. Alternative approaches to adaptive coverage [137, 138] have also been considered. A non-parametric adaptive coverage algorithm with provable regret guarantees was presented in this chapter. Similar adaptive coverage algorithms with formal performance guarantees are also developed in [68, 69]. Todescato et al. [68] use a Bernoulli random variable for each robot to decide between learning and coverage steps. The distribution of this random variable is designed to ensure the convergence of the algorithm. In contrast, we leverage the so-called “doubling trick" from the MAB literature to design a deterministic schedule of learning and coverage. This allows us to derive formal regret bounds on our adaptive coverage algorithm. The most closely related work to the ideas presented in this chapter is by Benevento et al. [69], which uses a Gaussian process optimization-based [44] approach to design an adaptive coverage algorithm. They derived an upper bound on a notion of coverage regret different from Definition 7.2 111 in this chapter. Their result is based on a strong assumption that the coverage control algorithm can drive the system to the global minimum of the coverage cost. In contrast, our coverage regret is defined with respect to the local minima which can be achieved by many state-of-the-art coverage control policies including the classic Lloyd’s algorithm. By analyzing the coverage regret defined in this chapter, the convergence of the adaptive coverage control can still be shown without requiring the global optimal assumption. 112 CHAPTER 8 CONCLUSIONS AND FUTURE DIRECTIONS This dissertation has focused upon optimal decision-making in the face of uncertainty. In particular, we address the exploration versus exploitation dilemma in the MAB setup as well as in robotic problems including target search and adaptive coverage control. All proposed algorithms are accompanied by rigorous analysis to indicate their convergence properties. Since the MAB problem provides a concise mathematical formulation of the exploration versus exploitation dilemma, we have investigated a variety of MAB problem variations that capture different properties of the stochastic environment in real-world problems. For the heavy-tailed bandit, we proposed the Robust MOSS algorithm, which is the first to achieve order optimal worst- case regret while maintaining a logarithm asymptotic regret. For the nonstationary bandits, we studied both the piece-wise stationary bandit and the more general nonstationary bandits with a variation budget. Exact order optimal or near order optimal algorithms for these problem setups are proposed, analyzed, and compared extensively in simulations. As an extension of the single-player MAB, we studied the distributed multi-player bandit in a piece-wise stationary environment. By modifying the single-player policy, novel multi-player policies are designed and proved to maintain group regrets matching with the standard single-player regret even without communication between agents. For the robotic target search problem, we have considered a scenario in with a robot operates in a 3D environment to search targets on a 2D floor. The target search task is modeled as a hot-spots identification problem in which sensing information is compromised by measurement noise. Since sensing at a location farther from the floor provides better coverage of the environment but less accurate results, we have modeled the sensing field with a multi-fidelity Gaussian process that captures the coverage-accuracy trade-off. Leveraging this novel sensing model, we established a new informative path planning strategy that allows for jointly planning for sampling locations and associated fidelity levels, and thus reduces target search time. 113 For the adaptive coverage problem, the demand for robotic service within the environment is modeled as a realization of the Gaussian process. With Bayesian techniques, we have devised a policy that balances the tradeoff between learning the demand and covering the environment. To provide analytical rigor, we have defined the coverage regret, and based on it, we have analyzed the convergence property of the proposed online estimation and coverage algorithm. There are several possible avenues of future studies on problems addressed in this dissertation. The distributed multi-player MAB problem in a general nonstationary environment is a challenging problem and is expected to be applied in the opportunistic spectrum access wherein the stochastic nature of the channel vacancy changes with time. We intend to design a multi-player policy that actively detects the drift in stochastic reward processes such that the players require no prior information about the nonstationary environment. It can be foreseen that the nonstationarity could bring more difficulty in reward estimation so that achieving coordinate behavior to reduce collisions is a trickier task. To deal with it, we can allow communication among agents. For example, the agents can do cooperative reward estimation through a bi-directional communication network by running consensus algorithms as in [27]. Since communicating sampling results may require relatively large bandwidth, to reduce the communication requirement, it is also possible to only require the player to share the ranking of arms as mentioned in [139]. Other issues for such a problem include privacy and defense against adversarial attacks. We are also interested in extending the single-robot target search policy to cooperative multi- robot search scenarios. As has been mentioned, coverage control is a potential method that can balance the workload. Note that the workload distribution in the environment changes as the search mission progresses, so the robots need to cover the dynamic demands of service. For such a problem, providing analytical rigor would be of interest. Other workload-balancing ideas include using multi-robot path planning methods that solve a vehicle routing problem [140] or orienteering problem [141]. Also of interest would be the implementation of target search algorithms in underwater multi-target search testbeds. It would be worthwhile to pursue the adaptive coverage problem from a variety of new directions. 114 The proposed online estimation and coverage algorithm requires an information propagation phase to maintain uniform estimation of demands among agents, while we envision a fully distributed policy that allows for small differences in demand estimates. Besides, the problem setup can be generalized by considering heterogeneous agents that can provide multiple types of services. The quality of service could depend on both the servicing agent and the service type. Considering both inter-service dependencies and the spatial correlations, the multi-task Gaussian process might be a good fit to model demands of different service types. Another interesting direction could be to use the time-varying Gaussian process to model a dynamic environment in which the demands change with time. Like adaptive coverage control, adaptive multi-robot patrolling is also an interesting problem with the explore-exploit tradeoff. In the multi-robot patrolling problem [142], a team of robots circling around a set of important locations (viewpoints) with different known priorities. The objective is to minimize the weighted refresh time, which is the longest time interval between any two visits of a viewpoint, weighted by the viewpoint’s priority. We envision addressing this problem by considering the priorities of viewpoints to be unknown and time-varying so that they need to be learned. Since the LM-DSEE algorithm for the piece-wise stationary MAB problem has a block allocation structure that benefits path planning, its multi-player extension and associated distributed control methods could be promising to solve the problem. 115 BIBLIOGRAPHY 116 BIBLIOGRAPHY [1] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933. [2] R. Albert and A.-L. Barabási, “Statistical mechanics of complex networks,” Reviews of Modern Physics, vol. 74, no. 1, p. 47, 2002. [3] M. Vidyasagar, “Law of large numbers, heavy-tailed distributions, and the recent financial crisis,” in Perspectives in Mathematical System Theory, Control, and Signal Processing. Springer, 2010, pp. 285–295. [4] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple players,” IEEE Transactions on Signal Processing, vol. 58, no. 11, pp. 5667–5681, 2010. [5] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, “Distributed algorithms for learning and cognitive medium access with logarithmic regret,” IEEE Journal on Selected Areas in Communications, vol. 29, no. 4, pp. 731–745, 2011. [6] O. Avner and S. Mannor, “Concurrent bandits and cognitive radio networks,” in Joint Euro- pean Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp. 66–81. [7] J. R. Krebs, A. Kacelnik, and P. Taylor, “Test of optimal sampling by foraging great tits,” Nature, vol. 275, no. 5675, pp. 27–31, 1978. [8] V. Srivastava, P. Reverdy, and N. E. Leonard, “On optimal foraging and multi-armed bandits,” in Proceedings of the 51st Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 2013, pp. 494–499. [9] V. Srivastava, P. Reverdy, and N. E. Leonard, “Surveillance in an abruptly changing world via multiarmed bandits,” in IEEE Conference on Decision and Control, 2014, pp. 692–697. [10] M. Y. Cheung, J. Leighton, and F. S. Hover, “Autonomous mobile acoustic relay positioning as a multi-armed bandit with switching costs,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, Nov. 2013, pp. 3368–3373. [11] Y. Sung, D. Dixit, and P. Tokekar, “Environmental hotspot identification in limited time with a uav equipped with a downward-facing camera,” arXiv preprint arXiv:1909.08483, 2019. [12] M. C. Kennedy and A. O’Hagan, “Predicting the output from a complex computer code when fast approximations are available,” Biometrika, vol. 87, no. 1, pp. 1–13, 2000. 117 [13] J. Cortés, S. Martínez, T. Karataş, and F. Bullo, “Coverage control for mobile sensing networks,” IEEE Transactions on Robotics and Automation, vol. 20, no. 2, pp. 243–255, 2004. [14] H. Robbins, “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematical Society, vol. 58, no. 5, pp. 527–535, 1952. [15] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985. [16] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, no. 2, pp. 235–256, 2002. [17] A. Garivier and O. Cappé, “The KL-UCB algorithm for bounded stochastic bandits and beyond,” in Proceedings of the 24th Conference on Learning Theory, vol. 19, Budapest, Hungary, 2011, pp. 359–376. [18] S. Bubeck, N. Cesa-Bianchi, and G. Lugosi, “Bandits with heavy tail,” IEEE Transactions on Information Theory, vol. 59, no. 11, pp. 7711–7717, 2013. [19] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM journal on computing, vol. 32, no. 1, pp. 48–77, 2002. [20] A. Garivier and E. Moulines, “On upper-confidence bound policies for switching bandit problems,” in International Conference on Algorithmic Learning Theory. Springer, 2011, pp. 174–188. [21] O. Besbes and Y. Gur, “Stochastic multi-armed-bandit problem with non-stationary rewards,” in Advances in neural information processing systems, 2014, pp. 199–207. [22] V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part I: IID rewards,” IEEE Transactions on Automatic Control, vol. 32, no. 11, pp. 968–976, 1987. [23] Y. Gai and B. Krishnamachari, “Distributed stochastic online learning policies for oppor- tunistic spectrum access,” IEEE Transactions on Signal Processing, vol. 62, no. 23, pp. 6184–6193, 2014. [24] D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2331–2345, 2014. [25] N. Nayyar, D. Kalathil, and R. Jain, “On regret-optimal learning in decentralized multiplayer multiarmed bandits,” IEEE Transactions on Control of Network Systems, vol. 5, no. 1, pp. 597–606, 2018. 118 [26] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Multi-armed bandits in multi-agent net- works,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. [27] P. Landgren, V. Srivastava, and N. E. Leonard, “On distributed cooperative decision-making in multiarmed bandits,” in 2016 European Control Conference, Aalborg, Denmark, 2016, pp. 243–248. [28] P. Landgren, V. Srivastava, and N. E. Leonard, “Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms,” in IEEE Conference on Decision and Control, Las Vegas, NV, USA, Dec. 2016, pp. 167–172. [29] A. B. H. Alaya-Feki, E. Moulines, and A. LeCornec, “Dynamic spectrum access with non-stationary multi-armed bandit,” in IEEE Workshop on Signal Processing Advances in Wireless Communications, 2008, pp. 416–420. [30] Y. Li, Q. Hu, and N. Li, “A reliability-aware multi-armed bandit approach to learn and select users in demand response,” Automatica, vol. 119, p. 109015, 2020. [31] D. Kalathil and R. Rajagopal, “Online learning for demand response,” in Annual Allerton Conference on Communication, Control, and Computing, 2015, pp. 218–222. [32] D. Agarwal, B. C. Chen, P. Elango, N. Motgi, S. T. Park, R. Ramakrishnan, S. Roy, and J. Zachariah, “Online models for content optimization,” in Advances in Neural Information Processing Systems, vol. 21. Curran Associates, Inc., 2009, pp. 17–24. [33] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in International Conference on World Wide Web, 2010, pp. 661–670. [34] C. Baykal, G. Rosman, S. Claici, and D. Rus, “Persistent surveillance of events with unknown, time-varying statistics,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2017, pp. 2682–2689. [35] R. Dimitrova, I. Gavran, R. Majumdar, V. S. Prabhu, and S. E. Z. Soudjani, “The robot routing problem for collecting aggregate stochastic rewards,” arXiv preprint arXiv:1704.05303, 2017. [36] V. Srivastava, F. Pasqualetti, and F. Bullo, “Stochastic surveillance strategies for spatial quickest detection,” The International Journal of Robotics Research, vol. 32, no. 12, pp. 1438–1458, 2013. [37] R. Agrawal, M. V. Hedge, and D. Teneketzis, “Asymptotically efficient adaptive alloca- tion rules for the multi-armed bandit problem with switching cost,” IEEE Transactions on Automatic Control, vol. 33, no. 10, pp. 899–906, 1988. 119 [38] P. Reverdy, V. Srivastava, and N. E. Leonard, “Modeling human decision making in general- ized Gaussian multiarmed bandits,” Proceedings of the IEEE, vol. 102, no. 4, pp. 544–571, 2014. [39] V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg, “Batched bandit problems,” The Annals of Statistics, vol. 44, no. 2, pp. 660–681, 2016. [40] S. Vakili, K. Liu, and Q. Zhao, “Deterministic sequencing of exploration and exploitation for multi-armed bandit problems,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 5, pp. 759–767, 2013. [41] H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restless multiarmed bandit with unknown dynamics,” IEEE Transactions on Information Theory, vol. 59, no. 3, pp. 1902–1916, 2013. [42] C. K. Williams and C. E. Rasmussen, Gaussian processes for Machine Learning. MIT press Cambridge, MA, 2006, vol. 2, no. 3. [43] S. Vasudevan, F. Ramos, E. Nettleton, and H. Durrant-Whyte, “Gaussian process modeling of large-scale terrain,” Journal of Field Robotics, vol. 26, no. 10, pp. 812–840, 2009. [44] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Information-theoretic regret bounds for Gaussian process optimization in the bandit setting,” IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3250–3265, 2012. [45] G. A. Hollinger, B. Englot, F. S. Hover, U. Mitra, and G. S. Sukhatme, “Active planning for underwater inspection and the benefit of adaptivity,” The International Journal of Robotics Research, vol. 32, no. 1, pp. 3–18, 2013. [46] D. E. Soltero, M. Schwager, and D. Rus, “Generating informative paths for persistent sensing in unknown environments,” in IEEE/RSJ Int Conf on Intelligent Robots and Systems, Vilamoura, Algarve, Portugal, Oct. 2012, pp. 2172–2179. [47] J. Yu, M. Schwager, and D. Rus, “Correlated orienteering problem and its application to informative path planning for persistent monitoring tasks,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2014, pp. 342–349. [48] G. A. Hollinger and G. S. Sukhatme, “Sampling-based robotic information gathering al- gorithms,” The International Journal of Robotics Research, vol. 33, no. 9, pp. 1271–1287, 2014. [49] G. Hitz, E. Galceran, M.-È. Garneau, F. Pomerleau, and R. Siegwart, “Adaptive continuous- space informative path planning for online environmental monitoring,” Journal of Field Robotics, vol. 34, no. 8, pp. 1427–1449, 2017. 120 [50] G. Hitz, A. Gotovos, M.-É. Garneau, C. Pradalier, A. Krause, R. Y. Siegwart et al., “Fully au- tonomous focused exploration for robotic environmental monitoring,” in IEEE International Conference on Robotics and Automation, 2014, pp. 2658–2664. [51] N. Atanasov, J. Le Ny, K. Daniilidis, and G. J. Pappas, “Information acquisition with sensing robots: Algorithms and error bounds,” in IEEE International Conference on Robotics and Automation, 2014, pp. 6447–6454. [52] A. A. Meera, M. Popović, A. Millane, and R. Siegwart, “Obstacle-aware adaptive informative path planning for uav-based target search,” in IEEE International Conference on Robotics and Automation, 2019, pp. 718–724. [53] X. Lan and M. Schwager, “Planning periodic persistent monitoring trajectories for sensing robots in Gaussian random fields,” in IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, May 2013, pp. 2415–2420. [54] N. E. Leonard, D. A. Paley, F. Lekien, R. Sepulchre, D. M. Fratantoni, and R. E. Davis, “Collective motion, sensor networks, and ocean sampling,” Proceedings of the IEEE, vol. 95, no. 1, pp. 48–74, 2007. [55] S. L. Smith, M. Schwager, and D. Rus, “Persistent robotic tasks: Monitoring and sweeping in changing environments,” IEEE Transactions on Robotics, vol. 28, no. 2, pp. 410–426, 2012. [56] C. G. Cassandras, X. Lin, and X. Ding, “An optimal control approach to the multi-agent persistent monitoring problem,” IEEE Transactions on Automatic Control, vol. 58, no. 4, pp. 947–961, 2013. [57] R. N. Smith, M. Schwager, S. L. Smith, B. H. Jones, D. Rus, and G. S. Sukhatme, “Persistent ocean monitoring with underwater gliders: Adapting sampling resolution,” Journal of Field Robotics, vol. 28, no. 5, pp. 714–741, 2011. [58] A. Krause and C. E. Guestrin, “Near-optimal nonmyopic value of information in graphical models,” in Proceedings of the 21st Conference Conference on Uncertainty in Artificial Intelligence, Edinburgh, Scotland, July 2005, pp. 324–331. [59] P. Auer and R. Ortner, “UCB revisited: Improved regret bounds for the stochastic multi- armed bandit problem,” Periodica Mathematica Hungarica, vol. 61, no. 1-2, pp. 55–65, 2010. [60] S. Kalyanakrishnan and P. Stone, “Efficient selection of multiple bandit arms: Theory and practice,” in ICML, 2010. [61] J. Cortés and F. Bullo, “Coordination and Geometric Optimization via Distributed Dynamical Systems,” SIAM Journal on Control and Optimization, vol. 44, no. 5, pp. 1543–1574, 2005. 121 [62] F. Lekien and N. E. Leonard, “Nonuniform coverage and cartograms,” IEEE Conference on Decision and Control, pp. 5518–5523, 2010. [63] I. I. Hussein and D. M. Stipanovic, “Effective coverage control for mobile sensor networks with guaranteed collision avoidance,” IEEE Transactions on Control Systems Technology, vol. 15, no. 4, pp. 642–657, 2007. [64] J. Choi, J. Lee, and S. Oh, “Swarm intelligence for achieving the global maximum using spatio-temporal Gaussian processes,” Proceedings of the American Control Conference, pp. 135–140, 2008. [65] Y. Xu and J. Choi, “Adaptive sampling for learning Gaussian processes using mobile sensor networks,” Sensors, vol. 11, no. 3, pp. 3051–3066, 2011. [66] W. Luo and K. Sycara, “Adaptive Sampling and Online Learning in Multi-Robot Sensor Coverage with Mixture of Gaussian Processes,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2018, pp. 6359–6364. [67] W. Luo, C. Nam, G. Kantor, and K. Sycara, “Distributed environmental modeling and adaptive sampling for multi-robot sensor coverage,” in International Joint Conference on Autonomous Agents and Multiagent Systems, 2019, pp. 1488–1496. [68] M. Todescato, A. Carron, R. Carli, G. Pillonetto, and L. Schenato, “Multi-robots Gaus- sian estimation and coverage control: From client–server to peer-to-peer architectures,” Automatica, vol. 80, pp. 284–294, 2017. [69] A. Benevento, M. Santos, G. Notarstefano, K. Paynabar, M. Bloch, and M. Egerstedt, “Multi-robot coordination for estimation and coverage of unknown spatial fields,” in IEEE International Conference on Robotics and Automation, 2020, pp. 7740–7746. [70] J. Audibert and S. Bubeck, “Minimax policies for adversarial and stochastic bandits,” in Proceedings of the 22nd conference on learning theory, 2009, pp. 217–226. [71] A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies for sequential allocation problems,” Advances in Applied Mathematics, vol. 17, no. 2, pp. 122–142, 1996. [72] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in a rigged casino: The adversarial multi-armed bandit problem,” in IEEE Annual Foundations of Computer Science, 1995, pp. 322–331. [73] S. Mannor and J. N. Tsitsiklis, “The sample complexity of exploration in the multi-armed bandit problem,” Journal of Machine Learning Research, vol. 5, no. Jun, pp. 623–648, 2004. [74] L. Wei and V. Srivastava, “Minimax policy for heavy-tailed bandits,” IEEE Control Systems Letters, vol. 5, no. 4, pp. 1423–1428, 2021. 122 [75] X. Fan, I. Grama, and Q. Liu, “Hoeffding’s inequality for supermartingales,” Stochastic Processes and their Applications, vol. 122, no. 10, pp. 3545–3559, 2012. [76] S. Bubeck, “Bandits games and clustering foundations,” Ph.D. dissertation, Université des Sciences et Technologie de Lille - Lille I, 2010. [77] O. C. E. Kaufmann and A. Garivier, “On bayesian upper confidence bounds for bandit problems,” in Artificial Intelligence and Statistics, 2012, pp. 592–600. [78] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit problem,” in Conference on Learning Theory, 2012, pp. 39–1. [79] N. K. E. Kaufmann and R. Munos, “Thompson sampling: An asymptotically optimal finite- time analysis,” in International Conference on Algorithmic Learning Theory. Springer, 2012, pp. 199–213. [80] P. Ménard and A. Garivier, “A minimax and asymptotically optimal algorithm for stochastic bandits,” in Algorithmic Learning Theory, vol. 76, 2017. [81] R. Degenne and V. Perchet, “Anytime optimal algorithms in stochastic multi-armed bandits,” in International Conference on Machine Learning, 2016, pp. 1587–1595. [82] L. Wei and V. Srivastava, “On abruptly-changing and slowly-varying multiarmed bandit problems,” in American Control Conference, Milwaukee, WI, June 2018, pp. 6291–6296. [83] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, 1963. [84] J. C. Gittins, “Bandit processes and dynamic allocation indices,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 41, no. 2, pp. 148–164, 1979. [85] K. Liu and Q. Zhao, “Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5547–5567, 2010. [86] L. Kocsis and C. Szepesvári, “Discounted UCB,” in 2nd PASCAL Challenges Workshop, vol. 2, 2006. [87] C. Hartland, N. Baskiotis, S. Gelly, M. Sebag, and O. Teytaud, “Change Point Detection and Meta-Bandits for Online Learning in Dynamic Environments,” in Conférence Francophone Sur L’Apprentissage Automatique, Grenoble, France, July 2007, pp. 237–250. [88] F. Liu, J. Lee, and N. Shroff, “A change-detection based framework for piecewise-stationary multi-armed bandit problem,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. 123 [89] L. Besson and E. Kaufmann, “The generalized likelihood ratio test meets klucb: an improved algorithm for piece-wise non-stationary bandits,” arXiv preprint arXiv:1902.01575, 2019. [90] Y. Cao, Z. Wen, B. Kveton, and Y. Xie, “Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit,” in International Conference on Artificial Intelli- gence and Statistics, 2019, pp. 418–427. [91] R. Allesiardo and R. Féraud, “Exp3 with drift detection for the switching bandit problem,” in 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2015, pp. 1–7. [92] J. Mellor and J. Shapiro, “Thompson sampling in switching environments with bayesian online change detection,” in Artificial Intelligence and Statistics, 2013, pp. 442–450. [93] P. Auer, P. Gajane, and R. Ortner, “Adaptively tracking the best bandit arm with an unknown number of distribution changes,” in Annual Conference on Learning Theory, 2019, pp. 138–158. [94] Y. Chen, C. W. Lee, H. Luo, and C. Y. Wei, “A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free,” in Proceedings of the 32nd Conference on Learning Theory, vol. 99, Phoenix, USA, 2019, pp. 696–726. [95] L. Wei and V. Srivastava, “On distributed multi-player multiarmed bandit problems in abruptly changing environment,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 5783–5788. [96] R. Burkard, M. Dell’Amico, and S. Martello, Assignment problems: revised reprint. SIAM, 2012. [97] D. P. Bertsekas, “The auction algorithm: A distributed relaxation method for the assignment problem,” Annals of operations research, vol. 14, no. 1, pp. 105–123, 1988. [98] E. Boursier and V. Perchet, “SIC-MMAB: Synchronisation involves communication in multi- player multi-armed bandits,” in Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 2249–2257. [99] C. Shi and C. Shen, “On no-sensing adversarial multi-player multi-armed bandits with collision communications,” IEEE Journal on Selected Areas in Information Theory, 2021. [100] I. Bistritz and A. Leshem, “Distributed multi-player bandits-a game of thrones approach,” in Advances in Neural Information Processing Systems, vol. 31, 2018, pp. 7222–7232. [101] J. R. Marden, H. P. Young, and L. Y. Pao, “Achieving pareto optimality through distributed learning,” SIAM Journal on Control and Optimization, vol. 52, no. 5, pp. 2753–2770, 2014. 124 [102] O. Besbes, Y. Gur, and A. Zeevi, “Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards,” Stochastic Systems, vol. 9, no. 4, pp. 319–337, 2019. [103] V. Raj and S. Kalyani, “Taming non-stationary bandits: A Bayesian approach,” arXiv preprint arXiv:1707.09727, 2017. [104] W. C. Cheung, D. Simchi-Levi, and R. Zhu, “Learning to optimize under non-stationarity,” in Proceedings of Machine Learning Research, vol. 89, 16–18 Apr 2019, pp. 1079–1087. [105] P. Zhao, L. Zhang, Y. Jiang, and Z.-H. Zhou, “A simple approach for non-stationary linear bandits,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 746–755. [106] Y. Russac, C. Vernade, and O. Cappé, “Weighted linear bandits for non-stationary en- vironments,” in Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 12 017–12 026. [107] L. Wei, X. Tan, and V. Srivastava, “Expedited multi-target search with guaranteed perfor- mance via multi-fidelity Gaussian processes,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV (Virtual), Oct. 2020, pp. 7095–7100. [108] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018. [109] P. Perdikaris, “Gaussian processes a hands-on tutorial,” 2017. [Online]. Available: https://github.com/paraklas/GPTutorial [110] S. Kemna, J. G. Rogers, C. Nieto-Granda, S. Young, and G. S. Sukhatme, “Multi-robot coordination through dynamic Voronoi partitioning for informative adaptive sampling in communication-constrained environments,” in IEEE International Conference on Robotics and Automation, 2017, pp. 2124–2130. [111] D. Applegate, R. Bixby, V. Chvatal, and W. Cook, “Concorde TSP solver,” 2006. [112] J. Y. Audibert and S. Bubeck, “Best arm identification in multi-armed bandits,” in Proceed- ings of the 23rd Conference on Learning Theory, 2010, pp. 41–53. [113] E. Rolf, D. Fridovich-Keil, M. Simchowitz, B. Recht, and C. Tomlin, “A successive- elimination approach to adaptive robotic sensing,” ArXiv e-prints, 2018. [114] M. M. M. Manhães, S. A. Scherer, M. Voss, L. R. Douat, and T. Rauschenbach, “UUV simulator: A Gazebo-based package for underwater intervention and multi-robot simulation,” in OCEANS 2016 MTS/IEEE Monterey. IEEE, 2016, pp. 1–8. 125 [115] M. Abramowitz and I. A. Stegun, Eds., Handbook of Mathematical Functions: with Formu- las, Graphs, and Mathematical Tables. Dover Publications, 1964. [116] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approximations for maximizing submodular set functions,” Mathematical programming, vol. 14, no. 1, pp. 265–294, 1978. [117] K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, and B. Póczos, “Gaussian process bandit optimisation with multi-fidelity evaluations,” in Advances in Neural Information Processing Systems, 2016, pp. 992–1000. [118] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & Sons, 2012. [119] H. J. Karloff, “How long can a euclidean traveling salesman tour be?” SIAM Journal on Discrete Mathematics, vol. 2, no. 1, pp. 91–99, 1989. [120] A. Singh, A. Krause, C. Guestrin, and W. J. Kaiser, “Efficient informative sensing using multiple robots,” Journal of Artificial Intelligence Research, vol. 34, no. 2, p. 707, 2009. [121] A. Krause, A. Singh, and C. Guestrin, “Near-optimal sensor placements in gaussian pro- cesses: Theory, efficient algorithms and empirical studies,” Journal of Machine Learning Research, vol. 9, no. Feb, pp. 235–284, 2008. [122] J. L. Ny and G. J. Pappas, “On trajectory optimization for active sensing in Gaussian process models,” in IEEE Conf on Decision and Control and Chinese Control Conference, Shanghai, China, Dec. 2009, pp. 6286–6292. [123] S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen, “Combinatorial pure exploration of multi-armed bandits,” in Advances in Neural Information Processing Systems, 2014, pp. 379–387. [124] P. Reverdy, V. Srivastava, and N. E. Leonard, “Satisficing in multi-armed bandit problems,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3788 – 3803, 2017. [125] L. Wei, A. McDonald, and V. Srivastava, “Multi-robot Gaussian process estimation and coverage: a deterministic sequencing algorithm and regret analysis,” in Proceedings of the IEEE International Conference on Robotics and Automation, Xi’an, CN (Virtual), 202. [126] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume I : Estimation Theory. Prentice Hall, 1993. [127] J. W. Durham, R. Carli, P. Frasca, and F. Bullo, “Discrete partitioning and coverage control for gossiping robots,” IEEE Transactions on Robotics, vol. 28, no. 2, pp. 364–378, 2012. 126 [128] F. Bullo, J. Cortés, and S. Martínez, Distributed Control of Robotic Networks, ser. Ap- plied Mathematics Series. Princeton University Press, 2009, electronically available at http://coordinationbook.info. [129] H. Lim and C. Kim, “Flooding in wireless ad hoc networks,” Computer Communications, vol. 24, no. 3-4, pp. 353–363, 2001. [130] L. Wang and F. Xiao, “Finite-time consensus problems for networks of dynamic agents,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 950–955, 2010. [131] C. Paradis, “Fires from Space: Australia,” 2019. [Online]. Available: https: //www.kaggle.com/carlosparadis/fires-from-space-australia-and-new-zeland [132] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982. [133] F. Bullo, R. Carli, and P. Frasca, “Gossip coverage control for robotic networks: Dynamical systems on the space of partitions,” SIAM Journal on Control and Optimization, vol. 50, no. 1, pp. 419–447, 2012. [134] Q. Du, M. Emelianenko, and L. Ju, “Convergence of the lloyd algorithm for computing centroidal voronoi tessellations,” SIAM journal on numerical analysis, vol. 44, no. 1, pp. 102–119, 2006. [135] M. Schwager, D. Rus, and J. J. Slotine, “Decentralized, adaptive coverage control for net- worked robots,” International Journal of Robotics Research, vol. 28, no. 3, pp. 357–375, 2009. [136] M. Schwager, M. P. Vitus, S. Powers, D. Rus, and C. J. Tomlin, “Robust adaptive coverage control for robotic sensor networks,” IEEE Transactions on Control of Network Systems, vol. 4, no. 3, pp. 462–476, 2017. [137] P. Davison, N. E. Leonard, A. Olshevsky, and M. Schwemmer, “Nonuniform Line Coverage from Noisy Scalar Measurements,” IEEE Transactions on Automatic Control, vol. 60, no. 7, pp. 1975–1980, 2015. [138] J. Choi and R. Horowitz, “Learning coverage control of mobile sensing agents in one- dimensional stochastic environments,” IEEE Transactions on Automatic Control, vol. 55, no. 3, pp. 804–809, 2010. [139] M. Agarwal, V. Aggarwal, and K. Azizzadenesheli, “Multi-agent multi-armed bandits with limited communication,” arXiv preprint arXiv:2102.08462, 2021. [140] P. Toth and D. Vigo, The vehicle routing problem. SIAM, 2002. 127 [141] A. Gunawan, H. C. Lau, and P. Vansteenwegen, “Orienteering problem: A survey of recent variants, solution approaches and applications,” European Journal of Operational Research, vol. 255, no. 2, pp. 315–332, 2016. [142] F. Pasqualetti, J. W. Durham, and F. Bullo, “Cooperative patrolling via weighted tours: Performance analysis and distributed algorithms,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1181–1188, 2012. 128