OPTIMAL & GAME THEORETIC FEEDBACK DESIGN FOR EFFICIENT HUMAN PERFORMANCE IN HUMAN-SUPERVISED AUTONOMY By Piyush Gupta A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering—Doctor of Philosophy 2023 ABSTRACT Human-in-the-loop systems play a pivotal role in numerous safety-critical applications, ensur- ing both safety and efficiency in complex operational environments. However, these systems face a significant challenge stemming from the inherent variability in human performance, influenced by factors such as workload, fatigue, task learning, expertise, and individual dif- ferences. Therefore, effective management of human cognitive resources is paramount in designing efficient human-in-the-loop systems. To address this challenge, it is critical to design robust and adaptive systems capable of continuously adapting models of human performance, and subsequently providing tailored feedback to enhance it. Effective feedback mechanisms play a pivotal role in improving the overall system performance by optimizing human workload, fostering skill development, and facilitating efficient collaboration among individuals within diverse human teams, each with their unique skill sets and expertise. In this dissertation, the primary focus lies in exploring optimal and game-theoretic ap- proaches for feedback design to enhance system performance, particularly in scenarios where humans are integral components. We begin by studying the problem of optimal fidelity selec- tion for a human operator servicing a stream of homogeneous tasks, where fidelity refers to the degree of exactness and precision while servicing the task. Initially, we assume a known human service time distribution model, later relaxing this assumption. We design a human decision support system that recommends optimal fidelity levels based on the operator’s cognitive state and queue length. We evaluate our methods through human experiments involving participants searching for underwater mines. We extend the optimal fidelity selection problem by incorporating uncertainty into the human service-time distribution. This extension involves the development of a robust and adaptive framework that accurately learns the human service-time model and adapts the policy while ensuring robustness under model uncertainty. However, a major challenge in designing adaptive and robust systems arises from the conflicting objectives of exploration and robustness. To mitigate system uncertainty, an agent must explore high-uncertainty state space regions, while robust policy optimization seeks to avoid these regions due to poor worst-case performance. To address this trade-off, we introduce an efficient Determin- istic Sequencing of Exploration and Exploitation (DSEE) algorithm for model-based rein- forcement learning. DSEE interleaves exploration and exploitation epochs with increasing lengths, resulting in sub-linear cumulative regret growth over time. In addition to cognitive resource management, enhancing human performance can also be achieved through tutoring for skill development. In this context, we study the impact of evaluative feedback on human learning in sequential decision-making tasks. We conduct experiments on Amazon Mechanical Turk, where participants engage with the Tower of Hanoi puzzle and receive AI-generated feedback during their problem-solving. We examine how this feedback influences their learning and skill transfer to related tasks. Additionally, we explore computational models to gain insights into how individuals integrate evaluative feedback into their decision-making processes. Lastly, we expand our focus from a single human operator to a team of heterogeneous agents, each with diverse skill sets and expertise. Within this context, we delve into the challenge of achieving efficient collaboration among heterogeneous team members to enhance overall system performance. Our approach leverages a game theoretic framework, where we design utility functions to incentivize decentralized collaboration among these agents. Copyright by PIYUSH GUPTA 2023 This dissertation is dedicated to my beloved family. v ACKNOWLEDGEMENTS This dissertation would not have been possible without the support and encouragement of several people. First and foremost, my advisor, Dr. Vaibhav Srivastava, deserves special recognition. He has been there for me every step of the way, guiding me and supporting me in my academic journey. He has helped me improve my writing and analytical skills significantly. I am thankful for his unwavering support and for always making time for me during my PhD. I would also like to extend my gratitude to my graduate committee members, Dr. Xiaobo Tan, Dr. Hayder Radha, and Dr. Vishnu Boddeti, for their constant encouragement and guidance throughout my time in graduate school. They have played a significant role in my academic development. I extend my gratitude to Dr. Shaunak D. Bopardikar for his unwavering support during my PhD journey and invaluable insights into common resource pool games and analysis. Additionally, I would like to thank Dr. Joshua E. Siegel for our engaging discussions on various aspects of autonomous vehicles. These discussions inspired me to pursue internship opportunities at the Honda Research Institute (HRI), US, in 2022-2023. During my three internships at HRI, I had the privilege of working on problems related to interaction-aware planning and applying the skills and knowledge I gained during my PhD in the field of autonomous driving. I am deeply appreciative of my mentors, Sangjae Bae and David Isele from HRI, for their unwavering support throughout these internships. Their guidance not only broadened my research horizons but also played a significant role in refining my coding and writing abilities. I would like to express my gratitude to my friends and lab mates, Ankur, Dong Hae, Pradyumna Reddy, and Abhishek, for the enduring sense of camaraderie in the lab and the close friendships we’ve cultivated. Additionally, I want to thank my friends in California, Faizan Tariq, Sophia Newton, Aakriti Kumar, Snehal Dikhale, Faizan Siddiqui, and Avinash Singh with whom I formed a special bond during my internships at HRI. Our weekend hiking adventures, shared laughter, and the wisdom I gained from their experiences have continually vi inspired me to aim higher in life. No amount of words can adequately convey my gratitude for the unwavering love and wholehearted support of my family. I want to express my heartfelt thanks to my parents for instilling in me the values that have guided my path. Each day, I gain a deeper understanding of the sacrifices they made while raising me. My elder sister has been a constant source of guidance and support throughout the years, and from my younger brother, I have learned the importance of taking breaks and finding positivity during tough times. I am immensely grateful for their unwavering presence during challenging moments. My family has been a guiding light in my life, inspiring me on my journey. I can never thank them enough for everything they’ve done for me. I am deeply grateful to both my high school mathematics teacher, Mr. Sanjay Singh, and my IIT Delhi professor, Dr. Supreet Singh Bahga, for their early influence that ignited my appreciation for mathematics. They have collectively been unwavering sources of support and inspiration throughout my academic journey. I want to express my heartfelt thanks to both of them for consistently being there to provide guidance whenever I needed it. The work reported in this dissertation was supported in parts by the NSF awards IIS- 1734272, CMMI-1940950, ECCS-2024649, and ONR award N00014-22-1-2813. vii TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Organization and Contribution . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 OPTIMAL FIDELITY SELECTION FOR HUMAN-IN-THE- LOOP QUEUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . . 2.2 Numerical Illustration of Optimal Fidelity Selection . . . . . . . . . . . . 2.3 Structural Properties of the Optimal Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Conclusions and Future Directions CHAPTER 3 OPTIMAL FIDELITY SELECTION FOR IMPROVED PERFORMANCE IN HUMAN-IN-THE-LOOP QUEUES FOR UNDERWATER SEARCH . . . . . . . . . . . . . . . . . . . . . . 3.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . . 3.2 Human Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4 ROBUST AND ADAPTIVE FIDELITY SELECTION FOR HUMAN-IN-THE-LOOP QUEUES . . . . . . . . . . . . . . . . . 4.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . . 4.2 Convergence of Robust Adaptive SMDP . . . . . . . . . . . . . . . . . . 4.3 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 5 DETERMINISTIC SEQUENCING OF EXPLORATION AND EXPLOITATION FOR REINFORCEMENT LEARNING . . . . . 5.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . . 5.2 DSEE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Analysis of DSEE algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 6 FOSTERING HUMAN LEARNING IN SEQUENTIAL DECISION-MAKING: UNDERSTANDING THE ROLE OF EVALUATIVE FEEDBACK . . . . . . . . . . . . . . . . . . . 6.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Human Experiments 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 7 INCENTIVIZING COLLABORATION IN HETEROGENEOUS TEAMS VIA COMMON-POOL RESOURCE GAMES . . . . . . 7.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . . 1 4 10 16 16 23 24 30 31 31 36 42 47 48 48 53 55 58 59 59 61 63 70 72 72 78 81 95 96 96 viii 7.2 Existence and Uniqueness of PNE . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Convergence to the Nash Equilibrium . . . . . . . . . . . . . . . . . . . . 105 7.4 Social Welfare and Inefficiency of PNE . . . . . . . . . . . . . . . . . . . 106 7.5 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . . . . . . 111 7.6 Conclusions and Future Directions CHAPTER 8 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 APPENDIX A: CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 APPENDIX B: CHAPTER 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 APPENDIX C: CHAPTER 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 ix CHAPTER 1 INTRODUCTION Modern-day computers are getting faster and more efficient; allowing automation to perform complex tasks [1]. Nevertheless, in numerous safety-critical domains, human involvement remains essential to guarantee both safety and efficiency. These human-in-the-loop systems have become ubiquitous, finding application in fields such as search and rescue [2–4], semi- autonomous vehicle systems [5, 6], robot-assisted surgery [7], and flight control [2]. The role of humans within these systems is notably application-dependent, ranging from mere supervision to direct teleoperation of robots. For instance, in a search and rescue scenario [2], human operators might be tasked with remotely controlling robots to conduct search operations, aid in victim rescue efforts, and verify the findings of autonomous agents. Nonetheless, these systems encounter a noteworthy challenge rooted in the inherent vari- ability of human performance, which can be influenced by factors including workload, fa- tigue, task learning, expertise, and individual differences. Furthermore, the prevailing goal of maximizing the ratio of robots to human operators, often driven by cost considerations, frequently results in increased workload conditions for human operators [8]. As a conse- quence, the effective management of human cognitive resources becomes paramount in the design of efficient human-in-the-loop systems. To address this challenge, it is imperative to develop robust and adaptive systems that can continuously learn and adapt their models of human performance. These systems should utilize these learned models to actively provide customized feedback to humans, thus en- hancing their performance. Effective feedback mechanisms can elevate the overall system performance by optimizing the human workload, fostering the development of skills, and facilitating efficient collaboration among individuals within diverse human teams, each with their unique skill sets and expertise. This dissertation primarily focuses on the exploration of optimal and game-theoretic approaches for feedback design, with a specific emphasis on enhancing system performance, 1 particularly in scenarios where humans are integral components. The key objectives include: (i) Feedback design for managing human cognitive resources: A central aspect of this re- search focuses on designing feedback strategies that enable human operators to manage their cognitive resources efficiently. Specifically, these strategies offer optimal recom- mendations, based on the system state and cognitive state of the human operator, that optimize the utilization of human cognitive resources for improved system performance. (ii) Adaptive Algorithms for Robust Feedback under Model Uncertainty: Human perfor- mance is inherently agent-specific, influenced by a variety of factors such as workload, fatigue, and trust. Therefore, we delve into the design of robust and adaptable al- gorithms. These algorithms possess the ability to continuously learn and refine their model of human behavior, adapting and changing feedback based on updated human models. Importantly, the proposed adaptive recommendations optimize worst-case per- formance, and consequently, remain robust in the face of uncertainties in the human model. (iii) Role of feedback in fostering skill-development: Enhancing system performance can be achieved through effective tutoring and skill development in individuals performing the task. Therefore, we explore the role of feedback, extending beyond its immediate impact on current performance, to nurture skill development among human operators. Our objective is to use feedback as a means of providing effective tutoring, thereby fostering continuous growth and improvement over time. (iv) Feedback for Collaboration Across Skill sets: In scenarios where multiple humans with diverse skill sets are involved, we explore how feedback incentives can be harnessed to improve system performance. This involves achieving efficient team collaboration, ensuring that each individual’s unique expertise contributes to the collective success of the system. 2 To achieve these objectives, we begin by studying the problem of optimal fidelity selection for a human operator servicing a stream of homogeneous tasks, where fidelity refers to the degree of exactness and precision while servicing the task. Initially, we assume a known human service time distribution model, later relaxing this assumption. We design a human decision support system that recommends optimal fidelity levels based on the operator’s cognitive state and queue length. We evaluate our methods through human experiments involving participants searching for underwater mines. We extend the optimal fidelity selection problem by incorporating uncertainty into the human service-time distribution. This extension involves the development of a robust and adaptive framework that accurately learns the human service-time model and adapts the policy while ensuring robustness under model uncertainty. However, a major challenge in designing adaptive and robust systems arises from the conflicting objectives of exploration and robustness. To mitigate system uncertainty, an agent must explore high-uncertainty state space regions, while robust policy optimizes worst-case performance and consequently avoids these regions. To address this trade-off, we introduce an efficient Deterministic Se- quencing of Exploration and Exploitation (DSEE) algorithm for model-based reinforcement learning (RL). DSEE interleaves exploration and robust exploitation epochs with increasing lengths, resulting in sub-linear cumulative regret growth over time. In addition to cognitive resource management, enhancing human performance can also be achieved through tutoring for skill development. In this context, we study the impact of evaluative feedback on human learning in sequential decision-making tasks. We conduct experiments on Amazon Mechanical Turk, where participants engage with the Tower of Hanoi puzzle and receive AI-generated feedback during their problem-solving. We examine how this feedback influences their learning and skill transfer to related tasks. Additionally, we explore computational models to gain insights into how individuals integrate evaluative feedback into their decision-making processes. Lastly, we expand our focus from a single human operator to a team of heterogeneous 3 agents, each with diverse skill sets and expertise. Within this context, we delve into the challenge of achieving efficient collaboration among heterogeneous team members to enhance overall system performance. Our approach leverages a game-theoretic framework, where we design utility functions to incentivize decentralized collaboration among these agents. We show the existence of a unique Pure Nash Equilibrium (PNE) and establish the convergence of the best response dynamics to this unique PNE. Additionally, we establish an analytical upper bound on measures of PNE inefficiency, shedding light on the effectiveness of our collaborative strategies. In summary, this dissertation seeks to advance feedback design techniques, particularly in the context of human-in-the-loop systems, by introducing optimal and game-theoretic approaches. By focusing on enhancing human performance, fostering skill development, and optimizing collaboration, this research aims to make significant contributions to the design and implementation of more efficient and effective human-machine systems. 1.1 Literature review In this section, we review the literature in areas relevant to this dissertation. We organize the literature according to the broad topics of interest in this dissertation. 1.1.1 Control of Queues with State-dependent Servers We start by studying the problem of optimal fidelity selection for a human operator servicing a queue of homogeneous tasks. We model it as a control of a queue problem, where the human acts as a server with its own cognitive dynamics. Recent years have seen significant efforts in integrating human knowledge and perception skills with autonomy [9]. A key research theme within this area concerns the systematic allocation of human cognitive resources for efficient performance. Therein, some of the fun- damental questions studied include optimal scheduling of the tasks to be serviced by the operator [10], enabling shorter operator reaction times by controlling the task release [11], and determining optimal operator attention allocation [12]. In contrast to the aforementioned works, we consider a semi-Markov decision process (SMDP) [13] formulation to deal with 4 general (non-memoryless) service time distributions of the human operator. Furthermore, while the above works propose heuristic algorithms, we focus on establishing the structural properties of the optimal policy. Some interesting recent studies with state-dependent queues are considered in [14, 15]. In these works, authors design scheduling policies that stabilize a queueing system and de- crease the utilization rate of a non-preemptive server that measures the proportion of time the server is working. The performance of the server degrades with the increase in server utilization and improves when the server is allowed to rest. In contrast to monotonic server performance with the utilization rate in [14, 15], we model the service time of the human operator as a unimodal function of its cognitive state. Our model for service time is inspired by experimental psychology literature [16] and incorporates the influence of cognitive state and fidelity level on service time. The optimal control of queueing systems [17] is a classical problem in queueing theory. Of particular interest are the works [18, 19], where authors study the optimal policies for an M/G/1 queue using an SMDP formulation and describe its qualitative features. In contrast to a standard control of queues problem, the server in our problem is a human operator with its cognitive dynamics that must be incorporated into the problem formulation. Our mathematical techniques to establish the structural properties of the optimal pol- icy are similar to [20]. In [20], the authors establish structural properties of an optimal transmission policy for transmitting packets over a point-to-point channel in communication networks. The optimal policy of their Markov decision process (MDP) [21] depends on the queue length, the number of packet arrivals, and the channel fading state. In [22], authors study structural properties of the optimal resource allocation policy for a single-queue sys- tem in which a central decision-maker assigns servers to each job. In contrast to [20, 22], a major challenge in our problem arises due to SMDP formulation for non-memoryless service time distribution and its unimodal dependence on the cognitive state. 5 1.1.2 Robust Control Policies with Uncertain Dynamics In this dissertation, we also study the optimal fidelity selection problem in the presence of uncertainty in human models. Specifically, we assume that the service time distribu- tion of the human operator is unknown a priori and therefore, formulate a robust adaptive SMDP. An SMDP accounts for the system uncertainty through probabilistic state transi- tions. However, the obtained policy is sensitive to errors in the stochastic models [23, 24]. The large uncertainty in the service time models, especially in the initial stage with limited observation data may lead to sub-optimal policies. Existing methods formulate such con- trol problems with model uncertainty as an MDP with an uncertain state transition model. In [25], authors propose a constrained MDP framework with risk constraints. They opti- mize Conditional Value-at-Risk (CVaR) [26] and propose an iterative offline algorithm to find the risk-constrained optimal control policy. In [27], a chance-constrained MDP [28] is proposed that provides a probabilistic framework for handling uncertainty in the transition probabilities. Robust MDPs are studied in [29, 30] that solve for policies that optimize a min-max criterion when the unknown stochastic model is assumed to lie within an uncer- tainty set. The policies obtained from solving robust MDPs can be overly conservative when implemented on the nominal system [27]. In our work, we consider a human agent with non-memoryless service time distribution and thus formulate a robust adaptive SMDP to deal with general service time distributions [18, 19] and learn a policy that is robust in the transient learning phase and converges to the optimal policy asymptotically. In this dissertation, we show that the solution of the synchronous and asynchronous value iteration (VI) methods [31] for robust adaptive SMDP converges to the optimal solution for the uncertainty-free SMDP. While there exists a convergence analysis for the robust [32, 33] and adaptive [34] MDPs, such an analysis is missing for robust adaptive SMDPs to the best of our knowledge. A key challenge that we address in comparison to MDPs is the time dependence of the robust adaptive Bellman operator for SMDPs that requires a careful comparison between optimal value functions for intermediate SMDPs at different time steps, 6 and the optimal value function for the uncertainty-free SMDP. 1.1.3 Efficient Algorithms for Model-Based Reinforcement Learning In this dissertation, we introduce an efficient DSEE algorithm for model-based RL. RL is used in solving complex sequential decision-making tasks in uncertain environments such as motion planning for robots [35, 36], personalized web services [37, 38], and the design of decision-support systems for human-supervisory control [39–41]. MDPs [31] provide a nat- ural framework for optimal decision-making under uncertainty and are used to model and solve numerous model-based RL problems. The objective of these problems is to simultane- ously learn the system model and the optimal policy. While MDP formulation accounts for environment uncertainty by using stochastic models, MDP policies are known to be sensitive to errors in these stochastic models [17, 42]. In many safety-critical systems, robust MDPs [29, 43] are used to mitigate performance degradation due to uncertainty in the learned MDP. However, to reduce the system uncer- tainty, the agent must explore the environment and visit parts of the state space associated with high estimation uncertainty. Most often, RL algorithms use simple randomized meth- ods to explore the environment, e.g. applying ϵ-greedy policies [44, 45] or adding random noise to continuous actions [46]. The objective of the robust MDPs conflicts with the ex- ploration objective. To mitigate system uncertainty, an agent must explore high-uncertainty state space regions, while robust policy optimizes worst-case performance and consequently avoids these regions. Therefore, to balance the trade-off between learning the MDP and de- signing a robust policy, we design a DSEE algorithm, in which exploration and exploitation epochs of increasing lengths are interleaved. There exist efficient algorithms for solving RL problems with provable bounds on the sample complexity [47, Definition 1]. In [47], authors analyze the Model-based Interval Es- timation (MBIE) algorithm that applies confidence bounds to compute an optimistic policy and show that the algorithm is PAC-optimal [47, Definition 2]. They provide an upper bound on the algorithm’s sample complexity given by O (cid:16) |S|2|A| (cid:17) (1−γ)6ϵ3 log(δ−1) which is the maximum 7 number of time steps until when the MBIE policy is not ϵ−optimal with at least probability 1 − δ, where |S|, |A|, are the cardinality of the state space and action space, respectively, γ is the discount factor, and ϵ, δ ∈ (0, 1) are pre-defined constants. A similar bound on the sample complexity is obtained for the R-max algorithm [48] which distinguishes the “known" and “unknown" states based on how often they have been visited. It explores by acting to maximize rewards under the assumption that unknown states deliver the maximum reward. UCRL2 algorithm [49] relies on optimistic bounds on the reward functions and probability density functions and enjoys near-optimal regret bounds. A review of model-based RL al- gorithms with provable finite time guarantees can be found in [50, Chapter 38]. A major drawback of these algorithms is that they consider optimism in the face of uncertainty and hence, are not robust to the estimation uncertainties. Furthermore, these algorithms with random exploration might lead to a bad user experience in applications in which the RL agent seeks to learn human preferences for system optimization. To address these shortcomings, we propose a DSEE algorithm for model-based RL in which we design a deterministic sequence of exploration and exploitation epochs. The DSEE approach has been used in multi-arm bandit problems [51–54] and multi-robot coordination problems [55]. It allows for differentiation between exploration and exploitation epochs. The announced exploration may lead to a better user experience for the agents (especially for human agents) than random exploration at any time. For example, many personalized web services calibrate their recommendations intermittently by announced exploration, i.e., through surveys and user selection. Another advantage of the DSEE algorithm is that it allows for efficient exploration of the environment in multi-agent systems. Specifically, in multi-agent systems, exploration can be well-planned to cover all regions of the state-space through agent coordination which can be easily arranged due to the deterministic structure of exploration and exploitation. 8 1.1.4 Resource Sharing Games Another key focus of this dissertation is to incentivize collaboration in a team of het- erogeneous agents. We connect the class of problems involving human-team-supervised au- tonomy [3] with the common-pool resource (CPR) games [56, 57], and design utilities that yield the desired behavior. CPR games [56, 57] is a class of resource-sharing games in which players jointly manage a common pool of resource and make strategic decisions to maximize their utilities. Our CPR formulation has features similar to the CPR game studied in [57,58]. In these works, authors utilize prospect theory to capture the risk aversion behavior of the players investing in a fragile CPR [59] that fails if there is excessive investment in the CPR. In the case of CPR failure, no player receives any return from the CPR. While our design of the common review pool is similar to the fragile CPR, our failure model incorporates the constraint that only serviced tasks can be reviewed. In contrast to the agent heterogeneity due to prospect-theoretic risk preferences in [57], heterogeneity in our model arises due to differences in the agents’ mean service and review times. While we use human-team-supervised autonomy as a motivating example, our problem formulation can be applied to a broad range of problems involving tandem queues [60], where servicing and reviewing of tasks can be considered as the subsequent stages of the queueing network. Tandem queues are utilized to study problems such as resource allocation, inven- tory management, process optimization, and quality control [61]. Existing game theoretic approaches [62, 63] to service rate control in tandem queues assume that a single server is present at each stage of the tandem queue and each server has its independent resources. In contrast, in our setup, multiple heterogeneous agents allocate their time at different stations based on their skill sets and maximize the system throughput. Additionally, our mathe- matical techniques are applicable to many problems involving the dual-screening process. For example, in human-in-the-loop systems which are pervasive in areas such as search-and- rescue, semi-autonomous vehicle systems, surveillance, etc., humans often supervise (review) the actions (service) performed by the autonomous agents. In such settings, our framework 9 incentivizes collaboration among heterogeneous agents. Game-theoretic approaches have been utilized for problems in distributed control [64], wherein the overall system is driven to an efficient equilibrium strategy that is close to the social optimum through an appropriate design of utility functions [65]. Price of Anarchy (PoA) [66] is often used to characterize efficiency of the equilibrium strategies in a game. As- sociated analysis techniques utilize smoothness property of the utility functions [67], leverage submodularity of the welfare function [68], or solve an auxiliary optimization problem [69,70]. These approaches do not immediately apply to our setup. Instead, we follow a new line of analysis to obtain bounds on PoA by constructing a homogeneous CPR game, for which we show that the equilibrium strategy is also the social optimum (PoA=1), and relating its utility to the original game. 1.2 Organization and Contribution In this section, we present the organization of the remainder of the thesis and the con- tributions of the work in each chapter. Chapter 2: In this chapter, we study optimal fidelity selection for a human operator servicing a queue of homogeneous tasks. The agent can service a task with a normal or high fidelity level, where fidelity refers to the degree of exactness and precision while servicing the task. Therefore, high-fidelity servicing results in higher-quality service but leads to larger service times and increased operator tiredness. We treat the cognitive state of the human operator as a lumped parameter that captures psychological factors such as workload, and fatigue. The service time distribution of the human operator depends on her cognitive dy- namics and the level of fidelity selected for servicing the task. Her cognitive dynamics evolve as a Markov chain in which the cognitive state increases with high probability whenever she is busy, and decreases while resting. The tasks arrive according to a Poisson process and each task waiting in the queue loses its value at a fixed rate. We address the trade-off be- tween high-quality service of the task and consequent loss in value of subsequent tasks using an SMDP framework. We numerically determine an optimal policy and the corresponding 10 optimal value function. Finally, we establish the structural properties of an optimal fidelity policy and provide conditions under which the optimal policy is a threshold-based policy. The material in this chapter is from [41] and [39]. The major contributions of this work are threefold. First, we pose the fidelity selection problem in an SMDP framework and compute an optimal policy. We formulate a control of queue problem, where in contrast to a standard queue, the server is a human operator with her own cognitive dynamics. Our model for service time distribution of the human operator incorporates the influence of cognitive state and fidelity level on the service time. Second, we numerically show the influence of cognitive dynamics on the optimal policy. In particular, we show that servicing the tasks with high fidelity is not always optimal due to larger service times and increased tiredness of the human operator. In fact, we determine the optimal policy as a function of the queue length as well as the cognitive state of the human operator. Our results provide insight into the efficient design of human decision support systems. Third, we establish structural properties of the optimal fidelity selection policy and provide sufficient conditions under which, for each cognitive state, there exist thresholds on queue lengths at which optimal policy switches fidelity levels. Chapter 3: In this chapter, we study the problem of optimal fidelity selection for a human operator performing an underwater visual search task. Human performance depends on various cognitive factors such as workload and fatigue. We perform human experiments in which participants perform two tasks simultaneously: a primary task, which was subject to evaluation, and a secondary task to estimate their workload. The primary task requires participants to search for underwater mines in videos, while the secondary task involves a simple visual test where they respond when a green light displayed on the side of their screens turns red. Videos arrive as a Poisson process and are stacked in a queue to be serviced by the human operator. The operator can choose to watch the video with either normal or high fidelity, with normal fidelity videos playing at three times the speed of high fidelity ones. Participants receive rewards for their accuracy in mine detection for each primary task and 11 penalties based on the number of videos waiting in the queue. We consider the workload of the operator as a hidden state and model the workload dynamics as an Input-Output Hidden Markov Model (IOHMM). We use a Partially Observable Markov Decision Process (POMDP) to learn an optimal fidelity selection policy, where the objective is to maximize total rewards. Our results demonstrate improved performance when videos were serviced based on the optimal fidelity selection policy compared to a baseline where humans chose the fidelity level themselves. The material in this chapter is from [71]. The major contributions of this work are threefold. First, we address the optimal fidelity selection challenge by framing it as a control of queue problem, incorporating a hidden workload state. We employ IOHMM and POMDP to derive the optimal fidelity selection policy. Second, we compare the human fidelity selection policy with the optimal policy and draw valuable insights into human behavioral patterns. Third, we illustrate that by recommending the optimal policy, autonomous systems can effectively aid human decision- making, leading to a substantial improvement in system performance. Chapter 4: In this chapter, we relax the assumption of the known human service time model in the optimal fidelity selection problem. We assume the parameters of the human’s service time distribution depend on the selected fidelity level and her cognitive state and are assumed to be unknown a priori. These parameters are learned online through Bayesian parameter estimation. We formulate a robust adaptive SMDP to solve our optimal fidelity selection problem. We extend the results on the convergence of robust-adaptive MDP to robust-adaptive SMDPs and show that the solution of the robust adaptive SMDP converges to the optimal solution for the uncertainty-free SMDP. Furthermore, we numerically illus- trate the convergence of the synchronous and asynchronous robust adaptive policy to the uncertainty-free optimal policy. The material in this chapter is from [17]. The major contributions of this work are fourfold. First, we pose the optimal fidelity se- lection problem with uncertain human service time distribution in a robust adaptive SMDP framework. Second, we continuously improve the service time distribution estimates using 12 Bayesian parametric estimation [72] and utilize it to obtain a robust policy. Third, we for- mally show that the solution of both synchronous and asynchronous value iteration methods for the robust adaptive SMDP converges to the optimal solution for the uncertainty-free SMDP. Fourth, we provide numerical illustrations that show the convergence of the robust adaptive SMDP solution to the uncertainty-free SMDP. Chapter 5: In this chapter, we propose a DSEE algorithm with interleaving exploration and exploitation epochs for model-based RL problems that aim to simultaneously learn the system model, i.e., an MDP, and the associated optimal policy. During exploration, DSEE explores the environment and updates the estimates for expected reward and transition probabilities. During exploitation, the latest estimates of the expected reward and transition probabilities are used to obtain a robust policy with high probability. We design the lengths of the exploration and exploitation epochs such that the cumulative regret grows as a sub- linear function of time. The material in this chapter is from [73]. The major contributions of this work are twofold: (i) we propose a DSEE algorithm for model-based RL problems and (ii) we design the lengths of the exploration and exploitation epochs such that the cumulative regret for the DSEE algorithm grows as a sub-linear function of time. Chapter 6: In this chapter, we investigate the role of feedback in fostering human learning in sequential decision-making tasks. Cognitive rehabilitation, STEM skill acquisi- tion, and coaching games such as chess often require tutoring decision-making strategies. The advancement of AI-driven tutoring systems for facilitating human learning requires an understanding of the impact of evaluative feedback on human decision-making and skill de- velopment. To this end, we conduct human experiments using Amazon Mechanical Turk to study the influence of evaluative feedback on human decision-making in sequential tasks. In these experiments, participants solve the Tower of Hanoi puzzle and receive AI-generated feedback while solving it. We examine how this feedback affects their learning and skill transfer to related tasks. We also explore various computational models to understand how 13 people incorporate evaluative feedback into their decision-making processes. The material in this chapter is from [74]. There are three major contributions of this work. (i) We investigate the impact of different evaluative feedback strategies on the performance of individuals learning to solve ToH, a widely studied sequential decision-making task. Furthermore, we explore how individuals trained with different feedback strategies transfer their skills to a more challenging task. (ii) Treating humans as noisy optimal agents, we study how various evaluative feedback strategies affect their reward functions. Our research highlights the influence of different forms of evaluative feedback on the implicit reward structure that explains human decisions. (iii) We create a set of candidate computational models that may explain how humans integrate evaluative feedback into their sequential decision-making processes. Our goal is to identify the model that best explains human decision-making under evaluative feedback conditions. Chapter 7: In this chapter, we consider a team of heterogeneous agents that is collec- tively responsible for servicing, and subsequently reviewing, a stream of homogeneous tasks. Each agent has an associated mean service time and a mean review time for servicing and reviewing the tasks, respectively. Agents receive a reward based on their service and review admission rates. The team objective is to collaboratively maximize the number of “serviced and reviewed" tasks. We formulate a Common-Pool Resource (CPR) game and design utility functions to incentivize collaboration among heterogeneous agents in a decentralized manner. We show the existence of a unique Pure Nash Equilibrium (PNE), and establish convergence of best response dynamics to this unique PNE. Finally, we establish an analytic upper bound on three inefficiency measures of the PNE, namely the price of anarchy (PoA), the ratio of the total review admission rate (TRI), and the ratio of latency (LI). The material in this chapter is from [75] and [76]. The major contributions of this work are fivefold. First, we present a novel formula- tion of team backup behavior and design incentives, within the CPR game formalism, to 14 facilitate such behavior. Second, we show that there exists a unique PNE for the proposed game. Third, we show that the proposed game is a best response potential game as defined in [77], for which both sequential best response dynamics [78] and simultaneous best reply dynamics [79] converge to the PNE. Thus, the best response of self-interested agents in a decentralized team converge to the PNE. Fourth, we provide the structure of the social wel- fare solution and numerically quantify different measures of the inefficiency for the PNE, namely the PoA, the ratio of the total review admission rate (TRI), and the ratio of latency (LI), as a function of a measure of heterogeneity. While PoA is a widely used inefficiency metric, we define TRI and LI as other relevant measures for our setup based on the total review admission rate and latency (inverse of throughput), respectively. Finally, we provide an analytic upper bound for all three measures of the inefficiency. 15 CHAPTER 2 OPTIMAL FIDELITY SELECTION FOR HUMAN-IN-THE-LOOP QUEUES In this chapter, we study optimal fidelity selection for a human operator servicing a queue of homogeneous tasks. The agent can service a task with a normal or high fidelity level, where fidelity refers to the degree of exactness and precision while servicing the task. Therefore, high-fidelity servicing results in higher-quality service but leads to larger service times and increased operator tiredness. We treat the cognitive state of the human operator as a lumped parameter that captures psychological factors such as workload, and fatigue. The service time distribution of the human operator depends on her cognitive dynamics and the level of fidelity selected for servicing the task. Her cognitive dynamics evolve as a Markov chain in which the cognitive state increases with high probability whenever she is busy, and decreases while resting. The tasks arrive according to a Poisson process and each task waiting in the queue loses its value at a fixed rate. We address the trade-off between high-quality service of the task and consequent loss in value of subsequent tasks using an SMDP framework. We numerically determine an optimal policy and the corresponding optimal value function. Finally, we establish the structural properties of an optimal fidelity policy and provide conditions under which the optimal policy is a threshold-based policy. 2.1 Background and Problem Formulation We now discuss our problem setup, formulate it as an SMDP, and solve it to obtain an optimal fidelity selection policy. 2.1.1 Problem Setup We consider a human supervisory control system in which a human operator is servicing a stream of homogeneous tasks. The human operator may service these tasks with different levels of fidelity. The servicing time of the operator depends on the fidelity level with which she services the task as well as her cognitive state. We assume that the mean service time of the operator increases with the selected fidelity level. For example, when the operator 16 Figure 2.1 Overall schematic of the problem setup. The incoming tasks arrive as a Poisson process with rate λ. The tasks are serviced by the human operator based on the recommended fidelity level by the decision support system. Each task loses its value at a fixed rate while waiting in the queue. services the task with high fidelity, she may look into deeper details of the task, and conse- quently take a longer time to service. In addition to the fidelity level, the human service time may depend on their cognitive state. We treat the cognitive state as a lumped parameter that can capture various physiological measures. It can be a function of stress, workload, arousal rate, operator utilization ratio, etc. Such lumped representation can be obtained by classifying these psychological measure- ments into different service time distribution parameters. Inspired by the Yerkes-Dodson law, for a fixed level of fidelity, we model the service time as a unimodal function of the human cognitive state. Specifically, the mean service time is minimal corresponding to an intermediate optimal cognitive state (later referred to as the optimal cognitive state cog∗) as shown in Fig. 2.2c. We are interested in the optimal fidelity selection policy for the human operator. To this end, we formulate a control of queue problem, where in contrast to a standard queue, the server is a human operator with her cognitive dynamics. The incoming tasks arrive according to a Poisson process at a given rate λ ∈ R>0 and are serviced by the operator based on the fidelity level recommended by a decision support system (Fig. 2.1). We consider a dynamic queue of homogeneous tasks with a maximum capacity L ∈ N. The operator is penalized for each task waiting in the queue at a constant rate c ∈ R>0 per unit delay in its servicing. The set of possible actions available for the operator corresponds to (i) Waiting (W), when the queue is empty, (ii) Resting (R), which allows the operator to rest and reach the optimal 17 cognitive state, (iii) Skipping (S), which allows the operator to skip a task to reduce the queue length and thereby focus on newer tasks, (iv) Normal Fidelity (N) for servicing the task with normal fidelity, and (v) High Fidelity (H) for servicing the task more carefully with high precision. The skipping action ensures the stability of the queue by allowing the operator to reduce the queue length by skipping some tasks. Ideally, through appropriate control of the arrival rate, the system designer should ensure that skipping is not an optimal action. Let s ∈ S be the state of the system and As be the set of admissible actions in state s, which we define formally in Section 2.1.2. The human receives a reward r : S × As (cid:55)→ R≥0 defined by r(s, a) =    rH, rN , if a = H, if a = N, 0, if a ∈ {W, R, S}, (2.1) where, rH, rN ∈ R≥0 and rH > rN . We intend to design a decision support system that assists the operator by recommending optimal fidelity level to service each task1. The recommen- dation is based on the queue length and the operator’s cognitive state which we assume to have real-time access using, e.g., Electroencephalogram (EEG) measurements (see [80] for measures of cognitive load from EEG data) or eye-tracking and pupillometry [81]. We assume that the noisy data from these devices can be clustered into a finite number of bins to estimate the cognitive state. We study the optimal policy under the perfect knowledge of the cognitive state2. 18 (a) (c) (b) (d) Figure 2.2 Service time distribution of the human operator with (a) varying cognitive state and high fidelity, (b) varying action and fixed cognitive state, cog = 0.9. (c) Mean and variance of the service time distribution are unimodal functions of the cognitive state. (d) The mean sojourn time distribution takes on different forms based on the selected action. Forward Backward Action Probabilitya (λf δt) Probabilityb (λbδt) W R N H S λf = 0.02 (Noise) λf = 0.02 (Noise) λf = 0.6 λf = 1.1 λf = 0 λb = 0.5 λb = 0.5 λb = 0.02 (Noise) λb = 0.02 (Noise) λb = 0 Stay Probabilityc (1-λf δt − λbδt) 1 − 0.52δt 1 − 0.52δt 1 − 0.62δt 1 − 1.12δt 1 aForward Probability does not exist for cog = 1 (reflective boundary) bBackward Probability does not exist for cog = 0 (reflective boundary) cStay Probability is 1 − λf δt for cog = 0 and 1 − λbδt for cog = 1 Table 2.1 Cognitive Dynamics modeled as Markov chain. 19 2.1.2 Mathematical Modeling We formulate the control of queue problem as a discrete-time SMDP Γ defined by the following six components: (i) A finite state space S := {(q, cog)| q ∈ {0, 1, ..., L}, cog ∈ C := {i/N }i∈{0,··· ,N }}, for some N ∈ N, where q is the queue length and cog represents the lumped cognitive state, which increases (decreases) when the operator is busy (idle). (ii) A set of admissible actions As for each state s ∈ S which is given by: (i) As := {W | s ∈ S, q = 0} when queue is empty, (ii) As := {{R, S, N, H }| s ∈ S, q ̸= 0} when queue is non-empty and cog > cog∗, where cog∗ ∈ C is the optimal cognitive state associated with minimum mean service time, and (iii) As := {{S, N, H }| s ∈ S, q ̸= 0} when queue is non-empty and cog ≤ cog∗. (iii) A state transition distribution P (s′| τ, s, a) from state s to s′ for each action a ∈ As conditioned on the discrete sojourn time τ ∈ R>0 (time spent in state s before transitioning into next state s′). The state transition from s = (q, cog) → s′ = (q′, cog′) consists of two independent transition processes which are given by (i) a Poisson process for transition from q → q′ (ii) human cognitive dynamics for the transition from cog → cog′. We model the cognitive dynamics of the human operator as a Markov chain in which, while servicing the task, the probability of an increase in cognitive state in small time δt ∈ R>0 is greater than the probability of a decrease in cognitive state. Furthermore, the probability of the increase in the cognitive state increases with the level of fidelity selected for servicing the task. Similarly, while waiting or resting, the probability of a decrease in cognitive state in small time δt 1We assume compliance of the operator with the recommendations. To account for non-compliance, we can introduce p as the probability of compliance and 1 − p as the probability that the operator will deviate and follow a different behavioral policy. This deviation can be incorporated by using a mixed service time distribution with probabilities p and 1 − p for the recommended and behavioral actions respectively. 2If the cognitive state is not perfectly known, then our policy can be used within algorithms such as QMDP [82], to derive approximate solutions to the associated partially observable Markov decision pro- cess [83]. 20 is higher than the probability of an increase in cognitive state. Sample parameters of the model used in our numerical simulations are shown in Table 2.1. This model of cognitive state dynamics is a stochastic equivalent of deterministic models of the utilization ratio considered in [11]. It is assumed that the cognitive state remains unchanged when the human operator chooses to skip the task. (iv) Sojourn time distribution P (τ | s, a) of (discrete) time τ ∈ R>0 spent in state s until the next action is chosen takes on different forms depending on the selected action (Fig. 2.2d). The sojourn time is the service time while servicing the task (normal/ high fidelity), resting time while resting, constant time of skipping ts ∈ R>0 while skipping, and time until the next task arrival while waiting in case of an empty queue. We model the rest time as the time required to reach from the current cognitive state to the optimal cognitive state cog∗. In our numerical illustrations, we model the service time distribution while servicing the task using a hypergeometric distribution (Fig. 2.2a and 2.2b), where the parameters of the distribution are chosen such that the mean service time has the desired characteristics, i.e., it increases with the fidelity level (Fig. 2.2d) and is a unimodal function of the cognitive state (Fig. 2.2c). While resting, sojourn time distribution is the first passage time (FPT) distribution for transitioning from the current cognitive state cog to cog∗. We determine this distribution using matrix methods [84] applied to the Markov chain used to model the cognitive dynamics. Finally, to ensure the stability of the queue, we assume that the constant time of skip is less than 1 λ , i.e., queue length decreases on average while skipping tasks. (v) For selecting action a at state s, the human receives a bounded reward r(s, a) defined in (2.1). Additionally, the human incurs a penalty at a constant cost rate of c due to each task waiting in the queue, and consequently, the cumulative expected cost for choosing action a at state s = (q, cog) is given by: (cid:32) (cid:34) P(τ |s, a)cτ E q + q′ 2 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:35)(cid:33) τ, s, a (cid:88) = P(τ |s, a)cτ τ (cid:18) 2q + λτ 2 (cid:19) , (cid:88) τ 21 which is obtained by using E[q|τ, s, a] = q and E[q′|τ, s, a] = q + λτ . The expected net immediate reward received by the operator for selecting an action a in state s is given by: R(s, a) = r(s, a) − P(τ |s, a)c (cid:88) τ (cid:18) 2q + λτ 2 (cid:19) τ = r(s, a) − c E [τ |s, a] q − cλ 2 E (cid:2)τ 2|s, a(cid:3) , (2.2) where E [τ | s, a] and E [τ 2|s, a] represent the first and the second conditional moment of the sojourn time distribution, respectively. (vi) A discount factor γ ∈ [0, 1), which we choose as 0.96 for our numerical illustration. Remark 1. Although we assume a finite skip time, an alternative approach is to incorporate a penalty for the skip action. Note that, unlike a fixed penalty, a finite skip time results in a penalty that increases with queue length (see (2.2)). Consequently, the current approach is less inclined to skip tasks as the queue length increases compared to a model with a constant penalty. Remark 2. The reward R(s, a) formulation can be interpreted as an unconstrained SMDP corresponding to a constrained SMDP that maximizes r(s, a) subject to a constraint on the average queue length for the stability of the queue. Therefore, the penalty rate c acts as the Lagrange multiplier for the unconstrained problem, and hence, can be obtained by primal-dual methods that use dual ascent for finding the Lagrange multiplier [20]. 2.1.3 Solving SMDP for Optimal Policy For SMDP Γ, the optimal value function V ∗ : S → R satisfies the following Bellman equation [85]: (cid:34) V ∗(s) = max a∈As R(s, a) + (cid:88) s′,τ 22 (cid:35) γτ P (s′, τ |s, a) V ∗ (s′) , (2.3) (a) (b) Figure 2.3 (a) Optimal Policy π∗ and (b) Optimal Value Function V ∗ for SMDP Γ where time required to skip the tasks is not too small compared to the mean service time required to process the task. where P (s′, τ |s, a), which is the joint probability that a transition from state s to state s′ occurs after time τ when action a is selected can be rewritten as: P (s′, τ |s, a) = P (s′|τ, s, a) P (τ |s, a) , (2.4) where P (s′|τ, s, a) and P (τ |s, a) are given by the state transition probability distribution and the sojourn time probability distribution, respectively. An optimal policy π∗ : S → As at each state s selects an action that achieves the maximum in (2.3). We utilize the value iteration algorithm [31] to compute an optimal policy. 2.2 Numerical Illustration of Optimal Fidelity Selection We now numerically illustrate the optimal value function and an optimal policy for SMDP Γ. (a) (b) (c) Figure 2.4 Optimal policy π∗ for different values of arrival rate λ. In case (a) and (b) the action S in the optimal policy does not have a unique threshold for some cognitive states. Similarly in case (c) actions S and N in the optimal policy do not have unique thresholds. Fig. 2.3a and 2.3b show an optimal policy π∗, and the optimal value function V ∗, respec- tively, for the case in which the skip time is not too small compared to the mean service 23 time. If the skip time is too small, the action S is the optimal action almost everywhere to reduce the queue length. For a sufficiently high arrival rate λ such that there is always a task in the queue after servicing the current task, we observe that for any given cog, V ∗ is monotonically decreasing with q. Additionally, we observe that for a given q, V ∗ is a unimodal function of cog, with its maximum value corresponding to the optimal cognitive state (cog∗ = 0.6 for numerical il- lustrations). We observe that π∗ selects the high fidelity level around the cog∗ for low queue length, and thereafter transitions to a normal fidelity level for higher queue lengths. We also observe that in low cognitive states, the optimal policy is to keep skipping the tasks until the queue length becomes small, and then start servicing the tasks. In higher cognitive states, we observe that resting is the optimal action at smaller queue lengths while skipping tasks is the optimal choice at larger queue lengths. Additionally, we observe the effect of cog on π∗. In particular, we observe that π∗ switches from H to N, N to R, and R to S at certain thresholds on q, and these thresholds appear to be a unimodal function of cog. This behavior can be attributed to the mean service time being unimodal w.r.t cog. Fig. 2.4 shows some examples of π∗ for certain parameters. We observe that for some cogni- tive states, π∗ does not have a unique threshold and the same action reappears after switching to another action. For example, in Fig. 2.4a, action S is observed between actions H and N, as well as after action N. In the following section, we provide sufficient conditions under which π∗ has unique transition thresholds at which actions switch, and the previous action does not re-appear for the same cog. 2.3 Structural Properties of the Optimal Policy We establish the structural properties of the optimal infinite-horizon value function by considering the finite horizon case and then extending the results to the infinite horizon by taking the infinite step limit. Let V ∗ n (s0), n ≥ 0, be the discounted n-step optimal expected reward when the initial state is 0 (q, cog) = −Cq is the terminal cost for the finite-horizon case for a non-negative s0, where V ∗ 24 constant C. Each step size k ∈ {0, . . . , n − 1} is based on the sojourn time τk, spent in a state sk when action ak is selected. Let Jn,π(s0) denote the discounted n-step expected reward with initial state s0 under a given policy π. Henceforth, for the brevity of notation, we denote the conditional expectation E[·|s0, π] by Eπ[·]. Jn,π(s0) is given by: (cid:34)n−1 (cid:88) (cid:35) Jn,π(s0) = Eπ γζiR(si, ai) − γζnCqn , (2.5) where ζi := (cid:80)i−1 j=0 τj for i > 0 and ζ0 := 0. The discounted n-step optimal expected reward i=0 n (s) is given by: V ∗ V ∗ n (s0) = Jn,π∗(s0), (2.6) where π∗ is the optimal policy that maximizes Jn,π(s0) at each s0. Let µ1 : S × As (cid:55)→ R>0 and µ2 : S × As (cid:55)→ R>0 be function defined by µ1(s, a) = E [τ |s, a] and µ2(s, a) = E [τ 2|s, a], where τ is the sojourn time. We study the structural properties of the optimal policy for a large queue capacity, i.e. in the limit L → +∞, and under the following assumptions: (A1) The task arrival rate λ is sufficiently high so that the queue is never empty with high probability3. (A2) For any state s = (q, cog)4: µ1(s, S) < µ1(s, R) < µ1(s, N ) < µ1(s, H), and µ2(s, S) < µ2(s, R) < µ2(s, N ) < µ2(s, H). (2.7) (A3) We assume that Eπ[γτ ] ≤ f (Eπ[τ ], Varπ(τ )) < 1, where Varπ(τ ) = Var(τ |s0, a = π(s0)) is the variance of τ in any initial state s0 under a given policy π, and f is a monotonic function such that f (·, Varπ(τ )) is monotonically decreasing and f (Eπ[τ ], ·) is monotonically increasing. 3Given the service time distributions and the Poisson arrival rate, we can precisely determine the distri- bution of the number of arrivals that occur between each state transition. Therefore, Chernoff bounds [86] can be utilized to characterize the high probability that the number of arrivals between state transitions while servicing a task exceeds one. 4The action R is only available for states with cog > cog∗. 25 We make the assumption (A1) for convenience. Indeed, if the queue is allowed to be empty, then we will need to deal with an extra “waiting" action. Also, high arrival rates are the most interesting setting to study optimal fidelity selection. Assumption (A2) is true for a broad range of interesting parameters that define sojourn time distribution(s). Assumption (A3) holds for a class of light-tail distributions with non-negative support for τ , for example, when the moment generating function (MGF) of τ is upper bounded by the MGF of Gamma distribution, i.e., Eπ[etτ ] ≤ (cid:18) 1 − Varπ(τ )t Eπ[τ ] (cid:19) − Eπ [τ ]2 Varπ (τ ) , for all t < Eπ[τ ] Varπ(τ ) . In this scenario, substituting t = ln(γ) < 0 < Eπ[τ ] Varπ(τ ) , we get Eπ[γτ ] ≤ (cid:18) 1 − Varπ(τ ) ln(γ) Eπ[τ ] (cid:19) − Eπ [τ ]2 Varπ (τ ) =: f (Eπ[τ ], Varπ(τ )). Let ρ := maxcog,a f (E[τ |cog, a], Var(τ |cog, a)). Therefore, Eπ[γτ ] ≤ ρ. For the class of distributions of τ satisfying assumption (A3), and any initial state s0 and policy π, we have Eπ[γζk] (1)∗ = k−1 (cid:89) i=0 Eπ[γτi] ≤ k−1 (cid:89) i=0 f (Eπ[τi], Varπ(τi)) ≤ ρk, where (1)∗ follows from the independence of τi and τj, for i ̸= j. Therefore, we have lim n→∞ n−1 (cid:88) k=0 Eπ[γζk] ≤ lim n→∞ n−1 (cid:88) k=0 ρk = 1 1 − ρ . We will now establish that the optimal policy for SMDP Γ is a threshold-based policy if the following condition holds for each cognitive state cog: min{E[τ |cog, H] − E[τ |cog, N ], E[τ |cog, N ] − E[τ |cog, R], E[τ |cog, R] − ts}+ tsγE[τ |cog,H] 1 − γtmax ≥ tmax 1 − ρ max a∈As E[γτ |cog, a], (2.8) where tmax = E[τ |cog = 1, a = H] is the maximum expected sojourn time (assuming the largest mean service time in the highest cognitive state), and ts is the constant time for the skip. 26 Remark 3. For tasks with large differences in expected sojourn times, i.e., 0 ≪ ts ≪ E[τ |cog, R] ≪ E[τ |cog, N ] ≪ E[τ |cog, H], maxa∈As E[γτ |cog, a] → 0, and (2.8) always holds. We introduce the following notation. Let q∗ j : C (cid:55)→ Z≥0 ∪{+∞}, for j ∈ {1, 2, 3} be some functions of cog. Theorem 1 (Structure of optimal policy ). For SMDP Γ under assumptions (A1-A3) and an associated optimal policy π∗, if the difference in the expected sojourn times is sufficiently large such that (2.8) holds for any cognitive state cog, then the following statements hold: (i) there exists unique threshold functions q∗ 1(cog), q∗ 2(cog), and q∗ 3(cog) such that for each cog > cog∗: π∗(s = (q, cog)) =    H, N, R, S, q ≤ q∗ 1(cog), 1(cog) < q ≤ q∗ q∗ 2(cog), 2(cog) < q ≤ q∗ q∗ 3(cog), q > q∗ 3(cog); (ii) there exists unique threshold functions q∗ 1(cog) and q∗ 2(cog) such that for any cog ≤ cog∗: π∗(s = (q, cog)) =    H, N, S, q ≤ q∗ 1(cog), 1(cog) < q ≤ q∗ q∗ 2(cog), q > q∗ 2(cog). We prove Theorem 1 using the following lemmas. Lemma 1. (Immediate Reward): For SMDP Γ, the immediate expected reward R(s, a), for each a ∈ As (i) is linearly decreasing with queue length q for any fixed cognitive state cog; 27 (ii) is a unimodal function5 of the cognitive state cog for any fixed queue length q with its maximum value achieved at the optimal cognitive state cog∗. Proof. The proof follows by noting that (2.2) is linearly decreasing in q and the coefficients E [τ | s, a] and E [τ 2|s, a] are unimodal w.r.t cog. Interested readers can refer to [87] for detailed proof. We now provide important mathematical results in Lemma 2 which we use to establish Lemma 3. Lemma 2. For the SMDP Γ, the following equations hold for any initial state s0 and policy π: (i) Eπ (cid:2)γζk E[τ 2 k |cogk, ak](cid:3) = Eπ (cid:2)γζkτ 2 k (cid:3); (ii) Eπ (cid:2)γζk E[τk|cogk, ak]qk (cid:3) = Eπ (cid:2)γζkτk Eπ [qk|s0, ζk](cid:3) Proof. The proof utilizes the properties of the expectation operator, and independence of the transition processes for qk and cogk proof. . Interested readers can refer to [87] for detailed Lemma 3. (Value function bounds): For SMDP Γ under assumptions (A1-A3), for any ˜q0 ≥ q0, 0 ≤ cts∆q 1−γtmax ≤ V ∗(q0, cog0) − V ∗(˜q0, cog0) ≤ ctmax∆q 1−ρ , where ∆q = ˜q0 − q0, ρ is an upper bound on Eπ[γτ ], tmax = E[τ |cog = 1, a = H] is the maximum expected sojourn time, and ts is the constant time for skip. Proof. See Appendix A: Chapter 2 for the proof. Remark 4. It follows from Lemma 3, that for SMDP Γ under assumptions (A1-A3), the optimal value function V ∗(q, ·) is monotonically decreasing with queue length q. 5The expected immediate reward under action S is a constant, which we treat as a unimodal function. 28 Lemma 4. (Thresholds for low cognitive states): For the SMDP Γ under assumptions (A1-A3), and an associated optimal policy π∗, the following statements hold for each cog ≤ cog∗: (i) there exists a threshold function q∗ 1(cog), such that N strictly dominates H, for each q > q∗ 1(cog) if E[τ |cog, H] − E[τ |cog, N ] + tsγE[τ |cog,H] 1 − γtmax ≥ tmax 1 − ρ E[γτ |cog, N ]; (ii) there exists a threshold function q∗ 2(cog), such that for each q > q∗ 2(cog), action S is optimal if E[τ |cog, N ] − ts + tsγE[τ |cog,H] 1 − γtmax ≥ γts tmax 1 − ρ . Proof. See Appendix A: Chapter 2 for the proof. Lemma 5. (Thresholds for high cognitive states): For the SMDP Γ under assumptions (A1-A3), and an associated optimal policy π∗, the following statements hold for each cog > cog∗: (i) there exists a threshold function q∗ 1(cog), such that N strictly dominates H, for each q > q∗ 1(cog) if E[τ |cog, H] − E[τ |cog, N ] + tsγE[τ |cog,H] 1 − γtmax ≥ tmax 1 − ρ E[γτ |cog, N ]; (ii) there exists a threshold function q∗ 2(cog), such that R strictly dominates H & N, for each q > q∗ 2(cog) if E[τ |cog, N ] − E[τ |cog, R] + tsγE[τ |cog,H] 1 − γtmax ≥ tmax 1 − ρ E[γτ |cog, R]. (iii) there exists a threshold function q∗ 3(cog), such that for each q > q∗ 3(cog), action S is optimal if E[τ |cog, R] − ts + tsγE[τ |cog,H] 1 − γtmax ≥ γts tmax 1 − ρ . 29 Proof. Recall that As := {{R, S, N, H }| s ∈ S, q ̸= 0} when queue is non-empty and cog > cog∗. The proof of Lemma 5 follows analogously to the proof of Lemma 4. Proof of Theorem 1: The proof follows by finding the intersection of the sufficient condi- tions from Lemmas 4 and 5 to obtain condition (2.8) for a threshold-based π∗. 2.4 Conclusions and Future Directions We studied optimal fidelity selection for a human operator servicing a stream of homo- geneous tasks using an SMDP framework. In particular, we studied the influence of human cognitive dynamics on an optimal fidelity selection policy. We presented numerical illus- trations of the optimal policy and established its structural properties. These structural properties can be leveraged to tune the design parameters, deal with the model uncertainty, or determine a minimally parameterized policy for specific individuals and tasks. There are several possible avenues for future research. An interesting direction is to conduct experiments with human subjects, measure EEG signals to assess their cognitive state and test the benefits of recommending optimal fidelity levels. It is of interest to extend this work to a team of human operators servicing a stream of heterogeneous tasks. A preliminary setup is considered in [75, 88], where authors study a game-theoretic approach to incentivize col- laboration in a team of heterogeneous agents. In such a setting, finding the optimal routing and scheduling strategies for these heterogeneous tasks is also of interest. 30 CHAPTER 3 OPTIMAL FIDELITY SELECTION FOR IMPROVED PERFORMANCE IN HUMAN-IN-THE-LOOP QUEUES FOR UNDERWATER SEARCH In this chapter, we study the problem of optimal fidelity selection for a human operator per- forming an underwater visual search task. Human performance depends on various cognitive factors such as workload and fatigue. We perform human experiments in which participants perform two tasks simultaneously: a primary task, which was subject to evaluation, and a secondary task to estimate their workload. The primary task requires participants to search for underwater mines in videos, while the secondary task involves a simple visual test where they respond when a green light displayed on the side of their screens turns red. Videos arrive as a Poisson process and are stacked in a queue to be serviced by the human operator. The operator can choose to watch the video with either normal or high fidelity, with normal fidelity videos playing at three times the speed of high fidelity ones. Participants receive rewards for their accuracy in mine detection for each primary task and penalties based on the number of videos waiting in the queue. We consider the workload of the operator as a hidden state and model the workload dynamics as an Input-Output Hidden Markov Model (IOHMM). We use a Partially Observable Markov Decision Process (POMDP) to learn an optimal fidelity selection policy, where the objective is to maximize total rewards. Our results demonstrate improved performance when videos were serviced based on the optimal fidelity selection policy compared to a baseline where humans chose the fidelity level themselves. 3.1 Background and Problem Formulation We now discuss our problem setup, formulate it as POMDP, and solve it to obtain the optimal fidelity selection policy. 3.1.1 Problem Setup We study the problem of optimal fidelity selection for a human operator performing a visual search task. The human operator performs two tasks simultaneously, a primary task and a secondary task. The primary task involves searching for underwater mines in videos 31 Figure 3.1 Human experiment interface. The participants press the spacebar key whenever a new mine is detected in the primary task video. Additionally, the green light (secondary task) randomly turns red once for each primary task and the participant responds by pressing the Enter key as early as possible. The queue length (tasks waiting in the queue) is displayed on top of the primary task. generated from an underwater simulation designed using Gazebo [89] and ROS [90]. The operator watches videos and responds by pressing a key whenever a mine is spotted. On the other hand, the secondary task involves a simple visual exercise, where participants press a key as early as possible when a green light located at the side of the screen changes to red. During each primary task, the green light undergoes this transition randomly, occurring between the 25% and 75% mark of the video. We record the participants’ reaction time in the secondary tasks, and in a rare event when a participant misses the red light, the reaction time is set to the total time the light stays red until the end of the primary task. Fig. 3.1 shows the experiment interface. The videos for the primary task arrive as a Poisson process with an arrival rate of λ ∈ R>0 and get stacked in a queue awaiting service by the human operator. The operator has the option to select either high or normal fidelity levels for servicing each video. In normal fidelity, the video is presented at a speed of three times faster compared to high-fidelity processing. Additionally, operators can choose to delegate a task for autonomous processing, even though the accuracy of the autonomous system may be lower. This delegation or “skip" action serves as a means to maintain queue stability, especially in situations with large queue lengths. 32 Figure 3.2 Input-output hidden Markov model. The input a represents the fidelity level, the hidden state w signifies the workload, and the three observations o1, o2, and o3 correspond to the fraction of correctly detected mines in the primary task, the count of false alarms in the primary task, and the reaction time recorded during the execution of the secondary task, respectively. The performance of a human operator relies on their workload, making it a crucial factor in our problem formulation. To address this, we formulate the problem as a POMDP, with the workload of the operator treated as a latent or hidden variable. We estimate this hidden workload through a combination of reaction time measurements from the secondary task and performance metrics obtained from the primary task. Specifically, we model the workload dynamics of the human operator using an IOHMM (see Fig. 3.2). In this model, the fidelity level a serves as the input, the workload w operates as the hidden state, and we observe three distinct output measures o1, o2, o3. These output measures correspond to the fraction of correctly detected mines in the primary task, the count of false alarms in the primary task, and the reaction time recorded during the execution of the secondary task, respectively. This modeling approach helps us understand how fidelity, workload, and task performance are interconnected in one unified framework. We utilize the extended Baum-Welch algorithm [91] to train the IOHMM model, which provides the transition probabilities p(w′|w, a), observation probabilities p(o|w, a), and the initial state distribution p(w0) through expectation maximization. These probabilities are utilized in the POMDP formulation to obtain the optimal fidelity selection policy. To determine the most suitable number of hidden workload states, we use the Akaike information criterion (AIC) [92] and Bayesian information criterion (BIC) [93] for model 33 selection. For a given number of hidden states, the AIC and BIC are defined as: AIC = 2p − 2 log( ˆL), BIC = p log(no) − 2 log( ˆL), (3.1) where p represents the number of learned parameters, no stands for the number of observation trajectories, and ˆL signifies the maximized value of the log-likelihood function derived from the trained model. The model with the lowest AIC (or BIC) value is considered the optimal choice, as determined by the AIC (or BIC) criteria. Based on these criteria (presented in Section 3.2.3), we train an IOHMM model with two hidden states, which we refer to as normal and high workload states. 3.1.2 Mathematical Modeling We formulate our problem as a POMDP P = {S, A, Ω, T , O, r, γ}, where S is the state space, A in the action space, Ω is the set of observations, T is the set of conditional transition probabilities between states, O is the set of conditional observation probabilities, r : S ×A → R is the reward function, and γ ∈ [0, 1) is the discount factor. The state space of the system is defined using a tuple S = (q, w), where q ∈ {0, 1, . . . , L} and w ∈ W = {0, 1} denotes the number of tasks waiting in the queue (with maximum queue length L) and the hidden discrete workload of the human operator, respectively. We define w = 0 as the normal workload state and w = 1 as the high workload state. The action space is defined as A = {N, H, D}, with N and H representing normal and high-fidelity task processing, respectively. The action D corresponds to task delegation, allowing the task to be handled by autonomous systems. This delegation action is particularly useful for maintaining queue stability when dealing with a large queue length. To discourage excessive use of the delegation action, we assume that the autonomy’s accuracy in mine detection is significantly lower which results in a lower immediate reward. The observation space is defined by a tuple Ω = (o1, o2, o3), where o1, o2, and o3 cor- respond to the fraction of correctly detected mines in the primary task, the count of false alarms in the primary task, and the reaction time recorded during the execution of the sec- 34 ondary task, respectively. The state transition probability p(s′|s, a) ∈ T is derived from the queue dynamics p(q′|q, a) given by the Poisson distribution with arrival rate λt, where t is the duration of the task, and workload dynamics p(w′|w, a) obtained by training the IOHMM. The observation probabilities p(o|s′, a) = p(o1, o2, o3|w′, a) ∈ O are also obtained from the trained IOHMM model. The observation probabilities are assumed to be indepen- dent of the queue length, and therefore, only depend on the workload of the operator, i.e., p(o|s′, a) = p(o|w′, a). We define the reward function as r(s, a) = α1o1 − α2o2 − α3q, where αi for i ∈ {1, 2, 3} are positive constants, which rewards high accuracy in the primary task and penalizes for the number of tasks waiting in the queue. We convert the POMDP to a belief MDP defined by M = {B, A, τ, r, γ}. Here, B := {(q, bH)| q ∈ {0, 1, . . . , L}, bH ∈ ∆D} is the new state space, where q is the original queue length, bH is the discrete belief probability for being in the high workload state, and ∆D is a discretization of the interval [0, 1]. Therefore, the belief probability for being in the normal workload state is given by 1 − bH. For our experiments, we discretize [0, 1] with a step size of 0.1, i.e., ∆D = {0, 0.1, . . . , 1}. Note that the discretization of bH results in a finite state space B. Let b : W → ∆D denote the belief vector, where b(0) = 1 − bH, and b(1) = bH. From a current belief b(w), taking an action a and observing o, the updated belief b′(w′) is given by: b′(w′) = ηp(o|w′, a) (cid:88) p(w′|w, a)b(w), where η = 1 p(o|b,a) w∈W is the normalizing constant with p(o|b, a) = (cid:88) w′∈W p(o|w′, a)) (cid:88) w∈W p(w′|w, a)b(w). (3.2) (3.3) The updated belief probabilities in b′(w′), where b′(0) = 1 − b′ H and b′(1) = b′ H obtained from (3.2) are mapped to the closest discrete states such that b′ H ∈ ∆D. The action set A is the original action space. The transition probabilities τ (q′, b′|q, b, a) = p(q′|q, a)p(b′|b, a) is composed of the Poisson process for queue dynamics p(q′|q, a), and the workload dynamics 35 p(b′|b, a), which is given by: where p(b′|b, a) = (cid:88) o∈O p(b′|b, a, o)p(o|b, a), p(b′|b, a, o) =    1, if belief update in (3.2) returns b′, 0, otherwise, and p(o|b, a) is defined in (3.3). The reward function r : B × A → R is given by: r(q, bH, a) = (cid:88) s∈S|w=0 r(s, a)(1 − bH) + (cid:88) s∈S|w=1 r(s, a)bH, (3.4) (3.5) (3.6) where r(s, a) is the original reward function for the POMDP. The discount factor γ is the original discount factor of the POMDP. For the belief MDP M, the expected value for policy π starting from an initial state (q0, bH,0) is defined as: V π(q0, bH,0) = = ∞ (cid:88) t=0 ∞ (cid:88) t=0 γtr(qt, bH,t, at) γtE [r(st, at)|q0, bH,0, π] , (3.7) where the expectation is computed over (qt, bH,t, wt). The optimal fidelity selection policy maximizes the value in each belief state, i.e., We utilize the value iteration algorithm to solve the belief MDP and obtain the optimal π∗ = arg max π V π(q0, bH,0). fidelity selection policy. 3.2 Human Experiments In this section, we discuss the design of our human experiments conducted using Prolific (www.prolific.com). 3.2.1 Experimental Setup We developed an underwater mine search experiment within the Robot Operating System (ROS) framework using Gazebo models. For our simulation, we employed the “UUV simula- tor" [94], a comprehensive package encompassing Gazebo plugins and ROS nodes specifically 36 (a) (b) (c) (d) Figure 3.3 Example image frames containing mines recorded from ROV. The mines’ positions are highlighted within red bounding boxes in the frames. tailored for simulating unmanned underwater vehicles, including remotely operated vehicles (ROVs) and autonomous underwater vehicles (AUVs). In our experimental setup, we designed an underwater scenario where we placed under- water mines randomly throughout the environment following a uniform distribution within a predefined area. To increase the complexity of the task, we also randomly introduced a significant amount of underwater vegetation, which added an extra layer of difficulty in detecting the mines. We captured underwater videos by deploying an ROV into the environment. The ROV followed a predetermined circular trajectory as it moved through the underwater space. Equipped with a downward-facing camera, the ROV recorded images. To ensure optimal recording conditions, we maintained complete darkness in the environment. The sole source of illumination was a single light attached to the ROV, directing its beam vertically down- ward. Consequently, only the area directly beneath the ROV was within the recording scope, resulting in a focused and well-illuminated view. We recorded a total of 5600 images which are used as 56 videos, with each video consisting of 100 frames. Fig. 3.3 shows example image frames containing underwater mines recorded from ROV. The mines’ positions are highlighted within red bounding boxes in the frames. We performed 5 set of experiments, with each experiment group consisting of 20 partici- pants. In each experiment, the participants first performed 8 practice tasks followed by 48 main tasks, where each task consists of a primary and a secondary task. Following is the list 37 of experiments. • Experiment 1: The first experiment was the base experiment where we recorded data to train the IOHMM model and learn the optimal fidelity selection policy by solving the POMDP. In this experiment, the 48 videos were evenly divided, with half of them being displayed at normal fidelity and the other half at high fidelity. To maintain con- sistency and balance, the tasks were presented in alternating blocks of four tasks each. Specifically, the sequence of 48 tasks was organized as {N, N, N, N, H, H, H, H, . . .} for half of the participants and {H, H, H, H, N, N, N, N, . . .} for another half of the participants. This reversal was thereby used to prevent ordering bias in our estimates. • Experiment 2: In this experiment, the participants were allowed to choose a fidelity level for servicing each task. Before releasing each task, the participant decided on their desired fidelity level by pressing a key. This served as a baseline, where the human operator received no decision support. In this experiment, we do not allow task delegation, and the participants were only allowed to choose to service each task with either a normal or high fidelity level. • Experiment 3: In this experiment, the decision support system determined the appro- priate fidelity level for each task by considering both the current operator performance and the queue length. Specifically, the decision support system monitored the opera- tor’s performance in each task, updated its belief using (3.2), and selected the optimal action in accordance with the optimal fidelity selection policy. For this experiment, we utilized an optimal policy that comprises only two available actions: normal and high fidelity. • Experiment 4: This experiment was similar to Experiment 2, with the modification that the participants were provided an additional action of delegating the task to the autonomous system. 38 • Experiment 5: This experiment was similar to Experiment 3, with the modification that the optimal policy consisting of three actions: normal fidelity, high fidelity, and task delegation was used to choose the fidelity level for each task. 3.2.2 Methods After receiving the IRB consent (MSU IRB #9452) from Michigan State University’s IRB office, we recruited 100 participants using Prolific for the study. Inclusion criteria were established as having completed a minimum of 500 prior studies and maintaining a 99% approval rate on the platform. Participants were compensated with a base payment of $6 and had the opportunity to earn additional performance-based bonuses ranging from $0−$4. 3.2.3 IOHMM results We use the data from Experiment 1 to train IOHMM models with different numbers of hidden states. Table 3.1 illustrates the AIC and BIC values (normalized with the number of observation trajectories) for the trained IOHMM models with 2, 3, and 4 hidden workload states, respectively. Based on the AIC and BIC criterion, we utilize a trained IOHMM model with two hidden states, which we refer to as normal and high workload states. Hidden States AIC BIC 2 3 735.29 735.91 736.39 737.85 4 736.28 739.26 Table 3.1 AIC and BIC values for model selection. Fig. 3.4a and 3.4b show the workload transition diagram for the trained IOHMM model under normal and high fidelity, respectively. Under high fidelity, due to the slower speed of videos, there is a higher probability of transitioning from the high workload state to a lower workload state as compared to under normal fidelity. Similarly, the probability of transitioning into a high workload state from the normal workload state is higher under normal fidelity as compared to the high fidelity servicing. In the case of task delegation action, the task is instantaneously removed from the queue and hence, we assume that the workload remains the same under task delegation. 39 (a) Normal fidelity. Figure 3.4 Workload transition diagram under (a) normal and (b) high fidelity servicing of tasks. (b) High fidelity. (a) Normal fidelity. (b) High fidelity. Figure 3.5 Reaction time diagram under (a) normal and (b) high fidelity servicing of tasks. For observations o1, o2, and o3, we learn a normal distribution with unknown mean and standard deviation for each state-action pair. Fig. 3.5a and 3.5b present the reaction time distributions for normal and high fidelity, respectively. We identify state 0 as a normal workload state, while state 1 represents the high workload state. The mean and the variance of the reaction time in the secondary task are larger in the high workload state than in the normal workload state. Fig. 3.6a and 3.6b depict the distributions for the fraction of detected mines in the primary task for normal and high fidelity, respectively. Notably, we observe a substantial difference in the means of the fraction of detected mines between normal and high workload states under normal fidelity. In contrast, the means are relatively similar between these 40 (a) Normal fidelity. (b) High fidelity. Figure 3.6 Fraction of mines detected in primary task under (a) normal and (b) high fidelity servicing of tasks. states under high fidelity. This observation suggests that when we do not take into account the penalty associated with the queue length, choosing high fidelity (slower videos) could be a suitable action during high workload conditions. (a) Normal fidelity. (b) High fidelity. Figure 3.7 Number of false alarms in primary task under (a) normal and (b) high fidelity servicing of tasks. Fig. 3.7a and 3.7b illustrate the distributions for the number of false alarms in the primary task for normal and high fidelity, respectively. Notably, in the normal workload state, no false alarms were recorded. Therefore, in the normal workload state, we replace the normal distribution with a Dirac delta function that yields a probability of 1 at 0 false alarms and 0 everywhere else. Finally, the initial distribution for the workload was determined as [0.662, 0.338], where 0.662 is the probability of starting in a normal workload state. To solve the POMDP, we discretize the distributions of the reaction time, fraction of mines detected, and the false alarms, with a step size of 25 ms, 0.05, and 0.5, respectively. 41 3.2.4 Optimal Policy (a) Task delegation unavailable. (b) Task delegation available. Figure 3.8 Optimal fidelity selection policy where the action space (a) does not include task delegation and (b) includes task delegation. Using the distributions from trained IOHMM, we convert the POMDP into a belief MDP as detailed in Sec. 3.1.2. We utilize the following reward function: r(s, a) =    100o1 − 30o2 − 2q, for a ∈ {N, H}, 30 − 2(q − 1), for a = D, (3.8) where we assumed the accuracy of autonomous servicing (task delegation) to be just 30%. We employed the value iteration algorithm to derive an optimal fidelity selection policy. We developed two optimal policies: one with only two available actions and another with three available actions, including task delegation. Fig. 3.8a and 3.8b illustrate the optimal policy with two and three available actions, respectively. In situations characterized by low queue lengths and a higher level of belief that workload is high, opting for high-fidelity servicing emerges as the optimal action. For other regions of the state space, normal fidelity servicing is the optimal action. When task delegation is a viable option, it becomes the optimal action only when there is a near certainty of being in the high workload state, as indicated by a belief close to 1, and the queue length is substantial. 3.3 Results We now discuss the results of the experiments. 42 (a) Experiment 2 policy: Empirical human policy. (b) Learned optimal policy. (c) Experiment 3 policy: Empirical optimal policy. Figure 3.9 (a) Empirical human policy in Experiment 2, (b) Learned optimal policy by solving POMDP with two available actions, and (c) Empirical policy obtained using data from Experiment 3 that deploys optimal fidelity selection policy. Fig. 3.9 compares the policy utilized in Experiments 2 and 3 that only allowed servicing a task with either normal or high fidelity. Fig. 3.9a, 3.9b, and 3.9c show the empirical human policy obtained from Experiment 2 data, complete optimal policy learned from POMDP, and the empirical policy obtained using data from Experiment 3 that deploys optimal fi- delity selection policy, respectively. The two columns of the plots show the probability of choosing high-fidelity action and normal-fidelity action, respectively. The dark purple region in the plots represents the portions of the state space that have not been visited during the 43 Figure 3.10 Box plots for the participants’ scores in Experiment 2 (human policy) and Ex- periment 3 (optimal policy). Within each box plot, the median is represented by the red horizontal line, while the lower and upper edges of the box signify the 25th and 75th per- centiles, respectively. Whiskers extend to encompass the most extreme data points. The p value in the two-sample t-test comparing the outcomes of Experiments 2 and 3 was computed as 0.017, indicating a high level of statistical significance. experiment. By comparing the human policy with the optimal policy, we can gain important insights into human behavioral patterns. It is evident that the human policy exhibits minimal vari- ation in fidelity with respect to the queue length. This suggests that humans struggle to effectively balance the trade-off between their ongoing tasks and the pending tasks in the queue. Furthermore, human policy reveals a tendency for high-fidelity servicing in low-workload states and normal-fidelity servicing in high-workload states. This implies that humans tend to have difficulty switching between actions. Participants who prioritize high accuracy in primary tasks opt for high-fidelity servicing, resulting in normal workload conditions. Con- versely, those who prioritize speed choose normal fidelity servicing, which can lead to op- erating under high workload conditions. Consequently, humans appear to have difficulty accurately assessing their workload and performance, hampering their ability to switch ac- tions effectively. Lastly, the empirical policy obtained using data from Experiment 3 indicates that the optimal policy effectively manages the queue length, thereby keeping the region of high queue length unexplored during the experiment. 44 Fig. 3.10 presents the comparative box plots for the scores obtained in Experiment 2 (human policy) and Experiment 3 (optimal policy). Here, the score refers to the cumulative reward accrued by a participant over tasks, i.e., Score = (cid:80)48 t=1 rt, where rt denotes the reward obtained for servicing the task t given by (3.8). It can be seen that the performance under optimal policy is much better than the performance under human policy. Furthermore, under the optimal policy, we observe an improvement of 26.54% in the average total score as compared to the human policy. To assess the statistical significance of these findings, we conducted a two-sample t-test comparing the outcomes of Experiments 2 and 3. Remarkably, the p value was computed as 0.017, indicating a notably high level of significance. In line with the widely accepted significance threshold of 0.05, a p value below this threshold prompts us to reject the null hypothesis. This implies that the data from the two experiments do not stem from the same distribution at a 5% significance level. These results underscore the substantial impact of the optimal policy in enhancing human performance. Next, we allow an additional action of task delegation under which a task is instanta- neously removed from the queue to be serviced by the autonomy. Fig. 3.11 compares the policy utilized in Experiments 4 and 5 that allowed all three actions. Fig. 3.11a, 3.11b, and 3.11c show the empirical human policy obtained using data from Experiment 4, com- plete optimal policy learned from POMDP, and the empirical policy obtained using data from Experiment 3 that deploys optimal fidelity selection policy, respectively. The three columns of the plots show the probability of choosing high-fidelity action, normal-fidelity action, and task delegation, respectively. The dark purple region in the plots represents the portions of the state space that have not been visited during the experiment. Similar observations to those depicted in Fig. 3.9 can also be made for high and normal fidelity servicing in Fig. 3.11. Moreover, it is noticeable that the human policy exhibits a somewhat random utilization of task delegation. This observation suggests that humans struggle with workload management and effective task delegation. Lastly, the optimal policy 45 (a) Experiment 4 policy: Empirical human policy. (b) Learned optimal policy. (c) Experiment 5 policy: Empirical optimal policy. Figure 3.11 (a) Empirical human policy in Experiment 4, (b) Learned optimal policy by solving POMDP with three available actions, and (c) Empirical policy obtained using data from Experiment 5 that deploys optimal fidelity selection policy. in Experiment 5 keeps the queue length under check, and therefore, regions of higher queue lengths remain unvisited in the experiment. Fig. 3.12 shows the box plots for the participants’ scores in Experiment 4 (human policy) and Experiment 5 (optimal policy). It can be seen that the performance under optimal policy is much better than the performance under human policy. Furthermore, under the optimal policy, we observe an improvement of 50.3% in the average total score as compared to the human policy. Lastly, the p-value in the two-sample t-test is given by 0.001, showing that the results are statistically significant. Remark 5. The average score achieved with the optimal policy in Experiment 5 was marginally 46 Figure 3.12 Box plots for the participants’ scores in Experiment 4 (human policy) and Ex- periment 5 (optimal policy). Within each box plot, the median is represented by the red horizontal line, while the lower and upper edges of the box signify the 25th and 75th per- centiles, respectively. Whiskers extend to encompass the most extreme data points. The p value in the two-sample t-test comparing the outcomes of Experiments 4 and 5 was computed as 0.001, indicating a high level of statistical significance. less than that obtained with the optimal policy in Experiment 3. This discrepancy could be due to the relatively smaller reward obtained from task delegation, as opposed to servicing the task with normal fidelity. Although the optimal policy in Experiment 5 may default to task delegation under the presumption of diminished performance during periods of high workload, certain participants may demonstrate the capability to execute tasks with high accuracy even under high workload conditions in Experiment 3. 3.4 Conclusions We studied the problem of optimal fidelity selection for a human operator engaged in a visual search task. The human participants performed two tasks simultaneously: a primary mine search task and a secondary visual task used for estimating the human workload. We treated the human workload as a hidden state and modeled the workload dynamics using an Input-Output Hidden Markov Model (IOHMM). Leveraging the probability distributions derived from the IOHMM, we formulated a Partially Observable Markov Decision Process (POMDP) and solved it to derive an optimal fidelity selection policy. The results of our experiments offer valuable insights into human behavioral patterns and underscore the sub- stantial performance enhancements achievable through the application of the optimal policy. 47 CHAPTER 4 ROBUST AND ADAPTIVE FIDELITY SELECTION FOR HUMAN-IN-THE-LOOP QUEUES In this chapter, we relax the assumption of the known human service time model in the optimal fidelity selection problem. We assume the parameters of the human’s service time distribution depend on the selected fidelity level and her cognitive state and are assumed to be unknown a priori. These parameters are learned online through Bayesian parameter estimation. We formulate a robust adaptive SMDP to solve our optimal fidelity selection problem. We extend the results on the convergence of robust-adaptive MDP to robust- adaptive SMDPs and show that the solution of the robust adaptive SMDP converges to the optimal solution for the uncertainty-free SMDP. Furthermore, we numerically illustrate the convergence of the synchronous and asynchronous robust adaptive policy to the uncertainty- free optimal policy. 4.1 Background and Problem Formulation We now present our problem setup, and formulate the optimal fidelity selection problem as a robust adaptive SMDP. Figure 4.1 Schematic of our problem setup. The incoming tasks arrive at a constant arrival rate λ and gets stored in a queue. The decision support learns the human service-time distribution based on the online observations and recommends an optimal fidelity level to the human agent based on the system state (queue length and cognitive state). 48 4.1.1 Problem Setup We consider a human agent with unknown service time distribution servicing a queue of homogeneous tasks. We assume the availability of approximate model for human service time distribution from human subject experiments, which we adapt online to estimate the agent’s distribution using Bayesian parametric estimation. The homogeneous tasks arrive according to a Poisson process with a constant arrival rate λ ∈ R>0, and are stored in a dynamic queue with a maximum capacity L ∈ N, until serviced by the human agent in a first-come-first-serve discipline. These tasks continuously lose value at a constant rate c ∈ R>0 while waiting in the queue. The agent can choose to service the tasks with normal or high fidelity level. When the agent services a task with high fidelity, she meticulously looks into the details, which results in higher-quality service but leads to larger service times and increased operator tiredness. We treat the cognitive state of the agent as a lumped parameter that captures psychological factors such as fatigue, stress and situational awareness. We assume that the unknown mean service time of the agent increases with the fidelity level and is a unimodal function of the cognitive state, which is inspired from the experimental psychology literature. For example, according to Yerkes-Dodson law [16], excessive stress overwhelms the operator and too little stress leads to reduction in vigilance. Hence, the human performance is optimal for some intermediate cognitive state. Fig. 4.1 shows a schematic of our problem setup. We are interested in design of a decision support system that continuously learns the human service-time distribution and assists the agent by recommending an optimal fidelity level to service each task. The recommendation is made based on the robust adaptive policy learned by the decision support for a given queue length q ∈ Z≥0 and the human cognitive state. We assume to have real time access to the human cognitive state using, e.g., Electroencephalogram (EEG) measurements (see [80] for measures of cognitive load from EEG data). 49 4.1.2 Robust Adaptive SMDP formulation We now model our problem as a robust adaptive SMDP ΓRA. We focus on the unknown service time distributions and refer the interested readers to [41] for more details on modeling of uncertainty-free distributions. We consider a finite state space S := {(q, cog)| q ∈ {0, 1, ..., L}, cog ∈ C := {i/N }i∈{0,...,N }, for some N ∈ N, where cog represents the lumped cognitive state. We consider five possible actions for the agent given by: (i) Waiting (W), when the queue is empty, (ii) Resting (R), which provides the resting time for the human operator to reach the optimal cognitive state when tired, (iii) Skipping (S), which allows the operator to skip a task to reduce the queue length and thereby focus on newer tasks, (iv) Normal Fidelity (N) for servicing the task with normal fidelity, and (v) High Fidelity (H) for servicing the task more carefully with high precision. Hence, a set of admissible actions As for each state s ∈ S is given by: (i) As := {W | s ∈ S, q = 0} when queue is empty, (ii) As := {{R, S, N, H }| s ∈ S, q ̸= 0} when queue is non-empty and cog > cog∗, where cog∗ ∈ C is the optimal cognitive state, and (iii) As := {{S, N, H }| s ∈ S, q ̸= 0} when queue is non-empty and cog ≤ cog∗. Let τ be the sojourn time spent in state s. The sojourn time distribution P(τ | s, a) represents the service time while servicing the task with normal or high fidelity, resting time, constant time of skip, and time until the next task arrival while waiting. We model service time distributions in Section 4.1.3 and refer the readers to [41] for details on resting and waiting time. We define a state transition distribution P(s′| τ, s, a) from state s to s′ condi- tioned on an action a ∈ As and sojourn time τ spent in state s. This distribution involves the transition in queue length which is given by Poisson distribution and the cognitive dy- namics which we model as a Markov chain such that the cognitive state cog increases with high probability when the operator is busy (a ∈ {N, H}), and decreases when the operator is idle (a ∈ {R, W }). For a detailed description of the modeling of cognitive dynamics, we refer the interested readers to [41]. For each task, the human agent receives a high (low) immediate reward for servicing the 50 task with high (normal) fidelity, and no reward for not servicing the task. Furthermore, the agent incurs a penalty at a constant rate c ∈ R>0 for each task waiting in the queue. Hence, it can be shown that the expected net immediate reward received by the agent for selecting an action a in state s is given by: R(s, a) = r(s, a) − P(τ |s, a)c (cid:16) 2q + λτ 2 (cid:17) τ, (cid:88) τ (4.1) where r : S × As (cid:55)→ R≥0 is the reward defined by: (i) r(s, a) = rH, if a = H; (ii) r(s, a) = rN , if a = N ; and r(s, a) = 0, if a ∈ {W, R, S}, with rH, rN ∈ R≥0 and rH > rN , and (cid:80) τ is the expected penalty due to tasks waiting in the queue. P(τ |s, a)cτ (cid:104) q+q′ 2 (cid:105)(cid:17) E (cid:16) (cid:12) (cid:12) (cid:12) τ, s, a 4.1.3 Modeling human service time and uncertainity set There are many approximate models used for modeling service time distribution P(τ |s, a) for the human agents, most common being lognormal [95] and inverse Gaussian distribu- tion [96]. We assume that the agent’s service time distribution follows a log-normal distribu- tion Lognormal(µ, σ2) with unknown parameters µ and σ2, that are the functions of the cog- nitive state and fidelity-level. We utilize the Bayesian parameter estimation with a normal- inverse-chi-squared prior [97] to estimate the distribution parameters using online observa- tions. Using the prior distribution N Iχ2(µ0, κ0, ν0, σ2 0) = N (µ|µ0, σ2/κ0) × χ−2(σ2|ν0, σ2 0) for the parameters µ and σ2, and n ∈ N realizations from Lognormal(µ, σ2), the posterior distribution of (µ, σ2) is given by: p(µ, σ2) = N Iχ2(µn, κn, νn, σ2 n), where (4.2) κ0µ0 + nx κn (cid:32) ν0σ2 0 + µn = σ2 n = 1 νn , κn = κ0 + n, νn = ν0 + n, (cid:88) (xi − x)2 + i (cid:33) (µ0 − x)2 , nκ0 κ0 + n xi, i ∈ {1, . . . , n} are the service time samples and ¯x is the sample mean. Hence, the posterior distribution at any time t can be computed to recursively estimate the model parameters µ and σ2 by using the online observations. Let ˆPt(τ |s, a) be the estimate of the 51 service time distribution at time t. Note that the estimate ˆPt(τ |s, a) can be used to estimate P(s′, τ |s, a) and R(s, a), resulting in estimates ˆPt(s′, τ |s, a) and ˆRt(s, a) at time t. The uncertainty in the human service time models could be large, especially in the ini- tial stage with limited observation data, which may lead to suboptimal policies. This can be mitigated through the use of robust SMDP. The robust SMDP optimizes the worst- case performance to obtain robust policy when the joint distribution P(s′, τ |s, a) lies in an uncertainty set P a, i.e, P(s′, τ |s, a) ∈ P a. Note that the robust SMDP formulation is not in- herently adaptive in nature and does not explicitly use improved transition models that can be learned using online observations to obtain less conservative policy. The robust adaptive SMDP utilizes the latest improved estimates ˆPt and ˆRt for the joint probability P(s′, τ |s, a) and reward R(s, a) at time t, respectively. The choice of uncertainty set P a is critical for the performance of the robust algorithm. A poor modeling of the uncertainty set increases the computational complexity and could lead to highly conservative robust policy. Hence, a choice of uncertainty set P a is typically made such that the robust policy is not overly conservative and the optimization can be performed in a computationally tractable manner. Let D be the observation data up to time t. To construct an uncertainty set P a t at time t, random samples are generated from the posterior distribution of (µ, σ2) to construct a set ∆t comprised of matrices ˆPt(s′, τ |s, a) for each s ∈ S and a ∈ AS. Finally, a Ψt-confidence level subset of the transition probabilities ∆t defined by: P a t (Ψt) = {psa t ∈ ∆t : ∥psa t − psa t ∥1 ≤ Ψt, s ∈ S}, (4.3) where psa t is the nominal transition given by psa t |D], is used as a choice for the uncertainty set. We seek to find Ψt-confidence sets for state transition probability vector for t = E[psa every state-action pair at time t. We choose Ψt = 6α |S∥AS |π2t2 such that union bounds applied over each state-action pair and time yield that all state transition probabilities belong to respective confidence sets with at least probability α. In the following we choose α = 0.95. 52 Using the uncertainty set P a t constructed at time t based on the latest improved estimates ˆPt, the robust adaptive SMDP solves the following robust Bellman equation (4.4), V ∗(s) = max a∈AS min ˆPt∈P a t (cid:40) ˆRt(s, a) + (cid:88) (cid:88) τ s′ γτ ˆPt(s′, τ |s, a)V ∗(s′) , (4.4) (cid:41) where 0 < γ < 1 is the discount factor, to obtain a robust policy π∗ = argmaxa∈AS which optimizes the worst-case performance through minimization with respect to the uncer- V (s), tainty set P a t . In the next section, we show that the learned policy through robust adaptive SMDP converges to an optimal policy for the uncertainty-free formulation. 4.2 Convergence of Robust Adaptive SMDP We study the convergence properties of the robust adaptive SMDP under the following assumptions. (A1) State space S and actiona space AS are finite. (A2) ˆPt and ˆRt remains bounded for any t. (A3) ˆPt and ˆRt converges to their true values P and R, respectively, with probability 1. (A4) The uncertainty set P a t converges to a singleton estimate P with probability 1. (A5) Each admissible action is executed from every state infinitely often. Since agent’s service time distribution is independent of the queue length, assump- tion (A5) can be relaxed to executing each action at every cognitive state. Furthermore, assumptions (A3) and (A5) can be satisfied by adopting exploration strategies such as Gibbs/Boltzman distribution method for action selection [31], wherein an action is selected with probability proportional to the current estimate of the state-action value function di- vided by a temperature parameter. The temperature parameter is annealed to ensure that the action selection rule converges over time to a greedy policy with respect to the value function estimate, while ensuring each cognitive state-action pair is selected infinitely often. 53 Let T : R|S| (cid:55)→ R|S| be the Bellman operator for an uncertainty-free SMDP Γ defined by: (cid:40) T (V (s)) = max a∈AS R(s, a) + (cid:88) (cid:88) τ s′ γτ P(s′, τ |s, a)V (s′) . (4.5) (cid:41) We first show the convergence of the adaptive asynchronous VI method followed by the convergence of the robust approach. In asynchronous VI, an effective approach to deal with large state spaces, the value of only a subset of states s ∈ Bt ⊆ S are updated at any time t. The adaptive VI method adapts to the latest estimates of the model to compute the control policy, which is continuously improved through online observations. Hence, the adaptive asynchronous VI update at time t is given by: Vt+1(s) =    ˆTt(Vt), if s ∈ Bt, Vt(s), otherwise, (4.6) where Bt ⊆ S is the subset of states that are updated at time t, and ˆTt is the Bellman operator for the SMDP estimate ˆΓt at time t, that utilizes the estimates ˆPt and ˆRt in (4.5). The set Bt can be chosen using prioritized sweeping [98] for improved computational performance. Theorem 2. Under Assumptions A1-A5, the adaptive asynchronous VI converges to the optimal value function V ∗ for the uncertainty-free SMDP Γ with probability 1. Proof. See Appendix B: Chapter 4 for the proof. Let P a be the uncertainty set for the probability P(s′, τ |s, a) for a given action a ∈ AS. Then, the robust adaptive asynchronous VI update at time t is given by: Vt+1(s) =    maxa∈AS minP∈P a {J a(Vt)} , if s ∈ Bt, (4.7) Vt(s), otherwise, where for a given a ∈ AS, J a : R|S| (cid:55)→ R|S| is given by J a(V (s)) = R(s, a) + (cid:88) (cid:88) τ s′ 54 γτ P(s′, τ |s, a)V (s′). (4.8) Let Tr : R|S| (cid:55)→ R|S| be the robust Bellman operator for SMDP Γ defined by: Tr(V (s)) = max a∈AS min P∈P a {J a(V (s))} . (4.9) Theorem 3. Under Assumptions A1-A5, the robust adaptive asynchronous VI converges to the optimal value function V ∗ for the uncertainty-free SMDP Γ with probability 1. In addition, for Ψt = |S∥AS |π2t2 , the union bounds applied over each state-action pair and time yield that at any time, the obtained policy is robust with respect to uncertainty in service 6α time distributions with at least probability α. Proof. See Appendix B: Chapter 4 for the proof. 4.3 Numerical Illustrations Fig. 4.2a and 4.2b shows an optimal policy and the optimal value function obtained using VI algorithm for the uncertainty-free SMDP Γ with cog∗ = 0.6 as the optimal cognitive state. The optimal policy selects high fidelity around the cog∗ for small queue lengths, and then transitions to normal fidelity as the queue length increases in sub-optimal cognitive states. The optimal action is to rest when the queue length is large and cognitive state is high. Similarly, skipping of tasks is an optimal action for large queue lengths in sub-optimal cognitive states. The corresponding optimal value function is a decreasing function of q and is a uni-modal function of the cognitive state, the maximum for which occurs at cog∗ for each q; see [41] for more details. Fig. 4.2c and 4.2d shows a robust adaptive policy, and the corresponding optimal value function for the uncertain SMDP ΓRA obtained from asynchronous VI algorithm. In our nu- merical illustrations, we only estimate the parameter µ of the Lognormal(µ, σ2) distribution using Bayesian estimation. In case when σ2 is unknown, Markov chain Monte Carlo (MCMC) methods [99] can be used to sample from the posterior inverse-chi-squared-distribution to create the uncertainty set P a t at time t. An identical policy and value function is obtained in case of synchronous VI algorithm. Hence, the solution of the synchronous and asynchronous VI converges to the optimal solution of uncertainty-free SMDP. 55 (a) (c) (b) (d) Figure 4.2 (a) Optimal policy and (b) optimal value function for the uncertainty-free SMDP Γ. (c) Robust adaptive optimal policy, and corresponding (d) optimal value function for the uncertain SMDP ΓRA. Fig. 4.3 shows the policy updates for the synchronous and asynchronous robust adaptive algorithm after 4, 8, 32 and 76 iterations, respectively. The synchronous robust adaptive algorithm performs a VI update at each time step in (4.6) for all states, i.e. Bt = S, while in the asynchronous robust adaptive algorithm, the VI update is performed on a randomly chosen subset Bt ∈ S at each time step. The asynchronous robust adaptive algorithm converges (80 iterations) much faster than the synchronous robust adaptive algorithm (404 iterations), while both eventually converge to the optimal policy for the uncertainty-free SMDP Γ as shown in Fig. 4.2. 56 (a) (c) (e) (g) (b) (d) (f) (h) Figure 4.3 Policy updates for the synchronous ((a)-(d)) and asynchronous ((e)-(h)) robust adaptive algorithm after 4, 8, 32 and 76 iterations, respectively. 57 4.4 Conclusions We studied the optimal fidelity selection problem for a human agent servicing a stream of homogeneous tasks. The parameters of the service time distribution for the agent are un- known a priori which are learned online through Bayesian parametric estimation. We utilize the robust-adaptive SMDP approach which adapts to the latest estimates of the distribu- tion model, while obtaining robust policy towards the worst-case performance. We formally extend the convergence results of the robust adaptive MDP to robust adaptive SMDP, and show that the solution of the robust adaptive SMDP converges to the optimal solution for the uncertainty-free SMDP. Furthermore, we numerically illustrate the convergence of the synchronous and asynchronous robust adaptive policy to the uncertainty-free optimal policy. 58 CHAPTER 5 DETERMINISTIC SEQUENCING OF EXPLORATION AND EXPLOITATION FOR REINFORCEMENT LEARNING In this chapter, we propose a DSEE algorithm with interleaving exploration and exploitation epochs for model-based RL problems that aim to simultaneously learn the system model, i.e., an MDP, and the associated optimal policy. During exploration, DSEE explores the environment and updates the estimates for expected reward and transition probabilities. During exploitation, the latest estimates of the expected reward and transition probabilities are used to obtain a robust policy with high probability. We design the lengths of the exploration and exploitation epochs such that the cumulative regret grows as a sub-linear function of time. 5.1 Background and Problem Formulation We focus on the model-based RL problems which aim to simultaneously learn the system model, i.e., a Markov decision process (MDP), and the associated optimal policy. We seek to design policies that are robust to uncertainty in the learned MDP. However, learning the MDP requires visiting parts of the state space associated with high uncertainty in estimates and has exactly the opposite effect of a robust policy. Therefore, to balance the trade-off between learning the MDP and designing a robust policy, we design a DSEE algorithm, in which exploration and exploitation epochs of increasing lengths are interleaved. In explo- ration epochs, the algorithm learns the MDP, while in exploitation epochs, it uses a robust policy based on the learned MDP and the associated uncertainty. Consider an MDP (S; A, R, P, γ), where S is the state space, A is the action space, reward R(s, a), for each (s, a) ∈ S ×A, is a random variable with support [0, Rmax], P : S × A → ∆|S| is the transition distribution, and γ ∈ (0, 1) is the discount factor. Here, ∆|S| represents probability simplex in R|S|, |·| represents the cardinality of a set. Let R(s, a) be the expected value of R(s, a). We consider a finite MDP setting in which |S| and |A| are finite. We assume that the rewards R and the state transition distribution P are unknown a 59 priori. Hence, during exploration, we estimate R and P using online observations. Let (s, a) be any state-action pair where s ∈ S and a ∈ A. At any time t, let nt(s, a) be the number of times state-action pair (s, a) is observed until time t. For each (s, a), the empirical mean estimates ˆRt(s, a) and ˆPt(s′|s, a), s′ ∈ S are: ˆRt(s, a) = ˆPt(s′|s, a) = nt(s,a) (cid:88) 1 nt(s, a) nt(s, a, s′) nt(s, a) , i=1 ri(s, a), and (5.1) (5.2) respectively, where ri(s, a) is the immediate reward obtained in (s, a) during observation i ∈ {1, . . . , nt(s, a)} until time t and nt(s, a, s′) is the number of times the next state s′ is observed from (s, a) out of nt(s, a) times. Oftentimes, the uncertainty in probability transition matrices and mean reward function can be large, especially in the initial stages of learning due to limited observation data, which may lead to sub-optimal policies. Robust MDPs [29] mitigate the sub-optimal performance arising from this uncertainty by optimizing the worst-case performance over given uncertainty sets for reward function and probability transition matrices to obtain a robust policy. Given, at time t, uncertainty sets RU t and P U t containing R and P, respectively, the robust MDP solves the following robust Bellman equation: V R t (s) = max a∈A min t , ˜Pt∈P U t ˜Rt∈RU (cid:110) ˜Rt(s, a) + γ (cid:88) s′ ˜Pt(s′|s, a)V R t (s′) (cid:111) , (5.3) to obtain a robust policy ˆπR t = argmaxa∈A V R t , which optimizes the worst-case performance through minimization with respect to the uncertainty sets RU t and P U t , where V R t is the robust value function. The choice of these uncertainty sets are critical for the performance of the robust algo- rithm. A poor modeling choice can increase the computational complexity and result in a highly conservative policy [17,30]. To avoid these issues, during the exploitation epoch of the DSEE, we construct these uncertainty sets based on the estimates ˆRt and ˆPt from the pre- vious exploration epochs and Hoeffding bounds [100] for ˆRt (Lemma 6) and ˆPt (Lemma 7). 60 Subsequently, we utilize robust MDP to learn a policy that is robust to the estimation uncertainties with high probability. The convergence of the robust MDP with uncertain transition matrices to the uncertainty-free MDP can be shown under the assumption that the uncertainty sets converge to singleton estimates almost surely [17, 32]. Definition 1 (Instantaneous and Cumulative Regret). For a discounted and ergodic RL [21], consider an algorithm A that, at the end of the (t − 1)-th step, returns a policy πt to be applied in the t-th step. For any state s ∈ S, let V ∗(s) and V πt(s) be the optimal value of the state and its value under the policy πt, respectively. At any time t, the instantaneous regret R(t) of the algorithm A is given by: R(t) = ∥V ∗(s) − V πt(s)∥∞, (5.4) where ∥ · ∥∞ denotes the L∞-norm of a vector, and the cumulative regret RT until time horizon T is given by: RT = T (cid:88) t=1 R(t) = T (cid:88) t=1 ∥V ∗(s) − V πt(s)∥∞. (5.5) We design the exploration and exploitation epochs of the DSEE algorithm such that its cumulative regret grows as a sub-linear function of time. In the next section, we provide an overview of the DSEE algorithm. 5.2 DSEE Algorithm We design the DSEE algorithm for model-based RL under the following assumptions: (A1) State space S and action space A are finite sets. (A2) The MDP is ergodic under the uniform policy π, i.e., under a policy π that, in every state s, randomly selects the actions from A with equal probability, the MDP admits a unique stationary distribution ϕπ(s) : S → ∆|S|, with ϕπ(s) > 0 for all s. Ergodic MDP [21] (assumption (A2)) is a common assumption. It ensures that the stationary distribution is independent of the initial distribution and all states are recurrent, 61 Input: Set of states S, Set of actions A, Initial State s0; 0 = s0, s = s0; Set: η > 1, Sequences {ϵj}j∈N, {δj}j∈N, send Set: t = 0, n(s, a) = 0, n(s, a, s′) = 0, S(s, a) = 0, ∀s, a, s′; 1: for epoch j = 1, 2, . . . do % Exploration phase: ρj ← ϵj 2: ; µ ← 2 4+ 2Rmaxγ (1−γ)2 (cid:104) log(2|S| − 2) + log (cid:110) (Rmax)2 log( 4|S||A| δj Uj ← max while n(s, a) < Uj, ∀(s, a) 2ρ2 j (cid:16) 2|S||A| δj (cid:17)(cid:105) ; ) (cid:111) , µ ρ2 j t ← t + 1; Pick a ∼ UNIF(A) in current state s do Observe reward R and the next state s′ if send j−1 n(s, a) ← n(s, a) + 1; n(s, a, s′) ← n(s, a, s′) + 1; S(s, a) = S(s, a) + R; has been visited in epoch j then n(s,a) , ∀(s, a); end if s ← s′; end while j ← s; send ˆRt(s, a) = S(s,a) ˆPt(s′|s, a) = n(s,a,s′) % Exploitation phase: Construct uncertainty sets RU t Compute V R using (5.3); Implement ˆπR t t ← t + ⌈ηj⌉; for ⌈ηj⌉ time steps; n(s,a) , ∀(s, a); t (s) and ˆπR t 19: 20: 21: 22: 23: end for and P U t using (5.11); 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: Algorithm 5.1 Deterministic Sequencing of Exploration and Exploitation (DSEE). 62 i.e., each state s is visited infinitely often and ϕπ(s) > 0. We use this assumption to estimate the number of times each state is visited in N time steps. Algorithm 5.1 shows an overview of the DSEE algorithm. In the DSEE algorithm, we design a sequence of alternating exploration and exploitation epochs. Let αi and βi be the lengths of the i-th exploration and exploitation epoch, respectively, where i ∈ N. During an exploration epoch, we uniformly sample the action in the current state and update the estimates ˆRt and ˆPt. For a given sequence of {ϵi}i∈N and {δi}i∈N that we design in Section 5.3, the length of the exploration epoch αi is determined to reduce the estimation uncertainty such that P(∥V ∗(s) − V ˆπR t (s)∥∞ ≤ ϵi) ≥ 1 − δi after the epoch, where P(·) denotes the probability measure, and V ˆπR t (s) is the value of state s under the robust policy ˆπR t . In DSEE, we choose exponentially increasing lengths of the exploitation epochs βi. During the exploitation epoch, we utilize the estimates ˆRt and ˆPt from previous exploration epochs and construct the uncertainty sets RU and P U at time t. We use these uncertainty sets with a robust Bellman equation to learn a policy that is robust to the estimation uncertainties with high probability. In next section, we analyze the DSEE algorithm, and design the sequence of {ϵi}i∈N and {δi}i∈N, such that the cumulative regret (5.5) grows as a sub-linear function of time. 5.3 Analysis of DSEE algorithm We now characterize the regret of the DSEE algorithm under the assumptions (A1-A2) and design the exploration and exploitation epochs. The optimal value V ∗(st) of the state st is given by: V ∗(st) = R(st, π∗(st)) + γE[V ∗(st+1)|st, π∗(st)], where π∗ is an optimal policy that satisfies: π∗(st) = argmax at (cid:8)R(st, at) + γE[V ∗(st+1)|st, at](cid:9) . (5.6) (5.7) 63 We define an approximate optimal value function ˆVt that utilizes the estimates ˆRt and ˆPt at time t. Therefore, ˆVt(st) is given by: ˆVt(st) = ˆRt(st, ˆπt(st)) + γ ˆE (cid:104) ˆVt(st+1)|st, ˆπt(st) (cid:105) , (5.8) where ˆE (cid:104) ˆVt(st+1)|st, ˆπt(st) (cid:105) is used to denote (cid:80) ˆPt(st+1|st, ˆπt(st)) ˆVt(st+1) and ˆπt is an st+1 optimal policy for the approximate optimal value function given by: ˆπt(st) = argmax at (cid:110) ˆRt(st, at) + γ ˆE (cid:104) ˆVt(st+1)|st, at (cid:105)(cid:111) . (5.9) Theorem 4 (Concentration of robust value function). Let ∥ · ∥1 denote the L1-norm of a vector. For any given ϵt, δt ∈ (0, 1), there exists an n ∈ O (cid:16) |S| ϵ2 t + 1 ϵ2 t log (cid:17)(cid:17) (cid:16) |S||A| δt such that if each state-action pair (s, a) is observed nt(s, a) ≥ n times until time t, then for each state s, the following inequality holds: (cid:16) P ∥V ∗(s) − V ˆπR t (s)∥∞ ≤ ϵt (cid:17) ≥ 1 − δt, (5.10) where V ˆπR t (s) is the value of state s under the robust policy ˆπR value function V R t is defined in (5.3) with ρt = ϵt 2 (cid:16) 2 + Rmaxγ (1−γ)2 t = argmaxa∈A V R (cid:17)−1 and t . The robust RU t = (cid:110) RU (s, a) : |RU (s, a) − ˆRt(s, a)| ≤ ρt, ∀(s, a) (cid:111) , (cid:110) P U t = PU (s, a) : (cid:13) (cid:13) PU (s, a) − ˆPt(s, a) (cid:13) (cid:13) (cid:13)1 (cid:13) ≤ ρt, ∀(s, a) (cid:111) . (5.11) We prove Theorem 4 using the following Lemmas 6-9. Lemma 6 (Concentration of rewards). Suppose until time step t, the state-action pair (s, a) is observed nt(s, a) times and bounded immediate rewards ri(s, a), i ∈ {1, . . . , nt(s, a)}, are obtained at these instances. Then the following inequality holds: P (cid:12) (cid:16)(cid:12) (cid:12)R(s, a) − ˆRt(s, a) (cid:12) ≤ ϵR (cid:12) (cid:12) t (cid:17) ≥ 1 − δR t , (5.12) where ˆRt(s, a) is the empirical mean reward defined in (5.1) and ϵR t = (cid:113) (Rmax)2 log(2/δR t ) 2nt(s,a) . 64 Proof. For brevity of notation, let ϵR and δR denote ϵR t and δR t , respectively. For bounded random variables ri(s, a), using the Hoeffding bounds [100], we have P (cid:12) (cid:16)(cid:12) (cid:12)R(s, a) − ˆRt(s, a) (cid:12) (cid:12) (cid:12) ≤ ϵR (cid:17) ≥ 1 − 2e− 2nt(s,a)ϵ2 R (Rmax)2 . (5.13) Choosing δR = 2e− 2nt(s,a)ϵ2 R (Rmax)2 , we get the desired result. Lemma 7 (Concentration of transition probabilities). Suppose until time step t, the state-action pair (s, a) is observed nt(s, a) times and let P(s, a) ∈ ∆|S| be the true transition probability distribution for (s, a). Then for any (s, a), the following inequality holds: P (cid:13) (cid:16)(cid:13) P(s, a) − ˆPt(s, a) (cid:13) (cid:13) (cid:13)1 (cid:13) (cid:17) ≤ ϵP t ≥ 1 − δP t , (5.14) where ∥ · ∥1 is the L1 norm of a vector and ˆPt(s, a) is the empirical transition probability vector with components ˆPt(s′|s, a) defined in (5.2), and ϵP (cid:113) 2[log(2|S|−2)−log(δP t )] . t = nt(s,a) Proof. For brevity of notation, let ϵP and δP denote ϵP t and δP t , respectively. Using [101, Theorem 2.1], we have: P (cid:16)(cid:13) P(s, a) − ˆPt(s, a) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13)1 (cid:17) ≤ ϵP ≥ 1 − (2|S| − 2)e− nt(s,a)ϵ2 P 2 . (5.15) Setting δP = (2|S| − 2)e− nt(s,a)ϵ2 P 2 , yields the desired result. Lemmas 6 and 7 provide concentration bounds on the reward and transition probability based on how often a state-action pair is visited. Lemma 8. (Concentration of reward and transition probability functions) Let δR t = t = δt δP 2|S||A| . Then, for any ρt > 0, there exists an n ∈ O (cid:16) |S| ρ2 t + 1 ρ2 t log (cid:17)(cid:17) (cid:16) |S||A| δt such that when each state-action pair (s, a) is observed nt(s, a) ≥ n times, then the following statements hold for any (s, a): (cid:16) (i) P |R(s, a) − ˆRt(s, a)| ≤ ρt (cid:17) ≥ 1 − δt 2|S||A| , (ii) P (cid:13) (cid:16)(cid:13) P(s, a) − ˆPt(s, a) (cid:13) (cid:13) (cid:13)1 (cid:13) (cid:17) ≤ ρt ≥ 1 − δt 2|S||A| . 65 (cid:13) (cid:13) P(s, a) − ˆPt(s, a) Proof. Using Lemmas 6 and 7, we know that |R(s, a)− ˆRt(s, a)| ≤ ρt and (cid:13) (cid:13) (cid:13) (cid:13)1 (cid:114) (Rmax)2 log( 4|S||A| 2nt(s,a) ρt holds for any (s, a) with at least probability 1 − δt for ρt ≥ 2|S||A| and δt ) ≤ (cid:114) ρt ≥ 2[log(2|S|−2)−log( δt 2|S||A|)] nt(s,a) , respectively. Hence, we have that nt(s, a) ≥ max (cid:40)(Rmax)2 log( 4|S||A| 2ρ2 t δt ) , µ ρ2 t (cid:41) , (5.16) where µ = 2 (cid:104) log(2|S| − 2) + log (cid:17)(cid:105) (cid:16) 2|S||A| δt , is sufficient to guarantee that |R(s, a)− ˆRt(s, a)| ≤ ρt and (cid:13) (cid:13) P(s, a) − ˆPt(s, a) (cid:13) (cid:13) (cid:13)1 (cid:13) Hence, we can choose n ∈ O ≤ ρt holds for any (s, a) with at least probability 1 − δt (cid:16) |S| ρ2 t such that (5.16) holds, and hence, the (cid:16) |S||A| δt + 1 ρ2 t 2|S||A| log (cid:17)(cid:17) . lemma follows. Remark 6 (Concentration inequalities). The concentration inequalities in Lemmas 6, 7, and 8 use a deterministic value of nt(s, a). However, these bound also apply if nt(s, a) is a realization of a random process that is independent of ˆRt(s, a) and ˆPt(s, a), which would be the case in this paper. Remark 7 (Uncertainty set). Using union bounds over all (s, a) and Lemma 8, it follows that uncertainty sets RU t stated in Theorem 4 are ρt-level uncertainty sets for R(s, a) and P(s, a), respectively, for each (s, a) ∈ S × A with at least probability 1 − δt. Thus, the t and P U policy obtained at time t in an exploitation epoch is robust to estimation uncertainties with at least probability 1 − δt. Lemma 9 (Loss in robust value function). Suppose |R(s, a) − ˆRt(s, a)| ≤ ρt and (cid:13) (cid:13) P(s, a) − ˆPt(s, a) (cid:13) (cid:13) (cid:13)1 (cid:13) ≤ ρt, for any (s, a), at time t. Then Lt(s) = V ∗(s) − V ˆπR t (s) satis- fies: (cid:18) Lt(s) ≤ 2ρt 2 + Rmaxγ (1 − γ)2 (cid:19) , (5.17) where V ˆπR t (s) is the value of state s under the robust policy ˆπR t at time t. Proof [Sketch]: The upper bound on the loss is obtained by following a similar analysis as in [102] that provides an upper bound on the loss by considering the uncertainty in P only. 66 The analysis in [102] can be extended by considering the uncertainty in R as well. The details of the proof can be found in [103]. Lemma 9 provides the bounds on the loss in robust value function w.r.t. the optimal value function using the concentration bounds on the rewards and transition probabilities. Proof of Theorem 1: Using Lemma 8, we know that when each state-action pair (s, a) is sampled n ∈ O (cid:16) |S| ρ2 t + 1 ρ2 t log (cid:17)(cid:17) (cid:16) |S||A| δt (s, a): times, then the following inequalities holds for any (cid:16) P |R(s, a) − ˆRt(s, a)| ≤ ρt (cid:17) ≥ 1 − P (cid:13) (cid:16)(cid:13) P(s, a) − ˆPt(s, a) (cid:13) (cid:13) (cid:13)1 (cid:13) (cid:17) ≤ ρt ≥ 1 − , δt 2|S||A| δt 2|S||A| (5.18) (5.19) . Hence, using Lemma 9 and applying union bounds, we obtain that the following holds with at least probability 1 − (δR t + δP t )|S||A| V ∗(s) − V ˆπR t (s) ≤ 2ρt (cid:18) 2 + Rmaxγ (1 − γ)2 (cid:19) . Setting ρt = ϵt 2 (cid:16) 2 + Rmaxγ (1−γ)2 (cid:17)−1 and δR t = δP t = δt 2|S||A| , (cid:16) P V ∗(s) − V ˆπR t (s) ≤ ϵt =⇒ P (cid:16) ∥V ∗(s) − V ˆπR t (s)∥∞ ≤ ϵt (cid:17) (cid:17) ≥ 1 − δt, ∀s ∈ S ≥ 1 − δt. Additionally, the order of n in terms of ϵt becomes n ∈ O (cid:16) |S| ϵ2 t + 1 ϵ2 t log (cid:16) |S||A| δt (cid:17)(cid:17) . (5.20) (5.21) ■ In Theorem 4, we obtain the number of times n each state-action pair needs to be visited to reduce the estimation uncertainty in rewards and transition probabilities to obtain an ϵt-optimal policy with probability at least 1 − δt. Now we estimate the total number of exploration steps that are needed to ensure that each state-action pair is visited at least n times. Lemma 10 (Adapted from [104, Theorem 3] ). For an ergodic Markov chain with state space S and stationary distribution ϕss, let τ = τ (σ) be the σ-mixing time1 with σ ≤ 1 8. Let 1σ-mixing time for an ergodic Markov chain in the minimal time until the distribution of Markov chain is σ-close in total variation distance to its steady state distribution [105]. 67 ϕ0 be the initial distribution on S and let ∥ϕ0∥ϕss = (cid:113)(cid:80) s∈S ϕ0(s)2 ϕss(s) . Let nvis(si, N ) be the number of times state si ∈ S is visited until time N . Then, for any 0 ≤ κ ≤ 1, there exists a constant c > 0 (independent of σ and κ) such that: P (nvis(si, N ) ≥ (1 − κ)N ϕss(si)) ≥ 1 − c∥ϕ0∥ϕsse− κ2N ϕss(si) 72τ . (5.22) Proof. See [104, Theorem 3] for the proof. In the exploration epoch, at any state si ∈ S, we choose actions uniformly randomly. Consider the Markov chain on S that is associated with the uniform action selection policy. Let {ϕ0(s)}s∈S and {ϕss(s)}s∈S, respectively, be the associated initial and stationary distri- bution. We can also consider an equivalent lifted Markov chain on S × A with states (si, aj) such that si ∈ S and aj ∈ A. The lifted Markov chain has the initial and stationary distri- bution, ϕ0(s, a) = ϕ0(si) |A| and ϕss(s, a) = ϕss(si) |A| , respectively. Hence, we can apply Lemma 10 to obtain the probability of visiting a state-action pair (s, a) at least (1 − κ)N ϕss(s, a) times after N time steps under the uniform action selection policy. We now design a sequence of exploration and exploitation epochs. Let αi and βi be the lengths of the i-th exploration and exploitation epoch, respectively. Let ϕmin ss := min(s,a)∈S×A ϕss(s, a) ¯Ni (1−κ)ϕmin and Ni = δαi := c∥ϕ0∥ϕsse− κ2Niϕmin obtained by tuning N and κ in (5.22). 72τ ss ss , where ¯Ni is the upper bound in (5.16) associated with (ϵi, δi). Let . Note that the desired values of (1 − κ)N ϕss(si, aj) and δαi can be Theorem 5 (Regret bound for DSEE algorithm). Let the length of exploitation epochs in DSEE be exponentially increasing, i.e. βi = ηi, η > 1. Let ϵi = η− i 3 and δi = η− i 3 such that P(∥V ∗(s) − V ˆπR t (s)∥∞ ≤ ϵi) ≥ 1 − δi after exploration epoch i. For any δ ∈ (0, 1), set δαi = 6δ |S||A|π2i2 . Then, the cumulative regret for the DSEE algorithm RT ∈ O((T ) 2 3 log(T )) grows sub-linearly with time T with probability at least 1 − δ. Proof. We note that the system state at the start of the i-th exploration epoch might be different from the final state at the end of the (i − 1)-th exploration epoch. Therefore, we 68 remember the final state of the previous exploration epoch and wait for the same state to restart the new exploration epoch. For the ergodic MDP under the uniform action selection policy (assumption A2), we know that the expected hitting time is finite [106]. Let U ∈ R>0 be a constant upper bound on the expected hitting time to reach the final state in the previous exploration epoch from an arbitrary initial state in the current exploration epoch. Hence, the cumulative regret during the i-th exploration epoch of length αi is upper-bounded by (U + αi)Rmax, where Rmax = Rmax 1−γ is the maximum instantaneous regret. Since at start of the exploitation epoch i of length βi, P(∥V ∗(s)−V ˆπR t (s)∥∞ ≤ ϵi) ≥ 1−δi, the expected cumulative regret during the exploitation epoch is (1 − δi)βiϵi + δiβiRmax. Therefore, the total cumulative regret after k sequences of exploration and exploitation each is upper bounded by: RTk ≤ ≤ k (cid:88) i=1 k (cid:88) i=1 ((αi + U )Rmax + (1 − δi)βiϵi + δiβiRmax) ((αi + U )Rmax + βiϵi + δiβiRmax) . (5.23) j=1(αj +U )+(cid:80)k Let Ti be the time at the end of the i-th exploitation epoch. Then, (cid:80)k (cid:80)k j=1 βj < Tk ≤ j=1 βj. We design the length of the exploitation epochs to be exponentially j=1 ηj) = O(ηk). Let ϵi = η−di and δi = η−gi, where d ∈ (0, 1) and g ∈ (0, 1) are constants that we design later. Thus, (5.23) increasing, i.e., βi = ηi, for η > 1. Thus, Tk ∈ O((cid:80)k can be written as: RTk ≤ Rmax (cid:33) αi + kU + (cid:32) k (cid:88) i=1 k (cid:88) i=1 ηi(1−d) + Rmax k (cid:88) i=1 ηi(1−g). (5.24) For a state-action pair (s, a), where s ∈ S and a ∈ A, let δαi (s,a) := c∥ϕ0∥ϕsse− κ2Niϕss(s,a) 72τ . Therefore, using Lemma 10, at the end of the i-th epoch, P (nvis(s, a, Ni) ≥ (1 − κ)Niϕss(s, a)) ≥ 1 − δαi (s,a). (5.25) Recall δαi := c∥ϕ0∥ϕsse− κ2Niϕmin 72τ ss , where ϕmin ss := min(s,a) ϕss(s, a). Substituting Ni = 69 ¯Ni (1−κ)ϕmin ss in (5.25), P (cid:0)nvis(s, a, Ni) ≥ ¯Ni (cid:1) ≥ 1 − |S||A| (cid:88) m=1 δαi, (5.26) for each state-action pair (s, a). Therefore, in Ni time steps of the lifted Markov chain, each (s, a) is visited at least ¯Ni times with high probability. Thus, Ni is an upper bound on (cid:80)i j=1 αj with probability in (5.26). Therefore, using union bounds, with high probability 1 − (cid:80)k j=1 (cid:80)|S||A| m=1 δαj , (cid:80)k j=1 αj ≤ Nk = ¯Nk (1−κ)ϕmin ss , and hence, RTk ≤ Rmax ¯Nk (1 − κ)ϕmin ss + kU Rmax + k (cid:88) i=1 ηi(1−d) + Rmax k (cid:88) i=1 ηi(1−g). (5.27) Using Theorem 4, ¯Nk ∈ O (cid:16) |S| ϵ2 k + 1 ϵ2 k log (cid:17)(cid:17) (cid:16) |S||A| δk . Therefore, RTk ≤ Rmaxλ (1 − κ)ϕmin ss (cid:18) |S| ϵ2 k + 1 ϵ2 k log (cid:18) |S||A| δk (cid:19)(cid:19) + kU Rmax + k (cid:88) i=1 ηi(1−d) + Rmax k (cid:88) i=1 ηi(1−g) ≤ Rmaxλ (1 − κ)ϕmin ss (cid:0)η2dk|S| + η2dk log (cid:0)ηgk|S||A|(cid:1)(cid:1) + kU Rmax + k (cid:88) i=1 ηi(1−d) + Rmax k (cid:88) i=1 ηi(1−g), (5.28) for some constant λ. Recall that Tk ∈ O(ηk), which implies k ∈ O(log(Tk)). Let Z be the right-hand side of (5.28). Then, we have: Z ∈ O (cid:0)(Tk)2d + (Tk)2d log(Tk) + (Tk)(1−d) + (Tk)(1−g)(cid:1) ∈ O((Tk) 2 3 log(Tk)), (5.29) by choosing d = g = 1 3 linearly with time Tk with probability at least 1 − (cid:80)k . Hence, the cumulative regret RTk ∈ O((Tk) 2 (cid:80)|S||A| m=1 δαi. i=1 3 log(Tk)) grows sub- Setting δαi = 6δ |S||A|π2i2 , we have (cid:80)k i=1 (cid:80)|S||A| m=1 δαi ≤ δ. 5.4 Conclusions We proposed a DSEE algorithm with interleaving exploration and exploitation epochs for model-based RL problems that aims to simultaneously learn the system model, i.e., an MDP, 70 and the associated optimal policy. During exploration, we uniformly sample the action in each state and update the estimates of the mean rewards and transition probabilities. These estimates are used in the exploitation epoch to obtain a robust policy with high probability. We designed the length of the exploration and exploitation epochs such that the cumulative regret grows as a sub-linear function of time. 71 CHAPTER 6 FOSTERING HUMAN LEARNING IN SEQUENTIAL DECISION-MAKING: UNDERSTANDING THE ROLE OF EVALUATIVE FEEDBACK In this chapter, we investigate the role of feedback in fostering human learning in sequential decision-making tasks. Cognitive rehabilitation, STEM skill acquisition, and coaching games such as chess often require tutoring decision-making strategies. The advancement of AI- driven tutoring systems for facilitating human learning requires an understanding of the impact of evaluative feedback on human decision-making and skill development. To this end, we conduct human experiments using Amazon Mechanical Turk to study the influence of evaluative feedback on human decision-making in sequential tasks. In these experiments, participants solve the Tower of Hanoi puzzle and receive AI-generated feedback while solving it. We examine how this feedback affects their learning and skill transfer to related tasks. We also explore various computational models to understand how people incorporate evaluative feedback into their decision-making processes. 6.1 Background and Problem Formulation We investigate the influence of evaluative feedback on human performance in a sequential decision-making task through experimental evaluations and computational modeling. To this end, we conducted experiments, where the participants were asked to solve the ToH puzzle. ToH is a puzzle in which disks with a priority order are placed on three pegs. The priority order determines which disk can be placed on top of another disk and each instance of admissible disk placement is referred to as a configuration. Thus, for a four-disk and a five- disk ToH, there are 34 = 81 and 35 = 243 possible configurations, respectively. The goal is to move one disk at a time and reach the desired configuration while maintaining the priority order at each time. Consider the ToH puzzle with n disks, where the disks are numbered {0, 1, . . . , n − 1} in ascending order of size, and the three pegs are numbered {0, 1, 2} from left to right. The state of the n-disk ToH can be represented as Sn = (s0s1 . . . sn−1), where si ∈ {0, 1, 2} denotes the 72 Figure 6.1 State space of a 4-disk ToH with 81 states. Each state corresponds to a unique configuration of the disks on three pegs and edges encode allowed transitions between states. The task is to reach the configuration associated with a randomly selected target state (for example 2201 in this figure). Warmer colors are associated with the higher value function (see Sec. 6.1.1 for discussion). peg on which disk i is placed, for 0 ≤ i ≤ n − 1. Each state in an n-disk ToH has either two or three possible state transitions as can be seen by the state space of a 4-disk ToH shown in Fig. 6.1. 6.1.1 Evaluative Feedback We train an RL agent that is capable of optimally solving the ToH puzzle. The ToH is a finite state space and finite action puzzle, and thus, an optimal policy can be derived using tabular RL methods as described in [31, 107]. In Figure 6.1, we demonstrate the optimal value function for the standard 4-disk ToH. To obtain the optimal value function for a given target state, we utilize the value iteration algorithm [31], where the reward function r(s) : S → R is designed as follows:   r(s) = 1, if s is the target state, (6.1)  0, otherwise. Using the reward function in (6.1) results in an optimal value function that is proportional to the length of the shortest path for each state to the target state. The obtained optimal value 73 Figure 6.2 State space of a 4-disk ToH with 81 states. Each state corresponds to a unique configuration of the disks on three pegs and edges encode allowed transitions between states. The state space can be visualized as comprising three triangular structures. The states that connect different triangular structures are critical states to transition between triangles. function is utilized to provide evaluative feedback to the human player based on the change in the value at states before and after the move. We deploy several feedback mechanisms as detailed in Section 6.2.1 and systematically explore how human decision-making is influenced by different feedback mechanisms. The state space of the ToH problem exhibits a recursive structure. Specifically, the state space of a ToH puzzle with n disks can be effectively illustrated using three interlocking triangles. Each of these triangles symbolizes the state space of a ToH puzzle with n − 1 disks. To illustrate this concept, let’s examine the state space of a 4-disk ToH in Figure 6.2, which is highlighted in red. In the same figure, the blue and green squares are employed to represent the state spaces of 3-disk and 2-disk ToH puzzles, respectively. Hence, the state space for the ToH with n − 1 disks can be simply achieved by removing the last digit from each state in the upper triangle of the n-disk ToH. This digit corresponds to the position of the largest disk. As illustrated in Fig. 6.2, the state space of the 4-disk ToH puzzle can be decomposed into three triangles labeled as T1, T2, and T3. Throughout the remainder of the manuscript, 74 we will consistently refer to the regions of the state space as follows: the top triangle will be denoted as T1, the lower left triangle as T2, and the lower right triangle as T3. These triangles are interconnected at their vertices through single edges. These vertex states are critical states, transitioning from one triangle to another necessitates passing through these states. For instance, starting from an initial state in T1, the optimal path to reach a desired state in T2 or T3 must involve the state transitions 1110 −→ 1112 and 2220 −→ 2221, respectively. Indeed, to master the art of solving the ToH puzzle effectively, one must grasp its inherent recursive structure. Success in solving the puzzle relies on systematically working towards reaching the crucial critical states within the state space. 6.1.2 Human Rewards using Maximum Entropy Inverse Reinforcement Learn- ing In the context of human participants solving the ToH puzzle, we can perceive them as noisy optimal agents striving to optimize an implicit reward function. Utilizing their demonstrations, we can leverage Inverse Reinforcement Learning (IRL) techniques to deduce a reward function [108]. This reward function is designed to align the optimal policy with the observed human demonstrations. The maximum entropy IRL [109, 110] assumes human demonstrations are not perfect and allows us to learn from sub-optimal demonstrations by incorporating a probabilistic model that captures the variability in human behavior. Maximum entropy IRL has gained significant traction in the literature as a means to effectively learn from human demonstra- tions [111]. Consider a Markov Decision Process [112] M = {S, A, T , γ, r}, where S is the state space, A is the action space, T sa s′ is the probability of transition from state s ∈ S to state s′ ∈ S under the action a ∈ A, γ ∈ [0, 1) is the discount factor, and r : S × A → R is the reward function. In the case of the n-disk ToH, each edge originating from a given state s in the state transition graph can be considered a unique action. Under these actions, the transition probability T sa s′ equals 1 for the transition from state s to s′ if they are connected through 75 an edge. Let D = {ζ1, . . . , ζN } be the set of N demonstrations, where each demonstration ζi is a path ζi = {(si,0, ai,0), . . . , (si,T , ai,T )}. The unknown reward function r is expressed as a linear combination of a set of predefined features denoted as f : S → R. The weights associated with these features are learned through the maximum entropy IRL algorithm. In the framework of maximum entropy IRL, the probability of following a particular tra- jectory ζ is directly proportional to the exponential of the accumulated rewards experienced along that path. This leads to a stochastic behavior model, where the probability of taking a specific action a in a given state s is determined by the exponential of the expected total reward subsequent to taking that action, i.e., P (a|s) ∝ exp (Qr sa), where Qr sa is computed as Qr sa = r + γT V r s . The value function V r s is computed using a “soft" variant of the familiar Bellman operator: V r s = log (cid:80) a exp (Qr sa). Consequently, the probability of action a in state s is normalized by exp (V r s ), yielding P (a|s) = exp (Qr sa − V r s ). The complete log-likelihood of the observed data under the reward function r can be expressed as: log P (D|r) = (cid:88) (cid:88) i t log P (ai,t|si,t) = (cid:88) (cid:88) (cid:16) Qr si,tai,t − V r si,t (cid:17) . (6.2) i t Interested readers are referred to [111] for detailed derivations. We employ maximum entropy IRL to infer the reward functions associated with human behavior. Detailed results are presented in Section 6.3.2. 6.1.3 Modeling human sequential decision-making under feedback A central and challenging aspect of designing efficient tutoring systems lies in under- standing the impact of evaluative feedback from AI on human decision-making. Precisely, the question of how humans incorporate feedback into their decision-making processes is of paramount importance. In modeling this process, a foundational challenge is understand- ing how they interpret the feedback, including whether it’s seen as an immediate reward or an evaluation of long-term impacts. Further, it’s important to explore if feedback relates only to the current action or spans the sequence of actions. Additionally, understanding 76 if evaluative feedback affects the assessment of value functions over time or momentarily influences action choices is vital. To tackle these questions, we develop candidate models that embody different mechanisms for incorporating feedback into human decision-making processes. Our models are inspired by the Training an Agent Manually via Evaluative Re- inforcement (TAMER) framework [113–116] developed to incorporate human feedback into the policy of an artificial RL agent. Let ˆH(s, a) and |f | denote the evaluative feedback and number of predefined features f employed to define human rewards, respectively. We study four different models detailed as follows: • Model 1 - Ignore feedback: This baseline model operates under the assumption that evaluative feedback isn’t directly integrated into human decision-making processes. Instead, individuals are postulated to focus on maximizing the long-term value derived from their personal reward functions. In this framework, evaluative feedback plays an indirect role by shaping and refining these reward functions. This is the default model studied in Sec. 6.1.2. The model encompasses |f | learned parameters. • Model 2 - Update Q(s, a): In this model, we postulate that humans interpret evaluative feedback as an indicator of the long-term effectiveness of their strategic actions, serving as an approximation of Qr sa . The model integrates this feedback to update the Q- estimate as follows: Q′(s, a) = Q(s, a) + k ˆH(s, a), (6.3) where k is a parameter to be learned. Consequently, the policy gets updated as P (a|s) = exp (Qsa ′ − Vs ′), where V ′ s = log (cid:80) a exp (Q′ sa) denotes the newly adjusted value function. The model encompasses |f | + 1 learned parameters. • Model 3 - Update r(s, a): In this model, we postulate that humans perceive the eval- uative feedback as a measure of the myopic effectiveness of the strategy, serving as an 77 approximation of r(s, a). The model updates the human rewards as follows: r′(s, a) = r(s, a) + k ˆH(s, a), (6.4) where k is a parameter to be learned. The updated reward function is used to estimate the Q-values and the policy. The model encompasses |f | + 1 learned parameters. • Model 4 - Feedback as a measure of Q(s, a): In this model, we assume that humans ignore learning by interaction and treat evaluative feedback as a fixed measure of Q(s, a). Therefore, Q′(s, a) = k ˆH(s, a), (6.5) where k is the only parameter to be learned. Remark 8. It’s worth noting that the log-likelihood of the maximum entropy IRL depends solely on Q(s, a) through P (a|s). Consequently, Model 2 can be considered equivalent to another model where humans utilize feedback solely to influence their action selection. In this alternate model, humans do not incorporate evaluative feedback into their estimation of Q(s, a); rather, they use it exclusively to bias their action selection, i.e., P (a|s) ∝ exp(Q(s, a) + k ˆH(s, a)). In the context of maximum entropy IRL, this model is tanta- mount to Model 2, where humans employ evaluative feedback to update their Q-estimates as Q′(s, a) = Q(s, a) + k ˆH(s, a). We investigate these models in Sec 6.3.3 to understand how humans incorporate evalua- tive feedback in their decision-making. 6.2 Human Experiments In this section, we discuss the human experiments conducted using AMT. 6.2.1 Experiment Design We examine the effect of evaluative feedback on sequential decision-making using ToH task. To achieve this, we designed five separate experiments, each featuring a different type of feedback. Participants for each experiment were recruited randomly through AMT. Each 78 participant first solved a 4-disk ToH task ten times (training task) and then a 5-disk ToH task five times (transfer task) to evaluate their skill transfer to a more challenging task. The initial state of each puzzle was standardized with all the disks located on the first peg. Considering the state-space of the puzzle as comprising of three interlocking triangles, the target state was randomly selected from the states within the triangles that did not include the initial state, i.e., triangles T2 and T3. The participants were given a maximum number of moves mallowed to solve the puzzle, calculated as : mallowed = ⌈1.5 × mmin⌉, where mmin represents the minimum number of moves from the initial configuration to the final configuration, as determined by the minimum path length in the state graph (Fig. 6.1). The only difference among the experiments was the feedback provided during the 4-disk ToH training task. No feedback was provided during the 5-disk ToH transfer task in any of the experiments. In each experiment, participants were asked to try their best to get the highest scores. The feedback and scoring metrics used during the training task for the five experiments were: (i. Experiment 1 - No feedback: In this experiment, the participants solved the 4-disk ToH puzzle without any feedback. The scoring metric for these tasks was selected as: S = 10(mallowed − mused + 1), (6.6) where mused is the total moves used to solve the puzzle. The participant receives a score of 0 if the puzzle remains unsolved after exhausting the allowed number of moves. The same scoring metric was used in the 5-disk transfer tasks in all the experiments. (ii. Experiment 2 - Numeric feedback: The participants in this experiment received visual feedback on each move they made while solving the 4-disk ToH. The feedback was in the form of text that reads “good move +2" or “bad move −2", indicating whether the 79 move increased or decreased the value of the state, respectively. The scoring metric for the training tasks in this experiment was selected as: S = 10(mallowed − mused + 1) + 2(mgood − mbad), (6.7) where mgood ∈ N and mbad ∈ N denote the number of good and bad moves, respectively. (iii. Experiment 3 - Optional feedback: In this experiment, the participants did not receive visual feedback automatically but had the option to request it by pressing a button, which came at the cost of a small penalty. If the participant requested feedback, they would receive the same visual feedback as in Experiment 2, which evaluated their last move. The scoring metric for the training tasks in this experiment was as follows: S = 10(mallowed − mused + 1) − foptional, (6.8) where foptional ∈ N denotes the number of times the participant requests feedback. (iv. Experiment 4 - Sub-goal: In the state graph of the 4-disk ToH task (as illustrated in Fig. 6.1), states 1110 and 2220 are critical in reaching the target states efficiently in triangles T2 and T3, respectively. In this experiment, based on the target configuration, the participants were presented with an intermediate sub-goal configuration (1110 or 2220) in addition to the target configuration. The participant was instructed to try to reach the intermediate sub-goal first. The scoring metric for the training tasks in this experiment was as follows: S = 10(mallowed − mused + 1) + 5ssubgoal, (6.9) where ssubgoal ∈ {0, 1} was set to 1 if the participant successfully reaches intermediate sub-goal configuration, and 0 otherwise. (v. Experiment 5 - Sub-goal with numeric feedback: In this experiment, the participants received both the visual feedback as in Experiment 2 and the intermediate sub-goal 80 configuration as in Experiment 4. The scoring metric for the training tasks in this experiment was calculated as follows: S = 10(mallowed − mused + 1) + 2(mgood − mbad) + 5ssubgoal. (6.10) This experiment provided the participants with the maximum amount of evaluative feedback. 6.2.2 Methods After receiving the IRB consent (MSU IRB #8421) from Michigan State University’s IRB office, we recruited 238 participants using AMT for the study. Inclusion criteria were established as having completed a minimum of 500 prior studies and maintaining a 98% approval rate on the platform. Participants were compensated with a base payment of $6 and had the opportunity to earn additional performance-based bonuses ranging from $0 − $4. Of the recruited participants, 78 participants were excluded due to self-reported prior experience with the ToH task. 6.3 Results In this section, we discuss the results of the experiments conducted on AMT. 6.3.1 Performance under Evaluative Feedback First, we collect the data of 20 participants each for the 5 set of experiments detailed in Section 6.2.1. In every experiment, we assess participants’ performance by calculating their percentage scores for both the training and transfer tasks as follows: 100 × (cid:18) mallowed − mused + 1 mallowed − mmin + 1 (cid:19) . In Fig. 6.3a, we present box plots illustrating the percentage scores achieved in the training tasks (4-disk ToH). Notably, participants who underwent training with evaluative feedback in Experiment 2 (numeric feedback) and Experiment 5 (sub-goal with numeric 81 (a) Training. (b) Transfer. Figure 6.3 Box plots displaying percentage scores for both training (a) and transfer (b) tasks. Within each box plot, the median is represented by the red horizontal line, while the lower and upper edges of the box signify the 25th and 75th percentiles, respectively. Whiskers extend to encompass the most extreme data points that are not classified as outliers, and individual outliers are plotted using the symbol ‘+’. feedback) exhibited significantly improved performance during these training tasks compared to participants in Experiment 1 (no feedback), who received no evaluative feedback. In Experiment 3 (optional feedback), participants seldom requested feedback to avoid the feedback penalty, resulting in performance levels akin to those observed in Experiment 1. Experiment 4 (sub-goal) introduced a unique approach, where participants were exclusively exposed to sub-goal configuration (1110 or 2220) crucial for reaching the desired target state. In the absence of evaluative feedback, this method resembled the conditions of Experiment 1, where the sub-goal can be effectively thought as a target state until the sub-goal state is reached. We hypothesize that supplying solely sub-goal configurations without evaluative feedback may induce confusion, as participants may now consider two target states simul- taneously—the sub-goal and the target state. Consequently, participants in Experiment 4 exhibited a marginal decrease in performance compared to those in Experiment 1. In Fig. 6.3b, we present box plots illustrating the percentage scores achieved in the trans- fer tasks involving the 5-disk ToH. It’s important to note that solving the 5-disk ToH, with its 243 states, presents a significantly greater challenge compared to the training task, which involved the 4-disk ToH with 81 states. Furthermore, participants had no prior experience with the 5-disk ToH and relied solely on their training with the 4-disk ToH. Consequently, the transfer tasks yielded relatively lower scores, with many trials failing to solve the puzzle 82 within the allotted number of moves, which can make it challenging to interpret the box plots in Fig. 6.3b. To focus on successful outcomes, we filtered for positive percentage scores in each experi- ment, representing the trials where participants successfully solved the ToH puzzle. Table 6.1 provides an overview of the percentage of successful trials for each experiment, both in the training and transfer tasks. Notably, Experiment 2 and Experiment 5 demonstrated a sub- stantial improvement in successful trials, showing increases of 33.5% and 36%, respectively, compared to Experiment 1 in the training tasks. In the transfer tasks, Experiments 2 and 5 also showed notable improvements, with success rates increasing by 13% and 26%, respec- tively, compared to Experiment 1. To assess the statistical significance of these findings, we conducted a two-sample t-test comparing the results of Experiments 2 and 5 with the data from Experiment 1. Remarkably, the p values for Experiment 2 (in comparison to Experiment 1) and Experiment 5 (relative to Experiment 1) are 1.59 × 10−12 and 1.71 × 10−17, respectively, in the training tasks, indicating highly significant differences. In the transfer tasks, the p values are 3.9 × 10−2 and 7.17 × 10−4 for Experiments 2 and 5 compared to Experiment 1, respectively. Consistent with the commonly accepted significance level of 0.05, a p value below this threshold leads us to reject the null hypothesis, indicating that the data from the two experiments do not arise from the same distribution at a 5% significance level. These results underscore the substantial impact of evaluative feedback on performance, both in the training and transfer tasks. No feedback 58% 28% Numeric feedback 91.5% 41% Optional feedback 62% 35% Sub-goal 52.5% 22% Sub-goal with numeric feedback 94% 54% Training Transfer Table 6.1 Percentage of successful trials in the training and transfer tasks. Fig. 6.4a and 6.4b display box plots representing successful trials after filtering for positive scores. Notably, the medians of these box plots closely align with each other, suggesting that 83 (a) Training. (b) Transfer. Figure 6.4 Box plots displaying positive percentage scores for both training (a) and transfer (b) tasks. participants’ performances in the experiments can be effectively compared solely through the percentage of successful trials. Once participants have successfully learned to solve the ToH puzzle, their scores exhibit relatively little variation across experiments during successful trials. This observation highlights the stability and consistency of participants’ performance once they have mastered the task. (a) Training. (b) Transfer. Figure 6.5 Bar plots displaying the mean percentage scores for different trials for both training (a) and transfer (b) tasks. Recall that each participant completed 10 trials of training and 5 trials of transfer tasks. In Figures 6.5a and 6.5b, bar plots represent the mean percentage scores for different trials in the training and transfer tasks, respectively. It’s evident that participants who received no feedback exhibited relatively low scores compared to those who received either numeric feedback or sub-goals with numeric feedback. Furthermore, while there is no consistent improvement over the trials for participants who did not receive feedback, participants who received evaluative feedback demonstrated performance enhancement with increasing scores 84 across trials. Similar trends are observable in the transfer tasks, indicating that participants who received evaluative feedback found it easier to transfer their skills to related tasks and showed improvement across trials. The results in Table 6.1 underscore significant improvements in human decision-making attributed to evaluative feedback during training tasks, along with effective skill transfer to related tasks. We employ maximum entropy IRL [110] to investigate the pivotal role of evaluative feedback in shaping human decision-making, as detailed in Sections 6.3.2 and 6.3.3. To enable this analysis, we conducted additional data collection sessions with 20 participants each, encompassing experiments devoid of feedback (Experiment 1) and those involving evaluative feedback (Experiments 2 and 5). 6.3.2 Human Rewards under Evaluative Feedback In this section, we treat humans solving the ToH puzzle as noisy optimal agents striving for optimal play with some implicit reward structure. We examine participants from three sets of experiments: (a) No feedback (Experiment 1), (b) Numeric feedback (Experiment 2), and (c) Sub-goal with numeric feedback (Experiment 5). To gain insights into human learning under these varying feedback conditions, we employ maximum entropy IRL analysis to uncover the underlying human reward structures. Visualizing these human rewards can offer valuable insights into the learning process with and without evaluative feedback. Recall that for each experiment, the initial state is standardized with the starting state represented by the top vertex of triangle T1, and the target state is randomly selected from either triangles T2 or T3 (see Fig. 6.2). For each experiment, we partition the experimental data into two sets, one with target states in T2 and the other with target states in T3. In each of these sets, we learn the human rewards expressed as a linear combination of predefined features. By modifying these predefined features, we consider two different settings where the human rewards are learned for all the states and for a subset of 8 states. To estimate the rewards, we maximize the log-likelihood as defined in (6.2), while applying an L1 penalty to promote sparse rewards. To determine the coefficient of the L1 penalty λ ∈ R≥0, we consider 85 (a) Learned rewards in the training tasks for all states using all trajectories. (b) Learned rewards in the training tasks for all states using only successful trajectories. Figure 6.6 IRL plots displaying learned human rewards in the training tasks for all states, using trajectory datasets (from trials 6-10 for each participant) from each experiment that encompass (a) all available trajectories and (b) only successful trajectories, where success is defined by reaching the target state. λ ∈ {0, 0.1, . . . , 2} and perform 5-fold cross-validation on the data and select the coefficient that yields the maximum mean log-likelihood across the 5-fold validation sets. Fig. 6.6a and 6.6b display the learned IRL rewards in the training tasks for all states. While IRL typically assumes expert demonstrations, it’s important to note that participants may still be learning the task during the initial trials. Since the performance does not vary 86 significantly in the latter half of the trials (see Fig. 6.5a), we assume that the human rewards are relatively stationary from trials 6 to 10 and, therefore, exclusively utilize these trials for our IRL analysis. From these latter trajectories, we derive IRL rewards, considering both (a) all available trajectories and (b) only the successful ones, where success is defined by reaching the target state. Each of these plots is organized into a grid with 2 rows and 3 columns. The top row represents trajectories with the target state in triangle T2, while the bottom row represents trajectories with the target state in triangle T3. The columns correspond to the three sets of experiments: no feedback, numerical feedback, and sub-goal with numerical feedback arranged from left to right. In Fig. 6.6a, it becomes apparent that participants’ rewards in the experiment with no feedback (first column) exhibit a distribution across all states, encompassing both T2 and T3, despite the target state’s placement in T2 for the first row and in T3 for the second row. The occurrence of high rewards in T3 (respectively T2) when the target state resides in T2 (respectively T3) primarily stems from the unsuccessful attempts to solve the ToH puzzle in each experiment. Consequently, we observe that as participants’ performance improves across experiments from left to right, rewards increase within the triangle containing the target state while decreasing in the opposing triangle. Another noteworthy observation is the presence of high rewards at the critical states (vertices of the target triangle), which serve as pivotal entry points to the target triangle. These rewards become more pronounced as performance enhances from left to right. Fig. 6.6b depicts the learned IRL rewards derived exclusively from successful trajectories in each experiment. Due to the absence of failed trajectories in each experiment, the dis- parities in IRL rewards across experiments, from left to right, become less pronounced. In each experiment, states within the target triangle and critical states exhibit higher rewards compared to the opposing triangles. In Experiments 1 and 2, the elevated rewards along the edge in the opposite triangle, which is closer to the target triangle, suggest that participants 87 in these experiments occasionally complete the puzzle by opting for suboptimal routes. In contrast, participants in Experiment 5 predominantly solve the puzzle utilizing the optimal trajectory. (a) Learned rewards in the transfer tasks for all states using all trajectories. (b) Learned rewards in the transfer tasks for all states using only successful trajectories. Figure 6.7 IRL plots displaying learned human rewards in the transfer tasks for all states, using trajectory datasets from each experiment that encompass (a) all available trajectories and (b) only successful trajectories, where success is defined by reaching the target state. Fig. 6.7a and 6.7b present the learned IRL rewards for all states within the transfer tasks, utilizing trajectory datasets that encompass (a) all available trajectories and (b) only suc- 88 cessful trajectories. It is important to note that the transfer tasks pose significant challenges, with none of the participants receiving any feedback. Consequently, the trajectories for the transfer tasks in each experiment comprise numerous failed trajectories. However, a noticeable trend emerges: participants from Experiment 5, who were trained using sub-goals with numeric feedback, exhibit faster learning in solving the transfer tasks compared to participants from Experiments 1 and 2, who received no feedback and only numerical feedback, respectively. This is evident from the higher rewards within the target triangle and lower rewards in the opposite triangle for Experiment 5. When considering only successful trajectories to derive the IRL rewards in Fig. 6.7b, the differences across experi- ments become less pronounced due to the exclusion of failed trajectories in all experiments. The results presented in Fig. 6.6 and 6.7 offer valuable insights into how humans acquire puzzle-solving skills under various evaluative feedback strategies. However, it’s worth noting that the learned rewards appear less sparse due to the predefined features, which permit non-zero rewards in all states. Consequently, while these learned IRL rewards for all states offer insights into critical states, they can complicate the comparison between experiments. Furthermore, most of the RL rewards are often sparse. To this end, we modify the predefined features to encourage sparser rewards, allowing non-zero rewards in only 8 states for both the training and transfer tasks. These 8 states were thoughtfully selected as the vertices of the smaller triangles within the state space. In Fig. 6.2, these states correspond to 2200, 1100, 1110, 2220, 0012, 2212, 1121, 0021. Fig. 6.8a and 6.8b illustrate the learned IRL rewards for a specific subset of 8 states during the training tasks. These rewards are derived from trajectory datasets obtained from the latter half of the trials (trials 6 to 10) for each participant. We consider two scenarios: (a) using all available trajectories and (b) using only the trajectories that resulted in successful task completion. It is evident that participants from Experiment 5 demonstrate non-zero rewards exclusively within the target triangle and the corresponding critical states. As we progress from left to right, the non-zero rewards in the opposite triangle diminish due to 89 (a) Learned rewards in the training tasks for a subset of 8 states using all trajectories. (b) Learned rewards in the training tasks for a subset of 8 states using only successful trajectories. Figure 6.8 IRL plots displaying learned human rewards in the training tasks for a subset of 8 states, using trajectory datasets (from trials 6-10 for each participant) from each experi- ment that encompass (a) all available trajectories and (b) only successful trajectories, where success is defined by reaching the target state. fewer instances of failure. These differences become less pronounced when we solely consider successful trajectories in Fig. 6.8b. Fig. 6.9a and 6.9b depict the learned IRL rewards for a selected subset of 8 states within the transfer tasks, using trajectory datasets that encompass (a) all available trajectories and 90 (a) Learned rewards in the transfer tasks for a subset of 8 states using all trajectories. (b) Learned rewards in the transfer tasks for a subset of 8 states using only successful trajectories. Figure 6.9 IRL plots displaying learned human rewards in the transfer tasks for a subset of 8 states, using trajectory datasets from each experiment that encompass (a) all available trajectories and (b) only successful trajectories, where success is defined by reaching the target state. (b) only successful trajectories. While the distinctions are somewhat less pronounced due to the presence of numerous failure attempts in all experiments, the lower rewards in the opposite triangle indicate swifter learning when participants are trained with feedback, in contrast to participants who receive no feedback. These differences become less noticeable 91 when we exclusively consider successful trajectories in Fig. 6.9b, effectively eliminating most of the non-zero rewards in the opposite triangle. The results of the max entropy IRL analysis underscore the significance of critical states and demonstrate how human learning in sequential decision-making tasks can be organized more effectively when evaluative feedback is provided, in contrast to participants solely learning through exploration without any feedback. The results further indicate that the participants trained with evaluative feedback exhibit an ability to transfer their learning to newer, related, and more demanding tasks at a significantly accelerated pace compared to those who learn without feedback. 6.3.3 Modeling Human Decision-Making under evaluative feedback In Sec. 6.3.1 and 6.3.2, we have demonstrated the pivotal role of evaluative feedback in enhancing learning and performance within the context of the ToH puzzle. In this section, we delve into exploring models that aim to elucidate the mechanisms through which humans integrate evaluative feedback into their decision-making processes. In our analysis, we explore four distinct models for incorporating feedback into human decision-making, as detailed in Sec. 6.1.3. For each of these models, we calculate both the Akaike information criterion (AIC) [92] and Bayesian information criterion (BIC) [93] to identify the most suitable model. For a given model, the AIC and BIC are defined as: AIC = 2p − 2 log( ˆL), BIC = p log(o) − 2 log( ˆL), (6.11) where p, o, and ˆL denote the number of learned parameters, the number of observations, i.e., the sample size, and the maximized value of the likelihood function of the model, respectively. The model with the lowest AIC (or BIC) is deemed the optimal choice according to AIC (or BIC) criteria. For this analysis, we leverage the experimental data gathered during the training tasks of Experiment 2, where participants received numeric feedback. It is important to note that this numeric feedback is determined based on the change in state value before and after the state transition. Consequently, it is intrinsically tied to the 92 target state, given that the value function is contingent upon the target state. Since the target state is subject to randomization in triangles T2 and T3, we further segment these triangles into three sub-triangles each. This subdivision allows us to categorize the experimental data into six distinct groups, based on the location of the target state within these six sub-triangles. Within each group, we select the top vertex of the sub-triangle as the designated target state and truncate the trajectories to the point at which they initially enter the target sub-triangle. Upon completing this partitioning process for the 200 trajectories obtained from the training tasks, we arrived at six groups, each containing a respective number of trajecto- ries: 41, 34, 27, 31, 32, 35. Within each group, we subject all four models to testing, making appropriate modifications to either the Q-function or the reward function as discussed in Sec. 6.1.3. To estimate the unknown parameters for each model, we employ maximum en- tropy IRL, optimizing the log-likelihood as defined in (6.2) while applying an L1 penalty. To determine the coefficient for the L1 penalty in each model, we consider λ ∈ {0, 0.2, . . . , 1} and perform 5-fold cross-validation. This allows us to select the coefficient that results in the highest mean log-likelihood across the five validation sets. Num. of parameters Group 1 2 3 4 5 6 Num. of obs. 41 34 27 31 32 35 Model 1 Model 2 Model 3 Model 4 81 82 82 1 AIC BIC AIC BIC AIC BIC AIC BIC 23.65 27.76 30.84 21.65 27.69 30.42 26.45 23.02 27.03 31.40 22.70 26.38 34.73 26.94 30.87 25.40 20.68 24.47 31.40 23.49 27.25 34.02 26.56 30.21 23.70 27.82 30.52 21.72 27.76 30.47 27.12 31.50 34.46 25.51 31.51 34.12 22.93 22.89 24.70 24.66 28.54 28.49 22.20 22.16 24.30 24.34 27.620 27.66 Table 6.2 AIC, and BIC values (normalized by the number of observations) for different models allowing non-zero rewards for all the 81 states. Similar to Sec. 6.3.2, we investigate two settings for the predefined features: one that allows non-zero rewards in all states and another that restricts rewards to a subset of 8 states. 93 Table 6.2 presents the AIC and BIC values (normalized by the number of observations) for different models within each group when non-zero rewards are permitted for all 81 states. It’s worth noting that, while Model 2, where Q′(s, a) = Q(s, a) + k ˆH(s, a), emerges as the best fit according to AIC for the majority of the groups, Model 4, where Q′(s, a) = k ˆH(s, a), is selected as the best fit under the BIC criterion. This preference for Model 4 under BIC is attributed to the significant difference in the number of learned parameters between the two models, with BIC favoring the model with fewer learned parameters. Indeed, when considering non-zero rewards for all 81 states, it leads to a preference for the model with just a single learned parameter. Num. of parameters Group 1 2 3 4 5 6 Num. of obs. 41 34 27 31 32 35 Model 1 Model 2 Model 3 Model 4 8 9 9 1 AIC BIC AIC BIC AIC BIC AIC BIC 31.84 41.72 44.04 30.49 42.78 43.53 32.18 22.54 22.92 31.89 42.08 24.03 24.43 41.77 44.43 27.74 28.17 44.12 30.86 21.26 21.68 30.56 43.15 23.62 24.03 42.84 43.59 43.89 27.25 27.65 32.27 42.18 44.55 30.97 43.26 43.99 22.96 22.92 24.80 24.76 28.48 28.43 22.25 22.20 24.45 24.41 27.57 27.62 Table 6.3 AIC, and BIC values (normalized by the number of observations) for different models allowing non-zero rewards for a sub-set of 8 states. Table 6.3 presents the AIC and BIC values (normalized by the number of observations) for different models within each group when non-zero rewards are allowed for only a subset of 8 states. This setting represents a more realistic scenario with sparse rewards. Notably, in this context, Model 2 consistently emerges as the best fit according to both the AIC and BIC criteria. This suggests that humans tend to interpret evaluative feedback as a strong indicator of the long-term effectiveness of their strategic actions. Remark 9. Even though Model 2 stands out as the preferred model according to both AIC and BIC criteria, there is a small evidence of support for Model 4 as well (in case of non-sparse rewards). This suggests that there might be instances where some individuals do not primarily 94 learn through interaction but instead focus on maximizing their evaluative feedback directly. Such individuals could potentially encounter challenges in transfer tasks where evaluative feedback is not available. 6.4 Conclusions In this work, we study the influence of AI-generated evaluative feedback on human decision-making, with a specific focus on sequential decision-making tasks exemplified by the Tower of Hanoi. Our study demonstrates that individuals who receive training with evaluative feedback not only experience significant improvements in their decision-making abilities but also excel in transferring these enhanced skills to similar tasks. Through an analysis utilizing the maximum entropy inverse reinforcement learning framework, we show that human learning exhibits a more structured and organized implicit reward pattern when evaluative feedback is provided during the training process. This highlights the critical role played by AI-generated feedback in improving the cognitive and strategic abilities of individuals. Furthermore, our investigation explores various models to better comprehend how hu- mans integrate feedback into their decision-making processes. Our findings provide substan- tial evidence suggesting that individuals tend to interpret evaluative feedback as a valuable indicator of the long-term effectiveness of their strategic actions. This valuable insight can be leveraged to design intelligent IoT devices, capable of enriching human learning experiences and shaping human decision-making. 95 CHAPTER 7 INCENTIVIZING COLLABORATION IN HETEROGENEOUS TEAMS VIA COMMON-POOL RESOURCE GAMES In this chapter, we address the challenge of achieving efficient collaboration in a team of heterogeneous agents with different skill-sets and expertise. We consider a team of hetero- geneous agents that is collectively responsible for servicing, and subsequently reviewing, a stream of homogeneous tasks. Each agent has an associated mean service time and a mean review time for servicing and reviewing the tasks, respectively. Agents receive a reward based on their service and review admission rates. The team objective is to collaboratively maxi- mize the number of “serviced and reviewed" tasks. We formulate a Common-Pool Resource (CPR) game and design utility functions to incentivize collaboration among heterogeneous agents in a decentralized manner. We show the existence of a unique Pure Nash Equilibrium (PNE), and establish convergence of best response dynamics to this unique PNE. Finally, we establish an analytic upper bound on three inefficiency measures of the PNE, namely the price of anarchy (PoA), the ratio of the total review admission rate (TRI), and the ratio of latency (LI). 7.1 Background and Problem Formulation In this section, we describe the problem setup and formulate the problem using a game- theoretic framework. We also present some definitions that will be used in the paper. 7.1.1 Problem Description We consider a heterogeneous team of N ∈ N agents tasked with servicing a stream of homogeneous tasks. These agents could be autonomous systems or human operators. Each task, after getting serviced by a team-member, gets stored in a common review pool for a second review. This second review is a feedback process in which any team-member can re-examine the serviced task from the common review pool for performance monitoring and quality assurance purposes. Each agent i ∈ N = {1, . . . , N } may choose to spend a portion of her time to review the tasks from the common pool while spending her remaining time 96 Figure 7.1 Player i devotes her time to service homogeneous tasks (at a constant service ) while reviewing serviced tasks from the common review pool (at a constant admission rate λS i ). The maximum admission rate for player i for servicing and review admission rate λR i reviewing the tasks is given by µS i , respectively. and µR i to service the incoming tasks. We consider heterogeneity among the operators due to the difference in their level of expertise and skill-sets in servicing and reviewing the tasks. This heterogeneity is captured by the average service time (µS i )−1 ∈ R>0 and average review time (µR i )−1 ∈ R>0 spent by operator i ∈ N on servicing and reviewing a task, respectively. Let λS i ] be the deterministic service and review admission rates, i.e., the rates at which agent i chooses to admit tasks for servicing and reviewing, respectively. i ∈ [0, µR i ] and λR i ∈ [0, µS Each agent i can choose their service and review admission rate independent of other agents. The range of λS i and λR i have been chosen to satisfy the stability conditions (including marginal stability) for the service and review queues for operator i ∈ N [117, Chapter 8]. Suppose agent i selects λS i and λR i as their service and review admission rates, then λS i µS i + λR i µR i ≤ 1, where λS i µS i (respectively, λR i µR i ) is the average time the agent spends on servicing (respectively, reviewing) the tasks within a unit time. Thus, if the agent has selected a review admission rate λR i , then the service admission rate satisfies i ≤ µS λS i − hiλR i , (7.1) where hi := µS i µR i is the heterogeneity measure for the player i. 97 We consider self-interested agents that receive a utility based on their service and review admission rates. Hence, we will assume that agents operate at their maximum capacity, and equality holds in (7.1). Fig. 7.1 shows the schematic of our problem setup. Note that only serviced tasks are available for review, and therefore, i=1 By substituting (7.1) in (7.2), we obtain, N (cid:88) λR i ≤ N (cid:88) i=1 λS i . N (cid:88) i=1 aiλR i ≤ N (cid:88) i=1 µS i , (7.2) (7.3) where ai := (1 + hi). Eq. (7.3) represents the system constraint on the review admission rates chosen by agents. We are interested in incentivizing collaboration among the agents for the better team performance. Towards this end, we propose a game-theoretic setup defined below. 7.1.2 A Common-Pool Resource Game Formulation We now formulate our problem as a Common-Pool Resource (CPR) game. Henceforth, we would refer to each agent as a player. A maximum service admission rate µS i and a maximum review admission rate µR i are associated with each player i, based on her skill-set and level of expertise. Without loss of generality, let the players be labeled in increasing order of their heterogeneity measures, i.e., h1 ≤ · · · ≤ hN . Let Si := [0, µR i ] be the strategy set for each player i, from which the player chooses her review admission rate for reviewing the tasks from the common review pool. Since we have assumed (7.1) holds with equality, once player i decides her review admission rate i ∈ Si, her service admission rate for servicing the tasks λS λR of (7.1). Let S = (cid:81) Cartesian product. Furthermore, we define S−i = (cid:81) i∈N Si be the joint strategy space of all the players, where (cid:81) denotes the j∈N ,j̸=i Sj as the joint strategy space of is given by the right hand side i all the players except player i. For brevity of notation, we denote the total service admission rate and the total review admission rate by λS T = (cid:80)N i=1 λS i and λR T = (cid:80)N i=1 λR i , respectively. Similarly, µS T = (cid:80)N i=1 µS i 98 and µR T = (cid:80)N i=1 µR i denote the aggregated sum of the maximum service admission rates and maximum review admission rates of all the players, respectively. Let x ∈ R, defined by x = λS T − λR T = µS T − N (cid:88) i=1 aiλR i , (7.4) be the slackness parameter for system constraint (7.3). The constraint (7.3) is violated for negative values of x, i.e., when total review rate exceeds the total service rate. In such an event, some players commit to review more tasks than that are available in the common review pool. The slackness parameter characterizes the gap between the total service admission rate and the total review admission rate for all the players. In order to maximize high quality team throughput, i.e., the number of tasks that are both serviced and reviewed, we seek to incentivize the team to operate close to x = 0. Each player i receives a constant reward rS ∈ R>0 for servicing each task. Hence, the service utility uS i : Si (cid:55)→ R>0 for player i servicing the tasks at the service admission rate λS i is given by: uS i = λS i rS. (7.5) To incentivize collaboration among the agents, we design the review utility uR i : S (cid:55)→ R>0 received for reviewing the tasks from the common review pool using two functions: a rate of return, rR : S (cid:55)→ R>0 for each reviewed task and a constraint probability p : S (cid:55)→ [0, 1] of the common review pool. The constraint probability p is a soft penalty on the violation of system constraint (7.3). We model the rate of return rR and the constraint probability p in terms of the strategy of all the players through slackness parameter x. Furthermore, we assume that rR is strictly decreasing in x. Therefore, for each x ∈ [0, µS T ], system constraint (7.3) is satisfied, and the rate of return is maximized at x = 0. The rate of return can be interpreted as the perks that the employer provides to all the employees for high quality service. For example, an employer generates higher revenue based on the high quality throughput of her company, i.e., based on the number of “serviced and reviewed" tasks, which she redistributes among her employees as 99 perks as per their contribution to the review process. Highest quality throughput is achieved by the company when the team efficiently reviews all the serviced tasks, i.e., when x = 0. We introduce the constraint probability p as a soft penalty of the violation of system constraint (7.3), and therefore, we let p = 1 if the constraint gets violated, i.e. when x < 0. We assume that the constraint probability p is non-increasing in x, and approaches 1 as x approaches 0. The class of sharply decreasing exponential functions, p(λR −i) = exp(−Ax), T ] and A ∈ R>0, can be a good choice to effectively model the system constraint. While the system constraint (7.3) ensures stability in the asymptotic limit, the where x ∈ [0, µS i , λR deviation of empirical mean service/review times from the true mean service/review times during the transient may lead to large queue lengths. Hence, the constraint probability p can be interpreted as the deviation probability of the empirical mean service/review times from the true mean service/review times used in (7.3). If the constraint is violated with probability p, then uR i = 0 for each player i. Therefore, we define the utility uR i i (λR uR i , λR −i) =    0, with probability p(λR i , λR −i), i rR(λR λR i , λR −i), otherwise. by (7.6) i , λR Let ui(λR i + uR i maximize her expected utility ˜ui : S (cid:55)→ R defined by −i) = uS be the total utility of player i ∈ N . Each player i tries to ˜ui(λR i , λR −i) = E[uS i (λR i , λR −i) + uR i (λR i , λR −i)] = λS i rS + λR i rR(λR i , λR −i)(1 − p(λR i , λR −i)), (7.7) where the expectation is computed over the constraint probability p. Since rR and p depend on the review admission rates of all the players only through the slackness parameter x, with a slight abuse of notation, we express rR(λR i , λR −i) and p(λR i , λR −i) by rR(x) and p(x), respectively. Substituting (7.1) in (7.7), we get: ˜ui = µS i rS + λR i (cid:2)rR(x)(1 − p(x)) − hirS(cid:3) =: µS i rS + λR i fi(x), 100 (7.8) where fi : S (cid:55)→ R is defined by fi(λR i , λR −i) = fi(x) = rR(x)(1 − p(x)) − hirS. (7.9) The function fi is the incentive for player i to review the tasks. Note that player i will choose a non-zero λR i if and only if she has a positive incentive to review the tasks, i.e., fi(x) > 0. Otherwise, player i drops out without reviewing any task (λR i = 0) and focuses ), thereby maximizing her expected utility given by solely on servicing of tasks (λS i = µS i ˜ui = µS i rS. In the following, we will refer to the above CPR game by Γ = (N , {Si}i∈N , {˜ui}i∈N ). In this paper, we are interested in equilibrium strategies for the players that constitute a PNE defined below. Definition 2 (Pure Nash Equilibrium). A PNE is a strategy profile λR∗ = {λR i ∗}i∈N ∈ S, such that for each player i ∈ N , ˜ui(λR i ∗, λR −i ∗) ≥ ˜ui(λR i , λR −i ∗), for any λR i ∈ Si. Let bi : S−i (cid:55)→ Si defined by bi(λR −i) ∈ argmax λR i ∈Si ˜ui(λR i , λR −i), be a best response of player i to the review admission rates of other players λR −i . A PNE exists if and only if there exists an invariant strategy profile, λR∗ = {λR i ∗}i∈N ∈ S, such that λR i ∗ = bi(λR −i ∗), for each i ∈ N . 7.2 Existence and Uniqueness of PNE In this section, we study the existence and uniqueness of the PNE for the CPR game Γ under the system constraint (7.3). Each player i ∈ N chooses a review admission rate from her strategy set Si = [0, µR i ] and receives an expected utility ˜ui given by (7.8). For any given λR −i ∈ S−i, we obtain an upper bound λ R i : S−i (cid:55)→ Si on λR i defined by λ R i =    0, if Λi < 0, Λi, if 0 ≤ Λi ≤ µR i , µR i , if Λi > µR i , 101 (a) (b) (c) R Figure 7.2 Constraint probability of player i as λ i i ∈ Si, b) for λ pi(λR pi(λR is convex in λR i R , and pi(λR (cid:55)→ λ i and pi(λR i , ·) < 1, ∀λR i , ·) = 1, ∀λR i , ·) (cid:55)→ 1 as λR i , ·) = 1, ∀λR i ∈ Si. R i ∈ (0, µR i ), pi(λR i ∈ [λ i varies from 0 to µR i i , ·) is convex for λR R i ], and c) for λ i , µR R . a) For λ R i ∈ [0, λ R i = µR i i = 0, i ), with , pi(λR i , ·) where Λi := T −(cid:80) µS j∈N ,j̸=i aj λR j ai , such that for λR i ∈ [0, λ R matically satisfied, and for λR R i , µR i ∈ (λ R constraint (7.3) is satisfied if λ i ∈ (0, µR i ] ⊂ Si, constraint (7.3) is violated. For λR i ], and is violated if λ i = 0. R i ) ⊂ Si, constraint (7.3) is auto- , i = λ R i We study the properties of game Γ under following assumptions. Recall that x = λS T − T = µS λR T − (cid:80)N i=1 aiλR i . (A1) For a given λR −i ∈ S−i, i ∈ N , we assume that the rate of return rR(λR the tasks is continuously differentiable, strictly increasing and strictly concave for λR i ∈ Si, with rR(0, 0) = 0. Equivalently, x (cid:55)→ rR(x) is continuously differentiable, strictly i , ·) for reviewing decreasing and strictly concave for x ∈ [0, µS T ], with rR(µS T ) = 0. (A2) For a given λR −i ∈ S−i, i ∈ N , we assume that the constraint probability p(λR i , ·) is (i) continuous on Si; (ii) is continuously differentiable, non-decreasing and convex R for λR i ∈ (0, λ i ]. See Fig. 7.2 for an illustration. Equivalently, p(x) is continuously differentiable, non-increasing and i ) ⊂ Si; and (iii) is equal to 1, for λR i ∈ (λ R i , µR convex for x ∈ (0, µS T ], and p(x) → 1, as x → 0. Furthermore, p = 1, for every x < 0. (A3) We assume fi(µR i , 0)) − hirS > 0, for each i ∈ N , i.e., if no other player reviews any task, then each player i has a positive incentive to review i , 0)(1 − p(µR i , 0) = rR(µR tasks with maximum admission rate µR i . 102 Remark 10. The rate of return rR and the constraint probability p can be easily designed to accommodate (A1-A3). Under Assumptions (A1) and (A2), the incentive function fi(λR i , ·) is strictly concave in λR i , which means for a fixed λR −i, the player i has diminishing marginal incentive to review tasks. We make Assumption (A3) to provide positive incentives for players to review tasks with their maximum review admission rate µR i , if no other player chooses to review any task. We can design game Γ to satisfy Assumption (A3) by ensuring that the following conditions hold: (i) rR(µR i , 0) > rS, and µS i ≤ µR i , for each i ∈ N , and (ii) µR i ≪ µS T /ai, or equivalently (cid:80) j∈N , j̸=i µS j ≫ µR i , for each i ∈ N . If the latter condition holds, then x is large, and consequently, the constraint probability p(µR i , 0) ≈ 0, for each i ∈ N . For most practical purposes, servicing a task requires more time than reviewing it, i.e., µS i ≤ µR i . Therefore, condition (i) can be easily satisfied by designing rewards such that rR(µR i , 0) > rS, for each i ∈ N . If the total service admission rate of all the players except player i is much higher than the maximum review admission rate of player i, i.e. (cid:80) i , for each i ∈ N , then condition (ii) holds. Notice j ≫ µR j∈N , j̸=i µS that for a large team of agents where a single agent does not have much impact on the overall service rate, condition (ii) is true. We refer the reader to Section 7.5 for an example. □ Theorem 6 (Existence of PNE ). The CPR game Γ, under Assumptions (A1-A3), admits a PNE. Proof. See Appendix C: Chapter 7 for the proof. i(λR i , λR Let f ′ i , λR provide a corollary that characterizes a PNE of CPR game Γ. −i) be the first partial derivative of fi(λR −i) with respect to λR i . We now Corollary 1 (PNE ). For the CPR game Γ, under Assumptions (A1-A3), the following statements hold for a PNE λR∗ = [λR∗ 1 , . . . , λR∗ N ] with x∗ = µS T − (cid:80)N i=1 aiλR∗ i : 103 (i) f ′ i(λR∗ i , λR∗ −i ) < 0 (or dfi dx (x∗) > 0) for every player; (ii) λR∗ i = 0, if and only if, fi(λR∗ i , λR∗ −i ) ≤ 0 ; and (iii) λR∗ i is non-zero and satisfies the following implicit equation if and only if fi(λR∗ i , λR∗ −i ) = fi(x∗) > 0 at PNE, where i = min (cid:8)˜λR∗ λR∗ i , µR i (cid:9), (7.10) with ˜λR∗ i = − fi(λR∗ i (λR∗ f ′ i i ,λR∗ −i ) ,λR∗ −i ) = fi(x∗) dfi dx (x∗) ai . Proof. See Appendix C: Chapter 7 for the proof. Proposition 1 (Structure of PNE ). For the CPR game Γ with players ordered in in- creasing order of hi, let λR∗ = [λR∗ 1 , λR∗ 2 , . . . , λR∗ N ] be a PNE. Then, the following statements hold: (i) If, for any player k1, λR∗ k1 < µR k1 , then ak1λR∗ k1 ≥ ak2λR∗ k2 and λR∗ k1 ≥ λR∗ k2 , for each k2 > k1; and (ii) if λR∗ l = 0, for any l ∈ N , then λR i = 0, for each i ∈ {j ∈ N | j ≥ l}. Proof. See Appendix C: Chapter 7 for the proof. It follows from Proposition 1 that the review admission rate of a player i at a PNE is monotonically decreasing with the ratio hi. Therefore, at a PNE, as the heterogeneity in terms of hi among the players becomes very large, players with small (respectively, large) hi review tasks with high (respectively, zero) review admission rate. We will show in Lemma 12 that the PNE shares these characteristics with the social welfare solution, which we define in Section 7.4. We illustrate this further in Section 7.5. Theorem 7 (Uniqueness of PNE ). The PNE admitted by the CPR game Γ, under as- sumptions (A1-A3), is unique. Proof. See Appendix C: Chapter 7 for the proof. 104 7.3 Convergence to the Nash Equilibrium We now show that the proposed CPR game Γ under Assumptions (A1-A3) belong to the class of Quasi Aggregative games [118] as defined below. Definition 3 (Quasi Aggregative game). Consider a set of players N , where each player i ∈ N has a strategy set Si, and a utility function ui. Let S = (cid:81) space of all the players, and S−i = (cid:81) j∈N ,j̸=i Sj be the joint strategy space of all the players i∈N Si be the joint strategy except player i. A game Γ = (N , {Si}i∈N , {ui}i∈N ) is a quasi-aggregative game with aggre- gator g : S (cid:55)→ R, if there exists continuous functions Fi : R × Si (cid:55)→ R (the shift functions) and σi : S−i (cid:55)→ X−i ⊆ R, i ∈ N (the interaction functions) such that the utility functions ui for each player i ∈ N can be written as: ui(s) = ˜ui(σi(s−i), si), where ˜ui : X−i × Si (cid:55)→ R, and g(s) = Fi(σi(s−i), si), for all s ∈ S and i ∈ N . (7.11) (7.12) An alternative, but less general way of defining a quasi-aggregative game replaces (7.11) in the definition with: ui(s) = ui(g(s), si), (7.13) where ui : X × Si (cid:55)→ R, and X = {g(s) |s ∈ S} ⊆ R. For the CPR game Γ, let σi(λR −i) = (cid:80)N j=1,j̸=i ajλR j and g(λR) = Fi(σi(λR −i), λR i ) = (cid:80)N j=1,j̸=i ajλR j + aiλR i be the interaction functions and shift functions, respectively. The expected utility ˜ui, which is defined in (7.8), can be re-written in the form ˜ui(λR i , λR −i) = ˜ui(σi(λR −i), λR i ). (7.14) Hence, the CPR game Γ is a quasi-aggregative game. Specializing [118, Theorem 1] to the CPR game Γ, we obtain that if the best response for all the players is non-increasing in the interaction function σi(λR −i) = (cid:80)N CPR game Γ is a best response pseudo-potential game [119] as defined below. j=1,j̸=i ajλR j , the 105 Definition 4 (Best response (pseudo)-potential game). A game Γ = (N , {Si}i∈N , {˜ui}i∈N ) is a best response pseudo-potential game if there exists a continuous function ϕ : S (cid:55)→ R such that for every i ∈ N , bi(λR −i) ⊇ argmax λR i ∈Si ϕ(λR i , λR −i), where bi(λR −i) is the best response of player i to the review admission of other players λR −i. Furthermore, if bi(λR −i) = argmax λR i ∈Si ϕ(λR i , λR −i), then the game Γ is a best response potential game. We now establish that the best response for each player is non-increasing in σi. Lemma 11 (Non-increasing best response). For the CPR game Γ, under Assumptions (A1-A2), the best response mapping bi(λR −i) is non-increasing in σi(λR −i), for each i ∈ N , where σi(λR −i) = (cid:80)N j=1,j̸=i ajλR j . Proof. See Appendix C: Chapter 7 for the proof. Furthermore, Remark 1 in [78] states that a best response pseudo-potential game with a unique best response, is an instance of best response potential game [77]. Therefore, the CPR game Γ, with its unique (Lemma 16) and non-increasing best response bi in σi(λR −i) (Lemma 11), is a best response potential game. Hence, simple best response dynamics such as sequential best response dynamics [78] and simultaneous best response dynamics [79] converge to the unique PNE. 7.4 Social Welfare and Inefficiency of PNE In this section, we characterize the social welfare solution and provide analytic upper bounds on inefficiency measures for the PNE. 7.4.1 Social Welfare Social welfare corresponds to the optimal (centralized) allocation by players with respect to a social welfare function. To characterize the effect of self-interested optimization of each 106 agent, we compare the decentralized solution (PNE of the CPR game) with the centralized optimal solution (social welfare). We choose a typical social welfare function Ψ(λR) : S (cid:55)→ R defined by the sum of expected utility of all players, i.e., Ψ = N (cid:88) i=1 ˜ui = N (cid:88) i=1 [µS i rS + λR i fi(x)] = µS T rS + λR T rR(x)(1 − p(x)) − rS N (cid:88) i=1 hiλR i = (λR T + x)rS + λR T rR(x)(1 − p (x)) . (7.15) A social welfare solution is an optimal allocation that maximizes the social welfare func- tion. Lemma 12 (Social welfare solution). For the CPR game Γ with constraint (cid:80)N i = c , for any given c ∈ R≥0, and players ordered in increasing order of hi, the associated social i=1 aiλR welfare solution, λR ∈ S is given by: (cid:34) λR = 1 , µR µR 1 ak where k is the smallest index such that (cid:80)k−1 (cid:80)N 2 , . . . , µR k−1, i=1 aiλR i ∈ [0, µS T + µR i=1 aiµR i ≤ c < (cid:80)k i=1 aiµR i . Furthermore, since T ], a bisection algorithm can be employed to compute optimal c and (cid:35) i ), 0, . . . , 0 aiµR , (c − k−1 (cid:88) i=1 hence, the optimal social welfare solution. Proof. Under the constraint (cid:80)N i=1 aiλR i = c (equivalently, x = µS increasing function of λR T . Therefore, for a fixed (cid:80)N i=1 aiλR i = c, λR T T − c), Ψ is a strictly is maximized by selecting k−1 players with smallest ai’s (equivalently, hi) to operate at their highest review admission rate, where the value of k is selected such that (cid:80)k−1 . Finally, the i ≤ c < (cid:80)k i=1 aiµR i=1 aiµR i k-th player in the ordered sequence is selected to operate at a review admission rate such that the constraint (cid:80)N i=1 aiλR i = (cid:80)k i=1 aiλR i = c, is satisfied. Therefore, the social welfare solution is of the form, (cid:34) λR = 1 , µR µR 2 , . . . , µR k−1, 1 ak (c − k−1 (cid:88) i=1 (cid:35) i ), 0, . . . , 0 aiµR . 107 Furthermore, for the function rR(x) and p(x) satisfying Assumptions (A1-A2), Ψ is strictly concave in x, i.e., ∂2Ψ d2fi dx2 < 0 (Lemma 15). With the known form of the social welfare solution, the value of c, which corresponds to the unique maximizer x of Ψ, ∂x2 = λR T can be computed efficiently by employing a bisection algorithm [120]. 7.4.2 Inefficiency of the PNE We consider three measures of the inefficiency for the PNE: a) Price of Anarchy (PoA), b) Ratio of total review admission rate (ηT RI), and c) Ratio of Latency (ηLI), which are described by P oA = (Ψ)SW (Ψ)P N E , ηT RI = (λR (λR T )SW T )P N E , ηLI = ((cid:80)N ((cid:80)N i=1 aiλR i=1 aiλR i )P N E i )SW , respectively. While PoA is a widely used measure of the inefficiency, ηT RI and ηLI capture the inefficiency of the PNE based on the total review admission rate and the latency (inverse of throughput), respectively. Since incentivizing team collaboration is of interest, all three measures capture the inefficiency of the PNE well. We now provide an analytic upper bound for each of these measures of inefficiency for the PNE. To this end, we assume that mini{µS T hN N (1+hN ) a task requires much more time than reviewing it, i.e., µS i } > µS . For scenarios wherein servicing N ≪ µR N (hN → 0), the assumption reduces to mini{µS i } > 0. Theorem 8 (Analytic bounds on PNE inefficiency ). For the CPR game Γ, under assumptions (A1-A3), and mini{µS i } > µS T hN N (1+hN ), the inefficiency metrics for the PNE are upper bounded by P oA < µS T aN µS T − x , ηT RI < µS T aN T − x)a1 (µS , ηLI < µS T µS T − x , (7.16) where x is the unique maximizer of fi, i.e., dfi dx (x) = 0 . Proof. See Appendix C: Chapter 7 for the proof. 108 (a) (b) Figure 7.3 Social welfare solution (SW) and pure Nash equilibrium for a) low and b) high heterogeneity among players, respectively. Red circles show the maximum review admission rate (µR i ) for player i. Example 1: We show analytic upper bounds on inefficiency measures for the PNE for a specific class of exponential functions rR(x) = A[1 − exp{B(x − µS T )}] and p(x) = exp(−Bx), where A and B are positive constants, and x ∈ [0, µS T ]. dx (x) = 0, we obtain x = µS Setting dfi T 2 . Using Theorem 8, we get PoA < 2aN , ηT RI < 2aN a1 , and ηLI < 2. For µS N ≪ µR N , PoA < 2aN → 2, and ηT RI < 2aN a1 < 2aN → 2. 7.5 Numerical Illustrations In this section, we present numerical examples illustrating the uniqueness of PNE and the variation of inefficiency with increasing heterogeneity among the players. In our numerical illustrations, we obtain the PNE by simulating the sequential best re- sponse dynamics of players with randomized initialization of their strategy. We verify the uniqueness of the PNE for different choices of functions, rR(x) and p(x) satisfying Assump- tions (A1-A2), and by following sequential best response dynamics with multiple random initializations for the strategy of each player. Furthermore, in our numerical simulations, we relax Assumption (A3) and still obtain a unique PNE. An example illustration is shown in Fig. 7.3, where we show the social welfare solution (obtained using fmincon in MATLAB) and PNE for low and high heterogeneity in terms of variation in hi among players, respectively. For our numerical illustrations, we choose the number of players, N = 6, and choose the functions rR(x) and p(x), satisfying Assumptions 109 (a) (b) (c) (d) (e) Figure 7.4 Empirical a) PoA, b) ηT RI, and c) ηLI, along with analytic upper bounds for d) PoA, e) ηT RI, and f) ηLI with increasing heterogeneity (ρ) among the agents. (f) (A1-A2) as following: rR(λR i , λR −i) = rR(x) = 5[1 − exp{0.5(x − µS T )}], pR i (λR i , λR −i) = p(x) =    1, if x ≤ 0, exp(−0.5x), otherwise, where x = µS T − (cid:80)N i=1 aiλR i is the slackness parameter. To characterize the heterogeneity among the players, we sample the player’s maximum service admission rate µS i and maximum review admission rate µR i at random from normal distributions with fixed means, MµS ∈ R>0, and MµR ∈ R>0, and identical standard deviation, ρ ∈ R>0. We only consider realizations that satisfy µS for all the players, and hence, hi ≤ 1. For most practical purposes, i ≤ µR i where servicing a task requires much more time than reviewing it, the assumption µS i ≤ µR i holds true. Any non-positive realizations were discarded. We consider the standard deviation of the distributions as the measure of heterogeneity among the players. Fig. 7.3 shows that in the social welfare solution, players with low ratio of hi review the tasks at maximum review admission rate and players with high ratio of hi drop out of the game. At PNE, the strategy profile of players follow the characteristics described by 110 Proposition 1. Lastly, with the increase in heterogeneity among the players, the PNE starts to approach the social welfare solution. Fig. 7.4a-7.4c and Fig. 7.4d-7.4f shows the variation of different measures of inefficiency for PNE, and their corresponding analytic upper bounds (see Theorem 8), with increasing heterogeneity among the players. Fig. 7.4a shows the plot of PoA with increasing hetero- geneity. In case of homogeneous players, i.e., ρ = 0, we obtain P oA = 1, which we establish in Lemma 19. As we initially increase the heterogeneity among the players, PNE starts to deviate from the social welfare solution, resulting in an increase in the PoA. We note that PoA ≤ 1.15, suggesting that the unique PNE is close to the optimal centralized social wel- fare solution. Fig. 7.4b and 7.4c shows ηT RI and ηLI, which are other relevant measures of inefficiency for our problem. It is evident from Fig. 7.4, that all three measures of inefficiency are close to 1, therefore suggesting near-optimal PNE solution. 7.6 Conclusions and Future Directions We studied incentive design mechanisms to facilitate collaboration in a team of heteroge- neous agents that is collectively responsible for servicing and subsequently reviewing a stream of homogeneous tasks. The heterogeneity among the agents is based on their skill-sets and is characterized by their mean service time and mean review time. To incentivize collaboration in the heterogeneous team, we designed a Common-Pool Resource (CPR) game with appro- priate utilities and showed the existence of a unique PNE. We showed that the proposed CPR game is an instance of the best response potential game and by playing the sequential best response against each other, players converge to the unique PNE. We characterized the structure of the PNE and showed that at the PNE, the review admission rate of the players decreases with the increasing ratio of hi = µS i µR i players that are “better" at reviewing the tasks than servicing the tasks (characterized by , i.e., the review admission rate is higher for the their average service and review time). Furthermore, we consider three different inefficiency metrics for the PNE, including the Price of Anarchy (PoA), and provide an analytic upper bound for each metric. Additionally, we provide numerical evidence of their proximity to 111 unity, i.e., the unique PNE is close to the optimal centralized social welfare solution. There are several possible avenues of future research. It is of interest to extend the results for a broader class of games with less restrictive choice of utility functions, i.e., games that are not quasi-aggregative or commonly used games of weak strategic substitutes (WSTS) [78] or complements (WSTC) [78]. An interesting open problem is to consider a team of agents processing stream of heterogeneous tasks. In such a setting, incentivizing team collaboration based on the task-dependent skill-set of the agents is also of interest. 112 CHAPTER 8 CONCLUSIONS In this dissertation, we focused on human-in-the-loop systems that suffer from inherent variability of human performance that depends on various factors such as cognitive state, task learning, expertise, and individual differences. We explored the optimal and game-theoretic feedback mechanisms for improving human performance in human-supervised autonomy. To this end, we first studied the problem of optimal fidelity selection for effective management of human cognitive resources. We assumed known models for human service time distribution and formulated the problem as a semi-Markov decision process (SMDP). We solved the SMDP to obtain the optimal fidelity selection policy and studied the structural properties of the optimal policy. Next, we conducted human experiments on optimal fidelity selection to study the effect of the optimal policy on human performance. We assumed the human cognitive state as a hidden state and modeled the cognitive dynamics using an Input-Output Hidden Markov Model (IOHMM). We utilized the trained IOHMM model to formulate a Partially Observ- able Markov Decision Process (POMDP). We solved the POMDP to obtain the optimal fidelity selection policy and showed that the optimal policy significantly improves human performance. Next, we extended the optimal fidelity selection problem by incorporating uncertainty into the human service-time distribution. We designed a robust and adaptive framework that accurately learns the human service-time model and adapts the policy while ensuring robustness under model uncertainty. However, a major challenge in designing adaptive and robust systems arises from the conflicting objectives of exploration and robustness. To mitigate system uncertainty, an agent must explore high-uncertainty state space regions, while robust policy optimizes worst-case performance and consequently avoids these regions. To address this trade-off, we introduced an efficient Deterministic Sequencing of Exploration and Exploitation (DSEE) algorithm for model-based reinforcement learning (RL). DSEE 113 interleaves exploration and exploitation epochs with increasing lengths, resulting in sub- linear cumulative regret growth over time. Next, we focused on enhancing human performance through task learning and skill de- velopment. In this context, we studied the impact of evaluative feedback on human learning in sequential decision-making tasks. We conducted experiments on Amazon Mechanical Turk, where participants engaged with the Tower of Hanoi puzzle and received AI-generated feedback during their problem-solving. We examined how this feedback influenced their learning and skill transfer to related tasks. Additionally, we explored computational models to gain insights into how individuals integrate evaluative feedback into their decision-making processes. Lastly, we expanded our focus from a single human operator to a team of heterogeneous agents, each with diverse skill sets and expertise. Within this context, we delved into the challenge of achieving efficient collaboration among heterogeneous team members to enhance overall system performance. Our approach leveraged a game theoretic framework, where we designed utility functions to incentivize decentralized collaboration among these agents. We showed the existence of a unique Pure Nash Equilibrium (PNE) and established the con- vergence of the best response dynamics to this unique PNE. Additionally, we established an analytical upper bound on measures of PNE inefficiency, shedding light on the effectiveness of our collaborative strategies. 114 BIBLIOGRAPHY [1] [2] P. Gupta, D. Isele, D. Lee, and S. Bae, “Interaction-aware trajectory planning for autonomous vehicles with analytic integration of neural networks into model predictive control,” in IEEE International Conference on Robotics and Automation, pp. 7794– 7800, 2023. I. R. Nourbakhsh, K. Sycara, M. Koes, M. Yong, M. Lewis, and S. Burion, “Human- robot teaming for search and rescue,” IEEE Pervasive Computing, vol. 4, no. 1, pp. 72– 79, 2005. [3] M. A. Goodrich, J. L. Cooper, J. A. Adams, C. Humphrey, R. Zeeman, and B. G. Buss, “Using a mini-UAV to support wilderness search and rescue: Practices for human-robot teaming,” in IEEE International Workshop on Safety, Security and Rescue Robotics, pp. 1–6, IEEE, 2007. [4] [5] [6] [7] [8] [9] A. Rosenfeld, N. Agmon, O. Maksimov, and S. Kraus, “Intelligent agent supporting human–multi-robot team collaboration,” Artificial Intelligence, vol. 252, pp. 211–231, 2017. S. A. Seshia, D. Sadigh, and S. S. Sastry, “Formal methods for semi-autonomous driv- ing,” in 52nd Design Automation Conference, pp. 1–5, IEEE, June 2015. P. Gupta, D. Isele, D. Lee, and S. Bae, “Interaction-aware trajectory planning for autonomous vehicles with analytic integration of neural networks into model predictive control,” arXiv preprint arXiv:2301.05393, 2023. A. M. Okamura, “Methods for haptic feedback in teleoperated robot-assisted surgery,” Industrial Robot: An International Journal, vol. 31, no. 6, pp. 499–508, 2004. J. Heard, C. E. Harriott, and J. A. Adams, “A survey of workload assessment algo- rithms,” IEEE Transactions on Human-Machine Systems, vol. 48, no. 5, pp. 434–451, 2018. J. Peters, V. Srivastava, G. Taylor, A. Surana, M. P. Eckstein, and F. Bullo, “Human supervisory control of robotic teams: Integrating cognitive modeling with engineering design,” IEEE Control System Magazine, vol. 35, no. 6, pp. 57–80, 2015. [10] J. R. Peters, A. Surana, and F. Bullo, “Robust scheduling and routing for collaborative human/unmanned aerial vehicle surveillance missions,” Journal of Aerospace Informa- tion Systems, pp. 1–19, 2018. [11] K. Savla and E. Frazzoli, “A dynamical queue approach to intelligent task management for human operators,” Proceedings of the IEEE, vol. 100, no. 3, pp. 672–686, 2012. [12] V. Srivastava, R. Carli, C. Langbort, and F. Bullo, “Attention allocation for decision making queues,” Automatica, vol. 50, no. 2, pp. 378–388, 2014. 115 [13] Q. Hu and W. Yue, Markov Decision Processes with their Applications, vol. 14. Springer Science & Business Media, 2007. [14] M. Lin, R. J. La, and N. C. Martins, “Stabilizing a queue subject to activity-dependent server performance,” IEEE Transactions on Control of Network Systems, vol. 8, no. 4, pp. 1579–1591, 2021. [15] M. Lin, N. C. Martins, and R. J. La, “Queueing subject to action-dependent server performance: Utilization rate reduction,” arXiv preprint arXiv:2002.08514, 2020. [16] R. M. Yerkes and J. D. Dodson, “The relation of strength of stimulus to rapidity of habit-formation,” Journal of Comparative Neurology and Psychology, vol. 18, no. 5, pp. 459–482, 1908. [17] P. Gupta and V. Srivastava, “On robust and adaptive fidelity selection for human-in- the-loop queues,” in European Control Conference, pp. 872–877, 2021. [18] S. Stidham Jr and R. R. Weber, “Monotonic and insensitive optimal policies for control of queues with undiscounted costs,” Operations Research, vol. 37, no. 4, pp. 611–625, 1989. [19] L. I. Sennott, “Average cost semi-Markov decision processes and the control of queueing systems,” Probability in the Engineering and Informational Sciences, vol. 3, no. 2, pp. 247–272, 1989. [20] M. Agarwal, V. S. Borkar, and A. Karandikar, “Structural properties of optimal trans- mission policies over a randomly varying channel,” IEEE Transactions on Automatic Control, vol. 53, no. 6, pp. 1476–1491, 2008. [21] M. L. Puterman, “Markov decision processes,” Handbooks in Operations Research and Management Science, vol. 2, pp. 331–434, 1990. [22] R. Yang, S. Bhulai, and R. van der Mei, “Structural properties of the optimal resource allocation policy for single-queue systems,” Annals of Operations Research, vol. 202, no. 1, pp. 211–233, 2013. [23] C. C. White III and H. K. Eldeib, “Markov decision processes with imprecise transition probabilities,” Operations Research, vol. 42, no. 4, pp. 739–749, 1994. [24] S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis, “Bias and variance approximation in value function estimates,” Management Science, vol. 53, no. 2, pp. 308–322, 2007. [25] V. Borkar and R. Jain, “Risk-constrained Markov decision processes,” IEEE Transac- tions on Automatic Control, vol. 59, no. 9, pp. 2574–2579, 2014. [26] R. T. Rockafellar and S. Uryasev, “Optimization of conditional value-at-risk,” Journal of Risk, vol. 2, pp. 21–42, 2000. [27] E. Delage and S. Mannor, “Percentile optimization for Markov decision processes with parameter uncertainty,” Operations Research, vol. 58, no. 1, pp. 203–213, 2010. 116 [28] H. N. Nguyen, Chance-Constrained Optimization: Applications in Game Theory and Markov Decision Processes. PhD thesis, Université Paris-Saclay, 2023. [29] W. Wiesemann, D. Kuhn, and B. Rustem, “Robust Markov decision processes,” Math- ematics of Operations Research, vol. 38, no. 1, pp. 153–183, 2013. [30] L. F. Bertuccelli, A. Wu, and J. P. How, “Robust adaptive Markov decision processes: Planning with model uncertainty,” IEEE Control Systems Magazine, vol. 32, no. 5, pp. 96–109, 2012. [31] R. S. Sutton and A. G. Barto, Reinforcement Learning, Second Edition: An Introduc- tion. MIT Press, Nov. 2018. [32] G. N. Iyengar, “Robust dynamic programming,” Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005. [33] A. Nilim and L. El Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005. [34] V. Gullapalli and A. G. Barto, “Convergence of indirect adaptive asynchronous value iteration algorithms,” in Advances in Neural Information Processing Systems, pp. 695– 702, 1994. [35] J. Xin, H. Zhao, D. Liu, and M. Li, “Application of deep reinforcement learning in mobile robot path planning,” in 2017 Chinese Automation Congress, pp. 7112–7116, IEEE, 2017. [36] P. Gupta, D. Coleman, and J. E. Siegel, “Towards safer self-driving through great PAIN (Physically Adversarial Intelligent Networks),” arXiv preprint arXiv:2003.10662, 2020. [37] A. G. Barto, P. S. Thomas, and R. S. Sutton, “Some recent applications of rein- forcement learning,” in Proceedings of the Eighteenth Yale Workshop on Adaptive and Learning Systems, 2017. [38] S. Ferretti, S. Mirri, C. Prandi, and P. Salomoni, “Automatic web content personal- ization through reinforcement learning,” Journal of Systems and Software, vol. 121, pp. 157–169, 2016. [39] P. Gupta and V. Srivastava, “Structural properties of optimal fidelity selection policies for human-in-the-loop queues,” Automatica, vol. 159, p. 111388, 2024. Extended version available at: arXiv preprint arXiv: 2201.09990. [40] X. Zhou, H. Yue, T. Chai, and B. Fang, “Supervisory control for rotary kiln temperature based on reinforcement learning,” in Intelligent Control and Automation, pp. 428–437, Springer, 2006. [41] P. Gupta and V. Srivastava, “Optimal fidelity selection for human-in-the-loop queues using semi-Markov decision processes,” in American Control Conference, pp. 5266– 5271, 2019. 117 [42] L. F. Bertuccelli, Robust Decision-Making with Model Uncertainty in Aerospace Sys- tems. PhD thesis, Massachusetts Institute of Technology, 2008. [43] A. Nilim and L. El Ghaoui, Robust Markov Decision Processes with Uncertain Transi- tion Matrices. PhD thesis, University of California, Berkeley, 2004. [44] C. J. C. H. Watkins, Learning from Delayed Rewards. King’s College, Cambridge United Kingdom, 1989. [45] P. Gupta, D. Coleman, and J. E. Siegel, “Towards Physically Adversarial Intelli- gent Networks (PAINs) for safer self-driving,” IEEE Control Systems Letters, vol. 7, pp. 1063–1068, 2022. [46] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015. [47] A. L. Strehl and M. L. Littman, “An analysis of model-based interval estimation for Markov decision processes,” Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1309–1331, 2008. [48] S. M. Kakade, On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003. [49] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” Journal of Machine Learning Research, vol. 11, no. 51, pp. 1563–1600, 2010. [50] T. Lattimore and C. Szepesvári, Bandit Algorithms. Cambridge University Press, 2020. [51] S. Vakili, K. Liu, and Q. Zhao, “Deterministic sequencing of exploration and exploita- tion for multi-armed bandit problems,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 5, pp. 759–767, 2013. [52] H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restless multiarmed bandit with unknown dynamics,” IEEE Transactions on Information Theory, vol. 59, no. 3, pp. 1902–1916, 2013. [53] L. Wei and V. Srivastava, “On abruptly-changing and slowly-varying multiarmed ban- dit problems,” in American Control Conference, pp. 6291–6296, June 2018. [54] N. Nayyar, D. Kalathil, and R. Jain, “On regret-optimal learning in decentralized multiplayer multiarmed bandits,” IEEE Transactions on Control of Network Systems, vol. 5, no. 1, pp. 597–606, 2016. [55] L. Wei, A. McDonald, and V. Srivastava, “Multi-robot Gaussian process estimation and coverage: Deterministic sequencing algorithm and regret analysis,” in IEEE In- ternational Conference on Robotics and Automation, pp. 9080–9085, 2021. 118 [56] C. Keser and R. Gardner, “Strategic behavior of experienced subjects in a common pool resource game,” International Journal of Game Theory, vol. 28, no. 2, pp. 241– 252, 1999. [57] A. R. Hota, S. Garg, and S. Sundaram, “Fragility of the commons under prospect- theoretic risk attitudes,” Games and Economic Behavior, vol. 98, pp. 135–164, 2016. [58] A. R. Hota and S. Sundaram, “Controlling human utilization of failure-prone systems via taxes,” IEEE Transactions on Automatic Control, vol. 66, no. 12, pp. 5772–5787, 2020. [59] E. Ostrom, R. Gardner, J. Walker, and J. Walker, Rules, Games, and Common-Pool Resources. University of Michigan Press, 1994. [60] P. Le Gall, “The theory of networks of single server queues and the tandem queue model,” International Journal of Stochastic Analysis, vol. 10, no. 4, pp. 363–381, 1997. [61] N. T. Thomopoulos, Fundamentals of Queuing Systems: Statistical Methods for Ana- lyzing Queuing Models. Springer Science & Business Media, 2012. [62] E. Altman, “Applications of dynamic games in queues,” in Advances in Dynamic Games, pp. 309–342, Springer, 2005. [63] L. Xia, “Service rate control of closed Jackson networks from game theoretic perspec- tive,” European Journal of Operational Research, vol. 237, no. 2, pp. 546–554, 2014. [64] J. R. Marden and J. S. Shamma, “Game theory and control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, pp. 105–134, 2018. [65] G. Arslan, J. R. Marden, and J. S. Shamma, “Autonomous vehicle-target assign- ment: A game-theoretical formulation,” Journal of Dynamic Systems Measurement and Control-Transactions of the ASME, vol. 129, 09 2007. [66] T. Başar and G. J. Olsder, Dynamic Noncooperative Game Theory, vol. 23. SIAM, 1999. [67] T. Roughgarden, “Intrinsic robustness of the price of anarchy,” in Annual ACM Sym- posium on Theory of Computing, pp. 513–522, 2009. [68] J. R. Marden and T. Roughgarden, “Generalized efficiency bounds in distributed re- source allocation,” IEEE Transactions on Automatic Control, vol. 59, no. 3, pp. 571– 584, 2014. [69] L. Deori, K. Margellos, and M. Prandini, “Price of anarchy in electric vehicle charging control games: When Nash equilibria achieve social welfare,” Automatica, vol. 96, pp. 150–158, 2018. [70] D. Paccagnan, R. Chandan, and J. R. Marden, “Utility design for distributed resource allocation - Part I: Characterizing and optimizing the exact price of anarchy,” IEEE Transactions on Automatic Control, pp. 1–1, 2019. 119 [71] P. Gupta and V. Srivastava, “Optimal fidelity selection for improved performance in human-in-the-loop queues for underwater search,” arXiv preprint arXiv:2311.06381, 2023. [72] S. C. Kramer and H. W. Sorenson, “Bayesian parameter estimation,” IEEE Transac- tions on Automatic Control, vol. 33, no. 2, pp. 217–222, 1988. [73] P. Gupta and V. Srivastava, “Deterministic sequencing of exploration and exploitation for reinforcement learning,” in 61st Conference on Decision and Control, pp. 2313– 2318, IEEE, 2022. [74] P. Gupta, S. Biswas, and V. Srivastava, “Fostering human learning in sequential decision-making: Understanding the role of evaluative feedback,” arXiv preprint arXiv:2311.03486, 2023. [75] P. Gupta, S. D. Bopardikar, and V. Srivastava, “Achieving efficient collaboration in decentralized heterogeneous teams using common-pool resource games,” in 58th Con- ference on Decision and Control, pp. 6924–6929, IEEE, 2019. [76] P. Gupta, S. D. Bopardikar, and V. Srivastava, “Incentivizing collaboration in hetero- geneous teams via common-pool resource games,” IEEE Transactions on Automatic Control, vol. 68, no. 3, pp. 1902–1909, 2022. [77] M. Voorneveld, “Best-response potential games,” Economics Letters, vol. 66, no. 3, pp. 289–295, 2000. [78] P. Dubey, O. Haimanko, and A. Zapechelnyuk, “Strategic complements and substitutes, and potential games,” Games and Economic Behavior, vol. 54, no. 1, pp. 77–94, 2006. [79] M. K. Jensen, “Stability of pure strategy Nash equilibrium in best-reply potential games,” University of Birmingham, Tech. Rep, 2009. [80] R. P. Rao, Brain-Computer Interfacing: An Introduction. Cambridge University Press, 2013. [81] O. Palinko, A. L. Kun, A. Shyrokov, and P. Heeman, “Estimating cognitive load using remote eye tracking in a driving simulator,” in Symposium on Eye-tracking Research & Applications, pp. 141–144, 2010. [82] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling, “Learning policies for partially observable environments: Scaling up,” in Machine Learning Proceedings 1995, pp. 362– 370, Elsevier, 1995. [83] M. T. Spaan, “Partially observable Markov decision processes,” in Reinforcement Learning, pp. 387–414, Springer, 2012. [84] A. Diederich and J. R. Busemeyer, “Simple matrix methods for analyzing diffusion models of choice probability, choice response time, and simple response time,” Journal of Mathematical Psychology, vol. 47, no. 3, pp. 304–322, 2003. 120 [85] A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learn- ing,” Discrete Event Dynamic Systems, vol. 13, no. 1-2, pp. 41–77, 2003. [86] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. [87] P. Gupta and V. Srivastava, “Structural properties of optimal fidelity selection policies for human-in-the-loop queues,” arXiv preprint arXiv:2201.09990, 2022. [88] P. Gupta, S. D. Bopardikar, and V. Srivastava, “Incentivizing collaboration in hetero- geneous teams via common-pool resource games,” arXiv preprint arXiv: 1908.03938, Aug. 2019. [89] N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot simulator,” in International Conference on Intelligent Robots and Systems, vol. 3, pp. 2149–2154, IEEE, 2004. [90] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “ROS: An open-source robot operating system,” in ICRA Workshop on Open Source Software, vol. 3, p. 5, Kobe, Japan, 2009. [91] Y. Bengio and P. Frasconi, “Input-output HMMs for sequence processing,” IEEE Trans- actions on Neural Networks, vol. 7, no. 5, pp. 1231–1249, 1996. [92] Y. Sakamoto, M. Ishiguro, and G. Kitagawa, “Akaike information criterion statistics,” Dordrecht, The Netherlands: D. Reidel, vol. 81, no. 10.5555, p. 26853, 1986. [93] A. A. Neath and J. E. Cavanaugh, “The Bayesian information criterion: Background, derivation, and applications,” Wiley Interdisciplinary Reviews: Computational Statis- tics, vol. 4, no. 2, pp. 199–203, 2012. [94] M. M. M. Manhães, S. A. Scherer, M. Voss, L. R. Douat, and T. Rauschenbach, “UUV simulator: A Gazebo-based package for underwater intervention and multi- robot simulation,” in OCEANS Monterey, pp. 1–8, 2016. [95] L. F. Bertuccelli and M. L. Cummings, “Operator choice modeling for collaborative uav visual search tasks,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 42, no. 5, pp. 1088–1099, 2012. [96] V. Srivastava, P. Holmes, and P. Simen, “Explicit moments of decision times for single- and double-threshold drift-diffusion processes,” Journal of Mathematical Psychology, vol. 75, no. 2016, pp. 96–109, 2016. Special Issue in Honor of R. Duncan Luce. [97] K. P. Murphy, “Conjugate Bayesian analysis of the Gaussian distribution,” tech. rep., University of British Columbia, BC, Oct 2007. [98] L. Li and M. Littman, “Prioritized sweeping converges to the optimal value function,” Tech. Rep. DCS-TR-631, Rutgers University, NJ, June 2008. 121 [99] S. Chib, “Markov chain Monte Carlo methods: computation and inference,” Handbook of Econometrics, vol. 5, pp. 3569–3649, 2001. [100] M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cam- bridge University Press, 2019. [101] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger, “Inequalities for the L1 deviation of the empirical distribution,” Hewlett-Packard Labs, Tech. Rep, 2003. [102] A. Mastin and P. Jaillet, “Loss bounds for uncertain transition probabilities in Markov decision processes,” in 51st IEEE Conference on Decision and Control, pp. 6708–6715, IEEE, 2012. [103] P. Gupta and V. Srivastava, “Deterministic sequencing of exploration and exploitation for reinforcement learning,” arXiv preprint arXiv: 2209.05408, Sept. 2022. [104] K.-M. Chung, H. Lam, Z. Liu, and M. Mitzenmacher, “Chernoff-Hoeffding bounds for Markov chains: Generalized and simplified,” arXiv preprint arXiv:1201.0559, 2012. [105] D. Aldous, L. Lovász, and P. Winkler, “Mixing times for uniformly ergodic Markov chains,” Stochastic Processes and their Applications, vol. 71, no. 2, pp. 165–185, 1997. [106] H. Chen and F. Zhang, “The expected hitting times for finite Markov chains,” Linear Algebra and its Applications, vol. 428, no. 11-12, pp. 2730–2749, 2008. [107] C. Szepesvári, Algorithms for Reinforcement Learning. Springer Nature, 2022. [108] M. Lopes, F. Melo, and L. Montesano, “Active learning for reward estimation in in- verse reinforcement learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 31–46, Springer, 2009. [109] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.,” in AAAI, vol. 8, pp. 1433–1438, Chicago, IL, USA, 2008. [110] S. Levine, Z. Popovic, and V. Koltun, “Nonlinear inverse reinforcement learning with Gaussian processes,” Advances in Neural Information Processing Systems, vol. 24, 2011. [111] B. D. Ziebart, Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. Carnegie Mellon University, 2010. [112] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Program- ming. John Wiley & Sons, 1994. [113] W. B. Knox and P. Stone, “Interactively shaping agents via human reinforcement: The TAMER framework,” in International Conference on Knowledge Capture, pp. 9– 16, 2009. 122 [114] W. B. Knox and P. Stone, “Augmenting reinforcement learning with human feedback,” in ICML Workshop on New Developments in Imitation Learning, vol. 855, p. 3, 2011. [115] W. B. Knox and P. Stone, “Reinforcement learning from simultaneous human and MDP reward.,” in AAMAS, vol. 1004, pp. 475–482, Valencia, 2012. [116] W. B. Knox and P. Stone, “Learning non-myopically from human-generated reward,” in International Conference on Intelligent User Interfaces, pp. 191–202, 2013. [117] C. G. Cassandras and S. Lafortune, Introduction to Discrete Event Systems. Springer Science & Business Media, 2009. [118] M. K. Jensen, “Aggregative games and best-reply potentials,” Economic Theory, vol. 43, no. 1, pp. 45–66, 2010. [119] B. Schipper, “Pseudo-potential games,” tech. rep., University of Bonn, Germany, 2004. Working paper. [120] R. Burden and J. Faires, “The bisection method,” Numerical Analysis, pp. 48–56, 2011. [121] F. M. Dekking, C. Kraaikamp, H. P. Lopuhaä, and L. E. Meester, A Modern Intro- duction to Probability and Statistics: Understanding Why and How. Springer Science & Business Media, 2005. [122] A. Gosavi, Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning. Springer, 2015. [123] D. G. Luenberger and Y. Ye, Linear and Nonlinear Programming, vol. 2. Springer, 1984. [124] C. Berge, Topological Spaces: Including a Treatment of Multi-valued Functions, Vector Spaces, and Convexity. Courier Corporation, 1997. 123 APPENDIX A: CHAPTER 2 Proof of Lemma 3 Let wk be the number of tasks that arrive during stage k ∈ {0, . . . , n − 1} with sojourn time τk, in which the state transitions from sk = (qk, cogk) → sk+1 = (qk+1, cogk+1) and action ak is selected. Let ak be an optimal action at state sk and π be the corresponding optimal policy such that ak = π(sk). The optimal policy π when applied from an initial state s0 induces a sequence of states < sk > and sojourn times < τk > or ζk+1, where ζk+1 = (cid:80)k j=0 τj and ζ0 = 0. Similarly, let ˜s0 = (˜q0, cog0) be another initial state with the same initial cognitive state, and ˜q0 ≥ q0. Apply a policy ˜π from the initial state ˜s0 such that ˜π(q, cog) = π(q + q0 − ˜q0, cog) for any (q, cog). Note that ˜a0 = a0. The optimal policy ˜π when applied from an initial state ˜s0 = (˜q0, cog0) induces a sequence of realizations < ˜sk > and < ˜τk >. Since cognitive state and sojourn time are independent of the current queue length, for the same action sequence applied from the initial states s0 and ˜s0, the random process associated with the evolution of cognitive state and sojourn time is almost surely the same except for the offset in the queue length. Hence, the probability of observing a sequence of realizations < ˜sk = (˜qk, cogk) >, < ˜ak > and < ˜τk > when policy ˜π is applied from ˜s0 is equal to the probability of observing a sequence of realizations < sk = (qk, cogk) >, < ak > and < τk > when policy π is applied from s0, where ˜qk − ˜q0 = qk − q0, ˜ak = ak and ˜τk = τk. Therefore, it is easy to show that: E˜π[˜qk|˜s0, ζk] − Eπ[qk|s0, ζk] = ˜q0 − q0. (8.1) Note that the realization of sequence of actions < ak = π(sk) >, which are optimal for < sk = (qk, cogk) > might be sub-optimal for < ˜sk = (˜qk, cogk) >. Recall that Eπ[·] and E˜π[·] represents E[·|s0, π] and E[·|˜s0, ˜π], respectively. Let ∆q := ˜q0 − q0, and Z := V ∗ n (q0, cog0) − V ∗ n (˜q0, cog0). We first show the upper bound on Z. Z ≤ V ∗ n (q0, cog0) − Jn,˜π(˜q0, cog0) = Eπ (cid:34)n−1 (cid:88) k=0 γζkR(sk, ak) − γζnCqn − E˜π (cid:35) γζkR(˜sk, ˜ak) − γζnC ˜qn (cid:35) (cid:34)n−1 (cid:88) k=0 124 = Eπ − E˜π (cid:34) n−1 (cid:88) k=0 (cid:34) n−1 (cid:88) k=0 γζk{r(ak) − c E[τk|cogk, ak]qk − γζk{r(˜ak) − c E[τk|cogk, ˜ak]˜qk − E[τ 2 k |cogk, ak]} − γζnCqn (cid:35) E[τ 2 k |cogk, ˜ak]} − γζnC ˜qn . (8.2) (cid:35) cλ 2 cλ 2 Using statements of Lemma 2, RHS of (8.2) is given by: n−1 (cid:88) k=0 Eπ[γζkr(ak)] − c n−1 (cid:88) k=0 Eπ (cid:2)γζkτk Eπ [qk|s0, ζk](cid:3) − cλ 2 n−1 (cid:88) k=0 Eπ (cid:2)γζkτ 2 k (cid:3) − C Eπ (cid:2)γζn Eπ [qn|s0, ζn](cid:3) − n−1 (cid:88) k=0 E˜π[γζkr(˜ak)] + c n−1 (cid:88) k=0 E˜π (cid:2)γζkτk E˜π [˜qk|˜s0, ζk](cid:3) + cλ 2 n−1 (cid:88) k=0 E˜π (cid:2)γζkτ 2 k (cid:3) + C E˜π (cid:2)γζn E˜π [˜qn|˜s0, ζn](cid:3) (4)∗ = c n−1 (cid:88) k=0 Eπ (cid:2)γζkτk{E˜π[˜qk |˜s0, ζk] − Eπ[qk |s0, ζk]}(cid:3) + C Eπ (cid:2)γζn{E˜π[˜qn |˜s0, ζk] − Eπ[qn |s0, ζk]}(cid:3) , (8.3) where (4)∗ follows by recalling that the probability of observing a sequence of realizations < ˜sk = (˜qk, cogk) >, < ˜ak > and < ˜τk > when policy ˜π is applied from ˜s0 is equal to the probability of observing a sequence of realizations < sk = (qk, cogk) >, < ak > and < τk > when policy π is applied from s0, where ˜qk − ˜q0 = qk − q0, ˜ak = ak and ˜τk = τk. Substituting (8.1) in (8.3), we get, (cid:40) c n−1 (cid:88) Z ≤ Eπ (cid:2)γζkτk (cid:3) + C Eπ (cid:2)γζn(cid:3) (cid:41) ∆q (cid:40) c (5)∗ = k=0 n−1 (cid:88) k=0 (cid:40) ≤ ctmax n−1 (cid:88) k=0 Eπ (cid:2)γζk(cid:3) Eπ [τk] + C Eπ (cid:2)γζn(cid:3) (cid:41) ∆q (cid:41) ρk + Cρn ∆q, (8.4) where (5)∗ follows due to independence of ζk = (cid:80)k−1 i=0 τk and τk. Taking the infinite time limit in (8.4), we get, V ∗(q0, cog0) − V ∗(˜q0, cog0) ≤ ctmax∆q We now show the lower bound on Z. Let ˜ak be optimal for ˜sk = (˜qk, cogk), and choose < ak >=< ˜ak > for the sequence < sk = (qk, cogk) >, where ˜s0 = (˜q0, cog0) and s0 = (q0, cog0). Note that (8.1) still holds. Analogous to (8.3), Z is lower-bounded by: , ˜q0 ≥ q0. 1−ρ Z ≥ Jn,π(q0, cog0) − V ∗ n (˜q0, cog0) 125 = c n−1 (cid:88) k=0 (cid:104) γ ˜ζk ˜τk (cid:110) E˜π E˜π[˜qk|˜s0, ˜ζk] − Eπ[qk|s0, ˜ζk] (cid:111)(cid:105) + C E˜π (cid:104) γ ˜ζn (cid:110) E˜π[˜qn|˜s0, ˜ζk] − Eπ[qn|s0, ˜ζk] (cid:111)(cid:105) . Substituting (8.1) in (8.5), we get, (cid:40) c Z ≥ (cid:105) (cid:104) γ ˜ζk E˜π E˜π [˜τk] + C E˜π (cid:41) (cid:105) (cid:104) γ ˜ζn ∆q n−1 (cid:88) k=0 (cid:40) (6)∗ ≥ n−1 (cid:88) γE˜π[˜ζk] + CγE˜π[˜ζn] cts (cid:41) ∆q k=0 (cid:26) (1 − γntmax)cts 1 − γtmax ≥ + γntmaxC (cid:27) ∆q, (8.5) (8.6) where (6)∗ follows by applying Jensen’s inequality [121] (E[g(x)] ≥ g(E[x])) on the con- vex function g(x) = γx. Taking the infinite time limit in (8.6), we get, 0 ≤ cts∆q 1−γtmax ≤ □ V ∗(q0, cog0) − V ∗(˜q0, cog0), ˜q0 ≥ q0. Proof of Lemma 4 Proof. We start by proving the first statement. In the following, we find conditions under which if action N is the optimal choice at queue length q1 for a given cognitive state cog ≤ cog∗, then for all q2 > q1, N dominates H. Let N be the optimal action in state s1 = {q1, cog}. Let F (s, a) denote the expected future rewards received in state s for taking action a (the second term in the Bellman equation (2.3)). Then, we have R(s1, N ) − R(s1, H) + F (s1, N ) − F (s1, H) > 0, =⇒ M + (cid:88) (cid:88) (cid:88) τ cog′ q γτ Pois(q|τ )V ∗(cog′, q1+q−1)(P(cog′, τ |cog, N )−P(cog′, τ |cog, H)) > 0, (8.7) where M := c(E[τ |cog, H] − E[τ |cog, N ])q1 + rN − rH + cλ 2 (E[τ 2|cog, H] − E[τ 2|cog, N ]), and P(q1 + q − 1|q1, τ ) is replaced by Pois(q|τ ), which is the Poisson probability of q arrivals during service time τ . Now for the state s2 = {q2, cog}, with q2 > q1 and identical cog, under the assumption (A1) we show that: R(s2, N ) − R(s2, H) + F (s2, N ) − F (s2, H) > 0. (8.8) 126 The left-hand side of (8.8) is given by: X +M + (cid:88) (cid:88) (cid:88) τ cog′ q γτ Pois(q|, τ )V ∗(cog′, q2+q−1)(P(cog′, τ |cog, N )−P(cog′, τ |cog, H)), (8.9) where X := c(E[τ |cog, H] − E[τ |cog, N ])(q2 − q1). To show (8.8), we prove that the difference between LHS of (8.8) and (8.7) is positive. Subtracting LHS of (8.7) from (8.9), we get: (cid:88) (cid:88) (cid:88) X − τ cog′ q γτ Pois(q|τ )VD(P(cog′, τ |cog, N ) − P(cog′, τ |cog, H)), (8.10) where VD := (cid:104) (cid:105) . From Lemma 3, we know that V ∗(cog′, q1 + q − 1) − V ∗(cog′, q2 + q − 1) 0 ≤ β := cts(q2 − q1) 1 − γtmax ≤ VD ≤ ctmax(q2 − q1) 1 − ρ =: α. Therefore, (8.10) is lower bounded by (cid:88) (cid:88) (cid:88) X +β γτ Pois(q|τ )P(cog′, τ |cog, H)−α τ cog′ q (cid:88) (cid:88) (cid:88) τ cog′ q γτ Pois(q|τ )P(cog′, τ |cog, N ) ≥ X + βγE[τ |cog,H] − α E[γτ |cog, N ], (8.11) where we utilized Jensen’s inequality on convex function γτ to obtain E[γτ |cog, H] ≥ γE[τ |cog,H]. (8.11) is nonnegative when the condition in the first statement holds. Now we prove the second statement. Using a similar analysis it can be shown that if action S is the optimal choice at queue length q1 for a given cognitive state cog ≤ cog∗, then for every q2 > q1, S dominates H and N, respectively, under the following conditions: E[τ |cog, H] − ts + E[τ |cog, N ] − ts + tsγE[τ |cog,H] 1 − γtmax tsγE[τ |cog,N ] 1 − γtmax ≥ γts tmax 1 − ρ ≥ γts tmax 1 − ρ , , (8.12) (8.13) respectively, where we have used E[τ |cog, S] = ts and E[γτ |cog, S] = γts due to constant time of skip. Since E[τ |cog, H] > E[τ |cog, N ], (8.12)-(8.13) can be combined to obtain the condition in the second statement under which action S dominates both H and N. 127 APPENDIX B: CHAPTER 4 Proof of Theorem 2 A key challenge we address is the time-dependence of the asynchronous adaptive Bellman updates that adapt to the latest estimates of the service-time distribution. Using Lemmas 13 and 14, we upper-bound the difference between optimal value functions for intermediate SMDPs at subsequent time steps, and the optimal value function for the uncertainty-free SMDP, which are used to establish the convergence result. Theorem 9 (adapted from [122, Chapter 10]). T is a contraction mapping and therefore, there exists a unique fixed point satisfying T (V ∗) = V ∗, where V ∗ is the optimal value function for the uncertainty-free SMDP Γ. Let V ∗ t Therefore, V ∗ be the optimal value function for SMDP ˆΓt defined by the estimates ˆPt and ˆRt. t = ˆTt(V ∗ t ). Let ∥ · ∥ be the max-norm given by ∥v∥ = max{|v1|, |v2| . . . , |vn|}, for any vector v = (v1, v2, . . . , vn). Let τmin ≥ 1 be the minimum number of time steps spent in any state s ∈ S, for any action a ∈ AS. Lemma 13. For any state s ∈ S, following statements hold: (i) | ˆTt(V1(s)) − ˆTt(V2(s))| ≤ γτmin∥V1 − V2∥ for any s ∈ S, where the Bellman operator ˆTt at any time t is applied on value function estimates V1 and V2. (ii) |Vt+1(s) − V ∗ t (s)| ≤ γτmin∥Vt − V ∗ t ∥, if s ∈ Bt. Proof. We prove the first statement. For any s ∈ S, let V := | ˆTt(V1(s)) − ˆTt(V2(s))|. Therefore, V ≤ max a∈AS (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) γτ ˆPt(s′, τ |s, a)(V1(s′) − V2(s′)) (cid:12) (cid:12) (cid:12) (cid:88) (cid:88) τ s′ (cid:88) ≤ ∥V1 − V2∥ γτ ˆPt(τ |s, a) ≤ γτmin∥V1 − V2∥. The second statement follows from the first by noting that |Vt+1(s) − V ∗ t (s)| = | ˆTt(Vt) − τ ˆTt(V ∗ t )| if s ∈ Bt. 128 Lemma 14. Under Assumptions A1-A5, for any given ϵ > 0, there exists a time ˜t, such that for any t ≥ ˜t, (i) ∥V ∗ t − V ∗∥ ≤ ϵ, and (ii) ∥V ∗ t+1 − V ∗ t ∥ ≤ 2ϵ with probability 1. Proof. Since the estimates ˆPt and ˆRt are assumed to be bounded at any time t (assumption (A2)), the value function estimate Vt at any time t also remains bounded. Furthermore, using assumption (A3), ˆPt and ˆRt converge to their true values P and R, respectively, with probability 1, i.e, for any ϵ1, ϵ2 > 0, there exists a time ˜t0 such that, for any t ≥ ˜t0, and pij are the elements of ˆPt and P, respectively. | ˆRt − R| ≤ ϵ1 and |ˆpij t − pij| ≤ ϵ2, where ˆpij t In asynchronous VI, the value of only the states s ∈ Bt ⊆ S are updated at any time t, however, each state is assumed to be updated infinitely often. Therefore, the sequence (4.6) converges with probability 1, i.e., for any ϵ3 > 0, there exists a time ˜t1 ≥ ˜t0 such that ∥Vt+1 − Vt∥ ≤ ϵ3, for t ≥ ˜t1. Consider s ∈ Bt such that Vt+1(s) = ˆTt(Vt(s)). Let V ∗ t := ∥V ∗ t − V ∗∥, where we only consider the states s ∈ Bt in the vectors V ∗ t and V ∗. Therefore, t ≤ ∥V ∗ V ∗ t − Vt+1∥ + ∥Vt+1 − V ∗∥ =: Z1 + Z2. (8.14) Z1 = ∥V ∗ t − Vt+1∥ is upper bounded by: Z1 ≤ ∥V ∗ t − ˆTt(Vt+1)∥ + ∥ ˆTt(Vt+1) − ˆTt(Vt)∥ =: Z 1 1 + Z 2 1 . (8.15) Since Z 1 1 = ∥V ∗ t − ˆTt(Vt+1)∥ = ∥ ˆTt(V ∗ t ) − ˆTt(Vt+1)∥, from statement (i) of Lemma 13, we get Z 1 1 ≤ γτmin∥V ∗ t − Vt+1∥, and Z 2 1 ≤ γτmin∥Vt+1 − Vt∥. Substituting (8.16) in (8.15), we get: Z1 ≤ γτmin 1 − γτmin ∥Vt+1 − Vt∥. Z2 = ∥V ∗ − Vt+1∥ is upper bounded by: (8.16) (8.17) Z2 ≤ ∥V ∗ − T (Vt+1)∥ + ∥T (Vt+1) − Vt+1∥ =: Z 1 2 + Z 2 2 . (8.18) 129 Again using statement (i) of Lemma 13, we have Z 1 2 ≤ γτmin∥V ∗ − Vt+1∥ = γτminZ2. (8.19) Furthermore, Z 2 2 = ∥T (Vt+1) − Vt+1∥ = ∥T (Vt+1) − ˆTt(Vt)∥ is upper bounded by: Z 2 2 ≤ max a∈AS ∥R − ˆRt∥ + max a∈AS (cid:13) (cid:13) (cid:13) (cid:88) (cid:88) τ s′ γτ P(s′, τ |s, a)Vt+1(s′) − (cid:88) (cid:88) τ s′ (cid:13) γτ ˆPt(s′, τ |s, a)Vt(s′) (cid:13) (cid:13). (8.20) Recall that for any t ≥ ˜t0, | ˆRt − R| ≤ ϵ1 and |ˆpij t − pij| ≤ ϵ2. Therefore, for t ≥ ˜t0, (8.20) is upper bounded by: Z 2 2 ≤ ϵ1 + max a∈AS (cid:13) (cid:13) (cid:13) (cid:88) (cid:88) τ s′ (cid:13) γτ |P(s′, τ |s, a) − ˆPt(s′, τ |s, a)|Vt+1(s′) (cid:13) (cid:13) + (cid:13) (cid:13) (cid:13) (cid:88) (cid:88) τ s′ (cid:13) γτ ˆPt(s′, τ |s, a)(Vt+1(s′) − Vt(s′)) (cid:13) (cid:13) (1∗) ≤ ϵ1 + ∥Vt+1∥ (cid:88) (cid:88) γτ ϵ2 + γτmin∥Vt+1 − Vt∥, τ s′ ϵ2N γτmin∥Vt+1∥ 1 − γ = ϵ1 + + γτmin∥Vt+1 − Vt∥, (8.21) where N is the size of the finite state-space S, and (1∗) follows from |ˆpij t − pij| ≤ ϵ2 and statement (i) of Lemma 13. Substituting (8.19) and (8.21) in (8.18), we get: Z2 ≤ γτmin 1 − γτmin ϵ1 + ϵ2N γτmin ∥Vt+1∥ 1−γ ∥Vt+1 − Vt∥ + f (∥Vt+1∥), (8.22) where f (∥Vt+1∥) := 1 1−γτmin (cid:16) (cid:17) is bounded for bounded ∥Vt+1∥ and f (∥Vt+1∥) (cid:55)→ 0, when ϵ1, ϵ2 (cid:55)→ 0. Substituting (8.17) and (8.22) in (8.14), we get V ∗ t ≤ 2γτmin 1 − γτmin ∥Vt+1 − Vt∥ + f (∥Vt+1∥). (8.23) Recall that for any ϵ3 > 0, there exists ˜t1 ≥ ˜t0 such that |Vt+1 −Vt∥ ≤ ϵ3, for t ≥ ˜t1. Choosing , we get that there exists ˜t1, such that ∥Vt+1 − Vt∥ ≤ ϵ(1−γτmin ) ϵ1, ϵ2 (cid:55)→ 0, and ϵ3 = ϵ(1−γτmin ) , 2γτmin 2γτmin t − V ∗∥ < ϵ, for any t ≥ ˜t1. and V ∗ t = ∥V ∗ Recall that V ∗ t only considers states s ∈ Bt. However, since each state is updated infinitely often, there exists ˜t such that ∥V ∗ t − V ∗∥ < ϵ, for any t ≥ ˜t. Furthermore, for t ≥ ˜t, ∥V ∗ t+1 − V ∗ t ∥ ≤ ∥V ∗ t+1 − V ∗∥ + ∥V ∗ t − V ∗∥ ≤ 2ϵ. 130 Proof of Theorem 2: For any state s ∈ S, define a sequence {ts i }∞ i=1 of times at which state s is updated by the asynchronous VI, and consider the updates after time ˜t, i.e., consider the sequence {ts i }∞ i=k such that ts k ≥ ˜t. Let Vt(s) := |Vt+1(s) − V ∗ using statement (ii) of Lemma 13, Vts i+1 = |Vts i+1+1(s) − V ∗ ts i+1 therefore upper-bounded by: (s)| ≤ γτmin∥Vts t (s)|. Therefore, ∥, and − V ∗ ts i+1 i+1 Vts i+1 (s) ≤ γτmin(∥Vts i+1 − V ∗ ts i ∥ + ∥V ∗ ts i − V ∗ ts i+1 ∥) (1∗) ≤ γτmin (cid:0)∥Vts i ∥ + 2ϵ)(cid:1) , (8.24) for i ≥ k, where (1∗) follows from statement (ii) of Lemma 14. From (8.24), we get the following recursion: for i ≥ k. ∥Vts i+1 ∥ ≤ γτmin (cid:0)∥Vts i ∥ + 2ϵ(cid:1) , (8.25) Recursively performing (8.25) to obtain upper-bounds on ∥Vts j ∥, for j = k, . . . i, and substituting in (8.24), we get: Vts i+1 (s) ≤ γ(i+1)τmin∥Vts k ∥ + 2γτmin(1 − γ(i+1)τmin) 1 − γτmin ϵ, In the limit i → ∞, Vts i+1 = |Vts for ϵ (cid:55)→ 0. Since for each s ∈ S, V ∗ ts i+1 Vt(s) converges to V ∗(s) for any s. Proof of Theorem 3 i+1+1(s) − V ∗ ts i+1 (s)| ≤ ϵ4, where ϵ4 := 2γτmin 1−γτmin ϵ, and ϵ4 (cid:55)→ 0 (s) converges to V ∗(s) (Lemma 14), and ϵ is arbitrary, ■ We prove Theorem 3 using the following Theorem 10. Theorem 10. Tr is a contraction mapping, and hence, there exists a unique fixed point satisfying, Tr(V ) = V . Proof. The proof follows similar to the case of robust MDPs [32]. Proof of Theorem 3: Since Tr is a contraction mapping (Theorem 10), and each state is updated infinitely often, the robust adaptive asynchronous VI converges to a fixed point 131 Tr(V ) = V . Furthermore, bounded P a implies that the value function at any time t remains bounded. Once P a converges to the singleton estimate P, the robust adaptive asynchronous VI reduces to the adaptive asynchronous VI. Hence, the proof follows using Theorem 2. ■ 132 APPENDIX C: CHAPTER 7 Proof of Theorem 1 [Existence of PNE] We prove Theorem 6 using Brouwer’s fixed point theorem [66, Appendix C] applied to the best response mapping with the help of following lemmas (Lemmas 15-17). Recall that bi(λR −i) is the best response of player i to the review admission rates of other players λR −i . For brevity of notation, we will represent rR(λR −i) using i , λR rR, p, fi, ˜ui, respectively. Furthermore, let q′ and q′′, respectively, represent the first and −i), ˜ui(λR −i), fi(λR −i), p(λR i , λR i , λR i , λR the second partial derivatives of a generic function q with respect to λR i . Lemma 15 (Strict concavity of incentive). For the CPR game Γ, under Assumptions (A1-A2), the incentive function fi : S (cid:55)→ R is strictly concave in λR i ∈ [0, λ R i ] and any fixed λR −i. Equivalently, fi(x) is strictly concave in x for x ∈ [0, µS j∈N ,j̸=i ajλR j ]. i , for λR T − (cid:80) Proof. Recall from (7.9) that fi(λR i , λR −i) = fi(x) = rR(x)(1 − p(x)) − hirS. The first and the second partial derivative of the incentive function fi with respect to λR i in the interval λR R i ∈ [0, λ i ] are given by: i = (rR)′(1 − p) − rRp′ = −ai f ′ dfi dx , f ′′ i = (rR)′′(1 − p) − 2(rR)′p′ − rRp′′ = a2 i d2fi dx2 . (8.26a) (8.26b) From Assumptions (A1) and (A2), we have f ′′ i < 0 and d2fi dx2 < 0 in the interval where derivative of fi exists, thereby proving the strict concavity of fi in λR i and x. Lemma 16 (Best response mapping ). For the CPR game Γ, under Assumptions (A1- A2), the best response mapping bi(λR −i) is unique for any λR −i ∈ S−i and is given by: bi(λR −i) =    0, if fi(λR i , ·) ≤ 0, ∀λR i ∈ Si, αi, if ∃αi ∈ Si s.t. ∂ ˜ui ∂λR i (αi) = 0, and fi(αi, ·) > 0, µR i , otherwise . 133 Proof. We establish uniqueness of the best response mapping through the following three cases. Case 1: fi(λR i , ·) ≤ 0, for every λR −i ∈ S−i, fi(λR If for a given λR admits a unique maximum at λR i ∈ Si. i , ·) ≤ 0, for every λR i = 0, and therefore, bi(λR Case 2: There exists a non-empty interval Si ⊂ Si, such that fi(λR i ∈ Si, then from (7.8), ˜ui(λR i , λR −i) −i) = 0 is the unique best response. i , ·) > 0, and f ′ i , ·) < 0, i(λR for every λR i ∈ Si. For any given λR −i ∈ S−i, recall that the system constraint (7.3) is violated for every λR i ∈ (λ R i , µR i ] ⊂ Si, and p(λR i , λR −i) = 1. Therefore, for every λR i ∈ (λ R i , µR i ], we have fi = −hirS < 0. (8.27) , since Therefore, bi(λR i ] ⊂ Si, for any given λR p is continuously differentiable with respect to λR i −i) ∈ [0, λ R function on the set [0, λ ˜ui on the interval λR R i ), ˜ui is a smooth i ] × S−i. Hence, the best response, which is a global maximizer of i ∈ Si, either occurs at the boundary of Si or satisfies the first order , for each λR R i ∈ (0, λ −i ∈ S−i. Furthermore, for a fixed λR −i condition, ∂ ˜ui ∂λR i (bi) = 0 (see [123]). Let there exist αi ∈ Si such that fi(αi, ·) > 0, and (αi) = αif ′ i(αi, ·) + fi(αi, ·) = 0. (8.28a) (8.28b) ∂ ˜ui ∂λR i Since fi(αi, ·) > 0 and αi > 0, R fi(αi, ·) > 0 implies αi ∈ [0, λ R implies there exists a non-empty set Si ⊂ [0, λ and f ′ i(αi, ·) < 0. For any λR i ∈ Si, such that fi(λR (8.28b) has a solution only if f ′ i(αi, ·) < 0. Furthermore, i ] (see (8.27)). Therefore, existence of αi satisfying (8.28) i ] ⊂ Si, such that for each αi ∈ Si, fi(αi, ·) > 0 i , ·) < 0, using Lemma 15, i , ·) > 0 and f ′ i(λR we get: Hence, for λR i ∈ Si, the expected utility ˜ui is strictly concave with a unique global maximizer ∂2 ˜ui ∂λR i 2 = λR i f ′′ i + 2f ′ i < 0. (8.29) αi ∈ Si that satisfies αi = min{− fi(bi,·) i (bi,·) , µR f ′ i } (see (8.28b)). 134 Case 3: There exists a non-empty interval ˜Si ⊂ Si, such that fi(λR i , ·) > 0, for every λR i ∈ ˜Si, and f ′ i(λR i ∈ Si. i , ·) ≥ 0, for any λR Finally, consider the case that f ′ ˜Si ⊂ Si where fi(λR i(λR i , ·) > 0, for any λR i , ·) ≥ 0, for every λR i ∈ ˜Si. Since f ′ i(λR i ∈ Si, and there exists an interval i ∈ Si, i.e., i , ·) ≥ 0, for every λR . Since there i i , ·) is increasing in λR , and therefore, fi(λR fi(λR exists a non-empty interval ˜Si such that fi(λR i ∈ ˜Si, and fi(µR i , ·) is maximized at λR i , ·) > 0, for every λR i , ·) > 0. Therefore, in the interval λR (8.28b) has no solution and the expected utility of player i is strictly increasing in λR i i ∈ ˜Si, monotonically i ∈ ˜Si, , i.e., i , ·), it follows µR increasing fi(λR i = µR i > 0, for every λR ∂ ˜ui ∂λR i occurs at the boundary µR i . i ∈ Si. Therefore, the best response is the unique maximum of ˜ui which We state some important intermediate results from three cases of Lemma 16 as a corollary for later discussions. Corollary 2 (Best response and incentive). For the CPR game Γ, under Assumptions (A1-A3), the following statements hold: (i) bi = 0, if and only if, fi(λR i , ·) ≤ 0, for every λR i ∈ Si; furthermore, fi(λR i , ·) ≤ 0, for every λR i ∈ Si implies f ′ i(λR i , ·) < 0, for every λR i ∈ Si; (ii) if there exists an interval Si ⊂ Si, such that fi(λR i , ·) > 0, and f ′ i(λR i , ·) < 0, ∀λR i ∈ Si, then the unique best response for player i satisfies the implicit equation bi = min{− fi(bi,·) i (bi,·) , µR f ′ i } ∈ Si; and (iii) if f ′ i(λR i , ·) ≥ 0, for every λR i ∈ Si, then bi = µR i . Proof. We only establish the first statement of the corollary. The other statements are established in the proof of Lemma 16. We have already established in Lemma 16 that if fi(λR i , ·) ≤ 0, then for every λR i ∈ Si, the expected utility ˜ui is maximized for bi = 0. We now establish the “only if" part. Recall from (7.8) that ˜ui(λR i , λR −i) = µS i rS + λR i fi(λR i , λR −i). 135 Let bi = 0 be the best response for player i for a fixed λR −i . If there exists b ∈ Si, such that fi(b, ·) > 0, then ˜ui(b, ·) > ˜ui(bi, ·), and bi = 0 cannot be a best response. Hence, bi = 0 is the best response for player i, if and only if, fi(λR i , ·) ≤ 0, for every λR i ∈ Si, then f ′ Since fi is strictly concave in x (from Lemma 15) and fi(µR We now show that if fi(λR i , ·) ≤ 0, for every λR i(λR i ∈ Si. Assumption (A3), there exist γ1, γ2 ∈ R such that γ1 < µS T − aiµR only if x ∈ (γ1, γ2). If fi(λR i , λR −i) = fi(x) ≤ 0 for each λR i ∈ Si and for a given λR −i i , 0) = fi(µS i , ·) < 0, for every λR i ∈ Si. i ) > 0 by T − aiµR i < γ2 and fi(x) > 0 if and , then for each λR i ∈ Si, , either x ≤ γ1, or x ≥ γ2. Suppose x ≥ γ2, for each λR i = µR i T − aiµi < γ2, which is a contradiction. Hence, x ≤ γ1, for i ∈ Si. However, for λR j̸=i ajλR j ≤ µS x = µS each λR T − aiµi − (cid:80) i ∈ Si. Finally, from strict concavity of fi, fi is increasing in x for x ≤ γ1. Equivalently, fi is decreasing in λR i , i.e., f ′ i(λR i , ·) < 0, for every λR i ∈ Si. Theorem 11 (Berge Maximum Theorem, adapted from [124] ). Let ˜ui : Si × S−i (cid:55)→ R be a continuous function on Si × S−i, and C : S−i (cid:55)→ Si be a compact valued correspondence such that C(λR −i) ̸= ∅ for all λR −i ∈ Si. Define ˜u∗ i : S−i (cid:55)→ R by i (λR ˜u∗ −i) = max{˜ui(λR i , λR −i) | λR i ∈ C(λR −i)}, and bi : S−i (cid:55)→ Si by bi(λR −i) = argmax{˜ui(λR i , λR −i) | λR i ∈ C(λR −i)}. If C is continuous at λR −i, then ˜u∗ i is continuous and bi is upper hemicontinuous with nonempty and compact values. Furthermore, if ˜ui is strictly quasiconcave in λR i ∈ Si for each λR −i and C is convex-valued, then bi(λR −i) is single-valued, and thus is a continuous function. Lemma 17 (Continuity of best response mapping ). For the CPR game Γ, under As- sumptions (A1-A3), the best response mapping bi(λR −i) is continuous for each λR −i ∈ S−i. 136 Proof. Let z(λR −i) : S−i (cid:55)→ [ T −(cid:80) µS j∈N ,j̸=i aj µR j ai , µS T ai ] be defined by z(λR −i) := T − (cid:80) µS j∈N ,j̸=i ajλR j ai . The mapping z(λR −i) represents an upper bound on the value of λR i (8.30) above which the system constraint (7.3) is violated. Therefore, for each λR i ∈ [z(λR −i), ∞) ∩ Si, from (7.9), we get fi = −hirS < 0, and f ′ i = −rRp′ ≤ 0, (8.31) where the latter follows from monotonicity of p (Assumption (A2)). The mapping z(λR −i) defined in (8.30) is continuous on S−i and linearly decreasing in λR j , for every j ∈ N \ {i}. Therefore, to establish the continuity of the best response mapping −i) on S−i, it is sufficient to show that bi(λR bi(λR T −(cid:80) µS , µS T ai bi is unique and varies continuously with z(λR ] (cid:55)→ [0, µR j∈N ,j̸=i aj µR j ϕ : [ ai −i). i ]. To this end, we show that for each fixed value of z(λR −i)), for some continuous function −i), −i) = ϕ(z(λR Let ˆλ+ : S−i (cid:55)→ [0, µR i ] be defined by   0, ˆλ+(λR −i) = if fi(λR i , λR −i) ≤ 0, ∀λR i ∈ Si, (8.32)  sup{λR i ∈ Si| fi > 0}, otherwise. The mapping ˆλ+(λR −i), when non-zero, represents the maximal admissible review admission rate for player i, that yields her a positive incentive to review the tasks. Fig. C.1 shows the best response of player i ∈ N for the three possible cases of ˆλ+. Case 1: z(λR −i) ≤ 0. From (8.31) and statement (i) of Corollary 2, bi(λR −i) = 0 is the unique (continuous) best response for player i. Case 2: z(λR −i) > 0. From (8.31), fi < 0 and f ′ −i). Now we consider three cases based on the value of ˆλ+. i < 0 for any λR ˆλ+ < z(λR i ≥ z(λR −i). Hence, Case 2.1: ˆλ+ = 0. In this case, fi ≤ 0, for every λR i ∈ Si, and therefore, from statement (i) of Corollary 2, bi = 0 is the unique best response and continuity holds trivially. Case 2.2: ˆλ+ ∈ (0, µR such that fi > 0 and f ′ i ). In this case, for any λR i < 0, for every λR −i ∈ S−i, there exists an interval Si ⊂ [0, ˆλ+] i < 0 follows from the fact that i ∈ Si. Here, f ′ 137 (a) (b) (c) Figure C.1 Best response of player i with varying ˆλ+. The red curve shows different possi- i ], based on the i ) w.r.t λR bilities for strictly concave incentive function fi(λR i ∈ Si; in (b), there exists a subset of Si i < 0 for all λR value of λR −i i ) (represented by blue), where fi > 0 and f ′ fi < 0 and f ′ −i) ∈ Si; and c) for ˆλ+ = µR i i < 0; and in c) f ′ i < 0. a) For ˆλ+ = 0, bi(λR i ≥ 0 for any λR −i) = 0; b) for ˆλ+ ∈ (0, µR . In (a), fi < 0 and f ′ i ∈ Si = [0, µR i ) , bi(λR . At z(λR −i) = µR i , bi(λR −i . the supremum in (8.32) corresponds to the decreasing segment of fi. From statement (ii) of Corollary 2, there exists a unique bi(λR −i) ∈ Si that maximizes ˜ui. Application of Berge maximum theorem [124], yields the continuity of the unique maximizer. Since, bi < ˆλ+, it follows that if ˆλ+ → 0+, then bi → 0+. Hence, the continuity holds at ˆλ+ = 0. . Since ˆλ+ < z(λR Case 2.3: ˆλ+ = µR i i , λR continuity follows analogously to Case 2.2. Now consider the case f ′(µR −i) < 0, then the −i) = −δ, for and its derivative is decreasing, there exists ϵ > 0 such δ > 0. Since fi is concave in λR i ]. If f ′(µR −i) ∈ (µR −i), z(λR i , λR T ai i , µS that f ′ < 0 for λi ∈ (µR i − ϵ, µR there exists at most one point λR i i ]. Since fi(λR , such that f ′(λR ϵ → 0+. Hence, in this limiting case Si = (µR i , ·) is strictly concave in λR i (Lemma 15), i , ·) = 0. Therefore, in the limit δ → 0+, i ], where ϵ → 0+, and the best response bi(λR −i) ∈ Si = (µR If f ′(µR i , λR i − ϵ, µR i ] converges to µR i −i) ≥ 0, then it follows from strict concavity of fi that f ′(λR every λR i ∈ Si. Using statement (iii) of Corollary 2, bi(λR −i) = µR i best response. −i) ≥ 0, for i , λR is the unique (continuous) i − ϵ, µR . Note that when z(λR , i.e., when no other player reviews any task (λR Assumption (A3), fi(µR −i) = µR i . Hence, bi(λR T ai −i) = µS i , 0) > 0 and therefore bi(λR −i = 0), from −i) is continuous for 138 every z(λR −i), and therefore, is continuous for λR −i ∈ S−i. Proof of Theorem 6: To prove the existence of a PNE, define a mapping M : S (cid:55)→ S as follows: M (λR 1 , λR 2 , ..., λR N ) = (b1(λR −1), b2(λR −2), ..., bN (λR −N )). (8.33) The mapping M is unique (Lemma 16) and continuous (Lemma 17), and maps the compact convex set S (Si is convex and compact, ∀i ∈ N ) to itself. Hence, application of Brouwer’s fixed point theorem [66, Appendix C] yields that there exists a strategy profile λR = {λR i ∗}i∈N ∈ S which is invariant under the best response mapping and therefore is a PNE of the game. Proof of Corollary 1 [PNE]: □ Since PNE is a best response which remains invariant under the best-response mapping M given by (8.33), Corollary 1 is a direct consequence of Corollary 2 with a simplification that i i , λR∗ , λR∗ i(λR∗ f ′ (i), i.e. f ′ −i ) < 0 at PNE. Therefore, to prove Corollary 1, it is sufficient to show statement i(λR∗ −i ) < 0 for any player i ∈ N at PNE, which we prove by contradiction. Let −j ) ≥ 0 at PNE. From (8.26a), it can be seen that i ≥ 0 for all i ∈ N at that PNE. In such a case, (8.28b) implies that the expected utility of each player remains the same for all players at a PNE. Therefore, f ′ there exist a player j such that f ′ j ≥ 0 implies f ′ the sign of f ′ i j(λR∗ , λR∗ j with λR∗ i > 0 (therefore, fi > 0) is increasing in λR i at that PNE, and therefore, each of these players can improve their expected utility by unilaterally increasing their review admission rate. Therefore λR∗ cannot be a PNE, which is a contradiction. Hence, f ′ i(λR∗ i any player i ∈ N at a PNE, and the corollary follows. Proof of Proposition 1 [Structure of PNE] , λR∗ −i ) < 0 for □ be the review admission rates at a PNE for players k1 and k2, respectively, Let λR∗ k1 with hk1 ≤ hk2 and λR∗ k2 . By proving ak1λR∗ k1 < ak2λR∗ k2 . We assume ak1λR∗ k1 ak1 ≤ ak2 ≥ ak2λR∗ k2 , λR∗ k1 ≥ λR∗ k2 is established trivially since and prove the first statement by establishing a contradiction argument using two cases discussed below. Furthermore, the proof of the 139 second statement is contained within Case 1 below. Case 1: λR∗ k1 = 0. From statement (ii) of Corollary 1, fk1(λR∗ k1 for players k1 and k2 at a PNE satisfies: fk2 , λR∗ −k1 ) ≤ 0. From (7.9), the incentives fk1 and fk2 = fk1 + (hk1 − hk2)rS ≤ 0. Therefore, utilizing statement (ii) of Corollary 1 again implies λR∗ k2 = 0, which is a contra- diction. This case also proves the second statement. Case 2: λR∗ k1 > 0. By assumption, ak1λR∗ k1 < ak2λR∗ k2 {k1, k2}, satisfy the implicit equation , from statement (iii) of Corollary 1, λR∗ i , where i ∈ (cid:40) λR∗ i = min − fi(λR∗ i i(λR∗ f ′ i , λR∗ −i ) , λR∗ −i ) (cid:41) . , µR i We assume that λR∗ k1 < µR k1 , and therefore, λR∗ k1 (cid:40) ak2λR∗ k2 = min − ak2 fk2 f ′ k2 , ak2µR k2 . Using (7.9) and (8.26a), we get = − fk1 f ′ k1 (cid:41) ≤ −ak2 fk2 f ′ k2 = −ak1 fk1 + (hk1 − hk2)rS f ′ k1 ≤ ak1λR∗ k1 , which is a contradiction. Hence, if λR∗ k1 < µR k1 , then ak1λR∗ k1 ≥ ak2λR∗ k2 and λR∗ k1 ≥ λR∗ k2 for each k2 > k1. □ Proof of Theorem 2 [Uniqueness of PNE] Suppose that the CPR game Γ has multiple PNEs. We define the support of a PNE as the total number of players with non-zero review admission rate. Let PNE1 = λ1 = [λ1 1, λ1 2, . . . , λ1 N ], be two different PNEs with distinct supports m1 and m2, respectively. For brevity of notation, we have removed the superscript N ] and PNE2 = λ2 = [λ2 2, . . . , λ2 1, λ2 R from the two PNEs and replaced it by their unique identifier. Without loss of generality, let m2 > m1. Let x1 = µS T − (cid:80)N at PNE1 and PNE2, respectively. i=1 aiλ1 i and x2 = µS T − (cid:80)N i=1 aiλ2 i be the slackness parameters 140 We prove the uniqueness of PNE using a six step process. Step 1: We first show that if there exists two different PNEs with distinct supports m1 and m2 (m1 < m2), then x1 < x2. If m1 and m2 are the supports of PNE1 and PNE2, respectively, then λ1 for each i > m1, and j > m2, respectively (Proposition 1). Additionally, λ1 i = 0 and λ2 i > 0, and λ2 j = 0, j > 0, for each i ≤ m1, and j ≤ m2. Hence, m2 > m1 implies λ1 m2 = 0, while λ2 From statement (i) of Corollary 2, bi = 0, if and only if fi ≤ 0 and fi m2 > 0. ′ < 0 (equivalently dfi dx > 0) for all λR everywhere, while λ2 i ∈ Si. Therefore, λ1 m2 > 0 implies f 2 m2 = 0 implies f 1 m2 := fm2(λ2) > 0. Since f 2 m2 := fm2(λ1) ≤ 0 and dfm2 and dfm2 m2 > 0 > f 1 m2 dx > 0 dx > 0 everywhere, it follows that x1 < x2. Step 2: We now show that x1 > x2 using Steps 2-5, which is a contradiction to the result of Step 1, and consequently m1 = m2. From statement (iii) of Corollary 1, the review admission rate of any player i, i ≤ m1, at PNEk, k ∈ {1, 2}, satisfies (cid:40) λk i = min − (cid:41) . (8.34) f k i f k i ′ , µR i Step 3: We show that f 2 i > f 1 i for any player i, i ≤ m1. From (7.9), the incentives fi and fj for any two distinct players i and j with j > i at a PNEk, k ∈ {1, 2} satisfies: i − f k f k j = (hj − hi)rS > 0, ∀j > i. Notice that the right hand side of above equation is independent of λR i and therefore, a constant for both PNEs. Hence, for every i < m2 i − f 1 f 1 m2 = f 2 i − f 2 m2. Therefore, f 2 m2 > f 1 m2 implies f 2 i > f 1 i , for every i ≤ m1 < m2. Step 4: We show that f ′1 i < f ′2 i , for every player i, i ≤ m1. Recall that fi is strictly concave in x (Lemma 15). Therefore, x1 < x2 (Step 1) implies df 1 i dx > df 2 i dx . Therefore, from (8.26a), f ′1 i < f ′2 i , for any player i, i ≤ m1. 141 Step 5: We now show that x1 > x2, which is a contradiction to result of Step 1, and consequently m1 = m2. Since for all players i, i ≤ m1, f 2 i > f 1 i , for each i ≤ m1. Therefore, (cid:80)N (Step 3) and −f ′1 i > (cid:80)m1 i = (cid:80)N i ≥ λ1 λ2 i which implies x1 > x2, which is a contradiction to result of Step 1. Hence, m1 = m2 i ≥ (cid:80)m1 i=1 aiλ2 i=1 aiλ2 i=1 aiλ1 i > −f ′2 i (Step 4), (8.34) implies i=1 aiλ1 i , Step 6: We now show the value of slackness parameter x at any PNE is unique. Steps 1 to 5 show that, at a PNE, the number of players with non-zero review admission rate are unique. Therefore, let m be the identical support for PNE1 and PNE2. Without loss of generality, let x1 > x2. Let gi : R (cid:55)→ R, for i ≤ m, be defined by gi(x) = − f (x) f ′(x) . Differentiating gi(x) w.r.t x, we get dgi(x) dx = 2 ( dfi(x) dx ) − fi(x) d2fi(x) dx2 2 ai( dfi(x) dx ) . Recall from statement (iii) of Corollary 1 that players have non-zero review admission rate at PNE, if and only if, fi > 0 at PNE. Strict concavity of fi (Lemma 15) implies dgi(x) dx > 0. Consequently, at PNE, the review admission rate for any player i, i ≤ m, is increasing with x. Therefore, assumption x1 > x2 implies λ1 i ≥ λ2 i , for each player i ≤ m. Consequently, T − (cid:80)m x1 = µS i = x2, which is a contradiction. Therefore, x1 = x2. We now show the uniqueness of PNE. Steps 1 to 6 show that, at a PNE, the number of players i=1 aiλ1 i=1 aiλ2 i ≤ µS T − (cid:80)m with non-zero review admission rate and the slackness parameter x are unique. Therefore, the first order conditions (8.34) give the unique review admission rate for each player i for unique slack parameter x, thereby implying uniqueness of PNE. □ Proof of Lemma 11 [Non-increasing best response] We prove this lemma by considering the three cases of the best response mapping in Lemma 16 (Appendix 13): 142 Case 1: bi = 0. Recall that x = µS T − (cid:80)N i=1 aiλR i . In this case, from statement (i) of Corollary 2, fi ≤ 0 and f ′ re-written as x = µS T − aiλR i < 0 (equivalently, dfi −i), therefore dfi i − σi(λR dx > 0), for all λR dx > 0 implies ∂fi i ∈ Si. Since x can be < 0. Hence, with increase ∂σi −i), bi = 0 remains the best response. −i)) −i)) in σi(λR Case 2: bi = − fi(bi,σi(λR i (bi,σi(λR f ′ i < 0, for every λR fi > 0 and f ′ i ∈ Si. Thus, . In this case, from statement (ii) of Corollary 2, bi ∈ Si such that dbi dσi = i fi 2 + f ′′ −f ′ i 2 aif ′ i < 0. (8.35) Hence, bi is strictly decreasing in σi(λR −i). Case 3: bi = µR i . Since bi ∈ Si = [0, µR increase in σi(λR −i). i ], bi either decreases or remains constant with □ Proof of Theorem 8 [Analytic bounds on PNE Inefficiency] We first establish the analytic upper bound on PoA, followed by upper bounds on ηT RI and ηLI. Let G be the family of CPR games parameterized by the ratios of the maximum service and review admission rates of each player i ∈ N . Therefore, the CPR game Γ ∈ G, with the corresponding ratio for player i given by hi. Define a set of homogeneous CPR games T }H GH ⊂ G, in which each player has a constant ratio {µS {µR N (1+h) The superscript H is used to distinguish maximum service and review admission rates of the i }H i }H =: h, and mini{{µR i }H} ≥ {µS . homogeneous game ΓH from the CPR game Γ. We obtain the bounds on PoA by comparing the social utility of the heterogeneous CPR game with a corresponding homogeneous game in which all players have the largest heterogeneity measure, i.e., hi = hN , for every i ∈ N . For any CPR game in G, PoA is given by: P oA = (Ψ)SW (Ψ)P N E = [(cid:80)N [(cid:80)N i=1 µS i=1 µS i rS + (cid:80)N i rS + (cid:80)N i=1 λR i=1 λR i fi(x)]SW i fi(x)]P N E . (8.36) We now provide an analytic upper bound on the PoA for the CPR game Γ using following Lemmas. 143 Lemma 18 (PNE solution for homogeneous CPR game). For any homogeneous CPR game ΓH ∈ GH, such that for each player i ∈ N , {µS {µR i }H i }H = h, and mini{{µR i }H} ≥ {µS T }H N (1+h), each player participates in the review process with equal review admission rate λH i = λH at T be the total review admission rate at PNE for ΓH. The unique PNE solution PNE. Let λH is given by λH = λH T N , where λH T = f (x) (1+h) df dx and x = {µS T }H − (1 + h)λH T . Proof. For the homogeneous CPR game ΓH, each player has equal incentive f (x) to review the tasks. If f (x) ≤ 0 at PNE, all players have λH i = 0 (statement (ii) of Corollary 1) which contradicts assumption (A3). Hence, at PNE, each players has λH i > 0. Let mini{{µR i }H} ≥ {µS player indices such that for any i ∈ D, {µR T }H N (1+h) i }H ≤ − f (x) f ′(x) . At PNE, for ΓH. At PNE, x > 0. Let D ⊆ N be a non-empty set of λH T = (cid:88) i∈D {µR i }H + (cid:88) i∈N \D −f (x) f ′(x) ≥ N min {{µR i }H} i T }H {µS (1 + h) . ≥ Therefore, at PNE, x = {µS T }H − (1 + h)λH T ≤ {µS T }H − (1 + h)N min {{µR i }H} ≤ 0, i which is a contradiction. Hence, D is an empty set and each player has equal review admission rate at PNE, given by λH i = λH = λH T N . Hence, each player being a maximizer of their expected utility maximizes: ˜ui = {µS i }HrS + λH T N f (x), (8.37) where x = {µS T }H − (1 + h)λH T . Setting ∂ ˜ui ∂λH T = 0, we get λH T = f (x) (1+h) df dx . Lemma 19 (PoA=1 for homogeneous CPR game). For any homogeneous CPR game ΓH ∈ GH, such that for each player i ∈ N , {µS {µR i }H i }H = h, and mini{{µR N (1+h) , PoA=1. i }H} ≥ {µS T }H 144 Proof. For homogeneous CPR game ΓH, social welfare function ΨH in (7.15) only depends on λR T and is given by: where x = {µS Note that dΨH dλR T T }H − (1 + h)λR T > 0 when df show that ΨH is maximized by any λR T ΨH = {µS T }HrS + λR T f (x), (8.38) , and f (x) is the uniform incentive function for each player. dx ≤ 0, and d2ΨH dλR T satisfying λR 2 > 0 in the interval where df dx > 0. It is easy to = 0, obtained by setting dΨH dλR T T = f (x) (1+h) df dx and df dx > 0 at the maximizer. Let λH T be the total review admission rate at PNE for ΓH. The unique PNE satisfies T = f (x) λH (1+h) df dx 1. (Lemma 18), and hence, maximizes social welfare utility resulting in PoA= Corresponding to the CPR game Γ, construct a homogeneous game ΓH N ∈ GH with , for each player i ∈ N . Note that for homogeneous players {µS in ΓH N i }H = µS and {µR i }H = µS i i }H , {µS i }H = h = hN , and (cid:80)N {µR i }H} > {µS implies mini{{µR i hN i=1{µS T }H N (1+h) µS T hN N (1+hN ) . Furthermore, the assumption mini{µS i } > i }H = µS T . Hence, P oA = 1 for ΓH N (Lemma 19). To obtain analytic bounds on PoA, we now compute a lower bound on the social utility obtained at the unique PNE ΨΓ P N E for the CPR game Γ. In Lemma 20, we show that the social utility obtained at the PNE ΨH P N E for the homogeneous game ΓH N lower bounds ΨΓ P N E . For any x ∈ [0, µS incentive ((cid:80) f ) to review tasks than players in Γ, and therefore, have a lower social utility T ], homogeneous players with ratio hN in ΓH have a lower cumulative N at PNE, i.e., ΨΓ P N E ≥ ΨH P N E . We further lower bound ΨΓ P N E by computing a lower bound on ΨH P N E . Lemma 20 (Lower bound for social welfare at PNE ). Let ΓH corresponding to CPR game Γ with each player i ∈ N having {µS N be a homogeneous game i }H = µS i and {µR i }H = µS . i hN Let ΨΓ P N E and ΨH P N E be the social welfare functions for Γ and ΓH N , respectively, evaluated at their unique PNEs. Then ΨΓ P N E ≥ ΨH P N E ≥ µS T rS + µS T −x aN fN (x), where x is the unique maximizer of fi, i.e. dfi dx (x) = 0. 145 Proof. Let λ∗ = [λ∗ 1, . . . , λ∗ PNEs for the CPR games Γ and ΓH N N ] (Corollary 1) and λH = [λH, . . . , λH] (Lemma 18) be the unique and xH = , respectively. Let x∗ = µS T − (cid:80)N i=1 aiλ∗ i T − N aN λH be their slackness parameters at respective PNEs. µS Step 1: We show that xH ≥ x∗ using contradiction. Let x∗ > xH. Recall that at PNE, dfi (Lemma 15), we have dfi dx (xH) > dfi that f1(x) ≥ · · · ≥ fN (x) for any x, and dfi dx > 0 (Corollary 1). Using strict concavity of fi . Recall = aN λH N dx (x∗) > 0. Therefore, fN (x∗) > fN (xH ) dfN dfN dx (x∗) dx (xH ) dx = drR(x) dx (1 − p(x)) − rR(x) dp(x) dx i = µS for any i. Using hN ≥ hi, we get aiµR (cid:26) (cid:27) is independent i + µS i hi ≥ µS i + = aN {µR µS > aN λH i N hN x∗ < xH, which is a contradiction. Therefore, xH ≥ x∗ (equivalently, (cid:80)N N . Therefore, aiλ∗ fi(x∗) dfi dx (x∗) i = min , aiµR i , for any i. Hence, i=1 aiλ∗ i ≥ N aN λH N ) of i. Hence, > aN λH N fi(x∗) dfi dx (x∗) i }H > aN λH and dfi dx (x∗) ≥ dfi dx (xH) > 0. Step 2: We show that (cid:80) i fi(x∗) ≥ N fN (xH) & (cid:80) Let d ≤ N be the support for λ∗. Therefore, λ∗ i = 0 for any i > d. Therefore, (cid:80)d dx (xH) > 0 ( dfi dx (x∗) ≥ dfN dx i ≥ N aN λH N i=1 aiλ∗ i=1 λ∗ fi(x∗) dfi dx (x∗) i=1 , and λ∗ Using dfi Additionally, (cid:80)N Step 3: We show that ΨΓ i λ∗ i ≥ N λH N (cid:26) . fi(x∗) dfi dx (x∗) i = min ≥ (cid:80)d ai i=1 aiλ∗ (cid:27) , µR i for every i ≤ d, i ≥ N aN λH N = N fN (xH ) dfN dx (xH ) i=1 fi(x∗) ≥ N fN (xH). . is independent of i), we get (cid:80)d i ≥ (cid:80)N implies (cid:80)d i ≥ N λH λ∗ N i=1 ai aN . Using hN ≥ hi, we have µR = {µR i }H > λH N . Let d1 ≤ d be the largest index of player satisfying fi(x∗) dfi dx ai < λH N , d1 < N . Therefore, λ∗ i satisfies, P N E. P N E ≥ ΨH i = µS . Since > λH N i hi ≥ µS i hN fN (x∗) dfN dx (x∗) aN λ∗ i =    (cid:26) min (cid:27) fi(x∗) dfi dx (x∗) ai , µR i > λH N , for i ≤ d1, fi(x∗) dfi dx (x∗) ai ≤ λH N , for d1 + 1 ≤ i ≤ d. (8.39) Hence, ΨΓ P N E = µS T rS + d (cid:88) i=1 i fi(x∗) λ∗ (1∗) = µS T rS + λH N d (cid:88) i=1 fi(x∗) + 146 d1(cid:88) (λ∗ i − λH N )fi(x∗) i=1 d (cid:88) + (λ∗ i − λH N )fi(x∗) i=d1+1 (2∗) ≥ µS T rS + λH N N fN (xH) + d1(cid:88) i=1 (λ∗ i − λH N )fd1+1(x∗) d (cid:88) + (λ∗ i − λH N )fd1+1(x∗) i=d1+1 = µS T rS + λH N N fN (xH) + fd1+1(x∗) d (cid:88) (λ∗ i − λH N ) (3∗) ≥ µS T rS + λH N N fN (xH) = ΨH P N E, i=1 where (1∗) follows by adding and subtracting λH i=1 fi(x∗) ≥ N N fN (xH) (Step 2), (8.39), and the fact that f1(x∗) ≥ · · · ≥ fN (x∗). (3∗) follows by recalling that (cid:80)d i=1 fi(x∗). (2∗) follows from (cid:80)d (Step 2). (cid:80)d i=1 λ∗ i ≥ N λH N Step 4: We show that ΨΓ P N E ≥ ΨH Let x be the unique maximizer of fi. From Lemma 19, ΨH P N E ≥ µS fN (x). T rS + µS T −x aN P N E = ΨH SW = max{ΨH} ≥ T rS +λR µS T fN (µS T −aN λR T ) for any λR T . Choosing λR T = µS T −x aN , we obtain the desired bounds. Proof of Theorem 8: The global optimum of social welfare function is upper bounded by: ΨΓ SW = µS T rS + max λR i (cid:41) λR i fi(x) (cid:40) N (cid:88) i=1 ≤ µS T rS + max λR i (cid:8)λR T (1∗) ≤ µS T rS + µS T fi(x) (cid:9) max x {fi(x)} ≤ µS T (rS + rR(x)(1 − p(x))), (8.40) where (1∗) is obtained using the system constraint x > 0 which implies µS T ≥ (cid:80)N i=1 aiλR i ≥ λR T . From Lemma 20, ΨΓ P N E is lower bounded by: ΨΓ P N E ≥ µS T rS + fN (x) µS T − x aN (cid:0)xaN rS + (µS = 1 aN T − x)(rS + rR(x)(1 − p(x)))(cid:1) 147 ≥ 1 aN (µS T − x)(rS + rR(x)(1 − p(x))). (8.41) Using (8.40) and (8.41), we get, P oA = ΨΓ ΨΓ SW P N E ≤ µS T aN µS T −x . Now we establish the bounds on ηT RI and ηLI. Let xP N E and xSW be the slackness parameter corresponding to the PNE and social welfare, respectively. Recall that dfi dx > 0 (Corollary 1), for x ∈ {xP N E, xSW }. Hence, using strict concavity of fi, we have xP N E, xSW ∈ (0, x). Therefore, µS T − x < (cid:80)N i=1 ai{λR i }P N E < µS T , and µS T − x < (cid:80)N i=1 ai{λR i }SW < µS T . Hence, ηT RI and ηLI are upper bounded by: ηT RI = (λR (λR T )SW T )P N E < µS T aN T − x)a1 (µS , and ηLI = ((cid:80)N ((cid:80)N i=1 aiλR i=1 aiλR i )P N E i )SW < µS T µS T − x . □ 148