OPTIMAL & GAME THEORETIC FEEDBACK DESIGN FOR
EFFICIENT HUMAN PERFORMANCE IN HUMAN-SUPERVISED AUTONOMY

By

Piyush Gupta

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Electrical Engineering—Doctor of Philosophy

2023

ABSTRACT

Human-in-the-loop systems play a pivotal role in numerous safety-critical applications, ensur-

ing both safety and efficiency in complex operational environments. However, these systems

face a significant challenge stemming from the inherent variability in human performance,

influenced by factors such as workload, fatigue, task learning, expertise, and individual dif-

ferences. Therefore, effective management of human cognitive resources is paramount in

designing efficient human-in-the-loop systems.

To address this challenge, it is critical to design robust and adaptive systems capable of

continuously adapting models of human performance, and subsequently providing tailored

feedback to enhance it. Effective feedback mechanisms play a pivotal role in improving the

overall system performance by optimizing human workload, fostering skill development, and

facilitating efficient collaboration among individuals within diverse human teams, each with

their unique skill sets and expertise.

In this dissertation, the primary focus lies in exploring optimal and game-theoretic ap-

proaches for feedback design to enhance system performance, particularly in scenarios where

humans are integral components. We begin by studying the problem of optimal fidelity selec-

tion for a human operator servicing a stream of homogeneous tasks, where fidelity refers to

the degree of exactness and precision while servicing the task. Initially, we assume a known

human service time distribution model, later relaxing this assumption. We design a human

decision support system that recommends optimal fidelity levels based on the operator’s

cognitive state and queue length. We evaluate our methods through human experiments

involving participants searching for underwater mines.

We extend the optimal fidelity selection problem by incorporating uncertainty into the

human service-time distribution. This extension involves the development of a robust and

adaptive framework that accurately learns the human service-time model and adapts the

policy while ensuring robustness under model uncertainty. However, a major challenge in

designing adaptive and robust systems arises from the conflicting objectives of exploration

and robustness. To mitigate system uncertainty, an agent must explore high-uncertainty

state space regions, while robust policy optimization seeks to avoid these regions due to

poor worst-case performance. To address this trade-off, we introduce an efficient Determin-

istic Sequencing of Exploration and Exploitation (DSEE) algorithm for model-based rein-

forcement learning. DSEE interleaves exploration and exploitation epochs with increasing

lengths, resulting in sub-linear cumulative regret growth over time.

In addition to cognitive resource management, enhancing human performance can also

be achieved through tutoring for skill development. In this context, we study the impact

of evaluative feedback on human learning in sequential decision-making tasks. We conduct

experiments on Amazon Mechanical Turk, where participants engage with the Tower of

Hanoi puzzle and receive AI-generated feedback during their problem-solving. We examine

how this feedback influences their learning and skill transfer to related tasks. Additionally,

we explore computational models to gain insights into how individuals integrate evaluative

feedback into their decision-making processes.

Lastly, we expand our focus from a single human operator to a team of heterogeneous

agents, each with diverse skill sets and expertise. Within this context, we delve into the

challenge of achieving efficient collaboration among heterogeneous team members to enhance

overall system performance. Our approach leverages a game theoretic framework, where we

design utility functions to incentivize decentralized collaboration among these agents.

Copyright by
PIYUSH GUPTA
2023

This dissertation is dedicated to my beloved family.

v

ACKNOWLEDGEMENTS

This dissertation would not have been possible without the support and encouragement of

several people. First and foremost, my advisor, Dr. Vaibhav Srivastava, deserves special

recognition. He has been there for me every step of the way, guiding me and supporting

me in my academic journey. He has helped me improve my writing and analytical skills

significantly.

I am thankful for his unwavering support and for always making time for

me during my PhD. I would also like to extend my gratitude to my graduate committee

members, Dr. Xiaobo Tan, Dr. Hayder Radha, and Dr. Vishnu Boddeti, for their constant

encouragement and guidance throughout my time in graduate school. They have played a

significant role in my academic development.

I extend my gratitude to Dr. Shaunak D. Bopardikar for his unwavering support during

my PhD journey and invaluable insights into common resource pool games and analysis.

Additionally, I would like to thank Dr. Joshua E. Siegel for our engaging discussions on

various aspects of autonomous vehicles. These discussions inspired me to pursue internship

opportunities at the Honda Research Institute (HRI), US, in 2022-2023.

During my three internships at HRI, I had the privilege of working on problems related

to interaction-aware planning and applying the skills and knowledge I gained during my

PhD in the field of autonomous driving. I am deeply appreciative of my mentors, Sangjae

Bae and David Isele from HRI, for their unwavering support throughout these internships.

Their guidance not only broadened my research horizons but also played a significant role

in refining my coding and writing abilities.

I would like to express my gratitude to my friends and lab mates, Ankur, Dong Hae,

Pradyumna Reddy, and Abhishek, for the enduring sense of camaraderie in the lab and the

close friendships we’ve cultivated. Additionally, I want to thank my friends in California,

Faizan Tariq, Sophia Newton, Aakriti Kumar, Snehal Dikhale, Faizan Siddiqui, and Avinash

Singh with whom I formed a special bond during my internships at HRI. Our weekend hiking

adventures, shared laughter, and the wisdom I gained from their experiences have continually

vi

inspired me to aim higher in life.

No amount of words can adequately convey my gratitude for the unwavering love and

wholehearted support of my family. I want to express my heartfelt thanks to my parents for

instilling in me the values that have guided my path. Each day, I gain a deeper understanding

of the sacrifices they made while raising me. My elder sister has been a constant source of

guidance and support throughout the years, and from my younger brother, I have learned

the importance of taking breaks and finding positivity during tough times. I am immensely

grateful for their unwavering presence during challenging moments. My family has been a

guiding light in my life, inspiring me on my journey. I can never thank them enough for

everything they’ve done for me.

I am deeply grateful to both my high school mathematics teacher, Mr. Sanjay Singh, and

my IIT Delhi professor, Dr. Supreet Singh Bahga, for their early influence that ignited my

appreciation for mathematics. They have collectively been unwavering sources of support

and inspiration throughout my academic journey. I want to express my heartfelt thanks to

both of them for consistently being there to provide guidance whenever I needed it.

The work reported in this dissertation was supported in parts by the NSF awards IIS-

1734272, CMMI-1940950, ECCS-2024649, and ONR award N00014-22-1-2813.

vii

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Organization and Contribution . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2

OPTIMAL FIDELITY SELECTION FOR HUMAN-IN-THE-
LOOP QUEUES . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . .
2.2 Numerical Illustration of Optimal Fidelity Selection . . . . . . . . . . . .
2.3 Structural Properties of the Optimal Policy . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
2.4 Conclusions and Future Directions

CHAPTER 3

OPTIMAL FIDELITY SELECTION FOR IMPROVED
PERFORMANCE IN HUMAN-IN-THE-LOOP QUEUES FOR
UNDERWATER SEARCH . . . . . . . . . . . . . . . . . . . . . .
3.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . .
3.2 Human Experiments
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 4

ROBUST AND ADAPTIVE FIDELITY SELECTION FOR
HUMAN-IN-THE-LOOP QUEUES . . . . . . . . . . . . . . . . .
4.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . .
4.2 Convergence of Robust Adaptive SMDP . . . . . . . . . . . . . . . . . .
4.3 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 5

DETERMINISTIC SEQUENCING OF EXPLORATION AND
EXPLOITATION FOR REINFORCEMENT LEARNING . . . . .
5.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . .
5.2 DSEE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Analysis of DSEE algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 6

FOSTERING HUMAN LEARNING IN SEQUENTIAL
DECISION-MAKING: UNDERSTANDING THE ROLE
OF EVALUATIVE FEEDBACK . . . . . . . . . . . . . . . . . . .
6.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Human Experiments
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 7

INCENTIVIZING COLLABORATION IN HETEROGENEOUS
TEAMS VIA COMMON-POOL RESOURCE GAMES . . . . . .
7.1 Background and Problem Formulation . . . . . . . . . . . . . . . . . . .

1
4
10

16
16
23
24
30

31
31
36
42
47

48
48
53
55
58

59
59
61
63
70

72
72
78
81
95

96
96

viii

7.2 Existence and Uniqueness of PNE . . . . . . . . . . . . . . . . . . . . . . 101
7.3 Convergence to the Nash Equilibrium . . . . . . . . . . . . . . . . . . . . 105
7.4 Social Welfare and Inefficiency of PNE . . . . . . . . . . . . . . . . . . . 106
7.5 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
. . . . . . . . . . . . . . . . . . . . . 111
7.6 Conclusions and Future Directions

CHAPTER 8

CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

APPENDIX A: CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

APPENDIX B: CHAPTER 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

APPENDIX C: CHAPTER 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

ix

CHAPTER 1

INTRODUCTION

Modern-day computers are getting faster and more efficient; allowing automation to perform

complex tasks [1]. Nevertheless, in numerous safety-critical domains, human involvement

remains essential to guarantee both safety and efficiency. These human-in-the-loop systems

have become ubiquitous, finding application in fields such as search and rescue [2–4], semi-

autonomous vehicle systems [5, 6], robot-assisted surgery [7], and flight control [2].

The role of humans within these systems is notably application-dependent, ranging from

mere supervision to direct teleoperation of robots. For instance, in a search and rescue

scenario [2], human operators might be tasked with remotely controlling robots to conduct

search operations, aid in victim rescue efforts, and verify the findings of autonomous agents.

Nonetheless, these systems encounter a noteworthy challenge rooted in the inherent vari-

ability of human performance, which can be influenced by factors including workload, fa-

tigue, task learning, expertise, and individual differences. Furthermore, the prevailing goal

of maximizing the ratio of robots to human operators, often driven by cost considerations,

frequently results in increased workload conditions for human operators [8]. As a conse-

quence, the effective management of human cognitive resources becomes paramount in the

design of efficient human-in-the-loop systems.

To address this challenge, it is imperative to develop robust and adaptive systems that

can continuously learn and adapt their models of human performance. These systems should

utilize these learned models to actively provide customized feedback to humans, thus en-

hancing their performance. Effective feedback mechanisms can elevate the overall system

performance by optimizing the human workload, fostering the development of skills, and

facilitating efficient collaboration among individuals within diverse human teams, each with

their unique skill sets and expertise.

This dissertation primarily focuses on the exploration of optimal and game-theoretic

approaches for feedback design, with a specific emphasis on enhancing system performance,

1

particularly in scenarios where humans are integral components. The key objectives include:

(i) Feedback design for managing human cognitive resources: A central aspect of this re-

search focuses on designing feedback strategies that enable human operators to manage

their cognitive resources efficiently. Specifically, these strategies offer optimal recom-

mendations, based on the system state and cognitive state of the human operator, that

optimize the utilization of human cognitive resources for improved system performance.

(ii) Adaptive Algorithms for Robust Feedback under Model Uncertainty: Human perfor-

mance is inherently agent-specific, influenced by a variety of factors such as workload,

fatigue, and trust. Therefore, we delve into the design of robust and adaptable al-

gorithms. These algorithms possess the ability to continuously learn and refine their

model of human behavior, adapting and changing feedback based on updated human

models. Importantly, the proposed adaptive recommendations optimize worst-case per-

formance, and consequently, remain robust in the face of uncertainties in the human

model.

(iii) Role of feedback in fostering skill-development: Enhancing system performance can be

achieved through effective tutoring and skill development in individuals performing

the task. Therefore, we explore the role of feedback, extending beyond its immediate

impact on current performance, to nurture skill development among human operators.

Our objective is to use feedback as a means of providing effective tutoring, thereby

fostering continuous growth and improvement over time.

(iv) Feedback for Collaboration Across Skill sets: In scenarios where multiple humans with

diverse skill sets are involved, we explore how feedback incentives can be harnessed

to improve system performance. This involves achieving efficient team collaboration,

ensuring that each individual’s unique expertise contributes to the collective success of

the system.

2

To achieve these objectives, we begin by studying the problem of optimal fidelity selection

for a human operator servicing a stream of homogeneous tasks, where fidelity refers to the

degree of exactness and precision while servicing the task.

Initially, we assume a known

human service time distribution model, later relaxing this assumption. We design a human

decision support system that recommends optimal fidelity levels based on the operator’s

cognitive state and queue length. We evaluate our methods through human experiments

involving participants searching for underwater mines.

We extend the optimal fidelity selection problem by incorporating uncertainty into the

human service-time distribution. This extension involves the development of a robust and

adaptive framework that accurately learns the human service-time model and adapts the

policy while ensuring robustness under model uncertainty. However, a major challenge in

designing adaptive and robust systems arises from the conflicting objectives of exploration

and robustness. To mitigate system uncertainty, an agent must explore high-uncertainty

state space regions, while robust policy optimizes worst-case performance and consequently

avoids these regions. To address this trade-off, we introduce an efficient Deterministic Se-

quencing of Exploration and Exploitation (DSEE) algorithm for model-based reinforcement

learning (RL). DSEE interleaves exploration and robust exploitation epochs with increasing

lengths, resulting in sub-linear cumulative regret growth over time.

In addition to cognitive resource management, enhancing human performance can also

be achieved through tutoring for skill development. In this context, we study the impact

of evaluative feedback on human learning in sequential decision-making tasks. We conduct

experiments on Amazon Mechanical Turk, where participants engage with the Tower of

Hanoi puzzle and receive AI-generated feedback during their problem-solving. We examine

how this feedback influences their learning and skill transfer to related tasks. Additionally,

we explore computational models to gain insights into how individuals integrate evaluative

feedback into their decision-making processes.

Lastly, we expand our focus from a single human operator to a team of heterogeneous

3

agents, each with diverse skill sets and expertise. Within this context, we delve into the

challenge of achieving efficient collaboration among heterogeneous team members to enhance

overall system performance. Our approach leverages a game-theoretic framework, where we

design utility functions to incentivize decentralized collaboration among these agents. We

show the existence of a unique Pure Nash Equilibrium (PNE) and establish the convergence

of the best response dynamics to this unique PNE. Additionally, we establish an analytical

upper bound on measures of PNE inefficiency, shedding light on the effectiveness of our

collaborative strategies.

In summary, this dissertation seeks to advance feedback design techniques, particularly

in the context of human-in-the-loop systems, by introducing optimal and game-theoretic

approaches. By focusing on enhancing human performance, fostering skill development, and

optimizing collaboration, this research aims to make significant contributions to the design

and implementation of more efficient and effective human-machine systems.

1.1 Literature review

In this section, we review the literature in areas relevant to this dissertation. We organize

the literature according to the broad topics of interest in this dissertation.

1.1.1 Control of Queues with State-dependent Servers

We start by studying the problem of optimal fidelity selection for a human operator

servicing a queue of homogeneous tasks. We model it as a control of a queue problem, where

the human acts as a server with its own cognitive dynamics.

Recent years have seen significant efforts in integrating human knowledge and perception

skills with autonomy [9]. A key research theme within this area concerns the systematic

allocation of human cognitive resources for efficient performance. Therein, some of the fun-

damental questions studied include optimal scheduling of the tasks to be serviced by the

operator [10], enabling shorter operator reaction times by controlling the task release [11],

and determining optimal operator attention allocation [12]. In contrast to the aforementioned

works, we consider a semi-Markov decision process (SMDP) [13] formulation to deal with

4

general (non-memoryless) service time distributions of the human operator. Furthermore,

while the above works propose heuristic algorithms, we focus on establishing the structural

properties of the optimal policy.

Some interesting recent studies with state-dependent queues are considered in [14, 15]. In

these works, authors design scheduling policies that stabilize a queueing system and de-

crease the utilization rate of a non-preemptive server that measures the proportion of time

the server is working. The performance of the server degrades with the increase in server

utilization and improves when the server is allowed to rest. In contrast to monotonic server

performance with the utilization rate in [14, 15], we model the service time of the human

operator as a unimodal function of its cognitive state. Our model for service time is inspired

by experimental psychology literature [16] and incorporates the influence of cognitive state

and fidelity level on service time.

The optimal control of queueing systems [17] is a classical problem in queueing theory. Of

particular interest are the works [18, 19], where authors study the optimal policies for an

M/G/1 queue using an SMDP formulation and describe its qualitative features. In contrast

to a standard control of queues problem, the server in our problem is a human operator with

its cognitive dynamics that must be incorporated into the problem formulation.

Our mathematical techniques to establish the structural properties of the optimal pol-

icy are similar to [20].

In [20], the authors establish structural properties of an optimal

transmission policy for transmitting packets over a point-to-point channel in communication

networks. The optimal policy of their Markov decision process (MDP) [21] depends on the

queue length, the number of packet arrivals, and the channel fading state. In [22], authors

study structural properties of the optimal resource allocation policy for a single-queue sys-

tem in which a central decision-maker assigns servers to each job. In contrast to [20, 22], a

major challenge in our problem arises due to SMDP formulation for non-memoryless service

time distribution and its unimodal dependence on the cognitive state.

5

1.1.2 Robust Control Policies with Uncertain Dynamics

In this dissertation, we also study the optimal fidelity selection problem in the presence

of uncertainty in human models. Specifically, we assume that the service time distribu-

tion of the human operator is unknown a priori and therefore, formulate a robust adaptive

SMDP. An SMDP accounts for the system uncertainty through probabilistic state transi-

tions. However, the obtained policy is sensitive to errors in the stochastic models [23, 24].

The large uncertainty in the service time models, especially in the initial stage with limited

observation data may lead to sub-optimal policies. Existing methods formulate such con-

trol problems with model uncertainty as an MDP with an uncertain state transition model.

In [25], authors propose a constrained MDP framework with risk constraints. They opti-

mize Conditional Value-at-Risk (CVaR) [26] and propose an iterative offline algorithm to

find the risk-constrained optimal control policy. In [27], a chance-constrained MDP [28] is

proposed that provides a probabilistic framework for handling uncertainty in the transition

probabilities. Robust MDPs are studied in [29, 30] that solve for policies that optimize a

min-max criterion when the unknown stochastic model is assumed to lie within an uncer-

tainty set. The policies obtained from solving robust MDPs can be overly conservative when

implemented on the nominal system [27].

In our work, we consider a human agent with

non-memoryless service time distribution and thus formulate a robust adaptive SMDP to

deal with general service time distributions [18, 19] and learn a policy that is robust in the

transient learning phase and converges to the optimal policy asymptotically.

In this dissertation, we show that the solution of the synchronous and asynchronous value

iteration (VI) methods [31] for robust adaptive SMDP converges to the optimal solution for

the uncertainty-free SMDP. While there exists a convergence analysis for the robust [32, 33]

and adaptive [34] MDPs, such an analysis is missing for robust adaptive SMDPs to the

best of our knowledge. A key challenge that we address in comparison to MDPs is the

time dependence of the robust adaptive Bellman operator for SMDPs that requires a careful

comparison between optimal value functions for intermediate SMDPs at different time steps,

6

and the optimal value function for the uncertainty-free SMDP.

1.1.3 Efficient Algorithms for Model-Based Reinforcement Learning

In this dissertation, we introduce an efficient DSEE algorithm for model-based RL. RL

is used in solving complex sequential decision-making tasks in uncertain environments such

as motion planning for robots [35, 36], personalized web services [37, 38], and the design of

decision-support systems for human-supervisory control [39–41]. MDPs [31] provide a nat-

ural framework for optimal decision-making under uncertainty and are used to model and

solve numerous model-based RL problems. The objective of these problems is to simultane-

ously learn the system model and the optimal policy. While MDP formulation accounts for

environment uncertainty by using stochastic models, MDP policies are known to be sensitive

to errors in these stochastic models [17, 42].

In many safety-critical systems, robust MDPs [29, 43] are used to mitigate performance

degradation due to uncertainty in the learned MDP. However, to reduce the system uncer-

tainty, the agent must explore the environment and visit parts of the state space associated

with high estimation uncertainty. Most often, RL algorithms use simple randomized meth-

ods to explore the environment, e.g. applying ϵ-greedy policies [44, 45] or adding random

noise to continuous actions [46]. The objective of the robust MDPs conflicts with the ex-

ploration objective. To mitigate system uncertainty, an agent must explore high-uncertainty

state space regions, while robust policy optimizes worst-case performance and consequently

avoids these regions. Therefore, to balance the trade-off between learning the MDP and de-

signing a robust policy, we design a DSEE algorithm, in which exploration and exploitation

epochs of increasing lengths are interleaved.

There exist efficient algorithms for solving RL problems with provable bounds on the

sample complexity [47, Definition 1]. In [47], authors analyze the Model-based Interval Es-

timation (MBIE) algorithm that applies confidence bounds to compute an optimistic policy

and show that the algorithm is PAC-optimal [47, Definition 2]. They provide an upper bound

on the algorithm’s sample complexity given by O

(cid:16) |S|2|A|

(cid:17)
(1−γ)6ϵ3 log(δ−1)

which is the maximum

7

number of time steps until when the MBIE policy is not ϵ−optimal with at least probability

1 − δ, where |S|, |A|, are the cardinality of the state space and action space, respectively,

γ is the discount factor, and ϵ, δ ∈ (0, 1) are pre-defined constants. A similar bound on the

sample complexity is obtained for the R-max algorithm [48] which distinguishes the “known"

and “unknown" states based on how often they have been visited. It explores by acting to

maximize rewards under the assumption that unknown states deliver the maximum reward.

UCRL2 algorithm [49] relies on optimistic bounds on the reward functions and probability

density functions and enjoys near-optimal regret bounds. A review of model-based RL al-

gorithms with provable finite time guarantees can be found in [50, Chapter 38]. A major

drawback of these algorithms is that they consider optimism in the face of uncertainty and

hence, are not robust to the estimation uncertainties. Furthermore, these algorithms with

random exploration might lead to a bad user experience in applications in which the RL

agent seeks to learn human preferences for system optimization.

To address these shortcomings, we propose a DSEE algorithm for model-based RL in

which we design a deterministic sequence of exploration and exploitation epochs. The DSEE

approach has been used in multi-arm bandit problems [51–54] and multi-robot coordination

problems [55].

It allows for differentiation between exploration and exploitation epochs.

The announced exploration may lead to a better user experience for the agents (especially

for human agents) than random exploration at any time. For example, many personalized

web services calibrate their recommendations intermittently by announced exploration, i.e.,

through surveys and user selection. Another advantage of the DSEE algorithm is that it

allows for efficient exploration of the environment in multi-agent systems. Specifically, in

multi-agent systems, exploration can be well-planned to cover all regions of the state-space

through agent coordination which can be easily arranged due to the deterministic structure

of exploration and exploitation.

8

1.1.4 Resource Sharing Games

Another key focus of this dissertation is to incentivize collaboration in a team of het-

erogeneous agents. We connect the class of problems involving human-team-supervised au-

tonomy [3] with the common-pool resource (CPR) games [56, 57], and design utilities that

yield the desired behavior. CPR games [56, 57] is a class of resource-sharing games in which

players jointly manage a common pool of resource and make strategic decisions to maximize

their utilities. Our CPR formulation has features similar to the CPR game studied in [57,58].

In these works, authors utilize prospect theory to capture the risk aversion behavior of the

players investing in a fragile CPR [59] that fails if there is excessive investment in the CPR.

In the case of CPR failure, no player receives any return from the CPR. While our design

of the common review pool is similar to the fragile CPR, our failure model incorporates the

constraint that only serviced tasks can be reviewed. In contrast to the agent heterogeneity

due to prospect-theoretic risk preferences in [57], heterogeneity in our model arises due to

differences in the agents’ mean service and review times.

While we use human-team-supervised autonomy as a motivating example, our problem

formulation can be applied to a broad range of problems involving tandem queues [60], where

servicing and reviewing of tasks can be considered as the subsequent stages of the queueing

network. Tandem queues are utilized to study problems such as resource allocation, inven-

tory management, process optimization, and quality control [61]. Existing game theoretic

approaches [62, 63] to service rate control in tandem queues assume that a single server is

present at each stage of the tandem queue and each server has its independent resources. In

contrast, in our setup, multiple heterogeneous agents allocate their time at different stations

based on their skill sets and maximize the system throughput. Additionally, our mathe-

matical techniques are applicable to many problems involving the dual-screening process.

For example, in human-in-the-loop systems which are pervasive in areas such as search-and-

rescue, semi-autonomous vehicle systems, surveillance, etc., humans often supervise (review)

the actions (service) performed by the autonomous agents. In such settings, our framework

9

incentivizes collaboration among heterogeneous agents.

Game-theoretic approaches have been utilized for problems in distributed control [64],

wherein the overall system is driven to an efficient equilibrium strategy that is close to

the social optimum through an appropriate design of utility functions [65]. Price of Anarchy

(PoA) [66] is often used to characterize efficiency of the equilibrium strategies in a game. As-

sociated analysis techniques utilize smoothness property of the utility functions [67], leverage

submodularity of the welfare function [68], or solve an auxiliary optimization problem [69,70].

These approaches do not immediately apply to our setup. Instead, we follow a new line of

analysis to obtain bounds on PoA by constructing a homogeneous CPR game, for which

we show that the equilibrium strategy is also the social optimum (PoA=1), and relating its

utility to the original game.

1.2 Organization and Contribution

In this section, we present the organization of the remainder of the thesis and the con-

tributions of the work in each chapter.

Chapter 2: In this chapter, we study optimal fidelity selection for a human operator

servicing a queue of homogeneous tasks. The agent can service a task with a normal or high

fidelity level, where fidelity refers to the degree of exactness and precision while servicing the

task. Therefore, high-fidelity servicing results in higher-quality service but leads to larger

service times and increased operator tiredness. We treat the cognitive state of the human

operator as a lumped parameter that captures psychological factors such as workload, and

fatigue. The service time distribution of the human operator depends on her cognitive dy-

namics and the level of fidelity selected for servicing the task. Her cognitive dynamics evolve

as a Markov chain in which the cognitive state increases with high probability whenever she

is busy, and decreases while resting. The tasks arrive according to a Poisson process and

each task waiting in the queue loses its value at a fixed rate. We address the trade-off be-

tween high-quality service of the task and consequent loss in value of subsequent tasks using

an SMDP framework. We numerically determine an optimal policy and the corresponding

10

optimal value function. Finally, we establish the structural properties of an optimal fidelity

policy and provide conditions under which the optimal policy is a threshold-based policy.

The material in this chapter is from [41] and [39].

The major contributions of this work are threefold. First, we pose the fidelity selection

problem in an SMDP framework and compute an optimal policy. We formulate a control

of queue problem, where in contrast to a standard queue, the server is a human operator

with her own cognitive dynamics. Our model for service time distribution of the human

operator incorporates the influence of cognitive state and fidelity level on the service time.

Second, we numerically show the influence of cognitive dynamics on the optimal policy. In

particular, we show that servicing the tasks with high fidelity is not always optimal due to

larger service times and increased tiredness of the human operator. In fact, we determine the

optimal policy as a function of the queue length as well as the cognitive state of the human

operator. Our results provide insight into the efficient design of human decision support

systems. Third, we establish structural properties of the optimal fidelity selection policy and

provide sufficient conditions under which, for each cognitive state, there exist thresholds on

queue lengths at which optimal policy switches fidelity levels.

Chapter 3: In this chapter, we study the problem of optimal fidelity selection for a

human operator performing an underwater visual search task. Human performance depends

on various cognitive factors such as workload and fatigue. We perform human experiments

in which participants perform two tasks simultaneously: a primary task, which was subject

to evaluation, and a secondary task to estimate their workload. The primary task requires

participants to search for underwater mines in videos, while the secondary task involves a

simple visual test where they respond when a green light displayed on the side of their screens

turns red. Videos arrive as a Poisson process and are stacked in a queue to be serviced by

the human operator. The operator can choose to watch the video with either normal or high

fidelity, with normal fidelity videos playing at three times the speed of high fidelity ones.

Participants receive rewards for their accuracy in mine detection for each primary task and

11

penalties based on the number of videos waiting in the queue. We consider the workload

of the operator as a hidden state and model the workload dynamics as an Input-Output

Hidden Markov Model (IOHMM). We use a Partially Observable Markov Decision Process

(POMDP) to learn an optimal fidelity selection policy, where the objective is to maximize

total rewards. Our results demonstrate improved performance when videos were serviced

based on the optimal fidelity selection policy compared to a baseline where humans chose

the fidelity level themselves. The material in this chapter is from [71].

The major contributions of this work are threefold. First, we address the optimal fidelity

selection challenge by framing it as a control of queue problem, incorporating a hidden

workload state. We employ IOHMM and POMDP to derive the optimal fidelity selection

policy. Second, we compare the human fidelity selection policy with the optimal policy

and draw valuable insights into human behavioral patterns. Third, we illustrate that by

recommending the optimal policy, autonomous systems can effectively aid human decision-

making, leading to a substantial improvement in system performance.

Chapter 4: In this chapter, we relax the assumption of the known human service time

model in the optimal fidelity selection problem. We assume the parameters of the human’s

service time distribution depend on the selected fidelity level and her cognitive state and

are assumed to be unknown a priori. These parameters are learned online through Bayesian

parameter estimation. We formulate a robust adaptive SMDP to solve our optimal fidelity

selection problem. We extend the results on the convergence of robust-adaptive MDP to

robust-adaptive SMDPs and show that the solution of the robust adaptive SMDP converges

to the optimal solution for the uncertainty-free SMDP. Furthermore, we numerically illus-

trate the convergence of the synchronous and asynchronous robust adaptive policy to the

uncertainty-free optimal policy. The material in this chapter is from [17].

The major contributions of this work are fourfold. First, we pose the optimal fidelity se-

lection problem with uncertain human service time distribution in a robust adaptive SMDP

framework. Second, we continuously improve the service time distribution estimates using

12

Bayesian parametric estimation [72] and utilize it to obtain a robust policy. Third, we for-

mally show that the solution of both synchronous and asynchronous value iteration methods

for the robust adaptive SMDP converges to the optimal solution for the uncertainty-free

SMDP. Fourth, we provide numerical illustrations that show the convergence of the robust

adaptive SMDP solution to the uncertainty-free SMDP.

Chapter 5: In this chapter, we propose a DSEE algorithm with interleaving exploration

and exploitation epochs for model-based RL problems that aim to simultaneously learn the

system model, i.e., an MDP, and the associated optimal policy. During exploration, DSEE

explores the environment and updates the estimates for expected reward and transition

probabilities. During exploitation, the latest estimates of the expected reward and transition

probabilities are used to obtain a robust policy with high probability. We design the lengths

of the exploration and exploitation epochs such that the cumulative regret grows as a sub-

linear function of time. The material in this chapter is from [73].

The major contributions of this work are twofold: (i) we propose a DSEE algorithm for

model-based RL problems and (ii) we design the lengths of the exploration and exploitation

epochs such that the cumulative regret for the DSEE algorithm grows as a sub-linear function

of time.

Chapter 6:

In this chapter, we investigate the role of feedback in fostering human

learning in sequential decision-making tasks. Cognitive rehabilitation, STEM skill acquisi-

tion, and coaching games such as chess often require tutoring decision-making strategies.

The advancement of AI-driven tutoring systems for facilitating human learning requires an

understanding of the impact of evaluative feedback on human decision-making and skill de-

velopment. To this end, we conduct human experiments using Amazon Mechanical Turk to

study the influence of evaluative feedback on human decision-making in sequential tasks. In

these experiments, participants solve the Tower of Hanoi puzzle and receive AI-generated

feedback while solving it. We examine how this feedback affects their learning and skill

transfer to related tasks. We also explore various computational models to understand how

13

people incorporate evaluative feedback into their decision-making processes. The material

in this chapter is from [74].

There are three major contributions of this work. (i) We investigate the impact of different

evaluative feedback strategies on the performance of individuals learning to solve ToH, a

widely studied sequential decision-making task. Furthermore, we explore how individuals

trained with different feedback strategies transfer their skills to a more challenging task.

(ii) Treating humans as noisy optimal agents, we study how various evaluative feedback

strategies affect their reward functions. Our research highlights the influence of different

forms of evaluative feedback on the implicit reward structure that explains human decisions.

(iii) We create a set of candidate computational models that may explain how humans

integrate evaluative feedback into their sequential decision-making processes. Our goal is

to identify the model that best explains human decision-making under evaluative feedback

conditions.

Chapter 7: In this chapter, we consider a team of heterogeneous agents that is collec-

tively responsible for servicing, and subsequently reviewing, a stream of homogeneous tasks.

Each agent has an associated mean service time and a mean review time for servicing and

reviewing the tasks, respectively. Agents receive a reward based on their service and review

admission rates. The team objective is to collaboratively maximize the number of “serviced

and reviewed" tasks. We formulate a Common-Pool Resource (CPR) game and design utility

functions to incentivize collaboration among heterogeneous agents in a decentralized manner.

We show the existence of a unique Pure Nash Equilibrium (PNE), and establish convergence

of best response dynamics to this unique PNE. Finally, we establish an analytic upper bound

on three inefficiency measures of the PNE, namely the price of anarchy (PoA), the ratio of

the total review admission rate (TRI), and the ratio of latency (LI). The material in this

chapter is from [75] and [76].

The major contributions of this work are fivefold. First, we present a novel formula-

tion of team backup behavior and design incentives, within the CPR game formalism, to

14

facilitate such behavior. Second, we show that there exists a unique PNE for the proposed

game. Third, we show that the proposed game is a best response potential game as defined

in [77], for which both sequential best response dynamics [78] and simultaneous best reply

dynamics [79] converge to the PNE. Thus, the best response of self-interested agents in a

decentralized team converge to the PNE. Fourth, we provide the structure of the social wel-

fare solution and numerically quantify different measures of the inefficiency for the PNE,

namely the PoA, the ratio of the total review admission rate (TRI), and the ratio of latency

(LI), as a function of a measure of heterogeneity. While PoA is a widely used inefficiency

metric, we define TRI and LI as other relevant measures for our setup based on the total

review admission rate and latency (inverse of throughput), respectively. Finally, we provide

an analytic upper bound for all three measures of the inefficiency.

15

CHAPTER 2

OPTIMAL FIDELITY SELECTION FOR HUMAN-IN-THE-LOOP QUEUES

In this chapter, we study optimal fidelity selection for a human operator servicing a queue of

homogeneous tasks. The agent can service a task with a normal or high fidelity level, where

fidelity refers to the degree of exactness and precision while servicing the task. Therefore,

high-fidelity servicing results in higher-quality service but leads to larger service times and

increased operator tiredness. We treat the cognitive state of the human operator as a lumped

parameter that captures psychological factors such as workload, and fatigue. The service time

distribution of the human operator depends on her cognitive dynamics and the level of fidelity

selected for servicing the task. Her cognitive dynamics evolve as a Markov chain in which

the cognitive state increases with high probability whenever she is busy, and decreases while

resting. The tasks arrive according to a Poisson process and each task waiting in the queue

loses its value at a fixed rate. We address the trade-off between high-quality service of the task

and consequent loss in value of subsequent tasks using an SMDP framework. We numerically

determine an optimal policy and the corresponding optimal value function. Finally, we

establish the structural properties of an optimal fidelity policy and provide conditions under

which the optimal policy is a threshold-based policy.

2.1 Background and Problem Formulation

We now discuss our problem setup, formulate it as an SMDP, and solve it to obtain an

optimal fidelity selection policy.

2.1.1 Problem Setup

We consider a human supervisory control system in which a human operator is servicing

a stream of homogeneous tasks. The human operator may service these tasks with different

levels of fidelity. The servicing time of the operator depends on the fidelity level with which

she services the task as well as her cognitive state. We assume that the mean service time

of the operator increases with the selected fidelity level. For example, when the operator

16

Figure 2.1 Overall schematic of the problem setup. The incoming tasks arrive as a Poisson
process with rate λ. The tasks are serviced by the human operator based on the recommended
fidelity level by the decision support system. Each task loses its value at a fixed rate while
waiting in the queue.

services the task with high fidelity, she may look into deeper details of the task, and conse-

quently take a longer time to service.

In addition to the fidelity level, the human service time may depend on their cognitive state.

We treat the cognitive state as a lumped parameter that can capture various physiological

measures. It can be a function of stress, workload, arousal rate, operator utilization ratio,

etc. Such lumped representation can be obtained by classifying these psychological measure-

ments into different service time distribution parameters. Inspired by the Yerkes-Dodson

law, for a fixed level of fidelity, we model the service time as a unimodal function of the

human cognitive state. Specifically, the mean service time is minimal corresponding to an

intermediate optimal cognitive state (later referred to as the optimal cognitive state cog∗)

as shown in Fig. 2.2c.

We are interested in the optimal fidelity selection policy for the human operator. To this end,

we formulate a control of queue problem, where in contrast to a standard queue, the server

is a human operator with her cognitive dynamics. The incoming tasks arrive according to

a Poisson process at a given rate λ ∈ R>0 and are serviced by the operator based on the

fidelity level recommended by a decision support system (Fig. 2.1). We consider a dynamic

queue of homogeneous tasks with a maximum capacity L ∈ N. The operator is penalized for

each task waiting in the queue at a constant rate c ∈ R>0 per unit delay in its servicing. The

set of possible actions available for the operator corresponds to (i) Waiting (W), when the

queue is empty, (ii) Resting (R), which allows the operator to rest and reach the optimal

17

cognitive state, (iii) Skipping (S), which allows the operator to skip a task to reduce the

queue length and thereby focus on newer tasks, (iv) Normal Fidelity (N) for servicing the

task with normal fidelity, and (v) High Fidelity (H) for servicing the task more carefully

with high precision. The skipping action ensures the stability of the queue by allowing the

operator to reduce the queue length by skipping some tasks. Ideally, through appropriate

control of the arrival rate, the system designer should ensure that skipping is not an optimal

action.

Let s ∈ S be the state of the system and As be the set of admissible actions in state s, which

we define formally in Section 2.1.2. The human receives a reward r : S × As (cid:55)→ R≥0 defined

by

r(s, a) =






rH,

rN ,

if a = H,

if a = N,

0,

if a ∈ {W, R, S},

(2.1)

where, rH, rN ∈ R≥0 and rH > rN . We intend to design a decision support system that assists

the operator by recommending optimal fidelity level to service each task1. The recommen-

dation is based on the queue length and the operator’s cognitive state which we assume

to have real-time access using, e.g., Electroencephalogram (EEG) measurements (see [80]

for measures of cognitive load from EEG data) or eye-tracking and pupillometry [81]. We

assume that the noisy data from these devices can be clustered into a finite number of bins

to estimate the cognitive state. We study the optimal policy under the perfect knowledge of

the cognitive state2.

18

(a)

(c)

(b)

(d)

Figure 2.2 Service time distribution of the human operator with (a) varying cognitive state
and high fidelity, (b) varying action and fixed cognitive state, cog = 0.9. (c) Mean and
variance of the service time distribution are unimodal functions of the cognitive state. (d)
The mean sojourn time distribution takes on different forms based on the selected action.

Forward

Backward

Action Probabilitya (λf δt) Probabilityb (λbδt)

W
R
N
H
S

λf = 0.02 (Noise)
λf = 0.02 (Noise)
λf = 0.6
λf = 1.1
λf = 0

λb = 0.5
λb = 0.5
λb = 0.02 (Noise)
λb = 0.02 (Noise)
λb = 0

Stay Probabilityc
(1-λf δt − λbδt)
1 − 0.52δt
1 − 0.52δt
1 − 0.62δt
1 − 1.12δt
1

aForward Probability does not exist for cog = 1 (reflective boundary)
bBackward Probability does not exist for cog = 0 (reflective boundary)
cStay Probability is 1 − λf δt for cog = 0 and 1 − λbδt for cog = 1
Table 2.1 Cognitive Dynamics modeled as Markov chain.

19

2.1.2 Mathematical Modeling

We formulate the control of queue problem as a discrete-time SMDP Γ defined by the

following six components:

(i) A finite state space S := {(q, cog)| q ∈ {0, 1, ..., L}, cog ∈ C := {i/N }i∈{0,··· ,N }}, for

some N ∈ N, where q is the queue length and cog represents the lumped cognitive

state, which increases (decreases) when the operator is busy (idle).

(ii) A set of admissible actions As for each state s ∈ S which is given by: (i) As := {W | s ∈

S, q = 0} when queue is empty, (ii) As := {{R, S, N, H }| s ∈ S, q ̸= 0} when queue

is non-empty and cog > cog∗, where cog∗ ∈ C is the optimal cognitive state associated

with minimum mean service time, and (iii) As := {{S, N, H }| s ∈ S, q ̸= 0} when

queue is non-empty and cog ≤ cog∗.

(iii) A state transition distribution P (s′| τ, s, a) from state s to s′ for each action a ∈

As conditioned on the discrete sojourn time τ ∈ R>0 (time spent in state s before

transitioning into next state s′). The state transition from s = (q, cog) → s′ =

(q′, cog′) consists of two independent transition processes which are given by (i) a

Poisson process for transition from q → q′ (ii) human cognitive dynamics for the

transition from cog → cog′. We model the cognitive dynamics of the human operator

as a Markov chain in which, while servicing the task, the probability of an increase

in cognitive state in small time δt ∈ R>0 is greater than the probability of a decrease

in cognitive state. Furthermore, the probability of the increase in the cognitive state

increases with the level of fidelity selected for servicing the task. Similarly, while

waiting or resting, the probability of a decrease in cognitive state in small time δt

1We assume compliance of the operator with the recommendations. To account for non-compliance, we
can introduce p as the probability of compliance and 1 − p as the probability that the operator will deviate
and follow a different behavioral policy. This deviation can be incorporated by using a mixed service time
distribution with probabilities p and 1 − p for the recommended and behavioral actions respectively.

2If the cognitive state is not perfectly known, then our policy can be used within algorithms such as
QMDP [82], to derive approximate solutions to the associated partially observable Markov decision pro-
cess [83].

20

is higher than the probability of an increase in cognitive state. Sample parameters

of the model used in our numerical simulations are shown in Table 2.1. This model

of cognitive state dynamics is a stochastic equivalent of deterministic models of the

utilization ratio considered in [11].

It is assumed that the cognitive state remains

unchanged when the human operator chooses to skip the task.

(iv) Sojourn time distribution P (τ | s, a) of (discrete) time τ ∈ R>0 spent in state s until

the next action is chosen takes on different forms depending on the selected action

(Fig. 2.2d). The sojourn time is the service time while servicing the task (normal/

high fidelity), resting time while resting, constant time of skipping ts ∈ R>0 while

skipping, and time until the next task arrival while waiting in case of an empty queue.

We model the rest time as the time required to reach from the current cognitive state

to the optimal cognitive state cog∗. In our numerical illustrations, we model the service

time distribution while servicing the task using a hypergeometric distribution (Fig. 2.2a

and 2.2b), where the parameters of the distribution are chosen such that the mean

service time has the desired characteristics, i.e., it increases with the fidelity level

(Fig. 2.2d) and is a unimodal function of the cognitive state (Fig. 2.2c). While resting,

sojourn time distribution is the first passage time (FPT) distribution for transitioning

from the current cognitive state cog to cog∗. We determine this distribution using

matrix methods [84] applied to the Markov chain used to model the cognitive dynamics.

Finally, to ensure the stability of the queue, we assume that the constant time of skip

is less than 1
λ

, i.e., queue length decreases on average while skipping tasks.

(v) For selecting action a at state s, the human receives a bounded reward r(s, a) defined

in (2.1). Additionally, the human incurs a penalty at a constant cost rate of c due

to each task waiting in the queue, and consequently, the cumulative expected cost for

choosing action a at state s = (q, cog) is given by:

(cid:32)

(cid:34)

P(τ |s, a)cτ

E

q + q′
2

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:35)(cid:33)

τ, s, a

(cid:88)

=

P(τ |s, a)cτ

τ

(cid:18) 2q + λτ
2

(cid:19)

,

(cid:88)

τ

21

which is obtained by using E[q|τ, s, a] = q and E[q′|τ, s, a] = q + λτ . The expected net

immediate reward received by the operator for selecting an action a in state s is given

by:

R(s, a) = r(s, a) −

P(τ |s, a)c

(cid:88)

τ

(cid:18) 2q + λτ
2

(cid:19)

τ

= r(s, a) − c E [τ |s, a] q −

cλ
2

E (cid:2)τ 2|s, a(cid:3) ,

(2.2)

where E [τ | s, a] and E [τ 2|s, a] represent the first and the second conditional moment

of the sojourn time distribution, respectively.

(vi) A discount factor γ ∈ [0, 1), which we choose as 0.96 for our numerical illustration.

Remark 1. Although we assume a finite skip time, an alternative approach is to incorporate

a penalty for the skip action. Note that, unlike a fixed penalty, a finite skip time results in

a penalty that increases with queue length (see (2.2)). Consequently, the current approach is

less inclined to skip tasks as the queue length increases compared to a model with a constant

penalty.

Remark 2. The reward R(s, a) formulation can be interpreted as an unconstrained SMDP

corresponding to a constrained SMDP that maximizes r(s, a) subject to a constraint on the

average queue length for the stability of the queue. Therefore, the penalty rate c acts as the

Lagrange multiplier for the unconstrained problem, and hence, can be obtained by primal-dual

methods that use dual ascent for finding the Lagrange multiplier [20].

2.1.3 Solving SMDP for Optimal Policy

For SMDP Γ, the optimal value function V ∗ : S → R satisfies the following Bellman

equation [85]:

(cid:34)

V ∗(s) = max
a∈As

R(s, a) +

(cid:88)

s′,τ

22

(cid:35)
γτ P (s′, τ |s, a) V ∗ (s′)

,

(2.3)

(a)

(b)

Figure 2.3 (a) Optimal Policy π∗ and (b) Optimal Value Function V ∗ for SMDP Γ where
time required to skip the tasks is not too small compared to the mean service time required
to process the task.

where P (s′, τ |s, a), which is the joint probability that a transition from state s to state s′

occurs after time τ when action a is selected can be rewritten as:

P (s′, τ |s, a) = P (s′|τ, s, a) P (τ |s, a) ,

(2.4)

where P (s′|τ, s, a) and P (τ |s, a) are given by the state transition probability distribution

and the sojourn time probability distribution, respectively. An optimal policy π∗ : S → As

at each state s selects an action that achieves the maximum in (2.3). We utilize the value

iteration algorithm [31] to compute an optimal policy.

2.2 Numerical Illustration of Optimal Fidelity Selection

We now numerically illustrate the optimal value function and an optimal policy for SMDP

Γ.

(a)

(b)

(c)

Figure 2.4 Optimal policy π∗ for different values of arrival rate λ. In case (a) and (b) the
action S in the optimal policy does not have a unique threshold for some cognitive states.
Similarly in case (c) actions S and N in the optimal policy do not have unique thresholds.

Fig. 2.3a and 2.3b show an optimal policy π∗, and the optimal value function V ∗, respec-

tively, for the case in which the skip time is not too small compared to the mean service

23

time. If the skip time is too small, the action S is the optimal action almost everywhere to

reduce the queue length. For a sufficiently high arrival rate λ such that there is always a

task in the queue after servicing the current task, we observe that for any given cog, V ∗ is

monotonically decreasing with q.

Additionally, we observe that for a given q, V ∗ is a unimodal function of cog, with its

maximum value corresponding to the optimal cognitive state (cog∗ = 0.6 for numerical il-

lustrations). We observe that π∗ selects the high fidelity level around the cog∗ for low queue

length, and thereafter transitions to a normal fidelity level for higher queue lengths. We

also observe that in low cognitive states, the optimal policy is to keep skipping the tasks

until the queue length becomes small, and then start servicing the tasks. In higher cognitive

states, we observe that resting is the optimal action at smaller queue lengths while skipping

tasks is the optimal choice at larger queue lengths. Additionally, we observe the effect of

cog on π∗. In particular, we observe that π∗ switches from H to N, N to R, and R to S at

certain thresholds on q, and these thresholds appear to be a unimodal function of cog. This

behavior can be attributed to the mean service time being unimodal w.r.t cog.

Fig. 2.4 shows some examples of π∗ for certain parameters. We observe that for some cogni-

tive states, π∗ does not have a unique threshold and the same action reappears after switching

to another action. For example, in Fig. 2.4a, action S is observed between actions H and

N, as well as after action N. In the following section, we provide sufficient conditions under

which π∗ has unique transition thresholds at which actions switch, and the previous action

does not re-appear for the same cog.

2.3 Structural Properties of the Optimal Policy

We establish the structural properties of the optimal infinite-horizon value function by

considering the finite horizon case and then extending the results to the infinite horizon by

taking the infinite step limit.

Let V ∗

n (s0), n ≥ 0, be the discounted n-step optimal expected reward when the initial state is
0 (q, cog) = −Cq is the terminal cost for the finite-horizon case for a non-negative

s0, where V ∗

24

constant C. Each step size k ∈ {0, . . . , n − 1} is based on the sojourn time τk, spent in

a state sk when action ak is selected. Let Jn,π(s0) denote the discounted n-step expected

reward with initial state s0 under a given policy π. Henceforth, for the brevity of notation,

we denote the conditional expectation E[·|s0, π] by Eπ[·]. Jn,π(s0) is given by:
(cid:34)n−1
(cid:88)

(cid:35)

Jn,π(s0) = Eπ

γζiR(si, ai) − γζnCqn

,

(2.5)

where ζi := (cid:80)i−1

j=0 τj for i > 0 and ζ0 := 0. The discounted n-step optimal expected reward

i=0

n (s) is given by:
V ∗

V ∗
n (s0) = Jn,π∗(s0),

(2.6)

where π∗ is the optimal policy that maximizes Jn,π(s0) at each s0.

Let µ1 : S × As (cid:55)→ R>0 and µ2 : S × As (cid:55)→ R>0 be function defined by µ1(s, a) = E [τ |s, a]

and µ2(s, a) = E [τ 2|s, a], where τ is the sojourn time. We study the structural properties

of the optimal policy for a large queue capacity, i.e.

in the limit L → +∞, and under the

following assumptions:

(A1) The task arrival rate λ is sufficiently high so that the queue is never empty with high

probability3.

(A2) For any state s = (q, cog)4:

µ1(s, S) < µ1(s, R) < µ1(s, N ) < µ1(s, H), and

µ2(s, S) < µ2(s, R) < µ2(s, N ) < µ2(s, H).

(2.7)

(A3) We assume that Eπ[γτ ] ≤ f (Eπ[τ ], Varπ(τ )) < 1, where Varπ(τ ) = Var(τ |s0, a =

π(s0)) is the variance of τ in any initial state s0 under a given policy π, and f is a

monotonic function such that f (·, Varπ(τ )) is monotonically decreasing and f (Eπ[τ ], ·)

is monotonically increasing.

3Given the service time distributions and the Poisson arrival rate, we can precisely determine the distri-
bution of the number of arrivals that occur between each state transition. Therefore, Chernoff bounds [86]
can be utilized to characterize the high probability that the number of arrivals between state transitions
while servicing a task exceeds one.

4The action R is only available for states with cog > cog∗.

25

We make the assumption (A1) for convenience. Indeed, if the queue is allowed to be empty,

then we will need to deal with an extra “waiting" action. Also, high arrival rates are the

most interesting setting to study optimal fidelity selection. Assumption (A2) is true for a

broad range of interesting parameters that define sojourn time distribution(s). Assumption

(A3) holds for a class of light-tail distributions with non-negative support for τ , for example,

when the moment generating function (MGF) of τ is upper bounded by the MGF of Gamma

distribution, i.e.,

Eπ[etτ ] ≤

(cid:18)

1 −

Varπ(τ )t
Eπ[τ ]

(cid:19) − Eπ [τ ]2

Varπ (τ )

, for all t <

Eπ[τ ]
Varπ(τ )

.

In this scenario, substituting t = ln(γ) < 0 <

Eπ[τ ]
Varπ(τ )

, we get

Eπ[γτ ] ≤

(cid:18)

1 −

Varπ(τ ) ln(γ)
Eπ[τ ]

(cid:19) − Eπ [τ ]2

Varπ (τ )

=: f (Eπ[τ ], Varπ(τ )).

Let ρ := maxcog,a f (E[τ |cog, a], Var(τ |cog, a)). Therefore, Eπ[γτ ] ≤ ρ. For the class of

distributions of τ satisfying assumption (A3), and any initial state s0 and policy π, we have

Eπ[γζk]

(1)∗
=

k−1
(cid:89)

i=0

Eπ[γτi] ≤

k−1
(cid:89)

i=0

f (Eπ[τi], Varπ(τi)) ≤ ρk,

where (1)∗ follows from the independence of τi and τj, for i ̸= j. Therefore, we have

lim
n→∞

n−1
(cid:88)

k=0

Eπ[γζk] ≤ lim
n→∞

n−1
(cid:88)

k=0

ρk =

1
1 − ρ

.

We will now establish that the optimal policy for SMDP Γ is a threshold-based policy if the

following condition holds for each cognitive state cog:

min{E[τ |cog, H] − E[τ |cog, N ], E[τ |cog, N ] − E[τ |cog, R], E[τ |cog, R] − ts}+

tsγE[τ |cog,H]
1 − γtmax

≥

tmax
1 − ρ

max
a∈As

E[γτ |cog, a],

(2.8)

where tmax = E[τ |cog = 1, a = H] is the maximum expected sojourn time (assuming the

largest mean service time in the highest cognitive state), and ts is the constant time for the

skip.

26

Remark 3. For tasks with large differences in expected sojourn times, i.e., 0 ≪ ts ≪

E[τ |cog, R] ≪ E[τ |cog, N ] ≪ E[τ |cog, H], maxa∈As

E[γτ |cog, a] → 0, and (2.8) always holds.

We introduce the following notation. Let q∗

j : C (cid:55)→ Z≥0 ∪{+∞}, for j ∈ {1, 2, 3} be some

functions of cog.

Theorem 1 (Structure of optimal policy ). For SMDP Γ under assumptions (A1-A3) and

an associated optimal policy π∗, if the difference in the expected sojourn times is sufficiently

large such that (2.8) holds for any cognitive state cog, then the following statements hold:

(i) there exists unique threshold functions q∗

1(cog), q∗

2(cog), and q∗

3(cog) such that for each

cog > cog∗:

π∗(s = (q, cog)) =






H,

N,

R,

S,

q ≤ q∗

1(cog),

1(cog) < q ≤ q∗
q∗

2(cog),

2(cog) < q ≤ q∗
q∗

3(cog),

q > q∗

3(cog);

(ii) there exists unique threshold functions q∗

1(cog) and q∗

2(cog) such that for any cog ≤

cog∗:

π∗(s = (q, cog)) =






H,

N,

S,

q ≤ q∗

1(cog),

1(cog) < q ≤ q∗
q∗

2(cog),

q > q∗

2(cog).

We prove Theorem 1 using the following lemmas.

Lemma 1. (Immediate Reward): For SMDP Γ, the immediate expected reward R(s, a),

for each a ∈ As

(i) is linearly decreasing with queue length q for any fixed cognitive state cog;

27

(ii) is a unimodal function5 of the cognitive state cog for any fixed queue length q with its

maximum value achieved at the optimal cognitive state cog∗.

Proof. The proof follows by noting that (2.2) is linearly decreasing in q and the coefficients

E [τ | s, a] and E [τ 2|s, a] are unimodal w.r.t cog.

Interested readers can refer to [87] for

detailed proof.

We now provide important mathematical results in Lemma 2 which we use to establish

Lemma 3.

Lemma 2. For the SMDP Γ, the following equations hold for any initial state s0 and policy

π:

(i) Eπ

(cid:2)γζk E[τ 2

k |cogk, ak](cid:3) = Eπ

(cid:2)γζkτ 2

k

(cid:3);

(ii) Eπ

(cid:2)γζk E[τk|cogk, ak]qk

(cid:3) = Eπ

(cid:2)γζkτk Eπ [qk|s0, ζk](cid:3)

Proof. The proof utilizes the properties of the expectation operator, and independence of

the transition processes for qk and cogk
proof.

.

Interested readers can refer to [87] for detailed

Lemma 3. (Value function bounds): For SMDP Γ under assumptions (A1-A3), for any

˜q0 ≥ q0, 0 ≤ cts∆q

1−γtmax ≤ V ∗(q0, cog0) − V ∗(˜q0, cog0) ≤ ctmax∆q

1−ρ

, where ∆q = ˜q0 − q0, ρ is an

upper bound on Eπ[γτ ], tmax = E[τ |cog = 1, a = H] is the maximum expected sojourn time,

and ts is the constant time for skip.

Proof. See Appendix A: Chapter 2 for the proof.

Remark 4. It follows from Lemma 3, that for SMDP Γ under assumptions (A1-A3), the

optimal value function V ∗(q, ·) is monotonically decreasing with queue length q.

5The expected immediate reward under action S is a constant, which we treat as a unimodal function.

28

Lemma 4. (Thresholds for low cognitive states): For the SMDP Γ under assumptions

(A1-A3), and an associated optimal policy π∗, the following statements hold for each cog ≤

cog∗:

(i) there exists a threshold function q∗

1(cog), such that N strictly dominates H, for each

q > q∗

1(cog) if

E[τ |cog, H] − E[τ |cog, N ] +

tsγE[τ |cog,H]
1 − γtmax

≥

tmax
1 − ρ

E[γτ |cog, N ];

(ii) there exists a threshold function q∗

2(cog), such that for each q > q∗

2(cog), action S is

optimal if

E[τ |cog, N ] − ts +

tsγE[τ |cog,H]
1 − γtmax

≥ γts tmax
1 − ρ

.

Proof. See Appendix A: Chapter 2 for the proof.

Lemma 5. (Thresholds for high cognitive states): For the SMDP Γ under assumptions

(A1-A3), and an associated optimal policy π∗, the following statements hold for each cog >

cog∗:

(i) there exists a threshold function q∗

1(cog), such that N strictly dominates H, for each

q > q∗

1(cog) if

E[τ |cog, H] − E[τ |cog, N ] +

tsγE[τ |cog,H]
1 − γtmax

≥

tmax
1 − ρ

E[γτ |cog, N ];

(ii) there exists a threshold function q∗

2(cog), such that R strictly dominates H & N, for

each q > q∗

2(cog) if

E[τ |cog, N ] − E[τ |cog, R] +

tsγE[τ |cog,H]
1 − γtmax

≥

tmax
1 − ρ

E[γτ |cog, R].

(iii) there exists a threshold function q∗

3(cog), such that for each q > q∗

3(cog), action S is

optimal if

E[τ |cog, R] − ts +

tsγE[τ |cog,H]
1 − γtmax

≥ γts tmax
1 − ρ

.

29

Proof. Recall that As := {{R, S, N, H }| s ∈ S, q ̸= 0} when queue is non-empty and

cog > cog∗. The proof of Lemma 5 follows analogously to the proof of Lemma 4.

Proof of Theorem 1: The proof follows by finding the intersection of the sufficient condi-

tions from Lemmas 4 and 5 to obtain condition (2.8) for a threshold-based π∗.

2.4 Conclusions and Future Directions

We studied optimal fidelity selection for a human operator servicing a stream of homo-

geneous tasks using an SMDP framework. In particular, we studied the influence of human

cognitive dynamics on an optimal fidelity selection policy. We presented numerical illus-

trations of the optimal policy and established its structural properties. These structural

properties can be leveraged to tune the design parameters, deal with the model uncertainty,

or determine a minimally parameterized policy for specific individuals and tasks.

There are several possible avenues for future research. An interesting direction is to conduct

experiments with human subjects, measure EEG signals to assess their cognitive state and

test the benefits of recommending optimal fidelity levels. It is of interest to extend this work

to a team of human operators servicing a stream of heterogeneous tasks. A preliminary setup

is considered in [75, 88], where authors study a game-theoretic approach to incentivize col-

laboration in a team of heterogeneous agents. In such a setting, finding the optimal routing

and scheduling strategies for these heterogeneous tasks is also of interest.

30

CHAPTER 3

OPTIMAL FIDELITY SELECTION FOR IMPROVED PERFORMANCE IN
HUMAN-IN-THE-LOOP QUEUES FOR UNDERWATER SEARCH

In this chapter, we study the problem of optimal fidelity selection for a human operator per-

forming an underwater visual search task. Human performance depends on various cognitive

factors such as workload and fatigue. We perform human experiments in which participants

perform two tasks simultaneously: a primary task, which was subject to evaluation, and a

secondary task to estimate their workload. The primary task requires participants to search

for underwater mines in videos, while the secondary task involves a simple visual test where

they respond when a green light displayed on the side of their screens turns red. Videos

arrive as a Poisson process and are stacked in a queue to be serviced by the human operator.

The operator can choose to watch the video with either normal or high fidelity, with normal

fidelity videos playing at three times the speed of high fidelity ones. Participants receive

rewards for their accuracy in mine detection for each primary task and penalties based on

the number of videos waiting in the queue. We consider the workload of the operator as a

hidden state and model the workload dynamics as an Input-Output Hidden Markov Model

(IOHMM). We use a Partially Observable Markov Decision Process (POMDP) to learn an

optimal fidelity selection policy, where the objective is to maximize total rewards. Our results

demonstrate improved performance when videos were serviced based on the optimal fidelity

selection policy compared to a baseline where humans chose the fidelity level themselves.

3.1 Background and Problem Formulation

We now discuss our problem setup, formulate it as POMDP, and solve it to obtain the

optimal fidelity selection policy.

3.1.1 Problem Setup

We study the problem of optimal fidelity selection for a human operator performing a

visual search task. The human operator performs two tasks simultaneously, a primary task

and a secondary task. The primary task involves searching for underwater mines in videos

31

Figure 3.1 Human experiment interface. The participants press the spacebar key whenever
a new mine is detected in the primary task video. Additionally, the green light (secondary
task) randomly turns red once for each primary task and the participant responds by pressing
the Enter key as early as possible. The queue length (tasks waiting in the queue) is displayed
on top of the primary task.

generated from an underwater simulation designed using Gazebo [89] and ROS [90]. The

operator watches videos and responds by pressing a key whenever a mine is spotted. On the

other hand, the secondary task involves a simple visual exercise, where participants press

a key as early as possible when a green light located at the side of the screen changes to

red. During each primary task, the green light undergoes this transition randomly, occurring

between the 25% and 75% mark of the video. We record the participants’ reaction time in

the secondary tasks, and in a rare event when a participant misses the red light, the reaction

time is set to the total time the light stays red until the end of the primary task.

Fig. 3.1 shows the experiment interface. The videos for the primary task arrive as a

Poisson process with an arrival rate of λ ∈ R>0 and get stacked in a queue awaiting service

by the human operator. The operator has the option to select either high or normal fidelity

levels for servicing each video. In normal fidelity, the video is presented at a speed of three

times faster compared to high-fidelity processing. Additionally, operators can choose to

delegate a task for autonomous processing, even though the accuracy of the autonomous

system may be lower. This delegation or “skip" action serves as a means to maintain queue

stability, especially in situations with large queue lengths.

32

Figure 3.2 Input-output hidden Markov model. The input a represents the fidelity level, the
hidden state w signifies the workload, and the three observations o1, o2, and o3 correspond
to the fraction of correctly detected mines in the primary task, the count of false alarms in
the primary task, and the reaction time recorded during the execution of the secondary task,
respectively.

The performance of a human operator relies on their workload, making it a crucial factor

in our problem formulation. To address this, we formulate the problem as a POMDP,

with the workload of the operator treated as a latent or hidden variable. We estimate this

hidden workload through a combination of reaction time measurements from the secondary

task and performance metrics obtained from the primary task. Specifically, we model the

workload dynamics of the human operator using an IOHMM (see Fig. 3.2). In this model,

the fidelity level a serves as the input, the workload w operates as the hidden state, and

we observe three distinct output measures o1, o2, o3. These output measures correspond to

the fraction of correctly detected mines in the primary task, the count of false alarms in the

primary task, and the reaction time recorded during the execution of the secondary task,

respectively. This modeling approach helps us understand how fidelity, workload, and task

performance are interconnected in one unified framework.

We utilize the extended Baum-Welch algorithm [91] to train the IOHMM model, which

provides the transition probabilities p(w′|w, a), observation probabilities p(o|w, a), and the

initial state distribution p(w0) through expectation maximization. These probabilities are

utilized in the POMDP formulation to obtain the optimal fidelity selection policy.

To determine the most suitable number of hidden workload states, we use the Akaike

information criterion (AIC) [92] and Bayesian information criterion (BIC) [93] for model

33

selection. For a given number of hidden states, the AIC and BIC are defined as:

AIC = 2p − 2 log( ˆL), BIC = p log(no) − 2 log( ˆL),

(3.1)

where p represents the number of learned parameters, no stands for the number of observation

trajectories, and ˆL signifies the maximized value of the log-likelihood function derived from

the trained model. The model with the lowest AIC (or BIC) value is considered the optimal

choice, as determined by the AIC (or BIC) criteria. Based on these criteria (presented in

Section 3.2.3), we train an IOHMM model with two hidden states, which we refer to as

normal and high workload states.

3.1.2 Mathematical Modeling

We formulate our problem as a POMDP P = {S, A, Ω, T , O, r, γ}, where S is the

state space, A in the action space, Ω is the set of observations, T is the set of conditional

transition probabilities between states, O is the set of conditional observation probabilities,

r : S ×A → R is the reward function, and γ ∈ [0, 1) is the discount factor. The state space of

the system is defined using a tuple S = (q, w), where q ∈ {0, 1, . . . , L} and w ∈ W = {0, 1}

denotes the number of tasks waiting in the queue (with maximum queue length L) and

the hidden discrete workload of the human operator, respectively. We define w = 0 as the

normal workload state and w = 1 as the high workload state. The action space is defined

as A = {N, H, D}, with N and H representing normal and high-fidelity task processing,

respectively. The action D corresponds to task delegation, allowing the task to be handled

by autonomous systems. This delegation action is particularly useful for maintaining queue

stability when dealing with a large queue length. To discourage excessive use of the delegation

action, we assume that the autonomy’s accuracy in mine detection is significantly lower which

results in a lower immediate reward.

The observation space is defined by a tuple Ω = (o1, o2, o3), where o1, o2, and o3 cor-

respond to the fraction of correctly detected mines in the primary task, the count of false

alarms in the primary task, and the reaction time recorded during the execution of the sec-

34

ondary task, respectively. The state transition probability p(s′|s, a) ∈ T is derived from

the queue dynamics p(q′|q, a) given by the Poisson distribution with arrival rate λt, where

t is the duration of the task, and workload dynamics p(w′|w, a) obtained by training the

IOHMM. The observation probabilities p(o|s′, a) = p(o1, o2, o3|w′, a) ∈ O are also obtained

from the trained IOHMM model. The observation probabilities are assumed to be indepen-

dent of the queue length, and therefore, only depend on the workload of the operator, i.e.,

p(o|s′, a) = p(o|w′, a). We define the reward function as r(s, a) = α1o1 − α2o2 − α3q, where

αi for i ∈ {1, 2, 3} are positive constants, which rewards high accuracy in the primary task

and penalizes for the number of tasks waiting in the queue.

We convert the POMDP to a belief MDP defined by M = {B, A, τ, r, γ}. Here, B :=

{(q, bH)| q ∈ {0, 1, . . . , L}, bH ∈ ∆D} is the new state space, where q is the original queue

length, bH is the discrete belief probability for being in the high workload state, and ∆D is a

discretization of the interval [0, 1]. Therefore, the belief probability for being in the normal

workload state is given by 1 − bH. For our experiments, we discretize [0, 1] with a step size

of 0.1, i.e., ∆D = {0, 0.1, . . . , 1}. Note that the discretization of bH results in a finite state

space B. Let b : W → ∆D denote the belief vector, where b(0) = 1 − bH, and b(1) = bH.

From a current belief b(w), taking an action a and observing o, the updated belief b′(w′) is

given by:

b′(w′) = ηp(o|w′, a)

(cid:88)

p(w′|w, a)b(w),

where η = 1

p(o|b,a)

w∈W
is the normalizing constant with

p(o|b, a) =

(cid:88)

w′∈W

p(o|w′, a))

(cid:88)

w∈W

p(w′|w, a)b(w).

(3.2)

(3.3)

The updated belief probabilities in b′(w′), where b′(0) = 1 − b′
H

and b′(1) = b′
H

obtained

from (3.2) are mapped to the closest discrete states such that b′

H ∈ ∆D. The action set A
is the original action space. The transition probabilities τ (q′, b′|q, b, a) = p(q′|q, a)p(b′|b, a) is

composed of the Poisson process for queue dynamics p(q′|q, a), and the workload dynamics

35

p(b′|b, a), which is given by:

where

p(b′|b, a) =

(cid:88)

o∈O

p(b′|b, a, o)p(o|b, a),

p(b′|b, a, o) =





1,

if belief update in (3.2) returns b′,

0, otherwise,

and p(o|b, a) is defined in (3.3). The reward function r : B × A → R is given by:

r(q, bH, a) =

(cid:88)

s∈S|w=0

r(s, a)(1 − bH) +

(cid:88)

s∈S|w=1

r(s, a)bH,

(3.4)

(3.5)

(3.6)

where r(s, a) is the original reward function for the POMDP. The discount factor γ is the

original discount factor of the POMDP. For the belief MDP M, the expected value for policy

π starting from an initial state (q0, bH,0) is defined as:

V π(q0, bH,0) =

=

∞
(cid:88)

t=0
∞
(cid:88)

t=0

γtr(qt, bH,t, at)

γtE [r(st, at)|q0, bH,0, π] ,

(3.7)

where the expectation is computed over (qt, bH,t, wt). The optimal fidelity selection policy

maximizes the value in each belief state, i.e.,

We utilize the value iteration algorithm to solve the belief MDP and obtain the optimal

π∗ = arg max

π

V π(q0, bH,0).

fidelity selection policy.

3.2 Human Experiments

In this section, we discuss the design of our human experiments conducted using Prolific

(www.prolific.com).

3.2.1 Experimental Setup

We developed an underwater mine search experiment within the Robot Operating System

(ROS) framework using Gazebo models. For our simulation, we employed the “UUV simula-

tor" [94], a comprehensive package encompassing Gazebo plugins and ROS nodes specifically

36

(a)

(b)

(c)

(d)

Figure 3.3 Example image frames containing mines recorded from ROV. The mines’ positions
are highlighted within red bounding boxes in the frames.

tailored for simulating unmanned underwater vehicles, including remotely operated vehicles

(ROVs) and autonomous underwater vehicles (AUVs).

In our experimental setup, we designed an underwater scenario where we placed under-

water mines randomly throughout the environment following a uniform distribution within

a predefined area. To increase the complexity of the task, we also randomly introduced

a significant amount of underwater vegetation, which added an extra layer of difficulty in

detecting the mines.

We captured underwater videos by deploying an ROV into the environment. The ROV

followed a predetermined circular trajectory as it moved through the underwater space.

Equipped with a downward-facing camera, the ROV recorded images. To ensure optimal

recording conditions, we maintained complete darkness in the environment. The sole source

of illumination was a single light attached to the ROV, directing its beam vertically down-

ward. Consequently, only the area directly beneath the ROV was within the recording scope,

resulting in a focused and well-illuminated view. We recorded a total of 5600 images which

are used as 56 videos, with each video consisting of 100 frames. Fig. 3.3 shows example

image frames containing underwater mines recorded from ROV. The mines’ positions are

highlighted within red bounding boxes in the frames.

We performed 5 set of experiments, with each experiment group consisting of 20 partici-

pants. In each experiment, the participants first performed 8 practice tasks followed by 48

main tasks, where each task consists of a primary and a secondary task. Following is the list

37

of experiments.

• Experiment 1: The first experiment was the base experiment where we recorded data

to train the IOHMM model and learn the optimal fidelity selection policy by solving

the POMDP. In this experiment, the 48 videos were evenly divided, with half of them

being displayed at normal fidelity and the other half at high fidelity. To maintain con-

sistency and balance, the tasks were presented in alternating blocks of four tasks each.

Specifically, the sequence of 48 tasks was organized as {N, N, N, N, H, H, H, H, . . .}

for half of the participants and {H, H, H, H, N, N, N, N, . . .} for another half of the

participants. This reversal was thereby used to prevent ordering bias in our estimates.

• Experiment 2: In this experiment, the participants were allowed to choose a fidelity

level for servicing each task. Before releasing each task, the participant decided on

their desired fidelity level by pressing a key. This served as a baseline, where the

human operator received no decision support. In this experiment, we do not allow task

delegation, and the participants were only allowed to choose to service each task with

either a normal or high fidelity level.

• Experiment 3: In this experiment, the decision support system determined the appro-

priate fidelity level for each task by considering both the current operator performance

and the queue length. Specifically, the decision support system monitored the opera-

tor’s performance in each task, updated its belief using (3.2), and selected the optimal

action in accordance with the optimal fidelity selection policy. For this experiment, we

utilized an optimal policy that comprises only two available actions: normal and high

fidelity.

• Experiment 4: This experiment was similar to Experiment 2, with the modification

that the participants were provided an additional action of delegating the task to the

autonomous system.

38

• Experiment 5: This experiment was similar to Experiment 3, with the modification

that the optimal policy consisting of three actions: normal fidelity, high fidelity, and

task delegation was used to choose the fidelity level for each task.

3.2.2 Methods

After receiving the IRB consent (MSU IRB #9452) from Michigan State University’s

IRB office, we recruited 100 participants using Prolific for the study. Inclusion criteria were

established as having completed a minimum of 500 prior studies and maintaining a 99%

approval rate on the platform. Participants were compensated with a base payment of $6

and had the opportunity to earn additional performance-based bonuses ranging from $0−$4.

3.2.3

IOHMM results

We use the data from Experiment 1 to train IOHMM models with different numbers of

hidden states. Table 3.1 illustrates the AIC and BIC values (normalized with the number of

observation trajectories) for the trained IOHMM models with 2, 3, and 4 hidden workload

states, respectively. Based on the AIC and BIC criterion, we utilize a trained IOHMM model

with two hidden states, which we refer to as normal and high workload states.

Hidden States
AIC
BIC

2

3

735.29 735.91
736.39 737.85

4
736.28
739.26

Table 3.1 AIC and BIC values for model selection.

Fig. 3.4a and 3.4b show the workload transition diagram for the trained IOHMM model

under normal and high fidelity, respectively. Under high fidelity, due to the slower speed

of videos, there is a higher probability of transitioning from the high workload state to a

lower workload state as compared to under normal fidelity. Similarly, the probability of

transitioning into a high workload state from the normal workload state is higher under

normal fidelity as compared to the high fidelity servicing.

In the case of task delegation

action, the task is instantaneously removed from the queue and hence, we assume that the

workload remains the same under task delegation.

39

(a) Normal fidelity.

Figure 3.4 Workload transition diagram under (a) normal and (b) high fidelity servicing of
tasks.

(b) High fidelity.

(a) Normal fidelity.

(b) High fidelity.

Figure 3.5 Reaction time diagram under (a) normal and (b) high fidelity servicing of tasks.

For observations o1, o2, and o3, we learn a normal distribution with unknown mean and

standard deviation for each state-action pair. Fig. 3.5a and 3.5b present the reaction time

distributions for normal and high fidelity, respectively. We identify state 0 as a normal

workload state, while state 1 represents the high workload state. The mean and the variance

of the reaction time in the secondary task are larger in the high workload state than in the

normal workload state.

Fig. 3.6a and 3.6b depict the distributions for the fraction of detected mines in the

primary task for normal and high fidelity, respectively. Notably, we observe a substantial

difference in the means of the fraction of detected mines between normal and high workload

states under normal fidelity.

In contrast, the means are relatively similar between these

40

(a) Normal fidelity.

(b) High fidelity.

Figure 3.6 Fraction of mines detected in primary task under (a) normal and (b) high fidelity
servicing of tasks.

states under high fidelity. This observation suggests that when we do not take into account

the penalty associated with the queue length, choosing high fidelity (slower videos) could be

a suitable action during high workload conditions.

(a) Normal fidelity.

(b) High fidelity.

Figure 3.7 Number of false alarms in primary task under (a) normal and (b) high fidelity
servicing of tasks.

Fig. 3.7a and 3.7b illustrate the distributions for the number of false alarms in the primary

task for normal and high fidelity, respectively. Notably, in the normal workload state, no

false alarms were recorded. Therefore, in the normal workload state, we replace the normal

distribution with a Dirac delta function that yields a probability of 1 at 0 false alarms and

0 everywhere else.

Finally, the initial distribution for the workload was determined as [0.662, 0.338], where

0.662 is the probability of starting in a normal workload state. To solve the POMDP, we

discretize the distributions of the reaction time, fraction of mines detected, and the false

alarms, with a step size of 25 ms, 0.05, and 0.5, respectively.

41

3.2.4 Optimal Policy

(a) Task delegation unavailable.

(b) Task delegation available.

Figure 3.8 Optimal fidelity selection policy where the action space (a) does not include task
delegation and (b) includes task delegation.

Using the distributions from trained IOHMM, we convert the POMDP into a belief MDP

as detailed in Sec. 3.1.2. We utilize the following reward function:

r(s, a) =






100o1 − 30o2 − 2q,

for a ∈ {N, H},

30 − 2(q − 1),

for a = D,

(3.8)

where we assumed the accuracy of autonomous servicing (task delegation) to be just 30%.

We employed the value iteration algorithm to derive an optimal fidelity selection policy.

We developed two optimal policies: one with only two available actions and another with

three available actions, including task delegation.

Fig. 3.8a and 3.8b illustrate the optimal policy with two and three available actions,

respectively. In situations characterized by low queue lengths and a higher level of belief

that workload is high, opting for high-fidelity servicing emerges as the optimal action. For

other regions of the state space, normal fidelity servicing is the optimal action. When task

delegation is a viable option, it becomes the optimal action only when there is a near certainty

of being in the high workload state, as indicated by a belief close to 1, and the queue length

is substantial.

3.3 Results

We now discuss the results of the experiments.

42

(a) Experiment 2 policy: Empirical human policy.

(b) Learned optimal policy.

(c) Experiment 3 policy: Empirical optimal policy.

Figure 3.9 (a) Empirical human policy in Experiment 2, (b) Learned optimal policy by
solving POMDP with two available actions, and (c) Empirical policy obtained using data
from Experiment 3 that deploys optimal fidelity selection policy.

Fig. 3.9 compares the policy utilized in Experiments 2 and 3 that only allowed servicing a

task with either normal or high fidelity. Fig. 3.9a, 3.9b, and 3.9c show the empirical human

policy obtained from Experiment 2 data, complete optimal policy learned from POMDP,

and the empirical policy obtained using data from Experiment 3 that deploys optimal fi-

delity selection policy, respectively. The two columns of the plots show the probability of

choosing high-fidelity action and normal-fidelity action, respectively. The dark purple region

in the plots represents the portions of the state space that have not been visited during the

43

Figure 3.10 Box plots for the participants’ scores in Experiment 2 (human policy) and Ex-
periment 3 (optimal policy). Within each box plot, the median is represented by the red
horizontal line, while the lower and upper edges of the box signify the 25th and 75th per-
centiles, respectively. Whiskers extend to encompass the most extreme data points. The p
value in the two-sample t-test comparing the outcomes of Experiments 2 and 3 was computed
as 0.017, indicating a high level of statistical significance.

experiment.

By comparing the human policy with the optimal policy, we can gain important insights

into human behavioral patterns. It is evident that the human policy exhibits minimal vari-

ation in fidelity with respect to the queue length. This suggests that humans struggle to

effectively balance the trade-off between their ongoing tasks and the pending tasks in the

queue.

Furthermore, human policy reveals a tendency for high-fidelity servicing in low-workload

states and normal-fidelity servicing in high-workload states. This implies that humans tend

to have difficulty switching between actions. Participants who prioritize high accuracy in

primary tasks opt for high-fidelity servicing, resulting in normal workload conditions. Con-

versely, those who prioritize speed choose normal fidelity servicing, which can lead to op-

erating under high workload conditions. Consequently, humans appear to have difficulty

accurately assessing their workload and performance, hampering their ability to switch ac-

tions effectively.

Lastly, the empirical policy obtained using data from Experiment 3 indicates that the

optimal policy effectively manages the queue length, thereby keeping the region of high queue

length unexplored during the experiment.

44

Fig. 3.10 presents the comparative box plots for the scores obtained in Experiment 2

(human policy) and Experiment 3 (optimal policy). Here, the score refers to the cumulative

reward accrued by a participant over tasks, i.e., Score = (cid:80)48

t=1 rt, where rt denotes the reward
obtained for servicing the task t given by (3.8). It can be seen that the performance under

optimal policy is much better than the performance under human policy. Furthermore,

under the optimal policy, we observe an improvement of 26.54% in the average total score

as compared to the human policy.

To assess the statistical significance of these findings, we conducted a two-sample t-test

comparing the outcomes of Experiments 2 and 3. Remarkably, the p value was computed

as 0.017, indicating a notably high level of significance. In line with the widely accepted

significance threshold of 0.05, a p value below this threshold prompts us to reject the null

hypothesis. This implies that the data from the two experiments do not stem from the same

distribution at a 5% significance level. These results underscore the substantial impact of

the optimal policy in enhancing human performance.

Next, we allow an additional action of task delegation under which a task is instanta-

neously removed from the queue to be serviced by the autonomy. Fig. 3.11 compares the

policy utilized in Experiments 4 and 5 that allowed all three actions. Fig. 3.11a, 3.11b,

and 3.11c show the empirical human policy obtained using data from Experiment 4, com-

plete optimal policy learned from POMDP, and the empirical policy obtained using data

from Experiment 3 that deploys optimal fidelity selection policy, respectively. The three

columns of the plots show the probability of choosing high-fidelity action, normal-fidelity

action, and task delegation, respectively. The dark purple region in the plots represents the

portions of the state space that have not been visited during the experiment.

Similar observations to those depicted in Fig. 3.9 can also be made for high and normal

fidelity servicing in Fig. 3.11. Moreover, it is noticeable that the human policy exhibits

a somewhat random utilization of task delegation. This observation suggests that humans

struggle with workload management and effective task delegation. Lastly, the optimal policy

45

(a) Experiment 4 policy: Empirical human policy.

(b) Learned optimal policy.

(c) Experiment 5 policy: Empirical optimal policy.

Figure 3.11 (a) Empirical human policy in Experiment 4, (b) Learned optimal policy by
solving POMDP with three available actions, and (c) Empirical policy obtained using data
from Experiment 5 that deploys optimal fidelity selection policy.

in Experiment 5 keeps the queue length under check, and therefore, regions of higher queue

lengths remain unvisited in the experiment.

Fig. 3.12 shows the box plots for the participants’ scores in Experiment 4 (human policy)

and Experiment 5 (optimal policy).

It can be seen that the performance under optimal

policy is much better than the performance under human policy. Furthermore, under the

optimal policy, we observe an improvement of 50.3% in the average total score as compared

to the human policy. Lastly, the p-value in the two-sample t-test is given by 0.001, showing

that the results are statistically significant.

Remark 5. The average score achieved with the optimal policy in Experiment 5 was marginally

46

Figure 3.12 Box plots for the participants’ scores in Experiment 4 (human policy) and Ex-
periment 5 (optimal policy). Within each box plot, the median is represented by the red
horizontal line, while the lower and upper edges of the box signify the 25th and 75th per-
centiles, respectively. Whiskers extend to encompass the most extreme data points. The p
value in the two-sample t-test comparing the outcomes of Experiments 4 and 5 was computed
as 0.001, indicating a high level of statistical significance.

less than that obtained with the optimal policy in Experiment 3. This discrepancy could be

due to the relatively smaller reward obtained from task delegation, as opposed to servicing the

task with normal fidelity. Although the optimal policy in Experiment 5 may default to task

delegation under the presumption of diminished performance during periods of high workload,

certain participants may demonstrate the capability to execute tasks with high accuracy even

under high workload conditions in Experiment 3.

3.4 Conclusions

We studied the problem of optimal fidelity selection for a human operator engaged in a

visual search task. The human participants performed two tasks simultaneously: a primary

mine search task and a secondary visual task used for estimating the human workload. We

treated the human workload as a hidden state and modeled the workload dynamics using

an Input-Output Hidden Markov Model (IOHMM). Leveraging the probability distributions

derived from the IOHMM, we formulated a Partially Observable Markov Decision Process

(POMDP) and solved it to derive an optimal fidelity selection policy. The results of our

experiments offer valuable insights into human behavioral patterns and underscore the sub-

stantial performance enhancements achievable through the application of the optimal policy.

47

CHAPTER 4

ROBUST AND ADAPTIVE FIDELITY SELECTION FOR
HUMAN-IN-THE-LOOP QUEUES

In this chapter, we relax the assumption of the known human service time model in the

optimal fidelity selection problem. We assume the parameters of the human’s service time

distribution depend on the selected fidelity level and her cognitive state and are assumed

to be unknown a priori. These parameters are learned online through Bayesian parameter

estimation. We formulate a robust adaptive SMDP to solve our optimal fidelity selection

problem. We extend the results on the convergence of robust-adaptive MDP to robust-

adaptive SMDPs and show that the solution of the robust adaptive SMDP converges to the

optimal solution for the uncertainty-free SMDP. Furthermore, we numerically illustrate the

convergence of the synchronous and asynchronous robust adaptive policy to the uncertainty-

free optimal policy.

4.1 Background and Problem Formulation

We now present our problem setup, and formulate the optimal fidelity selection problem

as a robust adaptive SMDP.

Figure 4.1 Schematic of our problem setup. The incoming tasks arrive at a constant arrival
rate λ and gets stored in a queue. The decision support learns the human service-time
distribution based on the online observations and recommends an optimal fidelity level to
the human agent based on the system state (queue length and cognitive state).

48

4.1.1 Problem Setup

We consider a human agent with unknown service time distribution servicing a queue

of homogeneous tasks. We assume the availability of approximate model for human service

time distribution from human subject experiments, which we adapt online to estimate the

agent’s distribution using Bayesian parametric estimation.

The homogeneous tasks arrive according to a Poisson process with a constant arrival rate

λ ∈ R>0, and are stored in a dynamic queue with a maximum capacity L ∈ N, until serviced

by the human agent in a first-come-first-serve discipline. These tasks continuously lose value

at a constant rate c ∈ R>0 while waiting in the queue. The agent can choose to service the

tasks with normal or high fidelity level. When the agent services a task with high fidelity, she

meticulously looks into the details, which results in higher-quality service but leads to larger

service times and increased operator tiredness. We treat the cognitive state of the agent as a

lumped parameter that captures psychological factors such as fatigue, stress and situational

awareness. We assume that the unknown mean service time of the agent increases with

the fidelity level and is a unimodal function of the cognitive state, which is inspired from

the experimental psychology literature. For example, according to Yerkes-Dodson law [16],

excessive stress overwhelms the operator and too little stress leads to reduction in vigilance.

Hence, the human performance is optimal for some intermediate cognitive state.

Fig. 4.1 shows a schematic of our problem setup. We are interested in design of a decision

support system that continuously learns the human service-time distribution and assists the

agent by recommending an optimal fidelity level to service each task. The recommendation

is made based on the robust adaptive policy learned by the decision support for a given

queue length q ∈ Z≥0 and the human cognitive state. We assume to have real time access to

the human cognitive state using, e.g., Electroencephalogram (EEG) measurements (see [80]

for measures of cognitive load from EEG data).

49

4.1.2 Robust Adaptive SMDP formulation

We now model our problem as a robust adaptive SMDP ΓRA. We focus on the unknown

service time distributions and refer the interested readers to [41] for more details on modeling

of uncertainty-free distributions.

We consider a finite state space S := {(q, cog)| q ∈ {0, 1, ..., L}, cog ∈ C := {i/N }i∈{0,...,N },

for some N ∈ N, where cog represents the lumped cognitive state. We consider five possible

actions for the agent given by: (i) Waiting (W), when the queue is empty, (ii) Resting

(R), which provides the resting time for the human operator to reach the optimal cognitive

state when tired, (iii) Skipping (S), which allows the operator to skip a task to reduce the

queue length and thereby focus on newer tasks, (iv) Normal Fidelity (N) for servicing the

task with normal fidelity, and (v) High Fidelity (H) for servicing the task more carefully

with high precision. Hence, a set of admissible actions As for each state s ∈ S is given by:

(i) As := {W | s ∈ S, q = 0} when queue is empty, (ii) As := {{R, S, N, H }| s ∈ S, q ̸= 0}

when queue is non-empty and cog > cog∗, where cog∗ ∈ C is the optimal cognitive state,

and (iii) As := {{S, N, H }| s ∈ S, q ̸= 0} when queue is non-empty and cog ≤ cog∗.

Let τ be the sojourn time spent in state s. The sojourn time distribution P(τ | s, a)

represents the service time while servicing the task with normal or high fidelity, resting time,

constant time of skip, and time until the next task arrival while waiting. We model service

time distributions in Section 4.1.3 and refer the readers to [41] for details on resting and

waiting time. We define a state transition distribution P(s′| τ, s, a) from state s to s′ condi-

tioned on an action a ∈ As and sojourn time τ spent in state s. This distribution involves

the transition in queue length which is given by Poisson distribution and the cognitive dy-

namics which we model as a Markov chain such that the cognitive state cog increases with

high probability when the operator is busy (a ∈ {N, H}), and decreases when the operator

is idle (a ∈ {R, W }). For a detailed description of the modeling of cognitive dynamics, we

refer the interested readers to [41].

For each task, the human agent receives a high (low) immediate reward for servicing the

50

task with high (normal) fidelity, and no reward for not servicing the task. Furthermore, the

agent incurs a penalty at a constant rate c ∈ R>0 for each task waiting in the queue. Hence,

it can be shown that the expected net immediate reward received by the agent for selecting

an action a in state s is given by:

R(s, a) = r(s, a) −

P(τ |s, a)c

(cid:16) 2q + λτ
2

(cid:17)

τ,

(cid:88)

τ

(4.1)

where r : S × As (cid:55)→ R≥0 is the reward defined by: (i) r(s, a) = rH, if a = H; (ii) r(s, a) =

rN , if a = N ; and r(s, a) = 0, if a ∈ {W, R, S}, with rH, rN ∈ R≥0 and rH > rN , and
(cid:80)
τ

is the expected penalty due to tasks waiting in the queue.

P(τ |s, a)cτ

(cid:104) q+q′
2

(cid:105)(cid:17)

E

(cid:16)

(cid:12)
(cid:12)
(cid:12) τ, s, a

4.1.3 Modeling human service time and uncertainity set

There are many approximate models used for modeling service time distribution P(τ |s, a)

for the human agents, most common being lognormal [95] and inverse Gaussian distribu-

tion [96]. We assume that the agent’s service time distribution follows a log-normal distribu-

tion Lognormal(µ, σ2) with unknown parameters µ and σ2, that are the functions of the cog-

nitive state and fidelity-level. We utilize the Bayesian parameter estimation with a normal-

inverse-chi-squared prior [97] to estimate the distribution parameters using online observa-

tions. Using the prior distribution N Iχ2(µ0, κ0, ν0, σ2

0) = N (µ|µ0, σ2/κ0) × χ−2(σ2|ν0, σ2
0)
for the parameters µ and σ2, and n ∈ N realizations from Lognormal(µ, σ2), the posterior

distribution of (µ, σ2) is given by:

p(µ, σ2) = N Iχ2(µn, κn, νn, σ2

n), where

(4.2)

κ0µ0 + nx
κn
(cid:32)

ν0σ2

0 +

µn =

σ2
n =

1
νn

, κn = κ0 + n, νn = ν0 + n,

(cid:88)

(xi − x)2 +

i

(cid:33)

(µ0 − x)2

,

nκ0
κ0 + n

xi, i ∈ {1, . . . , n} are the service time samples and ¯x is the sample mean.

Hence, the

posterior distribution at any time t can be computed to recursively estimate the model

parameters µ and σ2 by using the online observations. Let ˆPt(τ |s, a) be the estimate of the

51

service time distribution at time t. Note that the estimate ˆPt(τ |s, a) can be used to estimate
P(s′, τ |s, a) and R(s, a), resulting in estimates ˆPt(s′, τ |s, a) and ˆRt(s, a) at time t.

The uncertainty in the human service time models could be large, especially in the ini-

tial stage with limited observation data, which may lead to suboptimal policies. This can

be mitigated through the use of robust SMDP. The robust SMDP optimizes the worst-

case performance to obtain robust policy when the joint distribution P(s′, τ |s, a) lies in an

uncertainty set P a, i.e, P(s′, τ |s, a) ∈ P a. Note that the robust SMDP formulation is not in-

herently adaptive in nature and does not explicitly use improved transition models that can

be learned using online observations to obtain less conservative policy. The robust adaptive

SMDP utilizes the latest improved estimates ˆPt and ˆRt for the joint probability P(s′, τ |s, a)

and reward R(s, a) at time t, respectively.

The choice of uncertainty set P a is critical for the performance of the robust algorithm.

A poor modeling of the uncertainty set increases the computational complexity and could

lead to highly conservative robust policy. Hence, a choice of uncertainty set P a is typically

made such that the robust policy is not overly conservative and the optimization can be

performed in a computationally tractable manner. Let D be the observation data up to time

t. To construct an uncertainty set P a
t

at time t, random samples are generated from the

posterior distribution of (µ, σ2) to construct a set ∆t comprised of matrices ˆPt(s′, τ |s, a) for

each s ∈ S and a ∈ AS. Finally, a Ψt-confidence level subset of the transition probabilities

∆t defined by:

P a

t (Ψt) = {psa

t ∈ ∆t : ∥psa

t − psa

t ∥1 ≤ Ψt, s ∈ S},

(4.3)

where psa
t

is the nominal transition given by psa

t |D], is used as a choice for the
uncertainty set. We seek to find Ψt-confidence sets for state transition probability vector for

t = E[psa

every state-action pair at time t. We choose Ψt =

6α
|S∥AS |π2t2

such that union bounds applied

over each state-action pair and time yield that all state transition probabilities belong to

respective confidence sets with at least probability α. In the following we choose α = 0.95.

52

Using the uncertainty set P a
t

constructed at time t based on the latest improved estimates

ˆPt, the robust adaptive SMDP solves the following robust Bellman equation (4.4),

V ∗(s) = max
a∈AS

min
ˆPt∈P a
t

(cid:40)

ˆRt(s, a) +

(cid:88)

(cid:88)

τ

s′

γτ ˆPt(s′, τ |s, a)V ∗(s′)

,

(4.4)

(cid:41)

where 0 < γ < 1 is the discount factor, to obtain a robust policy π∗ = argmaxa∈AS
which optimizes the worst-case performance through minimization with respect to the uncer-

V (s),

tainty set P a
t

. In the next section, we show that the learned policy through robust adaptive

SMDP converges to an optimal policy for the uncertainty-free formulation.

4.2 Convergence of Robust Adaptive SMDP

We study the convergence properties of the robust adaptive SMDP under the following

assumptions.

(A1) State space S and actiona space AS are finite.

(A2) ˆPt and ˆRt remains bounded for any t.

(A3) ˆPt and ˆRt converges to their true values P and R, respectively, with probability 1.

(A4) The uncertainty set P a
t

converges to a singleton estimate P with probability 1.

(A5) Each admissible action is executed from every state infinitely often.

Since agent’s service time distribution is independent of the queue length, assump-

tion (A5) can be relaxed to executing each action at every cognitive state. Furthermore,

assumptions (A3) and (A5) can be satisfied by adopting exploration strategies such as

Gibbs/Boltzman distribution method for action selection [31], wherein an action is selected

with probability proportional to the current estimate of the state-action value function di-

vided by a temperature parameter. The temperature parameter is annealed to ensure that

the action selection rule converges over time to a greedy policy with respect to the value

function estimate, while ensuring each cognitive state-action pair is selected infinitely often.

53

Let T : R|S| (cid:55)→ R|S| be the Bellman operator for an uncertainty-free SMDP Γ defined by:

(cid:40)

T (V (s)) = max
a∈AS

R(s, a) +

(cid:88)

(cid:88)

τ

s′

γτ P(s′, τ |s, a)V (s′)

.

(4.5)

(cid:41)

We first show the convergence of the adaptive asynchronous VI method followed by the

convergence of the robust approach. In asynchronous VI, an effective approach to deal with

large state spaces, the value of only a subset of states s ∈ Bt ⊆ S are updated at any time t.

The adaptive VI method adapts to the latest estimates of the model to compute the control

policy, which is continuously improved through online observations. Hence, the adaptive

asynchronous VI update at time t is given by:

Vt+1(s) =





ˆTt(Vt), if s ∈ Bt,

Vt(s),

otherwise,

(4.6)

where Bt ⊆ S is the subset of states that are updated at time t, and ˆTt is the Bellman operator
for the SMDP estimate ˆΓt at time t, that utilizes the estimates ˆPt and ˆRt in (4.5). The set

Bt can be chosen using prioritized sweeping [98] for improved computational performance.

Theorem 2. Under Assumptions A1-A5, the adaptive asynchronous VI converges to the

optimal value function V ∗ for the uncertainty-free SMDP Γ with probability 1.

Proof. See Appendix B: Chapter 4 for the proof.

Let P a be the uncertainty set for the probability P(s′, τ |s, a) for a given action a ∈ AS.

Then, the robust adaptive asynchronous VI update at time t is given by:

Vt+1(s) =





maxa∈AS minP∈P a {J a(Vt)} , if s ∈ Bt,

(4.7)

Vt(s),

otherwise,

where for a given a ∈ AS, J a : R|S| (cid:55)→ R|S| is given by

J a(V (s)) = R(s, a) +

(cid:88)

(cid:88)

τ

s′

54

γτ P(s′, τ |s, a)V (s′).

(4.8)

Let Tr : R|S| (cid:55)→ R|S| be the robust Bellman operator for SMDP Γ defined by:

Tr(V (s)) = max
a∈AS

min
P∈P a

{J a(V (s))} .

(4.9)

Theorem 3. Under Assumptions A1-A5, the robust adaptive asynchronous VI converges

to the optimal value function V ∗ for the uncertainty-free SMDP Γ with probability 1. In

addition, for Ψt =

|S∥AS |π2t2 , the union bounds applied over each state-action pair and time
yield that at any time, the obtained policy is robust with respect to uncertainty in service

6α

time distributions with at least probability α.

Proof. See Appendix B: Chapter 4 for the proof.

4.3 Numerical Illustrations

Fig. 4.2a and 4.2b shows an optimal policy and the optimal value function obtained using

VI algorithm for the uncertainty-free SMDP Γ with cog∗ = 0.6 as the optimal cognitive

state. The optimal policy selects high fidelity around the cog∗ for small queue lengths, and

then transitions to normal fidelity as the queue length increases in sub-optimal cognitive

states. The optimal action is to rest when the queue length is large and cognitive state is

high. Similarly, skipping of tasks is an optimal action for large queue lengths in sub-optimal

cognitive states. The corresponding optimal value function is a decreasing function of q and

is a uni-modal function of the cognitive state, the maximum for which occurs at cog∗ for

each q; see [41] for more details.

Fig. 4.2c and 4.2d shows a robust adaptive policy, and the corresponding optimal value

function for the uncertain SMDP ΓRA obtained from asynchronous VI algorithm. In our nu-

merical illustrations, we only estimate the parameter µ of the Lognormal(µ, σ2) distribution

using Bayesian estimation. In case when σ2 is unknown, Markov chain Monte Carlo (MCMC)

methods [99] can be used to sample from the posterior inverse-chi-squared-distribution to

create the uncertainty set P a
t

at time t. An identical policy and value function is obtained in

case of synchronous VI algorithm. Hence, the solution of the synchronous and asynchronous

VI converges to the optimal solution of uncertainty-free SMDP.

55

(a)

(c)

(b)

(d)

Figure 4.2 (a) Optimal policy and (b) optimal value function for the uncertainty-free SMDP
Γ. (c) Robust adaptive optimal policy, and corresponding (d) optimal value function for the
uncertain SMDP ΓRA.

Fig. 4.3 shows the policy updates for the synchronous and asynchronous robust adaptive

algorithm after 4, 8, 32 and 76 iterations, respectively. The synchronous robust adaptive

algorithm performs a VI update at each time step in (4.6) for all states, i.e. Bt = S, while

in the asynchronous robust adaptive algorithm, the VI update is performed on a randomly

chosen subset Bt ∈ S at each time step. The asynchronous robust adaptive algorithm

converges (80 iterations) much faster than the synchronous robust adaptive algorithm (404

iterations), while both eventually converge to the optimal policy for the uncertainty-free

SMDP Γ as shown in Fig. 4.2.

56

(a)

(c)

(e)

(g)

(b)

(d)

(f)

(h)

Figure 4.3 Policy updates for the synchronous ((a)-(d)) and asynchronous ((e)-(h)) robust
adaptive algorithm after 4, 8, 32 and 76 iterations, respectively.

57

4.4 Conclusions

We studied the optimal fidelity selection problem for a human agent servicing a stream

of homogeneous tasks. The parameters of the service time distribution for the agent are un-

known a priori which are learned online through Bayesian parametric estimation. We utilize

the robust-adaptive SMDP approach which adapts to the latest estimates of the distribu-

tion model, while obtaining robust policy towards the worst-case performance. We formally

extend the convergence results of the robust adaptive MDP to robust adaptive SMDP, and

show that the solution of the robust adaptive SMDP converges to the optimal solution for

the uncertainty-free SMDP. Furthermore, we numerically illustrate the convergence of the

synchronous and asynchronous robust adaptive policy to the uncertainty-free optimal policy.

58

CHAPTER 5

DETERMINISTIC SEQUENCING OF EXPLORATION AND
EXPLOITATION FOR REINFORCEMENT LEARNING

In this chapter, we propose a DSEE algorithm with interleaving exploration and exploitation

epochs for model-based RL problems that aim to simultaneously learn the system model,

i.e., an MDP, and the associated optimal policy. During exploration, DSEE explores the

environment and updates the estimates for expected reward and transition probabilities.

During exploitation, the latest estimates of the expected reward and transition probabilities

are used to obtain a robust policy with high probability. We design the lengths of the

exploration and exploitation epochs such that the cumulative regret grows as a sub-linear

function of time.

5.1 Background and Problem Formulation

We focus on the model-based RL problems which aim to simultaneously learn the system

model, i.e., a Markov decision process (MDP), and the associated optimal policy. We seek

to design policies that are robust to uncertainty in the learned MDP. However, learning the

MDP requires visiting parts of the state space associated with high uncertainty in estimates

and has exactly the opposite effect of a robust policy. Therefore, to balance the trade-off

between learning the MDP and designing a robust policy, we design a DSEE algorithm, in

which exploration and exploitation epochs of increasing lengths are interleaved. In explo-

ration epochs, the algorithm learns the MDP, while in exploitation epochs, it uses a robust

policy based on the learned MDP and the associated uncertainty.

Consider an MDP (S; A, R, P, γ), where S is the state space, A is the action space, reward

R(s, a), for each (s, a) ∈ S ×A, is a random variable with support [0, Rmax], P : S × A → ∆|S|

is the transition distribution, and γ ∈ (0, 1) is the discount factor. Here, ∆|S| represents

probability simplex in R|S|, |·| represents the cardinality of a set. Let R(s, a) be the expected

value of R(s, a). We consider a finite MDP setting in which |S| and |A| are finite.

We assume that the rewards R and the state transition distribution P are unknown a

59

priori. Hence, during exploration, we estimate R and P using online observations. Let (s, a)

be any state-action pair where s ∈ S and a ∈ A. At any time t, let nt(s, a) be the number

of times state-action pair (s, a) is observed until time t. For each (s, a), the empirical mean

estimates ˆRt(s, a) and ˆPt(s′|s, a), s′ ∈ S are:

ˆRt(s, a) =

ˆPt(s′|s, a) =

nt(s,a)
(cid:88)

1
nt(s, a)
nt(s, a, s′)
nt(s, a)

,

i=1

ri(s, a), and

(5.1)

(5.2)

respectively, where ri(s, a) is the immediate reward obtained in (s, a) during observation

i ∈ {1, . . . , nt(s, a)} until time t and nt(s, a, s′) is the number of times the next state s′ is

observed from (s, a) out of nt(s, a) times.

Oftentimes, the uncertainty in probability transition matrices and mean reward function

can be large, especially in the initial stages of learning due to limited observation data, which

may lead to sub-optimal policies. Robust MDPs [29] mitigate the sub-optimal performance

arising from this uncertainty by optimizing the worst-case performance over given uncertainty

sets for reward function and probability transition matrices to obtain a robust policy. Given,

at time t, uncertainty sets RU
t

and P U
t

containing R and P, respectively, the robust MDP

solves the following robust Bellman equation:

V R
t (s) = max
a∈A

min
t , ˜Pt∈P U

t

˜Rt∈RU

(cid:110) ˜Rt(s, a) + γ

(cid:88)

s′

˜Pt(s′|s, a)V R

t (s′)

(cid:111)
,

(5.3)

to obtain a robust policy ˆπR

t = argmaxa∈A V R
t

, which optimizes the worst-case performance

through minimization with respect to the uncertainty sets RU
t

and P U
t

, where V R
t

is the

robust value function.

The choice of these uncertainty sets are critical for the performance of the robust algo-

rithm. A poor modeling choice can increase the computational complexity and result in a

highly conservative policy [17,30]. To avoid these issues, during the exploitation epoch of the

DSEE, we construct these uncertainty sets based on the estimates ˆRt and ˆPt from the pre-
vious exploration epochs and Hoeffding bounds [100] for ˆRt (Lemma 6) and ˆPt (Lemma 7).

60

Subsequently, we utilize robust MDP to learn a policy that is robust to the estimation

uncertainties with high probability. The convergence of the robust MDP with uncertain

transition matrices to the uncertainty-free MDP can be shown under the assumption that

the uncertainty sets converge to singleton estimates almost surely [17, 32].

Definition 1 (Instantaneous and Cumulative Regret). For a discounted and ergodic

RL [21], consider an algorithm A that, at the end of the (t − 1)-th step, returns a policy πt

to be applied in the t-th step. For any state s ∈ S, let V ∗(s) and V πt(s) be the optimal value

of the state and its value under the policy πt, respectively. At any time t, the instantaneous

regret R(t) of the algorithm A is given by:

R(t) = ∥V ∗(s) − V πt(s)∥∞,

(5.4)

where ∥ · ∥∞ denotes the L∞-norm of a vector, and the cumulative regret RT until time

horizon T is given by:

RT =

T
(cid:88)

t=1

R(t) =

T
(cid:88)

t=1

∥V ∗(s) − V πt(s)∥∞.

(5.5)

We design the exploration and exploitation epochs of the DSEE algorithm such that its

cumulative regret grows as a sub-linear function of time. In the next section, we provide an

overview of the DSEE algorithm.

5.2 DSEE Algorithm

We design the DSEE algorithm for model-based RL under the following assumptions:

(A1) State space S and action space A are finite sets.

(A2) The MDP is ergodic under the uniform policy π, i.e., under a policy π that, in every

state s, randomly selects the actions from A with equal probability, the MDP admits

a unique stationary distribution ϕπ(s) : S → ∆|S|, with ϕπ(s) > 0 for all s.

Ergodic MDP [21] (assumption (A2)) is a common assumption.

It ensures that the

stationary distribution is independent of the initial distribution and all states are recurrent,

61

Input: Set of states S, Set of actions A, Initial State s0;
0 = s0, s = s0;
Set: η > 1, Sequences {ϵj}j∈N, {δj}j∈N, send
Set: t = 0, n(s, a) = 0, n(s, a, s′) = 0, S(s, a) = 0, ∀s, a, s′;
1: for epoch j = 1, 2, . . . do
% Exploration phase:
ρj ← ϵj

2:

;

µ ← 2

4+ 2Rmaxγ
(1−γ)2
(cid:104)
log(2|S| − 2) + log
(cid:110) (Rmax)2 log( 4|S||A|

δj

Uj ← max
while n(s, a) < Uj, ∀(s, a)

2ρ2
j

(cid:16) 2|S||A|
δj

(cid:17)(cid:105)

;

)

(cid:111)

, µ
ρ2
j

t ← t + 1;
Pick a ∼ UNIF(A) in current state s do
Observe reward R and the next state s′
if send
j−1
n(s, a) ← n(s, a) + 1;
n(s, a, s′) ← n(s, a, s′) + 1;
S(s, a) = S(s, a) + R;

has been visited in epoch j then

n(s,a) , ∀(s, a);

end if
s ← s′;
end while
j ← s;
send
ˆRt(s, a) = S(s,a)
ˆPt(s′|s, a) = n(s,a,s′)
% Exploitation phase:
Construct uncertainty sets RU
t
Compute V R
using (5.3);
Implement ˆπR
t
t ← t + ⌈ηj⌉;

for ⌈ηj⌉ time steps;

n(s,a) , ∀(s, a);

t (s) and ˆπR

t

19:
20:
21:
22:
23: end for

and P U
t

using (5.11);

3:

4:

5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:

17:

18:

Algorithm 5.1 Deterministic Sequencing of Exploration and Exploitation (DSEE).

62

i.e., each state s is visited infinitely often and ϕπ(s) > 0. We use this assumption to estimate

the number of times each state is visited in N time steps.

Algorithm 5.1 shows an overview of the DSEE algorithm. In the DSEE algorithm, we

design a sequence of alternating exploration and exploitation epochs. Let αi and βi be the

lengths of the i-th exploration and exploitation epoch, respectively, where i ∈ N. During

an exploration epoch, we uniformly sample the action in the current state and update the

estimates ˆRt and ˆPt. For a given sequence of {ϵi}i∈N and {δi}i∈N that we design in Section 5.3,

the length of the exploration epoch αi is determined to reduce the estimation uncertainty

such that P(∥V ∗(s) − V ˆπR

t (s)∥∞ ≤ ϵi) ≥ 1 − δi after the epoch, where P(·) denotes the

probability measure, and V ˆπR

t (s) is the value of state s under the robust policy ˆπR
t

.

In

DSEE, we choose exponentially increasing lengths of the exploitation epochs βi. During the
exploitation epoch, we utilize the estimates ˆRt and ˆPt from previous exploration epochs and

construct the uncertainty sets RU and P U at time t. We use these uncertainty sets with a

robust Bellman equation to learn a policy that is robust to the estimation uncertainties with

high probability. In next section, we analyze the DSEE algorithm, and design the sequence

of {ϵi}i∈N and {δi}i∈N, such that the cumulative regret (5.5) grows as a sub-linear function

of time.

5.3 Analysis of DSEE algorithm

We now characterize the regret of the DSEE algorithm under the assumptions (A1-A2)

and design the exploration and exploitation epochs. The optimal value V ∗(st) of the state

st is given by:

V ∗(st) = R(st, π∗(st)) + γE[V ∗(st+1)|st, π∗(st)],

where π∗ is an optimal policy that satisfies:

π∗(st) = argmax

at

(cid:8)R(st, at) + γE[V ∗(st+1)|st, at](cid:9) .

(5.6)

(5.7)

63

We define an approximate optimal value function ˆVt that utilizes the estimates ˆRt and ˆPt at
time t. Therefore, ˆVt(st) is given by:

ˆVt(st) = ˆRt(st, ˆπt(st)) + γ ˆE

(cid:104) ˆVt(st+1)|st, ˆπt(st)

(cid:105)
,

(5.8)

where ˆE

(cid:104) ˆVt(st+1)|st, ˆπt(st)
(cid:105)

is used to denote (cid:80)

ˆPt(st+1|st, ˆπt(st)) ˆVt(st+1) and ˆπt is an

st+1

optimal policy for the approximate optimal value function given by:

ˆπt(st) = argmax

at

(cid:110) ˆRt(st, at) + γ ˆE

(cid:104) ˆVt(st+1)|st, at

(cid:105)(cid:111)

.

(5.9)

Theorem 4 (Concentration of robust value function). Let ∥ · ∥1 denote the L1-norm

of a vector. For any given ϵt, δt ∈ (0, 1), there exists an n ∈ O

(cid:16) |S|
ϵ2
t

+ 1
ϵ2
t

log

(cid:17)(cid:17)

(cid:16) |S||A|
δt

such

that if each state-action pair (s, a) is observed nt(s, a) ≥ n times until time t, then for each

state s, the following inequality holds:

(cid:16)

P

∥V ∗(s) − V ˆπR

t (s)∥∞ ≤ ϵt

(cid:17)

≥ 1 − δt,

(5.10)

where V ˆπR

t (s) is the value of state s under the robust policy ˆπR

value function V R
t

is defined in (5.3) with ρt = ϵt
2

(cid:16)

2 + Rmaxγ
(1−γ)2

t = argmaxa∈A V R
(cid:17)−1

and

t . The robust

RU

t =

(cid:110)
RU (s, a) : |RU (s, a) − ˆRt(s, a)| ≤ ρt, ∀(s, a)

(cid:111)

,

(cid:110)

P U

t =

PU (s, a) :

(cid:13)
(cid:13)
PU (s, a) − ˆPt(s, a)
(cid:13)
(cid:13)
(cid:13)1
(cid:13)

≤ ρt, ∀(s, a)

(cid:111)

.

(5.11)

We prove Theorem 4 using the following Lemmas 6-9.

Lemma 6 (Concentration of rewards). Suppose until time step t, the state-action pair

(s, a) is observed nt(s, a) times and bounded immediate rewards ri(s, a), i ∈ {1, . . . , nt(s, a)},

are obtained at these instances. Then the following inequality holds:

P

(cid:12)
(cid:16)(cid:12)
(cid:12)R(s, a) − ˆRt(s, a)
(cid:12) ≤ ϵR
(cid:12)
(cid:12)

t

(cid:17)

≥ 1 − δR
t ,

(5.12)

where ˆRt(s, a) is the empirical mean reward defined in (5.1) and ϵR

t =

(cid:113) (Rmax)2 log(2/δR
t )

2nt(s,a)

.

64

Proof. For brevity of notation, let ϵR and δR denote ϵR
t

and δR
t

, respectively. For bounded

random variables ri(s, a), using the Hoeffding bounds [100], we have

P

(cid:12)
(cid:16)(cid:12)
(cid:12)R(s, a) − ˆRt(s, a)
(cid:12)
(cid:12)
(cid:12) ≤ ϵR

(cid:17)

≥ 1 − 2e−

2nt(s,a)ϵ2
R
(Rmax)2 .

(5.13)

Choosing δR = 2e−

2nt(s,a)ϵ2
R
(Rmax)2 , we get the desired result.

Lemma 7 (Concentration of transition probabilities). Suppose until time step t, the

state-action pair (s, a) is observed nt(s, a) times and let P(s, a) ∈ ∆|S| be the true transition

probability distribution for (s, a). Then for any (s, a), the following inequality holds:

P

(cid:13)
(cid:16)(cid:13)
P(s, a) − ˆPt(s, a)
(cid:13)
(cid:13)
(cid:13)1
(cid:13)

(cid:17)

≤ ϵP
t

≥ 1 − δP
t ,

(5.14)

where ∥ · ∥1 is the L1 norm of a vector and ˆPt(s, a) is the empirical transition probability
vector with components ˆPt(s′|s, a) defined in (5.2), and ϵP

(cid:113) 2[log(2|S|−2)−log(δP
t )]

.

t =

nt(s,a)

Proof. For brevity of notation, let ϵP and δP denote ϵP
t

and δP
t

, respectively. Using [101,

Theorem 2.1], we have:

P

(cid:16)(cid:13)
P(s, a) − ˆPt(s, a)
(cid:13)
(cid:13)

(cid:13)
(cid:13)
(cid:13)1

(cid:17)

≤ ϵP

≥ 1 − (2|S| − 2)e−

nt(s,a)ϵ2
P
2

.

(5.15)

Setting δP = (2|S| − 2)e−

nt(s,a)ϵ2
P
2

, yields the desired result.

Lemmas 6 and 7 provide concentration bounds on the reward and transition probability

based on how often a state-action pair is visited.

Lemma 8. (Concentration of reward and transition probability functions) Let δR

t =

t = δt
δP

2|S||A| . Then, for any ρt > 0, there exists an n ∈ O

(cid:16) |S|
ρ2
t

+ 1
ρ2
t

log

(cid:17)(cid:17)

(cid:16) |S||A|
δt

such that

when each state-action pair (s, a) is observed nt(s, a) ≥ n times, then the following statements

hold for any (s, a):

(cid:16)

(i) P

|R(s, a) − ˆRt(s, a)| ≤ ρt

(cid:17)

≥ 1 − δt

2|S||A| ,

(ii) P

(cid:13)
(cid:16)(cid:13)
P(s, a) − ˆPt(s, a)
(cid:13)
(cid:13)
(cid:13)1
(cid:13)

(cid:17)

≤ ρt

≥ 1 − δt

2|S||A| .

65

(cid:13)
(cid:13)
P(s, a) − ˆPt(s, a)
Proof. Using Lemmas 6 and 7, we know that |R(s, a)− ˆRt(s, a)| ≤ ρt and
(cid:13)
(cid:13)
(cid:13)
(cid:13)1
(cid:114)
(Rmax)2 log( 4|S||A|
2nt(s,a)

ρt holds for any (s, a) with at least probability 1 − δt

for ρt ≥

2|S||A|

and

δt

)

≤

(cid:114)

ρt ≥

2[log(2|S|−2)−log(

δt
2|S||A|)]

nt(s,a)

, respectively. Hence, we have that

nt(s, a) ≥ max

(cid:40)(Rmax)2 log( 4|S||A|
2ρ2
t

δt

)

,

µ
ρ2
t

(cid:41)

,

(5.16)

where µ = 2

(cid:104)

log(2|S| − 2) + log

(cid:17)(cid:105)

(cid:16) 2|S||A|
δt

, is sufficient to guarantee that |R(s, a)− ˆRt(s, a)| ≤

ρt and

(cid:13)
(cid:13)
P(s, a) − ˆPt(s, a)
(cid:13)
(cid:13)
(cid:13)1
(cid:13)

Hence, we can choose n ∈ O

≤ ρt holds for any (s, a) with at least probability 1 − δt
(cid:16) |S|
ρ2
t

such that (5.16) holds, and hence, the

(cid:16) |S||A|
δt

+ 1
ρ2
t

2|S||A|

log

(cid:17)(cid:17)

.

lemma follows.

Remark 6 (Concentration inequalities). The concentration inequalities in Lemmas 6,

7, and 8 use a deterministic value of nt(s, a). However, these bound also apply if nt(s, a) is
a realization of a random process that is independent of ˆRt(s, a) and ˆPt(s, a), which would

be the case in this paper.

Remark 7 (Uncertainty set). Using union bounds over all (s, a) and Lemma 8, it follows

that uncertainty sets RU

t stated in Theorem 4 are ρt-level uncertainty sets for R(s, a)
and P(s, a), respectively, for each (s, a) ∈ S × A with at least probability 1 − δt. Thus, the

t and P U

policy obtained at time t in an exploitation epoch is robust to estimation uncertainties with

at least probability 1 − δt.

Lemma 9 (Loss in robust value function). Suppose |R(s, a) − ˆRt(s, a)| ≤ ρt and
(cid:13)
(cid:13)
P(s, a) − ˆPt(s, a)
(cid:13)
(cid:13)
(cid:13)1
(cid:13)

≤ ρt, for any (s, a), at time t. Then Lt(s) = V ∗(s) − V ˆπR

t (s) satis-

fies:

(cid:18)

Lt(s) ≤ 2ρt

2 +

Rmaxγ
(1 − γ)2

(cid:19)

,

(5.17)

where V ˆπR

t (s) is the value of state s under the robust policy ˆπR

t at time t.

Proof [Sketch]: The upper bound on the loss is obtained by following a similar analysis as

in [102] that provides an upper bound on the loss by considering the uncertainty in P only.

66

The analysis in [102] can be extended by considering the uncertainty in R as well. The

details of the proof can be found in [103].

Lemma 9 provides the bounds on the loss in robust value function w.r.t. the optimal

value function using the concentration bounds on the rewards and transition probabilities.

Proof of Theorem 1: Using Lemma 8, we know that when each state-action pair (s, a)

is sampled n ∈ O

(cid:16) |S|
ρ2
t

+ 1
ρ2
t

log

(cid:17)(cid:17)

(cid:16) |S||A|
δt

(s, a):

times, then the following inequalities holds for any

(cid:16)

P

|R(s, a) − ˆRt(s, a)| ≤ ρt

(cid:17)

≥ 1 −

P

(cid:13)
(cid:16)(cid:13)
P(s, a) − ˆPt(s, a)
(cid:13)
(cid:13)
(cid:13)1
(cid:13)

(cid:17)

≤ ρt

≥ 1 −

,

δt
2|S||A|
δt
2|S||A|

(5.18)

(5.19)

.

Hence, using Lemma 9 and applying union bounds, we obtain that the following holds with

at least probability 1 − (δR

t + δP

t )|S||A|

V ∗(s) − V ˆπR

t (s) ≤ 2ρt

(cid:18)

2 +

Rmaxγ
(1 − γ)2

(cid:19)

.

Setting ρt = ϵt
2

(cid:16)

2 + Rmaxγ
(1−γ)2

(cid:17)−1

and δR

t = δP

t = δt

2|S||A|

,

(cid:16)

P

V ∗(s) − V ˆπR

t (s) ≤ ϵt

=⇒ P

(cid:16)
∥V ∗(s) − V ˆπR

t (s)∥∞ ≤ ϵt

(cid:17)

(cid:17)

≥ 1 − δt, ∀s ∈ S

≥ 1 − δt.

Additionally, the order of n in terms of ϵt becomes n ∈ O

(cid:16) |S|
ϵ2
t

+ 1
ϵ2
t

log

(cid:16) |S||A|
δt

(cid:17)(cid:17)

.

(5.20)

(5.21)

■

In Theorem 4, we obtain the number of times n each state-action pair needs to be visited

to reduce the estimation uncertainty in rewards and transition probabilities to obtain an

ϵt-optimal policy with probability at least 1 − δt. Now we estimate the total number of

exploration steps that are needed to ensure that each state-action pair is visited at least n

times.

Lemma 10 (Adapted from [104, Theorem 3] ). For an ergodic Markov chain with state

space S and stationary distribution ϕss, let τ = τ (σ) be the σ-mixing time1 with σ ≤ 1

8. Let

1σ-mixing time for an ergodic Markov chain in the minimal time until the distribution of Markov chain

is σ-close in total variation distance to its steady state distribution [105].

67

ϕ0 be the initial distribution on S and let ∥ϕ0∥ϕss =

(cid:113)(cid:80)

s∈S

ϕ0(s)2
ϕss(s) . Let nvis(si, N ) be the

number of times state si ∈ S is visited until time N . Then, for any 0 ≤ κ ≤ 1, there exists

a constant c > 0 (independent of σ and κ) such that:

P (nvis(si, N ) ≥ (1 − κ)N ϕss(si)) ≥ 1 − c∥ϕ0∥ϕsse− κ2N ϕss(si)

72τ

.

(5.22)

Proof. See [104, Theorem 3] for the proof.

In the exploration epoch, at any state si ∈ S, we choose actions uniformly randomly.

Consider the Markov chain on S that is associated with the uniform action selection policy.

Let {ϕ0(s)}s∈S and {ϕss(s)}s∈S, respectively, be the associated initial and stationary distri-

bution. We can also consider an equivalent lifted Markov chain on S × A with states (si, aj)

such that si ∈ S and aj ∈ A. The lifted Markov chain has the initial and stationary distri-

bution, ϕ0(s, a) = ϕ0(si)
|A|

and ϕss(s, a) = ϕss(si)
|A|

, respectively. Hence, we can apply Lemma 10

to obtain the probability of visiting a state-action pair (s, a) at least (1 − κ)N ϕss(s, a) times

after N time steps under the uniform action selection policy.

We now design a sequence of exploration and exploitation epochs. Let αi and βi be the

lengths of the i-th exploration and exploitation epoch, respectively. Let ϕmin

ss

:= min(s,a)∈S×A ϕss(s, a)

¯Ni
(1−κ)ϕmin

and Ni =
δαi := c∥ϕ0∥ϕsse− κ2Niϕmin
obtained by tuning N and κ in (5.22).

72τ

ss

ss

, where ¯Ni is the upper bound in (5.16) associated with (ϵi, δi). Let

. Note that the desired values of (1 − κ)N ϕss(si, aj) and δαi can be

Theorem 5 (Regret bound for DSEE algorithm). Let the length of exploitation epochs

in DSEE be exponentially increasing, i.e. βi = ηi, η > 1. Let ϵi = η− i

3 and δi = η− i

3 such

that P(∥V ∗(s) − V ˆπR

t (s)∥∞ ≤ ϵi) ≥ 1 − δi after exploration epoch i. For any δ ∈ (0, 1), set

δαi =

6δ

|S||A|π2i2 . Then, the cumulative regret for the DSEE algorithm RT ∈ O((T ) 2

3 log(T ))

grows sub-linearly with time T with probability at least 1 − δ.

Proof. We note that the system state at the start of the i-th exploration epoch might be

different from the final state at the end of the (i − 1)-th exploration epoch. Therefore, we

68

remember the final state of the previous exploration epoch and wait for the same state to

restart the new exploration epoch. For the ergodic MDP under the uniform action selection

policy (assumption A2), we know that the expected hitting time is finite [106]. Let U ∈ R>0

be a constant upper bound on the expected hitting time to reach the final state in the

previous exploration epoch from an arbitrary initial state in the current exploration epoch.

Hence, the cumulative regret during the i-th exploration epoch of length αi is upper-bounded

by (U + αi)Rmax, where Rmax = Rmax
1−γ

is the maximum instantaneous regret.

Since at start of the exploitation epoch i of length βi, P(∥V ∗(s)−V ˆπR

t (s)∥∞ ≤ ϵi) ≥ 1−δi,

the expected cumulative regret during the exploitation epoch is (1 − δi)βiϵi + δiβiRmax.

Therefore, the total cumulative regret after k sequences of exploration and exploitation each

is upper bounded by:

RTk ≤

≤

k
(cid:88)

i=1
k
(cid:88)

i=1

((αi + U )Rmax + (1 − δi)βiϵi + δiβiRmax)

((αi + U )Rmax + βiϵi + δiβiRmax) .

(5.23)

j=1(αj +U )+(cid:80)k

Let Ti be the time at the end of the i-th exploitation epoch. Then, (cid:80)k
(cid:80)k

j=1 βj < Tk ≤
j=1 βj. We design the length of the exploitation epochs to be exponentially
j=1 ηj) = O(ηk). Let ϵi = η−di and
δi = η−gi, where d ∈ (0, 1) and g ∈ (0, 1) are constants that we design later. Thus, (5.23)

increasing, i.e., βi = ηi, for η > 1. Thus, Tk ∈ O((cid:80)k

can be written as:

RTk ≤ Rmax

(cid:33)

αi + kU

+

(cid:32) k

(cid:88)

i=1

k
(cid:88)

i=1

ηi(1−d) + Rmax

k
(cid:88)

i=1

ηi(1−g).

(5.24)

For a state-action pair (s, a), where s ∈ S and a ∈ A, let δαi

(s,a) := c∥ϕ0∥ϕsse− κ2Niϕss(s,a)

72τ

.

Therefore, using Lemma 10, at the end of the i-th epoch,

P (nvis(s, a, Ni) ≥ (1 − κ)Niϕss(s, a)) ≥ 1 − δαi

(s,a).

(5.25)

Recall δαi

:= c∥ϕ0∥ϕsse− κ2Niϕmin

72τ

ss

, where ϕmin

ss

:= min(s,a) ϕss(s, a). Substituting Ni =

69

¯Ni
(1−κ)ϕmin

ss

in (5.25),

P (cid:0)nvis(s, a, Ni) ≥ ¯Ni

(cid:1) ≥ 1 −

|S||A|
(cid:88)

m=1

δαi,

(5.26)

for each state-action pair (s, a). Therefore, in Ni time steps of the lifted Markov chain, each
(s, a) is visited at least ¯Ni times with high probability. Thus, Ni is an upper bound on
(cid:80)i

j=1 αj with probability in (5.26). Therefore, using union bounds, with high probability

1 − (cid:80)k

j=1

(cid:80)|S||A|

m=1 δαj , (cid:80)k

j=1 αj ≤ Nk =

¯Nk
(1−κ)ϕmin

ss

, and hence,

RTk ≤

Rmax

¯Nk

(1 − κ)ϕmin

ss

+ kU Rmax +

k
(cid:88)

i=1

ηi(1−d) + Rmax

k
(cid:88)

i=1

ηi(1−g).

(5.27)

Using Theorem 4, ¯Nk ∈ O

(cid:16) |S|
ϵ2
k

+ 1
ϵ2
k

log

(cid:17)(cid:17)

(cid:16) |S||A|
δk

. Therefore,

RTk ≤

Rmaxλ
(1 − κ)ϕmin

ss

(cid:18) |S|
ϵ2
k

+

1
ϵ2
k

log

(cid:18) |S||A|
δk

(cid:19)(cid:19)

+ kU Rmax +

k
(cid:88)

i=1

ηi(1−d) + Rmax

k
(cid:88)

i=1

ηi(1−g)

≤

Rmaxλ
(1 − κ)ϕmin

ss

(cid:0)η2dk|S| + η2dk log (cid:0)ηgk|S||A|(cid:1)(cid:1) + kU Rmax +

k
(cid:88)

i=1

ηi(1−d) + Rmax

k
(cid:88)

i=1

ηi(1−g),

(5.28)

for some constant λ. Recall that Tk ∈ O(ηk), which implies k ∈ O(log(Tk)). Let Z be the

right-hand side of (5.28). Then, we have:

Z ∈ O (cid:0)(Tk)2d + (Tk)2d log(Tk) + (Tk)(1−d) + (Tk)(1−g)(cid:1)

∈ O((Tk)

2

3 log(Tk)),

(5.29)

by choosing d = g = 1
3
linearly with time Tk with probability at least 1 − (cid:80)k

. Hence, the cumulative regret RTk ∈ O((Tk) 2

(cid:80)|S||A|

m=1 δαi.

i=1

3 log(Tk)) grows sub-

Setting δαi =

6δ
|S||A|π2i2

, we have (cid:80)k

i=1

(cid:80)|S||A|

m=1 δαi ≤ δ.

5.4 Conclusions

We proposed a DSEE algorithm with interleaving exploration and exploitation epochs for

model-based RL problems that aims to simultaneously learn the system model, i.e., an MDP,

70

and the associated optimal policy. During exploration, we uniformly sample the action in

each state and update the estimates of the mean rewards and transition probabilities. These

estimates are used in the exploitation epoch to obtain a robust policy with high probability.

We designed the length of the exploration and exploitation epochs such that the cumulative

regret grows as a sub-linear function of time.

71

CHAPTER 6

FOSTERING HUMAN LEARNING IN SEQUENTIAL DECISION-MAKING:
UNDERSTANDING THE ROLE OF EVALUATIVE FEEDBACK

In this chapter, we investigate the role of feedback in fostering human learning in sequential

decision-making tasks. Cognitive rehabilitation, STEM skill acquisition, and coaching games

such as chess often require tutoring decision-making strategies. The advancement of AI-

driven tutoring systems for facilitating human learning requires an understanding of the

impact of evaluative feedback on human decision-making and skill development. To this

end, we conduct human experiments using Amazon Mechanical Turk to study the influence

of evaluative feedback on human decision-making in sequential tasks. In these experiments,

participants solve the Tower of Hanoi puzzle and receive AI-generated feedback while solving

it. We examine how this feedback affects their learning and skill transfer to related tasks. We

also explore various computational models to understand how people incorporate evaluative

feedback into their decision-making processes.

6.1 Background and Problem Formulation

We investigate the influence of evaluative feedback on human performance in a sequential

decision-making task through experimental evaluations and computational modeling. To this

end, we conducted experiments, where the participants were asked to solve the ToH puzzle.

ToH is a puzzle in which disks with a priority order are placed on three pegs. The priority

order determines which disk can be placed on top of another disk and each instance of

admissible disk placement is referred to as a configuration. Thus, for a four-disk and a five-

disk ToH, there are 34 = 81 and 35 = 243 possible configurations, respectively. The goal is to

move one disk at a time and reach the desired configuration while maintaining the priority

order at each time.

Consider the ToH puzzle with n disks, where the disks are numbered {0, 1, . . . , n − 1} in

ascending order of size, and the three pegs are numbered {0, 1, 2} from left to right. The state

of the n-disk ToH can be represented as Sn = (s0s1 . . . sn−1), where si ∈ {0, 1, 2} denotes the

72

Figure 6.1 State space of a 4-disk ToH with 81 states. Each state corresponds to a unique
configuration of the disks on three pegs and edges encode allowed transitions between states.
The task is to reach the configuration associated with a randomly selected target state (for
example 2201 in this figure). Warmer colors are associated with the higher value function
(see Sec. 6.1.1 for discussion).

peg on which disk i is placed, for 0 ≤ i ≤ n − 1. Each state in an n-disk ToH has either two

or three possible state transitions as can be seen by the state space of a 4-disk ToH shown

in Fig. 6.1.

6.1.1 Evaluative Feedback

We train an RL agent that is capable of optimally solving the ToH puzzle. The ToH

is a finite state space and finite action puzzle, and thus, an optimal policy can be derived

using tabular RL methods as described in [31, 107].

In Figure 6.1, we demonstrate the

optimal value function for the standard 4-disk ToH. To obtain the optimal value function for

a given target state, we utilize the value iteration algorithm [31], where the reward function

r(s) : S → R is designed as follows:



r(s) =

1, if s is the target state,

(6.1)



0, otherwise.

Using the reward function in (6.1) results in an optimal value function that is proportional to

the length of the shortest path for each state to the target state. The obtained optimal value

73

Figure 6.2 State space of a 4-disk ToH with 81 states. Each state corresponds to a unique
configuration of the disks on three pegs and edges encode allowed transitions between states.
The state space can be visualized as comprising three triangular structures. The states that
connect different triangular structures are critical states to transition between triangles.

function is utilized to provide evaluative feedback to the human player based on the change

in the value at states before and after the move. We deploy several feedback mechanisms as

detailed in Section 6.2.1 and systematically explore how human decision-making is influenced

by different feedback mechanisms.

The state space of the ToH problem exhibits a recursive structure. Specifically, the state

space of a ToH puzzle with n disks can be effectively illustrated using three interlocking

triangles. Each of these triangles symbolizes the state space of a ToH puzzle with n − 1

disks. To illustrate this concept, let’s examine the state space of a 4-disk ToH in Figure 6.2,

which is highlighted in red. In the same figure, the blue and green squares are employed to

represent the state spaces of 3-disk and 2-disk ToH puzzles, respectively. Hence, the state

space for the ToH with n − 1 disks can be simply achieved by removing the last digit from

each state in the upper triangle of the n-disk ToH. This digit corresponds to the position of

the largest disk.

As illustrated in Fig. 6.2, the state space of the 4-disk ToH puzzle can be decomposed

into three triangles labeled as T1, T2, and T3. Throughout the remainder of the manuscript,

74

we will consistently refer to the regions of the state space as follows: the top triangle will be

denoted as T1, the lower left triangle as T2, and the lower right triangle as T3. These triangles

are interconnected at their vertices through single edges. These vertex states are critical

states, transitioning from one triangle to another necessitates passing through these states.

For instance, starting from an initial state in T1, the optimal path to reach a desired state

in T2 or T3 must involve the state transitions 1110 −→ 1112 and 2220 −→ 2221, respectively.

Indeed, to master the art of solving the ToH puzzle effectively, one must grasp its inherent

recursive structure. Success in solving the puzzle relies on systematically working towards

reaching the crucial critical states within the state space.

6.1.2 Human Rewards using Maximum Entropy Inverse Reinforcement Learn-

ing

In the context of human participants solving the ToH puzzle, we can perceive them

as noisy optimal agents striving to optimize an implicit reward function. Utilizing their

demonstrations, we can leverage Inverse Reinforcement Learning (IRL) techniques to deduce

a reward function [108]. This reward function is designed to align the optimal policy with

the observed human demonstrations.

The maximum entropy IRL [109, 110] assumes human demonstrations are not perfect

and allows us to learn from sub-optimal demonstrations by incorporating a probabilistic

model that captures the variability in human behavior. Maximum entropy IRL has gained

significant traction in the literature as a means to effectively learn from human demonstra-

tions [111].

Consider a Markov Decision Process [112] M = {S, A, T , γ, r}, where S is the state space,

A is the action space, T sa
s′

is the probability of transition from state s ∈ S to state s′ ∈ S

under the action a ∈ A, γ ∈ [0, 1) is the discount factor, and r : S × A → R is the reward

function. In the case of the n-disk ToH, each edge originating from a given state s in the

state transition graph can be considered a unique action. Under these actions, the transition

probability T sa

s′ equals 1 for the transition from state s to s′ if they are connected through

75

an edge. Let D = {ζ1, . . . , ζN } be the set of N demonstrations, where each demonstration

ζi is a path ζi = {(si,0, ai,0), . . . , (si,T , ai,T )}. The unknown reward function r is expressed

as a linear combination of a set of predefined features denoted as f : S → R. The weights

associated with these features are learned through the maximum entropy IRL algorithm.

In the framework of maximum entropy IRL, the probability of following a particular tra-

jectory ζ is directly proportional to the exponential of the accumulated rewards experienced

along that path. This leads to a stochastic behavior model, where the probability of taking

a specific action a in a given state s is determined by the exponential of the expected total

reward subsequent to taking that action, i.e., P (a|s) ∝ exp (Qr

sa), where Qr

sa

is computed as

Qr

sa = r + γT V r
s

. The value function V r
s

is computed using a “soft" variant of the familiar

Bellman operator: V r

s = log (cid:80)

a exp (Qr

sa). Consequently, the probability of action a in state

s is normalized by exp (V r

s ), yielding P (a|s) = exp (Qr

sa − V r

s ).

The complete log-likelihood of the observed data under the reward function r can be

expressed as:

log P (D|r) =

(cid:88)

(cid:88)

i

t

log P (ai,t|si,t) =

(cid:88)

(cid:88)

(cid:16)

Qr

si,tai,t

− V r
si,t

(cid:17)

.

(6.2)

i

t

Interested readers are referred to [111] for detailed derivations.

We employ maximum entropy IRL to infer the reward functions associated with human

behavior. Detailed results are presented in Section 6.3.2.

6.1.3 Modeling human sequential decision-making under feedback

A central and challenging aspect of designing efficient tutoring systems lies in under-

standing the impact of evaluative feedback from AI on human decision-making. Precisely,

the question of how humans incorporate feedback into their decision-making processes is of

paramount importance. In modeling this process, a foundational challenge is understand-

ing how they interpret the feedback, including whether it’s seen as an immediate reward or

an evaluation of long-term impacts. Further, it’s important to explore if feedback relates

only to the current action or spans the sequence of actions. Additionally, understanding

76

if evaluative feedback affects the assessment of value functions over time or momentarily

influences action choices is vital. To tackle these questions, we develop candidate models

that embody different mechanisms for incorporating feedback into human decision-making

processes. Our models are inspired by the Training an Agent Manually via Evaluative Re-

inforcement (TAMER) framework [113–116] developed to incorporate human feedback into

the policy of an artificial RL agent.

Let ˆH(s, a) and |f | denote the evaluative feedback and number of predefined features f

employed to define human rewards, respectively. We study four different models detailed as

follows:

• Model 1 - Ignore feedback: This baseline model operates under the assumption that

evaluative feedback isn’t directly integrated into human decision-making processes.

Instead, individuals are postulated to focus on maximizing the long-term value derived

from their personal reward functions. In this framework, evaluative feedback plays an

indirect role by shaping and refining these reward functions. This is the default model

studied in Sec. 6.1.2. The model encompasses |f | learned parameters.

• Model 2 - Update Q(s, a): In this model, we postulate that humans interpret evaluative

feedback as an indicator of the long-term effectiveness of their strategic actions, serving

as an approximation of Qr
sa

. The model integrates this feedback to update the Q-

estimate as follows:

Q′(s, a) = Q(s, a) + k ˆH(s, a),

(6.3)

where k is a parameter to be learned. Consequently, the policy gets updated as

P (a|s) = exp (Qsa

′ − Vs

′), where V ′

s = log (cid:80)

a exp (Q′

sa) denotes the newly adjusted

value function. The model encompasses |f | + 1 learned parameters.

• Model 3 - Update r(s, a): In this model, we postulate that humans perceive the eval-

uative feedback as a measure of the myopic effectiveness of the strategy, serving as an

77

approximation of r(s, a). The model updates the human rewards as follows:

r′(s, a) = r(s, a) + k ˆH(s, a),

(6.4)

where k is a parameter to be learned. The updated reward function is used to estimate

the Q-values and the policy. The model encompasses |f | + 1 learned parameters.

• Model 4 - Feedback as a measure of Q(s, a): In this model, we assume that humans

ignore learning by interaction and treat evaluative feedback as a fixed measure of

Q(s, a). Therefore,

Q′(s, a) = k ˆH(s, a),

(6.5)

where k is the only parameter to be learned.

Remark 8. It’s worth noting that the log-likelihood of the maximum entropy IRL depends

solely on Q(s, a) through P (a|s). Consequently, Model 2 can be considered equivalent to

another model where humans utilize feedback solely to influence their action selection. In

this alternate model, humans do not incorporate evaluative feedback into their estimation

of Q(s, a); rather, they use it exclusively to bias their action selection, i.e., P (a|s) ∝

exp(Q(s, a) + k ˆH(s, a)).

In the context of maximum entropy IRL, this model is tanta-

mount to Model 2, where humans employ evaluative feedback to update their Q-estimates

as Q′(s, a) = Q(s, a) + k ˆH(s, a).

We investigate these models in Sec 6.3.3 to understand how humans incorporate evalua-

tive feedback in their decision-making.

6.2 Human Experiments

In this section, we discuss the human experiments conducted using AMT.

6.2.1 Experiment Design

We examine the effect of evaluative feedback on sequential decision-making using ToH

task. To achieve this, we designed five separate experiments, each featuring a different type

of feedback. Participants for each experiment were recruited randomly through AMT. Each

78

participant first solved a 4-disk ToH task ten times (training task) and then a 5-disk ToH

task five times (transfer task) to evaluate their skill transfer to a more challenging task.

The initial state of each puzzle was standardized with all the disks located on the first peg.

Considering the state-space of the puzzle as comprising of three interlocking triangles, the

target state was randomly selected from the states within the triangles that did not include

the initial state, i.e., triangles T2 and T3. The participants were given a maximum number

of moves mallowed to solve the puzzle, calculated as :

mallowed = ⌈1.5 × mmin⌉,

where mmin represents the minimum number of moves from the initial configuration to the

final configuration, as determined by the minimum path length in the state graph (Fig. 6.1).

The only difference among the experiments was the feedback provided during the 4-disk

ToH training task. No feedback was provided during the 5-disk ToH transfer task in any of

the experiments. In each experiment, participants were asked to try their best to get the

highest scores. The feedback and scoring metrics used during the training task for the five

experiments were:

(i. Experiment 1 - No feedback: In this experiment, the participants solved the 4-disk ToH

puzzle without any feedback. The scoring metric for these tasks was selected as:

S = 10(mallowed − mused + 1),

(6.6)

where mused is the total moves used to solve the puzzle. The participant receives a score

of 0 if the puzzle remains unsolved after exhausting the allowed number of moves. The

same scoring metric was used in the 5-disk transfer tasks in all the experiments.

(ii. Experiment 2 - Numeric feedback: The participants in this experiment received visual

feedback on each move they made while solving the 4-disk ToH. The feedback was in

the form of text that reads “good move +2" or “bad move −2", indicating whether the

79

move increased or decreased the value of the state, respectively. The scoring metric for

the training tasks in this experiment was selected as:

S = 10(mallowed − mused + 1) + 2(mgood − mbad),

(6.7)

where mgood ∈ N and mbad ∈ N denote the number of good and bad moves, respectively.

(iii. Experiment 3 - Optional feedback: In this experiment, the participants did not receive

visual feedback automatically but had the option to request it by pressing a button,

which came at the cost of a small penalty. If the participant requested feedback, they

would receive the same visual feedback as in Experiment 2, which evaluated their last

move. The scoring metric for the training tasks in this experiment was as follows:

S = 10(mallowed − mused + 1) − foptional,

(6.8)

where foptional ∈ N denotes the number of times the participant requests feedback.

(iv. Experiment 4 - Sub-goal: In the state graph of the 4-disk ToH task (as illustrated in

Fig. 6.1), states 1110 and 2220 are critical in reaching the target states efficiently in

triangles T2 and T3, respectively. In this experiment, based on the target configuration,

the participants were presented with an intermediate sub-goal configuration (1110 or

2220) in addition to the target configuration. The participant was instructed to try to

reach the intermediate sub-goal first. The scoring metric for the training tasks in this

experiment was as follows:

S = 10(mallowed − mused + 1) + 5ssubgoal,

(6.9)

where ssubgoal ∈ {0, 1} was set to 1 if the participant successfully reaches intermediate

sub-goal configuration, and 0 otherwise.

(v. Experiment 5 - Sub-goal with numeric feedback: In this experiment, the participants

received both the visual feedback as in Experiment 2 and the intermediate sub-goal

80

configuration as in Experiment 4. The scoring metric for the training tasks in this

experiment was calculated as follows:

S = 10(mallowed − mused + 1) + 2(mgood − mbad) + 5ssubgoal.

(6.10)

This experiment provided the participants with the maximum amount of evaluative

feedback.

6.2.2 Methods

After receiving the IRB consent (MSU IRB #8421) from Michigan State University’s

IRB office, we recruited 238 participants using AMT for the study. Inclusion criteria were

established as having completed a minimum of 500 prior studies and maintaining a 98%

approval rate on the platform. Participants were compensated with a base payment of

$6 and had the opportunity to earn additional performance-based bonuses ranging from

$0 − $4. Of the recruited participants, 78 participants were excluded due to self-reported

prior experience with the ToH task.

6.3 Results

In this section, we discuss the results of the experiments conducted on AMT.

6.3.1 Performance under Evaluative Feedback

First, we collect the data of 20 participants each for the 5 set of experiments detailed in

Section 6.2.1. In every experiment, we assess participants’ performance by calculating their

percentage scores for both the training and transfer tasks as follows:

100 ×

(cid:18) mallowed − mused + 1
mallowed − mmin + 1

(cid:19)

.

In Fig. 6.3a, we present box plots illustrating the percentage scores achieved in the

training tasks (4-disk ToH). Notably, participants who underwent training with evaluative

feedback in Experiment 2 (numeric feedback) and Experiment 5 (sub-goal with numeric

81

(a) Training.

(b) Transfer.

Figure 6.3 Box plots displaying percentage scores for both training (a) and transfer (b) tasks.
Within each box plot, the median is represented by the red horizontal line, while the lower
and upper edges of the box signify the 25th and 75th percentiles, respectively. Whiskers
extend to encompass the most extreme data points that are not classified as outliers, and
individual outliers are plotted using the symbol ‘+’.

feedback) exhibited significantly improved performance during these training tasks compared

to participants in Experiment 1 (no feedback), who received no evaluative feedback.

In Experiment 3 (optional feedback), participants seldom requested feedback to avoid the

feedback penalty, resulting in performance levels akin to those observed in Experiment 1.

Experiment 4 (sub-goal) introduced a unique approach, where participants were exclusively

exposed to sub-goal configuration (1110 or 2220) crucial for reaching the desired target state.

In the absence of evaluative feedback, this method resembled the conditions of Experiment

1, where the sub-goal can be effectively thought as a target state until the sub-goal state

is reached. We hypothesize that supplying solely sub-goal configurations without evaluative

feedback may induce confusion, as participants may now consider two target states simul-

taneously—the sub-goal and the target state. Consequently, participants in Experiment 4

exhibited a marginal decrease in performance compared to those in Experiment 1.

In Fig. 6.3b, we present box plots illustrating the percentage scores achieved in the trans-

fer tasks involving the 5-disk ToH. It’s important to note that solving the 5-disk ToH, with

its 243 states, presents a significantly greater challenge compared to the training task, which

involved the 4-disk ToH with 81 states. Furthermore, participants had no prior experience

with the 5-disk ToH and relied solely on their training with the 4-disk ToH. Consequently,

the transfer tasks yielded relatively lower scores, with many trials failing to solve the puzzle

82

within the allotted number of moves, which can make it challenging to interpret the box

plots in Fig. 6.3b.

To focus on successful outcomes, we filtered for positive percentage scores in each experi-

ment, representing the trials where participants successfully solved the ToH puzzle. Table 6.1

provides an overview of the percentage of successful trials for each experiment, both in the

training and transfer tasks. Notably, Experiment 2 and Experiment 5 demonstrated a sub-

stantial improvement in successful trials, showing increases of 33.5% and 36%, respectively,

compared to Experiment 1 in the training tasks. In the transfer tasks, Experiments 2 and

5 also showed notable improvements, with success rates increasing by 13% and 26%, respec-

tively, compared to Experiment 1.

To assess the statistical significance of these findings, we conducted a two-sample t-test

comparing the results of Experiments 2 and 5 with the data from Experiment 1. Remarkably,

the p values for Experiment 2 (in comparison to Experiment 1) and Experiment 5 (relative

to Experiment 1) are 1.59 × 10−12 and 1.71 × 10−17, respectively, in the training tasks,

indicating highly significant differences. In the transfer tasks, the p values are 3.9 × 10−2 and

7.17 × 10−4 for Experiments 2 and 5 compared to Experiment 1, respectively. Consistent

with the commonly accepted significance level of 0.05, a p value below this threshold leads

us to reject the null hypothesis, indicating that the data from the two experiments do not

arise from the same distribution at a 5% significance level. These results underscore the

substantial impact of evaluative feedback on performance, both in the training and transfer

tasks.

No
feedback
58%
28%

Numeric
feedback
91.5%
41%

Optional
feedback
62%
35%

Sub-goal

52.5%
22%

Sub-goal with
numeric feedback
94%
54%

Training
Transfer

Table 6.1 Percentage of successful trials in the training and transfer tasks.

Fig. 6.4a and 6.4b display box plots representing successful trials after filtering for positive

scores. Notably, the medians of these box plots closely align with each other, suggesting that

83

(a) Training.

(b) Transfer.

Figure 6.4 Box plots displaying positive percentage scores for both training (a) and transfer
(b) tasks.

participants’ performances in the experiments can be effectively compared solely through the

percentage of successful trials. Once participants have successfully learned to solve the ToH

puzzle, their scores exhibit relatively little variation across experiments during successful

trials. This observation highlights the stability and consistency of participants’ performance

once they have mastered the task.

(a) Training.

(b) Transfer.

Figure 6.5 Bar plots displaying the mean percentage scores for different trials for both training
(a) and transfer (b) tasks.

Recall that each participant completed 10 trials of training and 5 trials of transfer tasks.

In Figures 6.5a and 6.5b, bar plots represent the mean percentage scores for different trials

in the training and transfer tasks, respectively. It’s evident that participants who received

no feedback exhibited relatively low scores compared to those who received either numeric

feedback or sub-goals with numeric feedback. Furthermore, while there is no consistent

improvement over the trials for participants who did not receive feedback, participants who

received evaluative feedback demonstrated performance enhancement with increasing scores

84

across trials. Similar trends are observable in the transfer tasks, indicating that participants

who received evaluative feedback found it easier to transfer their skills to related tasks and

showed improvement across trials.

The results in Table 6.1 underscore significant improvements in human decision-making

attributed to evaluative feedback during training tasks, along with effective skill transfer

to related tasks. We employ maximum entropy IRL [110] to investigate the pivotal role of

evaluative feedback in shaping human decision-making, as detailed in Sections 6.3.2 and 6.3.3.

To enable this analysis, we conducted additional data collection sessions with 20 participants

each, encompassing experiments devoid of feedback (Experiment 1) and those involving

evaluative feedback (Experiments 2 and 5).

6.3.2 Human Rewards under Evaluative Feedback

In this section, we treat humans solving the ToH puzzle as noisy optimal agents striving

for optimal play with some implicit reward structure. We examine participants from three

sets of experiments: (a) No feedback (Experiment 1), (b) Numeric feedback (Experiment

2), and (c) Sub-goal with numeric feedback (Experiment 5). To gain insights into human

learning under these varying feedback conditions, we employ maximum entropy IRL analysis

to uncover the underlying human reward structures. Visualizing these human rewards can

offer valuable insights into the learning process with and without evaluative feedback.

Recall that for each experiment, the initial state is standardized with the starting state

represented by the top vertex of triangle T1, and the target state is randomly selected from

either triangles T2 or T3 (see Fig. 6.2). For each experiment, we partition the experimental

data into two sets, one with target states in T2 and the other with target states in T3. In each

of these sets, we learn the human rewards expressed as a linear combination of predefined

features. By modifying these predefined features, we consider two different settings where

the human rewards are learned for all the states and for a subset of 8 states. To estimate the

rewards, we maximize the log-likelihood as defined in (6.2), while applying an L1 penalty to

promote sparse rewards. To determine the coefficient of the L1 penalty λ ∈ R≥0, we consider

85

(a) Learned rewards in the training tasks for all states using all trajectories.

(b) Learned rewards in the training tasks for all states using only successful trajectories.

Figure 6.6 IRL plots displaying learned human rewards in the training tasks for all states,
using trajectory datasets (from trials 6-10 for each participant) from each experiment that
encompass (a) all available trajectories and (b) only successful trajectories, where success is
defined by reaching the target state.

λ ∈ {0, 0.1, . . . , 2} and perform 5-fold cross-validation on the data and select the coefficient

that yields the maximum mean log-likelihood across the 5-fold validation sets.

Fig. 6.6a and 6.6b display the learned IRL rewards in the training tasks for all states.

While IRL typically assumes expert demonstrations, it’s important to note that participants

may still be learning the task during the initial trials. Since the performance does not vary

86

significantly in the latter half of the trials (see Fig. 6.5a), we assume that the human rewards

are relatively stationary from trials 6 to 10 and, therefore, exclusively utilize these trials for

our IRL analysis. From these latter trajectories, we derive IRL rewards, considering both

(a) all available trajectories and (b) only the successful ones, where success is defined by

reaching the target state.

Each of these plots is organized into a grid with 2 rows and 3 columns. The top row

represents trajectories with the target state in triangle T2, while the bottom row represents

trajectories with the target state in triangle T3. The columns correspond to the three sets

of experiments: no feedback, numerical feedback, and sub-goal with numerical feedback

arranged from left to right.

In Fig. 6.6a, it becomes apparent that participants’ rewards in the experiment with no

feedback (first column) exhibit a distribution across all states, encompassing both T2 and

T3, despite the target state’s placement in T2 for the first row and in T3 for the second row.

The occurrence of high rewards in T3 (respectively T2) when the target state resides in T2

(respectively T3) primarily stems from the unsuccessful attempts to solve the ToH puzzle

in each experiment. Consequently, we observe that as participants’ performance improves

across experiments from left to right, rewards increase within the triangle containing the

target state while decreasing in the opposing triangle. Another noteworthy observation is

the presence of high rewards at the critical states (vertices of the target triangle), which

serve as pivotal entry points to the target triangle. These rewards become more pronounced

as performance enhances from left to right.

Fig. 6.6b depicts the learned IRL rewards derived exclusively from successful trajectories

in each experiment. Due to the absence of failed trajectories in each experiment, the dis-

parities in IRL rewards across experiments, from left to right, become less pronounced. In

each experiment, states within the target triangle and critical states exhibit higher rewards

compared to the opposing triangles. In Experiments 1 and 2, the elevated rewards along the

edge in the opposite triangle, which is closer to the target triangle, suggest that participants

87

in these experiments occasionally complete the puzzle by opting for suboptimal routes. In

contrast, participants in Experiment 5 predominantly solve the puzzle utilizing the optimal

trajectory.

(a) Learned rewards in the transfer tasks for all states using all trajectories.

(b) Learned rewards in the transfer tasks for all states using only successful trajectories.

Figure 6.7 IRL plots displaying learned human rewards in the transfer tasks for all states,
using trajectory datasets from each experiment that encompass (a) all available trajectories
and (b) only successful trajectories, where success is defined by reaching the target state.

Fig. 6.7a and 6.7b present the learned IRL rewards for all states within the transfer tasks,

utilizing trajectory datasets that encompass (a) all available trajectories and (b) only suc-

88

cessful trajectories. It is important to note that the transfer tasks pose significant challenges,

with none of the participants receiving any feedback. Consequently, the trajectories for the

transfer tasks in each experiment comprise numerous failed trajectories.

However, a noticeable trend emerges: participants from Experiment 5, who were trained

using sub-goals with numeric feedback, exhibit faster learning in solving the transfer tasks

compared to participants from Experiments 1 and 2, who received no feedback and only

numerical feedback, respectively. This is evident from the higher rewards within the target

triangle and lower rewards in the opposite triangle for Experiment 5. When considering only

successful trajectories to derive the IRL rewards in Fig. 6.7b, the differences across experi-

ments become less pronounced due to the exclusion of failed trajectories in all experiments.

The results presented in Fig. 6.6 and 6.7 offer valuable insights into how humans acquire

puzzle-solving skills under various evaluative feedback strategies. However, it’s worth noting

that the learned rewards appear less sparse due to the predefined features, which permit

non-zero rewards in all states. Consequently, while these learned IRL rewards for all states

offer insights into critical states, they can complicate the comparison between experiments.

Furthermore, most of the RL rewards are often sparse. To this end, we modify the predefined

features to encourage sparser rewards, allowing non-zero rewards in only 8 states for both

the training and transfer tasks. These 8 states were thoughtfully selected as the vertices

of the smaller triangles within the state space.

In Fig. 6.2, these states correspond to

2200, 1100, 1110, 2220, 0012, 2212, 1121, 0021.

Fig. 6.8a and 6.8b illustrate the learned IRL rewards for a specific subset of 8 states

during the training tasks. These rewards are derived from trajectory datasets obtained from

the latter half of the trials (trials 6 to 10) for each participant. We consider two scenarios: (a)

using all available trajectories and (b) using only the trajectories that resulted in successful

task completion. It is evident that participants from Experiment 5 demonstrate non-zero

rewards exclusively within the target triangle and the corresponding critical states. As we

progress from left to right, the non-zero rewards in the opposite triangle diminish due to

89

(a) Learned rewards in the training tasks for a subset of 8 states using all trajectories.

(b) Learned rewards in the training tasks for a subset of 8 states using only successful
trajectories.

Figure 6.8 IRL plots displaying learned human rewards in the training tasks for a subset of
8 states, using trajectory datasets (from trials 6-10 for each participant) from each experi-
ment that encompass (a) all available trajectories and (b) only successful trajectories, where
success is defined by reaching the target state.

fewer instances of failure. These differences become less pronounced when we solely consider

successful trajectories in Fig. 6.8b.

Fig. 6.9a and 6.9b depict the learned IRL rewards for a selected subset of 8 states within

the transfer tasks, using trajectory datasets that encompass (a) all available trajectories and

90

(a) Learned rewards in the transfer tasks for a subset of 8 states using all trajectories.

(b) Learned rewards in the transfer tasks for a subset of 8 states using only successful
trajectories.

Figure 6.9 IRL plots displaying learned human rewards in the transfer tasks for a subset
of 8 states, using trajectory datasets from each experiment that encompass (a) all available
trajectories and (b) only successful trajectories, where success is defined by reaching the
target state.

(b) only successful trajectories. While the distinctions are somewhat less pronounced due

to the presence of numerous failure attempts in all experiments, the lower rewards in the

opposite triangle indicate swifter learning when participants are trained with feedback, in

contrast to participants who receive no feedback. These differences become less noticeable

91

when we exclusively consider successful trajectories in Fig. 6.9b, effectively eliminating most

of the non-zero rewards in the opposite triangle.

The results of the max entropy IRL analysis underscore the significance of critical states

and demonstrate how human learning in sequential decision-making tasks can be organized

more effectively when evaluative feedback is provided, in contrast to participants solely

learning through exploration without any feedback. The results further indicate that the

participants trained with evaluative feedback exhibit an ability to transfer their learning to

newer, related, and more demanding tasks at a significantly accelerated pace compared to

those who learn without feedback.

6.3.3 Modeling Human Decision-Making under evaluative feedback

In Sec. 6.3.1 and 6.3.2, we have demonstrated the pivotal role of evaluative feedback in

enhancing learning and performance within the context of the ToH puzzle. In this section,

we delve into exploring models that aim to elucidate the mechanisms through which humans

integrate evaluative feedback into their decision-making processes.

In our analysis, we explore four distinct models for incorporating feedback into human

decision-making, as detailed in Sec. 6.1.3. For each of these models, we calculate both the

Akaike information criterion (AIC) [92] and Bayesian information criterion (BIC) [93] to

identify the most suitable model. For a given model, the AIC and BIC are defined as:

AIC = 2p − 2 log( ˆL), BIC = p log(o) − 2 log( ˆL),

(6.11)

where p, o, and ˆL denote the number of learned parameters, the number of observations, i.e.,

the sample size, and the maximized value of the likelihood function of the model, respectively.

The model with the lowest AIC (or BIC) is deemed the optimal choice according to AIC

(or BIC) criteria. For this analysis, we leverage the experimental data gathered during the

training tasks of Experiment 2, where participants received numeric feedback.

It is important to note that this numeric feedback is determined based on the change in

state value before and after the state transition. Consequently, it is intrinsically tied to the

92

target state, given that the value function is contingent upon the target state.

Since the target state is subject to randomization in triangles T2 and T3, we further

segment these triangles into three sub-triangles each. This subdivision allows us to categorize

the experimental data into six distinct groups, based on the location of the target state within

these six sub-triangles. Within each group, we select the top vertex of the sub-triangle as

the designated target state and truncate the trajectories to the point at which they initially

enter the target sub-triangle.

Upon completing this partitioning process for the 200 trajectories obtained from the

training tasks, we arrived at six groups, each containing a respective number of trajecto-

ries: 41, 34, 27, 31, 32, 35. Within each group, we subject all four models to testing, making

appropriate modifications to either the Q-function or the reward function as discussed in

Sec. 6.1.3. To estimate the unknown parameters for each model, we employ maximum en-

tropy IRL, optimizing the log-likelihood as defined in (6.2) while applying an L1 penalty.

To determine the coefficient for the L1 penalty in each model, we consider λ ∈ {0, 0.2, . . . , 1}

and perform 5-fold cross-validation. This allows us to select the coefficient that results in

the highest mean log-likelihood across the five validation sets.

Num. of
parameters

Group

1
2
3
4
5
6

Num. of
obs.
41
34
27
31
32
35

Model 1

Model 2

Model 3

Model 4

81

82

82

1

AIC BIC AIC BIC AIC BIC AIC

BIC

23.65
27.76
30.84
21.65
27.69
30.42

26.45
23.02
27.03
31.40 22.70 26.38
34.73 26.94 30.87
25.40 20.68 24.47
31.40 23.49 27.25
34.02 26.56 30.21

23.70
27.82
30.52
21.72
27.76
30.47

27.12
31.50
34.46
25.51
31.51
34.12

22.93
22.89
24.70
24.66
28.54
28.49
22.20
22.16
24.30
24.34
27.620 27.66

Table 6.2 AIC, and BIC values (normalized by the number of observations) for different
models allowing non-zero rewards for all the 81 states.

Similar to Sec. 6.3.2, we investigate two settings for the predefined features: one that

allows non-zero rewards in all states and another that restricts rewards to a subset of 8 states.

93

Table 6.2 presents the AIC and BIC values (normalized by the number of observations) for

different models within each group when non-zero rewards are permitted for all 81 states.

It’s worth noting that, while Model 2, where Q′(s, a) = Q(s, a) + k ˆH(s, a), emerges as the

best fit according to AIC for the majority of the groups, Model 4, where Q′(s, a) = k ˆH(s, a),

is selected as the best fit under the BIC criterion. This preference for Model 4 under BIC

is attributed to the significant difference in the number of learned parameters between the

two models, with BIC favoring the model with fewer learned parameters.

Indeed, when

considering non-zero rewards for all 81 states, it leads to a preference for the model with

just a single learned parameter.

Num. of
parameters

Group

1
2
3
4
5
6

Num. of
obs.
41
34
27
31
32
35

Model 1

Model 2

Model 3

Model 4

8

9

9

1

AIC BIC AIC BIC AIC BIC AIC BIC

31.84
41.72
44.04
30.49
42.78
43.53

32.18 22.54 22.92 31.89
42.08 24.03 24.43 41.77
44.43 27.74 28.17 44.12
30.86 21.26 21.68 30.56
43.15 23.62 24.03 42.84
43.59
43.89 27.25

27.65

32.27
42.18
44.55
30.97
43.26
43.99

22.96
22.92
24.80
24.76
28.48
28.43
22.25
22.20
24.45
24.41
27.57 27.62

Table 6.3 AIC, and BIC values (normalized by the number of observations) for different
models allowing non-zero rewards for a sub-set of 8 states.

Table 6.3 presents the AIC and BIC values (normalized by the number of observations)

for different models within each group when non-zero rewards are allowed for only a subset

of 8 states. This setting represents a more realistic scenario with sparse rewards. Notably,

in this context, Model 2 consistently emerges as the best fit according to both the AIC and

BIC criteria. This suggests that humans tend to interpret evaluative feedback as a strong

indicator of the long-term effectiveness of their strategic actions.

Remark 9. Even though Model 2 stands out as the preferred model according to both AIC and

BIC criteria, there is a small evidence of support for Model 4 as well (in case of non-sparse

rewards). This suggests that there might be instances where some individuals do not primarily

94

learn through interaction but instead focus on maximizing their evaluative feedback directly.

Such individuals could potentially encounter challenges in transfer tasks where evaluative

feedback is not available.

6.4 Conclusions

In this work, we study the influence of AI-generated evaluative feedback on human

decision-making, with a specific focus on sequential decision-making tasks exemplified by

the Tower of Hanoi. Our study demonstrates that individuals who receive training with

evaluative feedback not only experience significant improvements in their decision-making

abilities but also excel in transferring these enhanced skills to similar tasks. Through an

analysis utilizing the maximum entropy inverse reinforcement learning framework, we show

that human learning exhibits a more structured and organized implicit reward pattern when

evaluative feedback is provided during the training process. This highlights the critical

role played by AI-generated feedback in improving the cognitive and strategic abilities of

individuals.

Furthermore, our investigation explores various models to better comprehend how hu-

mans integrate feedback into their decision-making processes. Our findings provide substan-

tial evidence suggesting that individuals tend to interpret evaluative feedback as a valuable

indicator of the long-term effectiveness of their strategic actions. This valuable insight can be

leveraged to design intelligent IoT devices, capable of enriching human learning experiences

and shaping human decision-making.

95

CHAPTER 7

INCENTIVIZING COLLABORATION IN HETEROGENEOUS TEAMS VIA
COMMON-POOL RESOURCE GAMES

In this chapter, we address the challenge of achieving efficient collaboration in a team of

heterogeneous agents with different skill-sets and expertise. We consider a team of hetero-

geneous agents that is collectively responsible for servicing, and subsequently reviewing, a

stream of homogeneous tasks. Each agent has an associated mean service time and a mean

review time for servicing and reviewing the tasks, respectively. Agents receive a reward based

on their service and review admission rates. The team objective is to collaboratively maxi-

mize the number of “serviced and reviewed" tasks. We formulate a Common-Pool Resource

(CPR) game and design utility functions to incentivize collaboration among heterogeneous

agents in a decentralized manner. We show the existence of a unique Pure Nash Equilibrium

(PNE), and establish convergence of best response dynamics to this unique PNE. Finally,

we establish an analytic upper bound on three inefficiency measures of the PNE, namely the

price of anarchy (PoA), the ratio of the total review admission rate (TRI), and the ratio of

latency (LI).

7.1 Background and Problem Formulation

In this section, we describe the problem setup and formulate the problem using a game-

theoretic framework. We also present some definitions that will be used in the paper.

7.1.1 Problem Description

We consider a heterogeneous team of N ∈ N agents tasked with servicing a stream of

homogeneous tasks. These agents could be autonomous systems or human operators. Each

task, after getting serviced by a team-member, gets stored in a common review pool for a

second review. This second review is a feedback process in which any team-member can

re-examine the serviced task from the common review pool for performance monitoring and

quality assurance purposes. Each agent i ∈ N = {1, . . . , N } may choose to spend a portion

of her time to review the tasks from the common pool while spending her remaining time

96

Figure 7.1 Player i devotes her time to service homogeneous tasks (at a constant service
) while reviewing serviced tasks from the common review pool (at a constant
admission rate λS
i
). The maximum admission rate for player i for servicing and
review admission rate λR
i
reviewing the tasks is given by µS
i

, respectively.

and µR
i

to service the incoming tasks. We consider heterogeneity among the operators due to the

difference in their level of expertise and skill-sets in servicing and reviewing the tasks. This

heterogeneity is captured by the average service time (µS

i )−1 ∈ R>0 and average review time

(µR

i )−1 ∈ R>0 spent by operator i ∈ N on servicing and reviewing a task, respectively.
Let λS

i ] be the deterministic service and review admission rates,
i.e., the rates at which agent i chooses to admit tasks for servicing and reviewing, respectively.

i ∈ [0, µR

i ] and λR

i ∈ [0, µS

Each agent i can choose their service and review admission rate independent of other agents.

The range of λS
i

and λR
i

have been chosen to satisfy the stability conditions (including

marginal stability) for the service and review queues for operator i ∈ N [117, Chapter 8].

Suppose agent i selects λS
i

and λR
i

as their service and review admission rates, then

λS
i
µS
i

+

λR
i
µR
i

≤ 1,

where λS
i
µS
i

(respectively, λR
i
µR
i

) is the average time the agent spends on servicing (respectively,

reviewing) the tasks within a unit time. Thus, if the agent has selected a review admission

rate λR
i

, then the service admission rate satisfies

i ≤ µS
λS

i − hiλR
i ,

(7.1)

where hi := µS
i
µR
i

is the heterogeneity measure for the player i.

97

We consider self-interested agents that receive a utility based on their service and review

admission rates. Hence, we will assume that agents operate at their maximum capacity, and

equality holds in (7.1). Fig. 7.1 shows the schematic of our problem setup. Note that only

serviced tasks are available for review, and therefore,

i=1
By substituting (7.1) in (7.2), we obtain,

N
(cid:88)

λR
i ≤

N
(cid:88)

i=1

λS
i .

N
(cid:88)

i=1

aiλR

i ≤

N
(cid:88)

i=1

µS
i ,

(7.2)

(7.3)

where ai := (1 + hi). Eq. (7.3) represents the system constraint on the review admission

rates chosen by agents.

We are interested in incentivizing collaboration among the agents for the better team

performance. Towards this end, we propose a game-theoretic setup defined below.

7.1.2 A Common-Pool Resource Game Formulation

We now formulate our problem as a Common-Pool Resource (CPR) game. Henceforth,

we would refer to each agent as a player. A maximum service admission rate µS
i

and a

maximum review admission rate µR
i

are associated with each player i, based on her skill-set

and level of expertise. Without loss of generality, let the players be labeled in increasing

order of their heterogeneity measures, i.e., h1 ≤ · · · ≤ hN .

Let Si := [0, µR

i ] be the strategy set for each player i, from which the player chooses
her review admission rate for reviewing the tasks from the common review pool. Since we

have assumed (7.1) holds with equality, once player i decides her review admission rate

i ∈ Si, her service admission rate for servicing the tasks λS
λR
of (7.1). Let S = (cid:81)
Cartesian product. Furthermore, we define S−i = (cid:81)

i∈N Si be the joint strategy space of all the players, where (cid:81) denotes the
j∈N ,j̸=i Sj as the joint strategy space of

is given by the right hand side

i

all the players except player i.

For brevity of notation, we denote the total service admission rate and the total review

admission rate by λS

T = (cid:80)N

i=1 λS
i

and λR

T = (cid:80)N

i=1 λR
i

, respectively. Similarly, µS

T = (cid:80)N

i=1 µS
i

98

and µR

T = (cid:80)N

i=1 µR
i

denote the aggregated sum of the maximum service admission rates and

maximum review admission rates of all the players, respectively.

Let x ∈ R, defined by

x = λS

T − λR

T = µS

T −

N
(cid:88)

i=1

aiλR
i ,

(7.4)

be the slackness parameter for system constraint (7.3). The constraint (7.3) is violated

for negative values of x, i.e., when total review rate exceeds the total service rate.

In

such an event, some players commit to review more tasks than that are available in the

common review pool. The slackness parameter characterizes the gap between the total

service admission rate and the total review admission rate for all the players. In order to

maximize high quality team throughput, i.e., the number of tasks that are both serviced and

reviewed, we seek to incentivize the team to operate close to x = 0.

Each player i receives a constant reward rS ∈ R>0 for servicing each task. Hence, the

service utility uS
i

: Si (cid:55)→ R>0 for player i servicing the tasks at the service admission rate λS
i

is given by:

uS
i = λS

i rS.

(7.5)

To incentivize collaboration among the agents, we design the review utility uR
i

: S (cid:55)→ R>0

received for reviewing the tasks from the common review pool using two functions: a rate

of return, rR : S (cid:55)→ R>0 for each reviewed task and a constraint probability p : S (cid:55)→ [0, 1] of

the common review pool. The constraint probability p is a soft penalty on the violation of

system constraint (7.3).

We model the rate of return rR and the constraint probability p in terms of the strategy

of all the players through slackness parameter x. Furthermore, we assume that rR is strictly

decreasing in x. Therefore, for each x ∈ [0, µS

T ], system constraint (7.3) is satisfied, and the
rate of return is maximized at x = 0. The rate of return can be interpreted as the perks that

the employer provides to all the employees for high quality service. For example, an employer

generates higher revenue based on the high quality throughput of her company, i.e., based on

the number of “serviced and reviewed" tasks, which she redistributes among her employees as

99

perks as per their contribution to the review process. Highest quality throughput is achieved

by the company when the team efficiently reviews all the serviced tasks, i.e., when x = 0.

We introduce the constraint probability p as a soft penalty of the violation of system

constraint (7.3), and therefore, we let p = 1 if the constraint gets violated, i.e. when x < 0.

We assume that the constraint probability p is non-increasing in x, and approaches 1 as x

approaches 0. The class of sharply decreasing exponential functions, p(λR

−i) = exp(−Ax),
T ] and A ∈ R>0, can be a good choice to effectively model the system
constraint. While the system constraint (7.3) ensures stability in the asymptotic limit, the

where x ∈ [0, µS

i , λR

deviation of empirical mean service/review times from the true mean service/review times

during the transient may lead to large queue lengths. Hence, the constraint probability p

can be interpreted as the deviation probability of the empirical mean service/review times

from the true mean service/review times used in (7.3).

If the constraint is violated with

probability p, then uR

i = 0 for each player i. Therefore, we define the utility uR

i

i (λR
uR

i , λR

−i) =






0,

with probability p(λR

i , λR

−i),

i rR(λR
λR

i , λR

−i), otherwise.

by

(7.6)

i , λR

Let ui(λR

i + uR
i
maximize her expected utility ˜ui : S (cid:55)→ R defined by

−i) = uS

be the total utility of player i ∈ N . Each player i tries to

˜ui(λR

i , λR

−i) = E[uS

i (λR

i , λR

−i) + uR

i (λR

i , λR

−i)]

= λS

i rS + λR

i rR(λR

i , λR

−i)(1 − p(λR

i , λR

−i)),

(7.7)

where the expectation is computed over the constraint probability p. Since rR and p depend

on the review admission rates of all the players only through the slackness parameter x,

with a slight abuse of notation, we express rR(λR

i , λR

−i) and p(λR

i , λR

−i) by rR(x) and p(x),

respectively. Substituting (7.1) in (7.7), we get:

˜ui = µS

i rS + λR
i

(cid:2)rR(x)(1 − p(x)) − hirS(cid:3)

=: µS

i rS + λR

i fi(x),

100

(7.8)

where fi : S (cid:55)→ R is defined by

fi(λR

i , λR

−i) = fi(x) = rR(x)(1 − p(x)) − hirS.

(7.9)

The function fi is the incentive for player i to review the tasks. Note that player i will

choose a non-zero λR
i

if and only if she has a positive incentive to review the tasks, i.e.,

fi(x) > 0. Otherwise, player i drops out without reviewing any task (λR

i = 0) and focuses
), thereby maximizing her expected utility given by

solely on servicing of tasks (λS

i = µS
i

˜ui = µS

i rS.

In the following, we will refer to the above CPR game by Γ = (N , {Si}i∈N , {˜ui}i∈N ). In

this paper, we are interested in equilibrium strategies for the players that constitute a PNE

defined below.

Definition 2 (Pure Nash Equilibrium). A PNE is a strategy profile λR∗ = {λR
i

∗}i∈N ∈ S,

such that for each player i ∈ N , ˜ui(λR
i

∗, λR
−i

∗) ≥ ˜ui(λR

i , λR
−i

∗), for any λR

i ∈ Si.

Let bi : S−i (cid:55)→ Si defined by

bi(λR

−i) ∈ argmax
λR
i ∈Si

˜ui(λR

i , λR

−i),

be a best response of player i to the review admission rates of other players λR
−i

. A PNE

exists if and only if there exists an invariant strategy profile, λR∗ = {λR
i

∗}i∈N ∈ S, such that

λR
i

∗ = bi(λR

−i

∗), for each i ∈ N .

7.2 Existence and Uniqueness of PNE

In this section, we study the existence and uniqueness of the PNE for the CPR game Γ

under the system constraint (7.3). Each player i ∈ N chooses a review admission rate from

her strategy set Si = [0, µR

i ] and receives an expected utility ˜ui given by (7.8). For any

given λR

−i ∈ S−i, we obtain an upper bound λ

R
i

: S−i (cid:55)→ Si on λR
i

defined by

λ

R
i =






0,

if Λi < 0,

Λi,

if 0 ≤ Λi ≤ µR
i ,

µR
i ,

if Λi > µR
i ,

101

(a)

(b)

(c)

R
Figure 7.2 Constraint probability of player i as λ
i
i ∈ Si, b) for λ
pi(λR
pi(λR
is convex in λR
i

R
, and pi(λR
(cid:55)→ λ
i
and pi(λR
i , ·) < 1, ∀λR

i , ·) = 1, ∀λR
i , ·) (cid:55)→ 1 as λR

i , ·) = 1, ∀λR
i ∈ Si.

R
i ∈ (0, µR

i ), pi(λR
i ∈ [λ

i

varies from 0 to µR
i
i , ·) is convex for λR
R
i ], and c) for λ
i , µR

R
. a) For λ
R
i ∈ [0, λ
R
i = µR
i

i = 0,
i ), with
, pi(λR
i , ·)

where Λi :=

T −(cid:80)
µS

j∈N ,j̸=i aj λR
j

ai

, such that for λR

i ∈ [0, λ

R

matically satisfied, and for λR

R
i , µR
i ∈ (λ
R
constraint (7.3) is satisfied if λ
i ∈ (0, µR

i ] ⊂ Si, constraint (7.3) is violated. For λR
i ], and is violated if λ

i = 0.

R

i ) ⊂ Si, constraint (7.3) is auto-
,

i = λ

R
i

We study the properties of game Γ under following assumptions. Recall that x = λS

T −

T = µS
λR

T − (cid:80)N

i=1 aiλR
i

.

(A1) For a given λR

−i ∈ S−i, i ∈ N , we assume that the rate of return rR(λR
the tasks is continuously differentiable, strictly increasing and strictly concave for λR

i ∈
Si, with rR(0, 0) = 0. Equivalently, x (cid:55)→ rR(x) is continuously differentiable, strictly

i , ·) for reviewing

decreasing and strictly concave for x ∈ [0, µS

T ], with rR(µS

T ) = 0.

(A2) For a given λR

−i ∈ S−i, i ∈ N , we assume that the constraint probability p(λR

i , ·)
is (i) continuous on Si; (ii) is continuously differentiable, non-decreasing and convex

R

for λR

i ∈ (0, λ

i ]. See Fig. 7.2 for
an illustration. Equivalently, p(x) is continuously differentiable, non-increasing and

i ) ⊂ Si; and (iii) is equal to 1, for λR

i ∈ (λ

R
i , µR

convex for x ∈ (0, µS

T ], and p(x) → 1, as x → 0. Furthermore, p = 1, for every x < 0.

(A3) We assume fi(µR

i , 0)) − hirS > 0, for each i ∈ N , i.e., if no
other player reviews any task, then each player i has a positive incentive to review

i , 0)(1 − p(µR

i , 0) = rR(µR

tasks with maximum admission rate µR
i

.

102

Remark 10. The rate of return rR and the constraint probability p can be easily designed to

accommodate (A1-A3). Under Assumptions (A1) and (A2), the incentive function fi(λR

i , ·)

is strictly concave in λR

i , which means for a fixed λR

−i, the player i has diminishing marginal

incentive to review tasks. We make Assumption (A3) to provide positive incentives for players

to review tasks with their maximum review admission rate µR

i , if no other player chooses to

review any task. We can design game Γ to satisfy Assumption (A3) by ensuring that the

following conditions hold:

(i) rR(µR

i , 0) > rS, and µS

i ≤ µR

i , for each i ∈ N , and

(ii) µR

i ≪ µS

T /ai, or equivalently (cid:80)

j∈N , j̸=i µS

j ≫ µR

i , for each i ∈ N .

If the latter condition holds, then x is large, and consequently, the constraint probability

p(µR

i , 0) ≈ 0, for each i ∈ N . For most practical purposes, servicing a task requires more

time than reviewing it, i.e., µS

i ≤ µR

i . Therefore, condition (i) can be easily satisfied by

designing rewards such that rR(µR

i , 0) > rS, for each i ∈ N . If the total service admission

rate of all the players except player i is much higher than the maximum review admission
rate of player i, i.e. (cid:80)

i , for each i ∈ N , then condition (ii) holds. Notice

j ≫ µR

j∈N , j̸=i µS

that for a large team of agents where a single agent does not have much impact on the overall

service rate, condition (ii) is true. We refer the reader to Section 7.5 for an example.

□

Theorem 6 (Existence of PNE ). The CPR game Γ, under Assumptions (A1-A3), admits

a PNE.

Proof. See Appendix C: Chapter 7 for the proof.

i(λR

i , λR

Let f ′

i , λR
provide a corollary that characterizes a PNE of CPR game Γ.

−i) be the first partial derivative of fi(λR

−i) with respect to λR

i

. We now

Corollary 1 (PNE ). For the CPR game Γ, under Assumptions (A1-A3), the following

statements hold for a PNE λR∗ = [λR∗

1 , . . . , λR∗

N ] with x∗ = µS

T − (cid:80)N

i=1 aiλR∗

i

:

103

(i) f ′

i(λR∗

i

, λR∗

−i ) < 0 (or dfi

dx (x∗) > 0) for every player;

(ii) λR∗

i = 0, if and only if, fi(λR∗

i

, λR∗

−i ) ≤ 0 ; and

(iii) λR∗

i

is non-zero and satisfies the following implicit equation if and only if fi(λR∗
i

, λR∗

−i ) =

fi(x∗) > 0 at PNE, where

i = min (cid:8)˜λR∗
λR∗

i

, µR
i

(cid:9),

(7.10)

with ˜λR∗

i = − fi(λR∗
i (λR∗
f ′

i

i

,λR∗
−i )
,λR∗
−i )

= fi(x∗)
dfi
dx (x∗)
ai

.

Proof. See Appendix C: Chapter 7 for the proof.

Proposition 1 (Structure of PNE ). For the CPR game Γ with players ordered in in-

creasing order of hi, let λR∗ = [λR∗

1 , λR∗

2 , . . . , λR∗

N ] be a PNE. Then, the following statements

hold:

(i) If, for any player k1, λR∗
k1

< µR
k1

, then ak1λR∗
k1

≥ ak2λR∗
k2

and λR∗
k1

≥ λR∗
k2

, for each k2 > k1;

and

(ii) if λR∗

l = 0, for any l ∈ N , then λR

i = 0, for each i ∈ {j ∈ N | j ≥ l}.

Proof. See Appendix C: Chapter 7 for the proof.

It follows from Proposition 1 that the review admission rate of a player i at a PNE is

monotonically decreasing with the ratio hi. Therefore, at a PNE, as the heterogeneity in

terms of hi among the players becomes very large, players with small (respectively, large) hi

review tasks with high (respectively, zero) review admission rate. We will show in Lemma 12

that the PNE shares these characteristics with the social welfare solution, which we define

in Section 7.4. We illustrate this further in Section 7.5.

Theorem 7 (Uniqueness of PNE ). The PNE admitted by the CPR game Γ, under as-

sumptions (A1-A3), is unique.

Proof. See Appendix C: Chapter 7 for the proof.

104

7.3 Convergence to the Nash Equilibrium

We now show that the proposed CPR game Γ under Assumptions (A1-A3) belong to the

class of Quasi Aggregative games [118] as defined below.

Definition 3 (Quasi Aggregative game). Consider a set of players N , where each player
i ∈ N has a strategy set Si, and a utility function ui. Let S = (cid:81)
space of all the players, and S−i = (cid:81)

j∈N ,j̸=i Sj be the joint strategy space of all the players

i∈N Si be the joint strategy

except player i. A game Γ = (N , {Si}i∈N , {ui}i∈N ) is a quasi-aggregative game with aggre-

gator g : S (cid:55)→ R, if there exists continuous functions Fi : R × Si (cid:55)→ R (the shift functions)

and σi : S−i (cid:55)→ X−i ⊆ R, i ∈ N (the interaction functions) such that the utility functions ui

for each player i ∈ N can be written as:

ui(s) = ˜ui(σi(s−i), si),

where ˜ui : X−i × Si (cid:55)→ R, and

g(s) = Fi(σi(s−i), si), for all s ∈ S and i ∈ N .

(7.11)

(7.12)

An alternative, but less general way of defining a quasi-aggregative game replaces (7.11) in

the definition with:

ui(s) = ui(g(s), si),

(7.13)

where ui : X × Si (cid:55)→ R, and X = {g(s) |s ∈ S} ⊆ R.

For the CPR game Γ,

let σi(λR

−i) = (cid:80)N

j=1,j̸=i ajλR
j

and g(λR) = Fi(σi(λR

−i), λR

i ) =

(cid:80)N

j=1,j̸=i ajλR

j + aiλR
i

be the interaction functions and shift functions, respectively. The

expected utility ˜ui, which is defined in (7.8), can be re-written in the form

˜ui(λR

i , λR

−i) = ˜ui(σi(λR

−i), λR

i ).

(7.14)

Hence, the CPR game Γ is a quasi-aggregative game.

Specializing [118, Theorem 1] to the CPR game Γ, we obtain that if the best response

for all the players is non-increasing in the interaction function σi(λR

−i) = (cid:80)N
CPR game Γ is a best response pseudo-potential game [119] as defined below.

j=1,j̸=i ajλR
j

, the

105

Definition 4 (Best response (pseudo)-potential game). A game Γ = (N , {Si}i∈N , {˜ui}i∈N )

is a best response pseudo-potential game if there exists a continuous function ϕ : S (cid:55)→ R such

that for every i ∈ N ,

bi(λR

−i) ⊇ argmax
λR
i ∈Si

ϕ(λR

i , λR

−i),

where bi(λR

−i) is the best response of player i to the review admission of other players λR
−i.

Furthermore, if

bi(λR

−i) = argmax
λR
i ∈Si

ϕ(λR

i , λR

−i),

then the game Γ is a best response potential game.

We now establish that the best response for each player is non-increasing in σi.

Lemma 11 (Non-increasing best response). For the CPR game Γ, under Assumptions

(A1-A2), the best response mapping bi(λR

−i) is non-increasing in σi(λR

−i), for each i ∈ N ,

where σi(λR

−i) = (cid:80)N

j=1,j̸=i ajλR
j .

Proof. See Appendix C: Chapter 7 for the proof.

Furthermore, Remark 1 in [78] states that a best response pseudo-potential game with

a unique best response, is an instance of best response potential game [77]. Therefore, the

CPR game Γ, with its unique (Lemma 16) and non-increasing best response bi in σi(λR

−i)

(Lemma 11), is a best response potential game. Hence, simple best response dynamics such

as sequential best response dynamics [78] and simultaneous best response dynamics [79]

converge to the unique PNE.

7.4 Social Welfare and Inefficiency of PNE

In this section, we characterize the social welfare solution and provide analytic upper

bounds on inefficiency measures for the PNE.

7.4.1 Social Welfare

Social welfare corresponds to the optimal (centralized) allocation by players with respect

to a social welfare function. To characterize the effect of self-interested optimization of each

106

agent, we compare the decentralized solution (PNE of the CPR game) with the centralized

optimal solution (social welfare).

We choose a typical social welfare function Ψ(λR) : S (cid:55)→ R defined by the sum of expected

utility of all players, i.e.,

Ψ =

N
(cid:88)

i=1

˜ui =

N
(cid:88)

i=1

[µS

i rS + λR

i fi(x)]

= µS

T rS + λR

T rR(x)(1 − p(x)) − rS

N
(cid:88)

i=1

hiλR
i

= (λR

T + x)rS + λR

T rR(x)(1 − p (x)) .

(7.15)

A social welfare solution is an optimal allocation that maximizes the social welfare func-

tion.

Lemma 12 (Social welfare solution). For the CPR game Γ with constraint (cid:80)N

i = c
, for any given c ∈ R≥0, and players ordered in increasing order of hi, the associated social

i=1 aiλR

welfare solution, λR ∈ S is given by:

(cid:34)

λR =

1 , µR
µR

1
ak
where k is the smallest index such that (cid:80)k−1
(cid:80)N

2 , . . . , µR

k−1,

i=1 aiλR

i ∈ [0, µS

T + µR

i=1 aiµR

i ≤ c < (cid:80)k

i=1 aiµR

i . Furthermore, since

T ], a bisection algorithm can be employed to compute optimal c and

(cid:35)
i ), 0, . . . , 0

aiµR

,

(c −

k−1
(cid:88)

i=1

hence, the optimal social welfare solution.

Proof. Under the constraint (cid:80)N

i=1 aiλR

i = c (equivalently, x = µS

increasing function of λR
T

. Therefore, for a fixed (cid:80)N

i=1 aiλR

i = c, λR

T

T − c), Ψ is a strictly
is maximized by selecting

k−1 players with smallest ai’s (equivalently, hi) to operate at their highest review admission
rate, where the value of k is selected such that (cid:80)k−1

. Finally, the

i ≤ c < (cid:80)k

i=1 aiµR

i=1 aiµR
i

k-th player in the ordered sequence is selected to operate at a review admission rate such

that the constraint (cid:80)N

i=1 aiλR

i = (cid:80)k

i=1 aiλR

i = c, is satisfied. Therefore, the social welfare

solution is of the form,

(cid:34)

λR =

1 , µR
µR

2 , . . . , µR

k−1,

1
ak

(c −

k−1
(cid:88)

i=1

(cid:35)
i ), 0, . . . , 0

aiµR

.

107

Furthermore, for the function rR(x) and p(x) satisfying Assumptions (A1-A2), Ψ is

strictly concave in x, i.e., ∂2Ψ

d2fi
dx2 < 0 (Lemma 15). With the known form of the
social welfare solution, the value of c, which corresponds to the unique maximizer x of Ψ,

∂x2 = λR

T

can be computed efficiently by employing a bisection algorithm [120].

7.4.2

Inefficiency of the PNE

We consider three measures of the inefficiency for the PNE: a) Price of Anarchy (PoA),

b) Ratio of total review admission rate (ηT RI), and c) Ratio of Latency (ηLI), which are

described by

P oA =

(Ψ)SW
(Ψ)P N E

, ηT RI =

(λR
(λR

T )SW
T )P N E

, ηLI =

((cid:80)N
((cid:80)N

i=1 aiλR
i=1 aiλR

i )P N E
i )SW

,

respectively. While PoA is a widely used measure of the inefficiency, ηT RI and ηLI capture

the inefficiency of the PNE based on the total review admission rate and the latency (inverse

of throughput), respectively. Since incentivizing team collaboration is of interest, all three

measures capture the inefficiency of the PNE well.

We now provide an analytic upper bound for each of these measures of inefficiency for

the PNE. To this end, we assume that mini{µS

T hN
N (1+hN )
a task requires much more time than reviewing it, i.e., µS

i } > µS

. For scenarios wherein servicing

N ≪ µR
N

(hN → 0), the assumption

reduces to mini{µS

i } > 0.

Theorem 8 (Analytic bounds on PNE inefficiency ). For the CPR game Γ, under

assumptions (A1-A3), and mini{µS

i } > µS

T hN

N (1+hN ), the inefficiency metrics for the PNE are

upper bounded by

P oA <

µS
T aN
µS
T − x

, ηT RI <

µS
T aN
T − x)a1

(µS

, ηLI <

µS
T
µS
T − x

,

(7.16)

where x is the unique maximizer of fi, i.e., dfi

dx (x) = 0 .

Proof. See Appendix C: Chapter 7 for the proof.

108

(a)

(b)

Figure 7.3 Social welfare solution (SW) and pure Nash equilibrium for a) low and b) high
heterogeneity among players, respectively. Red circles show the maximum review admission
rate (µR
i

) for player i.

Example 1: We show analytic upper bounds on inefficiency measures for the PNE for a

specific class of exponential functions rR(x) = A[1 − exp{B(x − µS

T )}] and p(x) = exp(−Bx),

where A and B are positive constants, and x ∈ [0, µS
T ].
dx (x) = 0, we obtain x = µS

Setting dfi

T
2

. Using Theorem 8, we get PoA < 2aN , ηT RI < 2aN
a1

,

and ηLI < 2. For µS

N ≪ µR
N

, PoA < 2aN → 2, and ηT RI < 2aN
a1

< 2aN → 2.

7.5 Numerical Illustrations

In this section, we present numerical examples illustrating the uniqueness of PNE and

the variation of inefficiency with increasing heterogeneity among the players.

In our numerical illustrations, we obtain the PNE by simulating the sequential best re-

sponse dynamics of players with randomized initialization of their strategy. We verify the

uniqueness of the PNE for different choices of functions, rR(x) and p(x) satisfying Assump-

tions (A1-A2), and by following sequential best response dynamics with multiple random

initializations for the strategy of each player. Furthermore, in our numerical simulations, we

relax Assumption (A3) and still obtain a unique PNE.

An example illustration is shown in Fig. 7.3, where we show the social welfare solution

(obtained using fmincon in MATLAB) and PNE for low and high heterogeneity in terms of

variation in hi among players, respectively. For our numerical illustrations, we choose the

number of players, N = 6, and choose the functions rR(x) and p(x), satisfying Assumptions

109

(a)

(b)

(c)

(d)

(e)
Figure 7.4 Empirical a) PoA, b) ηT RI, and c) ηLI, along with analytic upper bounds for d)
PoA, e) ηT RI, and f) ηLI with increasing heterogeneity (ρ) among the agents.

(f)

(A1-A2) as following:

rR(λR

i , λR

−i) = rR(x) = 5[1 − exp{0.5(x − µS

T )}],

pR
i (λR

i , λR

−i) = p(x) =






1,

if

x ≤ 0,

exp(−0.5x), otherwise,

where x = µS

T − (cid:80)N

i=1 aiλR
i

is the slackness parameter. To characterize the heterogeneity

among the players, we sample the player’s maximum service admission rate µS
i

and maximum

review admission rate µR
i

at random from normal distributions with fixed means, MµS ∈ R>0,
and MµR ∈ R>0, and identical standard deviation, ρ ∈ R>0. We only consider realizations
that satisfy µS
for all the players, and hence, hi ≤ 1. For most practical purposes,

i ≤ µR
i

where servicing a task requires much more time than reviewing it, the assumption µS

i ≤ µR
i
holds true. Any non-positive realizations were discarded. We consider the standard deviation

of the distributions as the measure of heterogeneity among the players.

Fig. 7.3 shows that in the social welfare solution, players with low ratio of hi review

the tasks at maximum review admission rate and players with high ratio of hi drop out of

the game. At PNE, the strategy profile of players follow the characteristics described by

110

Proposition 1. Lastly, with the increase in heterogeneity among the players, the PNE starts

to approach the social welfare solution.

Fig. 7.4a-7.4c and Fig. 7.4d-7.4f shows the variation of different measures of inefficiency

for PNE, and their corresponding analytic upper bounds (see Theorem 8), with increasing

heterogeneity among the players. Fig. 7.4a shows the plot of PoA with increasing hetero-

geneity. In case of homogeneous players, i.e., ρ = 0, we obtain P oA = 1, which we establish

in Lemma 19. As we initially increase the heterogeneity among the players, PNE starts to

deviate from the social welfare solution, resulting in an increase in the PoA. We note that

PoA ≤ 1.15, suggesting that the unique PNE is close to the optimal centralized social wel-

fare solution. Fig. 7.4b and 7.4c shows ηT RI and ηLI, which are other relevant measures of

inefficiency for our problem. It is evident from Fig. 7.4, that all three measures of inefficiency

are close to 1, therefore suggesting near-optimal PNE solution.

7.6 Conclusions and Future Directions

We studied incentive design mechanisms to facilitate collaboration in a team of heteroge-

neous agents that is collectively responsible for servicing and subsequently reviewing a stream

of homogeneous tasks. The heterogeneity among the agents is based on their skill-sets and is

characterized by their mean service time and mean review time. To incentivize collaboration

in the heterogeneous team, we designed a Common-Pool Resource (CPR) game with appro-

priate utilities and showed the existence of a unique PNE. We showed that the proposed

CPR game is an instance of the best response potential game and by playing the sequential

best response against each other, players converge to the unique PNE. We characterized the

structure of the PNE and showed that at the PNE, the review admission rate of the players

decreases with the increasing ratio of hi = µS
i
µR
i
players that are “better" at reviewing the tasks than servicing the tasks (characterized by

, i.e., the review admission rate is higher for the

their average service and review time). Furthermore, we consider three different inefficiency

metrics for the PNE, including the Price of Anarchy (PoA), and provide an analytic upper

bound for each metric. Additionally, we provide numerical evidence of their proximity to

111

unity, i.e., the unique PNE is close to the optimal centralized social welfare solution.

There are several possible avenues of future research. It is of interest to extend the results

for a broader class of games with less restrictive choice of utility functions, i.e., games that

are not quasi-aggregative or commonly used games of weak strategic substitutes (WSTS) [78]

or complements (WSTC) [78]. An interesting open problem is to consider a team of agents

processing stream of heterogeneous tasks. In such a setting, incentivizing team collaboration

based on the task-dependent skill-set of the agents is also of interest.

112

CHAPTER 8

CONCLUSIONS

In this dissertation, we focused on human-in-the-loop systems that suffer from inherent

variability of human performance that depends on various factors such as cognitive state, task

learning, expertise, and individual differences. We explored the optimal and game-theoretic

feedback mechanisms for improving human performance in human-supervised autonomy. To

this end, we first studied the problem of optimal fidelity selection for effective management of

human cognitive resources. We assumed known models for human service time distribution

and formulated the problem as a semi-Markov decision process (SMDP). We solved the

SMDP to obtain the optimal fidelity selection policy and studied the structural properties

of the optimal policy.

Next, we conducted human experiments on optimal fidelity selection to study the effect

of the optimal policy on human performance. We assumed the human cognitive state as a

hidden state and modeled the cognitive dynamics using an Input-Output Hidden Markov

Model (IOHMM). We utilized the trained IOHMM model to formulate a Partially Observ-

able Markov Decision Process (POMDP). We solved the POMDP to obtain the optimal

fidelity selection policy and showed that the optimal policy significantly improves human

performance.

Next, we extended the optimal fidelity selection problem by incorporating uncertainty

into the human service-time distribution. We designed a robust and adaptive framework

that accurately learns the human service-time model and adapts the policy while ensuring

robustness under model uncertainty. However, a major challenge in designing adaptive and

robust systems arises from the conflicting objectives of exploration and robustness. To

mitigate system uncertainty, an agent must explore high-uncertainty state space regions,

while robust policy optimizes worst-case performance and consequently avoids these regions.

To address this trade-off, we introduced an efficient Deterministic Sequencing of Exploration

and Exploitation (DSEE) algorithm for model-based reinforcement learning (RL). DSEE

113

interleaves exploration and exploitation epochs with increasing lengths, resulting in sub-

linear cumulative regret growth over time.

Next, we focused on enhancing human performance through task learning and skill de-

velopment. In this context, we studied the impact of evaluative feedback on human learning

in sequential decision-making tasks. We conducted experiments on Amazon Mechanical

Turk, where participants engaged with the Tower of Hanoi puzzle and received AI-generated

feedback during their problem-solving. We examined how this feedback influenced their

learning and skill transfer to related tasks. Additionally, we explored computational models

to gain insights into how individuals integrate evaluative feedback into their decision-making

processes.

Lastly, we expanded our focus from a single human operator to a team of heterogeneous

agents, each with diverse skill sets and expertise. Within this context, we delved into the

challenge of achieving efficient collaboration among heterogeneous team members to enhance

overall system performance. Our approach leveraged a game theoretic framework, where we

designed utility functions to incentivize decentralized collaboration among these agents. We

showed the existence of a unique Pure Nash Equilibrium (PNE) and established the con-

vergence of the best response dynamics to this unique PNE. Additionally, we established an

analytical upper bound on measures of PNE inefficiency, shedding light on the effectiveness

of our collaborative strategies.

114

BIBLIOGRAPHY

[1]

[2]

P. Gupta, D. Isele, D. Lee, and S. Bae, “Interaction-aware trajectory planning for
autonomous vehicles with analytic integration of neural networks into model predictive
control,” in IEEE International Conference on Robotics and Automation, pp. 7794–
7800, 2023.

I. R. Nourbakhsh, K. Sycara, M. Koes, M. Yong, M. Lewis, and S. Burion, “Human-
robot teaming for search and rescue,” IEEE Pervasive Computing, vol. 4, no. 1, pp. 72–
79, 2005.

[3] M. A. Goodrich, J. L. Cooper, J. A. Adams, C. Humphrey, R. Zeeman, and B. G. Buss,
“Using a mini-UAV to support wilderness search and rescue: Practices for human-robot
teaming,” in IEEE International Workshop on Safety, Security and Rescue Robotics,
pp. 1–6, IEEE, 2007.

[4]

[5]

[6]

[7]

[8]

[9]

A. Rosenfeld, N. Agmon, O. Maksimov, and S. Kraus, “Intelligent agent supporting
human–multi-robot team collaboration,” Artificial Intelligence, vol. 252, pp. 211–231,
2017.

S. A. Seshia, D. Sadigh, and S. S. Sastry, “Formal methods for semi-autonomous driv-
ing,” in 52nd Design Automation Conference, pp. 1–5, IEEE, June 2015.

P. Gupta, D. Isele, D. Lee, and S. Bae, “Interaction-aware trajectory planning for
autonomous vehicles with analytic integration of neural networks into model predictive
control,” arXiv preprint arXiv:2301.05393, 2023.

A. M. Okamura, “Methods for haptic feedback in teleoperated robot-assisted surgery,”
Industrial Robot: An International Journal, vol. 31, no. 6, pp. 499–508, 2004.

J. Heard, C. E. Harriott, and J. A. Adams, “A survey of workload assessment algo-
rithms,” IEEE Transactions on Human-Machine Systems, vol. 48, no. 5, pp. 434–451,
2018.

J. Peters, V. Srivastava, G. Taylor, A. Surana, M. P. Eckstein, and F. Bullo, “Human
supervisory control of robotic teams: Integrating cognitive modeling with engineering
design,” IEEE Control System Magazine, vol. 35, no. 6, pp. 57–80, 2015.

[10] J. R. Peters, A. Surana, and F. Bullo, “Robust scheduling and routing for collaborative
human/unmanned aerial vehicle surveillance missions,” Journal of Aerospace Informa-
tion Systems, pp. 1–19, 2018.

[11] K. Savla and E. Frazzoli, “A dynamical queue approach to intelligent task management
for human operators,” Proceedings of the IEEE, vol. 100, no. 3, pp. 672–686, 2012.

[12] V. Srivastava, R. Carli, C. Langbort, and F. Bullo, “Attention allocation for decision

making queues,” Automatica, vol. 50, no. 2, pp. 378–388, 2014.

115

[13] Q. Hu and W. Yue, Markov Decision Processes with their Applications, vol. 14. Springer

Science & Business Media, 2007.

[14] M. Lin, R. J. La, and N. C. Martins, “Stabilizing a queue subject to activity-dependent
server performance,” IEEE Transactions on Control of Network Systems, vol. 8, no. 4,
pp. 1579–1591, 2021.

[15] M. Lin, N. C. Martins, and R. J. La, “Queueing subject to action-dependent server
performance: Utilization rate reduction,” arXiv preprint arXiv:2002.08514, 2020.

[16] R. M. Yerkes and J. D. Dodson, “The relation of strength of stimulus to rapidity of
habit-formation,” Journal of Comparative Neurology and Psychology, vol. 18, no. 5,
pp. 459–482, 1908.

[17] P. Gupta and V. Srivastava, “On robust and adaptive fidelity selection for human-in-

the-loop queues,” in European Control Conference, pp. 872–877, 2021.

[18] S. Stidham Jr and R. R. Weber, “Monotonic and insensitive optimal policies for control
of queues with undiscounted costs,” Operations Research, vol. 37, no. 4, pp. 611–625,
1989.

[19] L. I. Sennott, “Average cost semi-Markov decision processes and the control of queueing
systems,” Probability in the Engineering and Informational Sciences, vol. 3, no. 2,
pp. 247–272, 1989.

[20] M. Agarwal, V. S. Borkar, and A. Karandikar, “Structural properties of optimal trans-
mission policies over a randomly varying channel,” IEEE Transactions on Automatic
Control, vol. 53, no. 6, pp. 1476–1491, 2008.

[21] M. L. Puterman, “Markov decision processes,” Handbooks in Operations Research and

Management Science, vol. 2, pp. 331–434, 1990.

[22] R. Yang, S. Bhulai, and R. van der Mei, “Structural properties of the optimal resource
allocation policy for single-queue systems,” Annals of Operations Research, vol. 202,
no. 1, pp. 211–233, 2013.

[23] C. C. White III and H. K. Eldeib, “Markov decision processes with imprecise transition

probabilities,” Operations Research, vol. 42, no. 4, pp. 739–749, 1994.

[24] S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis, “Bias and variance approximation
in value function estimates,” Management Science, vol. 53, no. 2, pp. 308–322, 2007.

[25] V. Borkar and R. Jain, “Risk-constrained Markov decision processes,” IEEE Transac-

tions on Automatic Control, vol. 59, no. 9, pp. 2574–2579, 2014.

[26] R. T. Rockafellar and S. Uryasev, “Optimization of conditional value-at-risk,” Journal

of Risk, vol. 2, pp. 21–42, 2000.

[27] E. Delage and S. Mannor, “Percentile optimization for Markov decision processes with
parameter uncertainty,” Operations Research, vol. 58, no. 1, pp. 203–213, 2010.

116

[28] H. N. Nguyen, Chance-Constrained Optimization: Applications in Game Theory and

Markov Decision Processes. PhD thesis, Université Paris-Saclay, 2023.

[29] W. Wiesemann, D. Kuhn, and B. Rustem, “Robust Markov decision processes,” Math-

ematics of Operations Research, vol. 38, no. 1, pp. 153–183, 2013.

[30] L. F. Bertuccelli, A. Wu, and J. P. How, “Robust adaptive Markov decision processes:
Planning with model uncertainty,” IEEE Control Systems Magazine, vol. 32, no. 5,
pp. 96–109, 2012.

[31] R. S. Sutton and A. G. Barto, Reinforcement Learning, Second Edition: An Introduc-

tion. MIT Press, Nov. 2018.

[32] G. N. Iyengar, “Robust dynamic programming,” Mathematics of Operations Research,

vol. 30, no. 2, pp. 257–280, 2005.

[33] A. Nilim and L. El Ghaoui, “Robust control of Markov decision processes with uncertain

transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005.

[34] V. Gullapalli and A. G. Barto, “Convergence of indirect adaptive asynchronous value
iteration algorithms,” in Advances in Neural Information Processing Systems, pp. 695–
702, 1994.

[35] J. Xin, H. Zhao, D. Liu, and M. Li, “Application of deep reinforcement learning in
mobile robot path planning,” in 2017 Chinese Automation Congress, pp. 7112–7116,
IEEE, 2017.

[36] P. Gupta, D. Coleman, and J. E. Siegel, “Towards safer self-driving through great PAIN
(Physically Adversarial Intelligent Networks),” arXiv preprint arXiv:2003.10662, 2020.

[37] A. G. Barto, P. S. Thomas, and R. S. Sutton, “Some recent applications of rein-
forcement learning,” in Proceedings of the Eighteenth Yale Workshop on Adaptive and
Learning Systems, 2017.

[38] S. Ferretti, S. Mirri, C. Prandi, and P. Salomoni, “Automatic web content personal-
ization through reinforcement learning,” Journal of Systems and Software, vol. 121,
pp. 157–169, 2016.

[39] P. Gupta and V. Srivastava, “Structural properties of optimal fidelity selection policies
for human-in-the-loop queues,” Automatica, vol. 159, p. 111388, 2024. Extended version
available at: arXiv preprint arXiv: 2201.09990.

[40] X. Zhou, H. Yue, T. Chai, and B. Fang, “Supervisory control for rotary kiln temperature
based on reinforcement learning,” in Intelligent Control and Automation, pp. 428–437,
Springer, 2006.

[41] P. Gupta and V. Srivastava, “Optimal fidelity selection for human-in-the-loop queues
using semi-Markov decision processes,” in American Control Conference, pp. 5266–
5271, 2019.

117

[42] L. F. Bertuccelli, Robust Decision-Making with Model Uncertainty in Aerospace Sys-

tems. PhD thesis, Massachusetts Institute of Technology, 2008.

[43] A. Nilim and L. El Ghaoui, Robust Markov Decision Processes with Uncertain Transi-

tion Matrices. PhD thesis, University of California, Berkeley, 2004.

[44] C. J. C. H. Watkins, Learning from Delayed Rewards. King’s College, Cambridge

United Kingdom, 1989.

[45] P. Gupta, D. Coleman, and J. E. Siegel, “Towards Physically Adversarial Intelli-
gent Networks (PAINs) for safer self-driving,” IEEE Control Systems Letters, vol. 7,
pp. 1063–1068, 2022.

[46] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and
D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint
arXiv:1509.02971, 2015.

[47] A. L. Strehl and M. L. Littman, “An analysis of model-based interval estimation for
Markov decision processes,” Journal of Computer and System Sciences, vol. 74, no. 8,
pp. 1309–1331, 2008.

[48] S. M. Kakade, On the Sample Complexity of Reinforcement Learning. PhD thesis,

University College London, 2003.

[49] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement
learning,” Journal of Machine Learning Research, vol. 11, no. 51, pp. 1563–1600, 2010.

[50] T. Lattimore and C. Szepesvári, Bandit Algorithms. Cambridge University Press, 2020.

[51] S. Vakili, K. Liu, and Q. Zhao, “Deterministic sequencing of exploration and exploita-
tion for multi-armed bandit problems,” IEEE Journal of Selected Topics in Signal
Processing, vol. 7, no. 5, pp. 759–767, 2013.

[52] H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restless multiarmed
bandit with unknown dynamics,” IEEE Transactions on Information Theory, vol. 59,
no. 3, pp. 1902–1916, 2013.

[53] L. Wei and V. Srivastava, “On abruptly-changing and slowly-varying multiarmed ban-

dit problems,” in American Control Conference, pp. 6291–6296, June 2018.

[54] N. Nayyar, D. Kalathil, and R. Jain, “On regret-optimal learning in decentralized
multiplayer multiarmed bandits,” IEEE Transactions on Control of Network Systems,
vol. 5, no. 1, pp. 597–606, 2016.

[55] L. Wei, A. McDonald, and V. Srivastava, “Multi-robot Gaussian process estimation
and coverage: Deterministic sequencing algorithm and regret analysis,” in IEEE In-
ternational Conference on Robotics and Automation, pp. 9080–9085, 2021.

118

[56] C. Keser and R. Gardner, “Strategic behavior of experienced subjects in a common
pool resource game,” International Journal of Game Theory, vol. 28, no. 2, pp. 241–
252, 1999.

[57] A. R. Hota, S. Garg, and S. Sundaram, “Fragility of the commons under prospect-
theoretic risk attitudes,” Games and Economic Behavior, vol. 98, pp. 135–164, 2016.

[58] A. R. Hota and S. Sundaram, “Controlling human utilization of failure-prone systems
via taxes,” IEEE Transactions on Automatic Control, vol. 66, no. 12, pp. 5772–5787,
2020.

[59] E. Ostrom, R. Gardner, J. Walker, and J. Walker, Rules, Games, and Common-Pool

Resources. University of Michigan Press, 1994.

[60] P. Le Gall, “The theory of networks of single server queues and the tandem queue
model,” International Journal of Stochastic Analysis, vol. 10, no. 4, pp. 363–381, 1997.

[61] N. T. Thomopoulos, Fundamentals of Queuing Systems: Statistical Methods for Ana-

lyzing Queuing Models. Springer Science & Business Media, 2012.

[62] E. Altman, “Applications of dynamic games in queues,” in Advances in Dynamic

Games, pp. 309–342, Springer, 2005.

[63] L. Xia, “Service rate control of closed Jackson networks from game theoretic perspec-

tive,” European Journal of Operational Research, vol. 237, no. 2, pp. 546–554, 2014.

[64] J. R. Marden and J. S. Shamma, “Game theory and control,” Annual Review of Control,

Robotics, and Autonomous Systems, vol. 1, pp. 105–134, 2018.

[65] G. Arslan, J. R. Marden, and J. S. Shamma, “Autonomous vehicle-target assign-
ment: A game-theoretical formulation,” Journal of Dynamic Systems Measurement
and Control-Transactions of the ASME, vol. 129, 09 2007.

[66] T. Başar and G. J. Olsder, Dynamic Noncooperative Game Theory, vol. 23. SIAM,

1999.

[67] T. Roughgarden, “Intrinsic robustness of the price of anarchy,” in Annual ACM Sym-

posium on Theory of Computing, pp. 513–522, 2009.

[68] J. R. Marden and T. Roughgarden, “Generalized efficiency bounds in distributed re-
source allocation,” IEEE Transactions on Automatic Control, vol. 59, no. 3, pp. 571–
584, 2014.

[69] L. Deori, K. Margellos, and M. Prandini, “Price of anarchy in electric vehicle charging
control games: When Nash equilibria achieve social welfare,” Automatica, vol. 96,
pp. 150–158, 2018.

[70] D. Paccagnan, R. Chandan, and J. R. Marden, “Utility design for distributed resource
allocation - Part I: Characterizing and optimizing the exact price of anarchy,” IEEE
Transactions on Automatic Control, pp. 1–1, 2019.

119

[71] P. Gupta and V. Srivastava, “Optimal fidelity selection for improved performance in
human-in-the-loop queues for underwater search,” arXiv preprint arXiv:2311.06381,
2023.

[72] S. C. Kramer and H. W. Sorenson, “Bayesian parameter estimation,” IEEE Transac-

tions on Automatic Control, vol. 33, no. 2, pp. 217–222, 1988.

[73] P. Gupta and V. Srivastava, “Deterministic sequencing of exploration and exploitation
for reinforcement learning,” in 61st Conference on Decision and Control, pp. 2313–
2318, IEEE, 2022.

[74] P. Gupta, S. Biswas, and V. Srivastava, “Fostering human learning in sequential
decision-making: Understanding the role of evaluative feedback,” arXiv preprint
arXiv:2311.03486, 2023.

[75] P. Gupta, S. D. Bopardikar, and V. Srivastava, “Achieving efficient collaboration in
decentralized heterogeneous teams using common-pool resource games,” in 58th Con-
ference on Decision and Control, pp. 6924–6929, IEEE, 2019.

[76] P. Gupta, S. D. Bopardikar, and V. Srivastava, “Incentivizing collaboration in hetero-
geneous teams via common-pool resource games,” IEEE Transactions on Automatic
Control, vol. 68, no. 3, pp. 1902–1909, 2022.

[77] M. Voorneveld, “Best-response potential games,” Economics Letters, vol. 66, no. 3,

pp. 289–295, 2000.

[78] P. Dubey, O. Haimanko, and A. Zapechelnyuk, “Strategic complements and substitutes,
and potential games,” Games and Economic Behavior, vol. 54, no. 1, pp. 77–94, 2006.

[79] M. K. Jensen, “Stability of pure strategy Nash equilibrium in best-reply potential

games,” University of Birmingham, Tech. Rep, 2009.

[80] R. P. Rao, Brain-Computer Interfacing: An Introduction. Cambridge University Press,

2013.

[81] O. Palinko, A. L. Kun, A. Shyrokov, and P. Heeman, “Estimating cognitive load using
remote eye tracking in a driving simulator,” in Symposium on Eye-tracking Research
& Applications, pp. 141–144, 2010.

[82] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling, “Learning policies for partially
observable environments: Scaling up,” in Machine Learning Proceedings 1995, pp. 362–
370, Elsevier, 1995.

[83] M. T. Spaan, “Partially observable Markov decision processes,”

in Reinforcement

Learning, pp. 387–414, Springer, 2012.

[84] A. Diederich and J. R. Busemeyer, “Simple matrix methods for analyzing diffusion
models of choice probability, choice response time, and simple response time,” Journal
of Mathematical Psychology, vol. 47, no. 3, pp. 304–322, 2003.

120

[85] A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learn-

ing,” Discrete Event Dynamic Systems, vol. 13, no. 1-2, pp. 41–77, 2003.

[86] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic

Theory of Independence. Oxford University Press, 2013.

[87] P. Gupta and V. Srivastava, “Structural properties of optimal fidelity selection policies

for human-in-the-loop queues,” arXiv preprint arXiv:2201.09990, 2022.

[88] P. Gupta, S. D. Bopardikar, and V. Srivastava, “Incentivizing collaboration in hetero-
geneous teams via common-pool resource games,” arXiv preprint arXiv: 1908.03938,
Aug. 2019.

[89] N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source
multi-robot simulator,” in International Conference on Intelligent Robots and Systems,
vol. 3, pp. 2149–2154, IEEE, 2004.

[90] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y.
Ng, “ROS: An open-source robot operating system,” in ICRA Workshop on Open
Source Software, vol. 3, p. 5, Kobe, Japan, 2009.

[91] Y. Bengio and P. Frasconi, “Input-output HMMs for sequence processing,” IEEE Trans-

actions on Neural Networks, vol. 7, no. 5, pp. 1231–1249, 1996.

[92] Y. Sakamoto, M. Ishiguro, and G. Kitagawa, “Akaike information criterion statistics,”

Dordrecht, The Netherlands: D. Reidel, vol. 81, no. 10.5555, p. 26853, 1986.

[93] A. A. Neath and J. E. Cavanaugh, “The Bayesian information criterion: Background,
derivation, and applications,” Wiley Interdisciplinary Reviews: Computational Statis-
tics, vol. 4, no. 2, pp. 199–203, 2012.

[94] M. M. M. Manhães, S. A. Scherer, M. Voss, L. R. Douat, and T. Rauschenbach,
“UUV simulator: A Gazebo-based package for underwater intervention and multi-
robot simulation,” in OCEANS Monterey, pp. 1–8, 2016.

[95] L. F. Bertuccelli and M. L. Cummings, “Operator choice modeling for collaborative
uav visual search tasks,” IEEE Transactions on Systems, Man, and Cybernetics-Part
A: Systems and Humans, vol. 42, no. 5, pp. 1088–1099, 2012.

[96] V. Srivastava, P. Holmes, and P. Simen, “Explicit moments of decision times for single-
and double-threshold drift-diffusion processes,” Journal of Mathematical Psychology,
vol. 75, no. 2016, pp. 96–109, 2016. Special Issue in Honor of R. Duncan Luce.

[97] K. P. Murphy, “Conjugate Bayesian analysis of the Gaussian distribution,” tech. rep.,

University of British Columbia, BC, Oct 2007.

[98] L. Li and M. Littman, “Prioritized sweeping converges to the optimal value function,”

Tech. Rep. DCS-TR-631, Rutgers University, NJ, June 2008.

121

[99] S. Chib, “Markov chain Monte Carlo methods: computation and inference,” Handbook

of Econometrics, vol. 5, pp. 3569–3649, 2001.

[100] M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cam-

bridge University Press, 2019.

[101] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger, “Inequalities
for the L1 deviation of the empirical distribution,” Hewlett-Packard Labs, Tech. Rep,
2003.

[102] A. Mastin and P. Jaillet, “Loss bounds for uncertain transition probabilities in Markov
decision processes,” in 51st IEEE Conference on Decision and Control, pp. 6708–6715,
IEEE, 2012.

[103] P. Gupta and V. Srivastava, “Deterministic sequencing of exploration and exploitation

for reinforcement learning,” arXiv preprint arXiv: 2209.05408, Sept. 2022.

[104] K.-M. Chung, H. Lam, Z. Liu, and M. Mitzenmacher, “Chernoff-Hoeffding bounds for

Markov chains: Generalized and simplified,” arXiv preprint arXiv:1201.0559, 2012.

[105] D. Aldous, L. Lovász, and P. Winkler, “Mixing times for uniformly ergodic Markov
chains,” Stochastic Processes and their Applications, vol. 71, no. 2, pp. 165–185, 1997.

[106] H. Chen and F. Zhang, “The expected hitting times for finite Markov chains,” Linear

Algebra and its Applications, vol. 428, no. 11-12, pp. 2730–2749, 2008.

[107] C. Szepesvári, Algorithms for Reinforcement Learning. Springer Nature, 2022.

[108] M. Lopes, F. Melo, and L. Montesano, “Active learning for reward estimation in in-
verse reinforcement learning,” in Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, pp. 31–46, Springer, 2009.

[109] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse
reinforcement learning.,” in AAAI, vol. 8, pp. 1433–1438, Chicago, IL, USA, 2008.

[110] S. Levine, Z. Popovic, and V. Koltun, “Nonlinear inverse reinforcement learning with
Gaussian processes,” Advances in Neural Information Processing Systems, vol. 24,
2011.

[111] B. D. Ziebart, Modeling Purposeful Adaptive Behavior with the Principle of Maximum

Causal Entropy. Carnegie Mellon University, 2010.

[112] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Program-

ming. John Wiley & Sons, 1994.

[113] W. B. Knox and P. Stone, “Interactively shaping agents via human reinforcement:
The TAMER framework,” in International Conference on Knowledge Capture, pp. 9–
16, 2009.

122

[114] W. B. Knox and P. Stone, “Augmenting reinforcement learning with human feedback,”
in ICML Workshop on New Developments in Imitation Learning, vol. 855, p. 3, 2011.

[115] W. B. Knox and P. Stone, “Reinforcement learning from simultaneous human and

MDP reward.,” in AAMAS, vol. 1004, pp. 475–482, Valencia, 2012.

[116] W. B. Knox and P. Stone, “Learning non-myopically from human-generated reward,”
in International Conference on Intelligent User Interfaces, pp. 191–202, 2013.

[117] C. G. Cassandras and S. Lafortune, Introduction to Discrete Event Systems. Springer

Science & Business Media, 2009.

[118] M. K. Jensen, “Aggregative games and best-reply potentials,” Economic Theory,

vol. 43, no. 1, pp. 45–66, 2010.

[119] B. Schipper, “Pseudo-potential games,” tech. rep., University of Bonn, Germany, 2004.

Working paper.

[120] R. Burden and J. Faires, “The bisection method,” Numerical Analysis, pp. 48–56, 2011.

[121] F. M. Dekking, C. Kraaikamp, H. P. Lopuhaä, and L. E. Meester, A Modern Intro-
duction to Probability and Statistics: Understanding Why and How. Springer Science
& Business Media, 2005.

[122] A. Gosavi, Simulation-Based Optimization: Parametric Optimization Techniques and

Reinforcement Learning. Springer, 2015.

[123] D. G. Luenberger and Y. Ye, Linear and Nonlinear Programming, vol. 2. Springer,

1984.

[124] C. Berge, Topological Spaces: Including a Treatment of Multi-valued Functions, Vector

Spaces, and Convexity. Courier Corporation, 1997.

123

APPENDIX A: CHAPTER 2

Proof of Lemma 3

Let wk be the number of tasks that arrive during stage k ∈ {0, . . . , n − 1} with sojourn

time τk, in which the state transitions from sk = (qk, cogk) → sk+1 = (qk+1, cogk+1) and
action ak is selected. Let ak be an optimal action at state sk and π be the corresponding

optimal policy such that ak = π(sk). The optimal policy π when applied from an initial

state s0 induces a sequence of states < sk > and sojourn times < τk > or ζk+1, where
ζk+1 = (cid:80)k

j=0 τj and ζ0 = 0.

Similarly, let ˜s0 = (˜q0, cog0) be another initial state with the same initial cognitive state, and
˜q0 ≥ q0. Apply a policy ˜π from the initial state ˜s0 such that ˜π(q, cog) = π(q + q0 − ˜q0, cog)

for any (q, cog). Note that ˜a0 = a0. The optimal policy ˜π when applied from an initial state

˜s0 = (˜q0, cog0) induces a sequence of realizations < ˜sk > and < ˜τk >. Since cognitive state
and sojourn time are independent of the current queue length, for the same action sequence

applied from the initial states s0 and ˜s0, the random process associated with the evolution of

cognitive state and sojourn time is almost surely the same except for the offset in the queue

length. Hence, the probability of observing a sequence of realizations < ˜sk = (˜qk, cogk) >,
< ˜ak > and < ˜τk > when policy ˜π is applied from ˜s0 is equal to the probability of observing

a sequence of realizations < sk = (qk, cogk) >, < ak > and < τk > when policy π is applied
from s0, where ˜qk − ˜q0 = qk − q0, ˜ak = ak and ˜τk = τk. Therefore, it is easy to show that:

E˜π[˜qk|˜s0, ζk] − Eπ[qk|s0, ζk] = ˜q0 − q0.

(8.1)

Note that the realization of sequence of actions < ak = π(sk) >, which are optimal for

< sk = (qk, cogk) > might be sub-optimal for < ˜sk = (˜qk, cogk) >. Recall that Eπ[·]
and E˜π[·] represents E[·|s0, π] and E[·|˜s0, ˜π], respectively. Let ∆q := ˜q0 − q0, and Z :=

V ∗
n (q0, cog0) − V ∗

n (˜q0, cog0). We first show the upper bound on Z.

Z ≤ V ∗

n (q0, cog0) − Jn,˜π(˜q0, cog0)

= Eπ

(cid:34)n−1
(cid:88)

k=0

γζkR(sk, ak) − γζnCqn

− E˜π

(cid:35)

γζkR(˜sk, ˜ak) − γζnC ˜qn

(cid:35)

(cid:34)n−1
(cid:88)

k=0

124

= Eπ

− E˜π

(cid:34) n−1
(cid:88)

k=0
(cid:34) n−1
(cid:88)

k=0

γζk{r(ak) − c E[τk|cogk, ak]qk −

γζk{r(˜ak) − c E[τk|cogk, ˜ak]˜qk −

E[τ 2

k |cogk, ak]} − γζnCqn

(cid:35)

E[τ 2

k |cogk, ˜ak]} − γζnC ˜qn

.

(8.2)

(cid:35)

cλ
2

cλ
2

Using statements of Lemma 2, RHS of (8.2) is given by:

n−1
(cid:88)

k=0

Eπ[γζkr(ak)] − c

n−1
(cid:88)

k=0

Eπ

(cid:2)γζkτk Eπ [qk|s0, ζk](cid:3) −

cλ
2

n−1
(cid:88)

k=0

Eπ

(cid:2)γζkτ 2

k

(cid:3) − C Eπ

(cid:2)γζn Eπ [qn|s0, ζn](cid:3)

−

n−1
(cid:88)

k=0

E˜π[γζkr(˜ak)] + c

n−1
(cid:88)

k=0

E˜π

(cid:2)γζkτk E˜π [˜qk|˜s0, ζk](cid:3) +

cλ
2

n−1
(cid:88)

k=0

E˜π

(cid:2)γζkτ 2

k

(cid:3) + C E˜π

(cid:2)γζn E˜π [˜qn|˜s0, ζn](cid:3)

(4)∗
= c

n−1
(cid:88)

k=0

Eπ

(cid:2)γζkτk{E˜π[˜qk |˜s0, ζk] − Eπ[qk |s0, ζk]}(cid:3) + C Eπ

(cid:2)γζn{E˜π[˜qn |˜s0, ζk] − Eπ[qn |s0, ζk]}(cid:3) ,

(8.3)

where (4)∗ follows by recalling that the probability of observing a sequence of realizations

< ˜sk = (˜qk, cogk) >, < ˜ak > and < ˜τk > when policy ˜π is applied from ˜s0 is equal to
the probability of observing a sequence of realizations < sk = (qk, cogk) >, < ak > and
< τk > when policy π is applied from s0, where ˜qk − ˜q0 = qk − q0, ˜ak = ak and ˜τk = τk.

Substituting (8.1) in (8.3), we get,

(cid:40)
c

n−1
(cid:88)

Z ≤

Eπ

(cid:2)γζkτk

(cid:3) + C Eπ

(cid:2)γζn(cid:3)

(cid:41)

∆q

(cid:40)
c

(5)∗
=

k=0

n−1
(cid:88)

k=0

(cid:40)

≤

ctmax

n−1
(cid:88)

k=0

Eπ

(cid:2)γζk(cid:3) Eπ [τk] + C Eπ

(cid:2)γζn(cid:3)

(cid:41)

∆q

(cid:41)

ρk + Cρn

∆q,

(8.4)

where (5)∗ follows due to independence of ζk = (cid:80)k−1

i=0 τk and τk. Taking the infinite time

limit in (8.4), we get, V ∗(q0, cog0) − V ∗(˜q0, cog0) ≤ ctmax∆q
We now show the lower bound on Z. Let ˜ak be optimal for ˜sk = (˜qk, cogk), and choose <
ak >=< ˜ak > for the sequence < sk = (qk, cogk) >, where ˜s0 = (˜q0, cog0) and s0 = (q0, cog0).
Note that (8.1) still holds. Analogous to (8.3), Z is lower-bounded by:

, ˜q0 ≥ q0.

1−ρ

Z ≥ Jn,π(q0, cog0) − V ∗

n (˜q0, cog0)

125

= c

n−1
(cid:88)

k=0

(cid:104)

γ ˜ζk ˜τk

(cid:110)

E˜π

E˜π[˜qk|˜s0, ˜ζk] − Eπ[qk|s0, ˜ζk]

(cid:111)(cid:105)

+ C E˜π

(cid:104)

γ ˜ζn

(cid:110)
E˜π[˜qn|˜s0, ˜ζk] − Eπ[qn|s0, ˜ζk]

(cid:111)(cid:105)

.

Substituting (8.1) in (8.5), we get,

(cid:40)
c

Z ≥

(cid:105)

(cid:104)
γ ˜ζk

E˜π

E˜π [˜τk] + C E˜π

(cid:41)

(cid:105)

(cid:104)

γ ˜ζn

∆q

n−1
(cid:88)

k=0

(cid:40)

(6)∗
≥

n−1
(cid:88)

γE˜π[˜ζk] + CγE˜π[˜ζn]

cts

(cid:41)

∆q

k=0

(cid:26) (1 − γntmax)cts
1 − γtmax

≥

+ γntmaxC

(cid:27)

∆q,

(8.5)

(8.6)

where (6)∗ follows by applying Jensen’s inequality [121] (E[g(x)] ≥ g(E[x])) on the con-

vex function g(x) = γx. Taking the infinite time limit in (8.6), we get, 0 ≤ cts∆q

1−γtmax ≤
□

V ∗(q0, cog0) − V ∗(˜q0, cog0), ˜q0 ≥ q0.

Proof of Lemma 4

Proof. We start by proving the first statement. In the following, we find conditions under

which if action N is the optimal choice at queue length q1 for a given cognitive state cog ≤

cog∗, then for all q2 > q1, N dominates H. Let N be the optimal action in state s1 = {q1, cog}.

Let F (s, a) denote the expected future rewards received in state s for taking action a (the

second term in the Bellman equation (2.3)). Then, we have

R(s1, N ) − R(s1, H) + F (s1, N ) − F (s1, H) > 0,

=⇒ M +

(cid:88)

(cid:88)

(cid:88)

τ

cog′

q

γτ Pois(q|τ )V ∗(cog′, q1+q−1)(P(cog′, τ |cog, N )−P(cog′, τ |cog, H)) > 0,

(8.7)

where M := c(E[τ |cog, H] − E[τ |cog, N ])q1 + rN − rH + cλ

2 (E[τ 2|cog, H] − E[τ 2|cog, N ]), and
P(q1 + q − 1|q1, τ ) is replaced by Pois(q|τ ), which is the Poisson probability of q arrivals

during service time τ .

Now for the state s2 = {q2, cog}, with q2 > q1 and identical cog, under the assumption (A1)

we show that:

R(s2, N ) − R(s2, H) + F (s2, N ) − F (s2, H) > 0.

(8.8)

126

The left-hand side of (8.8) is given by:

X +M +

(cid:88)

(cid:88)

(cid:88)

τ

cog′

q

γτ Pois(q|, τ )V ∗(cog′, q2+q−1)(P(cog′, τ |cog, N )−P(cog′, τ |cog, H)),

(8.9)

where X := c(E[τ |cog, H] − E[τ |cog, N ])(q2 − q1). To show (8.8), we prove that the difference

between LHS of (8.8) and (8.7) is positive. Subtracting LHS of (8.7) from (8.9), we get:

(cid:88)

(cid:88)

(cid:88)

X −

τ

cog′

q

γτ Pois(q|τ )VD(P(cog′, τ |cog, N ) − P(cog′, τ |cog, H)),

(8.10)

where VD :=

(cid:104)

(cid:105)
. From Lemma 3, we know that
V ∗(cog′, q1 + q − 1) − V ∗(cog′, q2 + q − 1)

0 ≤ β :=

cts(q2 − q1)
1 − γtmax

≤ VD ≤

ctmax(q2 − q1)
1 − ρ

=: α.

Therefore, (8.10) is lower bounded by

(cid:88)

(cid:88)

(cid:88)

X +β

γτ Pois(q|τ )P(cog′, τ |cog, H)−α

τ

cog′

q

(cid:88)

(cid:88)

(cid:88)

τ

cog′

q

γτ Pois(q|τ )P(cog′, τ |cog, N )

≥ X + βγE[τ |cog,H] − α E[γτ |cog, N ],

(8.11)

where we utilized Jensen’s inequality on convex function γτ to obtain E[γτ |cog, H] ≥ γE[τ |cog,H].

(8.11) is nonnegative when the condition in the first statement holds.

Now we prove the second statement. Using a similar analysis it can be shown that if action

S is the optimal choice at queue length q1 for a given cognitive state cog ≤ cog∗, then for

every q2 > q1, S dominates H and N, respectively, under the following conditions:

E[τ |cog, H] − ts +

E[τ |cog, N ] − ts +

tsγE[τ |cog,H]
1 − γtmax
tsγE[τ |cog,N ]
1 − γtmax

≥ γts tmax
1 − ρ
≥ γts tmax
1 − ρ

,

,

(8.12)

(8.13)

respectively, where we have used E[τ |cog, S] = ts and E[γτ |cog, S] = γts due to constant

time of skip. Since E[τ |cog, H] > E[τ |cog, N ], (8.12)-(8.13) can be combined to obtain the

condition in the second statement under which action S dominates both H and N.

127

APPENDIX B: CHAPTER 4

Proof of Theorem 2

A key challenge we address is the time-dependence of the asynchronous adaptive Bellman

updates that adapt to the latest estimates of the service-time distribution. Using Lemmas 13

and 14, we upper-bound the difference between optimal value functions for intermediate

SMDPs at subsequent time steps, and the optimal value function for the uncertainty-free

SMDP, which are used to establish the convergence result.

Theorem 9 (adapted from [122, Chapter 10]). T is a contraction mapping and therefore,

there exists a unique fixed point satisfying T (V ∗) = V ∗, where V ∗ is the optimal value

function for the uncertainty-free SMDP Γ.

Let V ∗
t

Therefore, V ∗

be the optimal value function for SMDP ˆΓt defined by the estimates ˆPt and ˆRt.
t = ˆTt(V ∗

t ). Let ∥ · ∥ be the max-norm given by ∥v∥ = max{|v1|, |v2| . . . , |vn|},
for any vector v = (v1, v2, . . . , vn). Let τmin ≥ 1 be the minimum number of time steps spent

in any state s ∈ S, for any action a ∈ AS.

Lemma 13. For any state s ∈ S, following statements hold:

(i) | ˆTt(V1(s)) − ˆTt(V2(s))| ≤ γτmin∥V1 − V2∥ for any s ∈ S, where the Bellman operator ˆTt

at any time t is applied on value function estimates V1 and V2.

(ii) |Vt+1(s) − V ∗

t (s)| ≤ γτmin∥Vt − V ∗

t ∥, if s ∈ Bt.

Proof. We prove the first statement. For any s ∈ S, let V := | ˆTt(V1(s)) − ˆTt(V2(s))|.

Therefore,

V ≤ max
a∈AS

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
γτ ˆPt(s′, τ |s, a)(V1(s′) − V2(s′))
(cid:12)
(cid:12)
(cid:12)

(cid:88)

(cid:88)

τ

s′
(cid:88)

≤ ∥V1 − V2∥

γτ ˆPt(τ |s, a) ≤ γτmin∥V1 − V2∥.

The second statement follows from the first by noting that |Vt+1(s) − V ∗

t (s)| = | ˆTt(Vt) −

τ

ˆTt(V ∗

t )| if s ∈ Bt.

128

Lemma 14. Under Assumptions A1-A5, for any given ϵ > 0, there exists a time ˜t, such

that for any t ≥ ˜t, (i) ∥V ∗

t − V ∗∥ ≤ ϵ, and (ii) ∥V ∗

t+1 − V ∗

t ∥ ≤ 2ϵ with probability 1.

Proof. Since the estimates ˆPt and ˆRt are assumed to be bounded at any time t (assumption

(A2)), the value function estimate Vt at any time t also remains bounded. Furthermore,
using assumption (A3), ˆPt and ˆRt converge to their true values P and R, respectively, with

probability 1, i.e, for any ϵ1, ϵ2 > 0, there exists a time ˜t0 such that, for any t ≥ ˜t0,
and pij are the elements of ˆPt and P, respectively.
| ˆRt − R| ≤ ϵ1 and |ˆpij

t − pij| ≤ ϵ2, where ˆpij

t

In asynchronous VI, the value of only the states s ∈ Bt ⊆ S are updated at any time t,

however, each state is assumed to be updated infinitely often. Therefore, the sequence (4.6)

converges with probability 1, i.e., for any ϵ3 > 0, there exists a time ˜t1 ≥ ˜t0 such that

∥Vt+1 − Vt∥ ≤ ϵ3, for t ≥ ˜t1.

Consider s ∈ Bt such that Vt+1(s) = ˆTt(Vt(s)). Let V ∗

t

:= ∥V ∗

t − V ∗∥, where we only

consider the states s ∈ Bt in the vectors V ∗
t

and V ∗. Therefore,

t ≤ ∥V ∗
V ∗

t − Vt+1∥ + ∥Vt+1 − V ∗∥ =: Z1 + Z2.

(8.14)

Z1 = ∥V ∗

t − Vt+1∥ is upper bounded by:

Z1 ≤ ∥V ∗

t − ˆTt(Vt+1)∥ + ∥ ˆTt(Vt+1) − ˆTt(Vt)∥ =: Z 1

1 + Z 2
1 .

(8.15)

Since Z 1

1 = ∥V ∗

t − ˆTt(Vt+1)∥ = ∥ ˆTt(V ∗

t ) − ˆTt(Vt+1)∥, from statement (i) of Lemma 13, we get

Z 1

1 ≤ γτmin∥V ∗

t − Vt+1∥, and Z 2

1 ≤ γτmin∥Vt+1 − Vt∥.

Substituting (8.16) in (8.15), we get:

Z1 ≤

γτmin
1 − γτmin

∥Vt+1 − Vt∥.

Z2 = ∥V ∗ − Vt+1∥ is upper bounded by:

(8.16)

(8.17)

Z2 ≤ ∥V ∗ − T (Vt+1)∥ + ∥T (Vt+1) − Vt+1∥ =: Z 1

2 + Z 2
2 .

(8.18)

129

Again using statement (i) of Lemma 13, we have

Z 1

2 ≤ γτmin∥V ∗ − Vt+1∥ = γτminZ2.

(8.19)

Furthermore, Z 2

2 = ∥T (Vt+1) − Vt+1∥ = ∥T (Vt+1) − ˆTt(Vt)∥ is upper bounded by:

Z 2

2 ≤ max
a∈AS

∥R − ˆRt∥ + max
a∈AS

(cid:13)
(cid:13)
(cid:13)

(cid:88)

(cid:88)

τ

s′

γτ P(s′, τ |s, a)Vt+1(s′) −

(cid:88)

(cid:88)

τ

s′

(cid:13)
γτ ˆPt(s′, τ |s, a)Vt(s′)
(cid:13)
(cid:13).

(8.20)

Recall that for any t ≥ ˜t0, | ˆRt − R| ≤ ϵ1 and |ˆpij

t − pij| ≤ ϵ2. Therefore, for t ≥ ˜t0, (8.20)

is upper bounded by:

Z 2

2 ≤ ϵ1 + max
a∈AS

(cid:13)
(cid:13)
(cid:13)

(cid:88)

(cid:88)

τ

s′

(cid:13)
γτ |P(s′, τ |s, a) − ˆPt(s′, τ |s, a)|Vt+1(s′)
(cid:13)
(cid:13)

+

(cid:13)
(cid:13)
(cid:13)

(cid:88)

(cid:88)

τ

s′

(cid:13)
γτ ˆPt(s′, τ |s, a)(Vt+1(s′) − Vt(s′))
(cid:13)
(cid:13)

(1∗)
≤ ϵ1 + ∥Vt+1∥

(cid:88)

(cid:88)

γτ ϵ2 + γτmin∥Vt+1 − Vt∥,

τ

s′
ϵ2N γτmin∥Vt+1∥
1 − γ

= ϵ1 +

+ γτmin∥Vt+1 − Vt∥,

(8.21)

where N is the size of the finite state-space S, and (1∗) follows from |ˆpij

t − pij| ≤ ϵ2 and

statement (i) of Lemma 13. Substituting (8.19) and (8.21) in (8.18), we get:

Z2 ≤

γτmin
1 − γτmin
ϵ1 + ϵ2N γτmin ∥Vt+1∥

1−γ

∥Vt+1 − Vt∥ + f (∥Vt+1∥),

(8.22)

where f (∥Vt+1∥) :=

1
1−γτmin

(cid:16)

(cid:17)

is bounded for bounded ∥Vt+1∥ and f (∥Vt+1∥) (cid:55)→

0, when ϵ1, ϵ2 (cid:55)→ 0. Substituting (8.17) and (8.22) in (8.14), we get

V ∗

t ≤

2γτmin
1 − γτmin

∥Vt+1 − Vt∥ + f (∥Vt+1∥).

(8.23)

Recall that for any ϵ3 > 0, there exists ˜t1 ≥ ˜t0 such that |Vt+1 −Vt∥ ≤ ϵ3, for t ≥ ˜t1. Choosing
, we get that there exists ˜t1, such that ∥Vt+1 − Vt∥ ≤ ϵ(1−γτmin )
ϵ1, ϵ2 (cid:55)→ 0, and ϵ3 = ϵ(1−γτmin )

,

2γτmin

2γτmin

t − V ∗∥ < ϵ, for any t ≥ ˜t1.

and V ∗

t = ∥V ∗
Recall that V ∗
t

only considers states s ∈ Bt. However, since each state is updated infinitely

often, there exists ˜t such that ∥V ∗

t − V ∗∥ < ϵ, for any t ≥ ˜t. Furthermore, for t ≥ ˜t,

∥V ∗

t+1 − V ∗

t ∥ ≤ ∥V ∗

t+1 − V ∗∥ + ∥V ∗

t − V ∗∥ ≤ 2ϵ.

130

Proof of Theorem 2: For any state s ∈ S, define a sequence {ts

i }∞
i=1

of times at which

state s is updated by the asynchronous VI, and consider the updates after time ˜t, i.e.,

consider the sequence {ts

i }∞
i=k

such that ts

k ≥ ˜t. Let Vt(s) := |Vt+1(s) − V ∗

using statement (ii) of Lemma 13, Vts

i+1

= |Vts

i+1+1(s) − V ∗
ts
i+1

therefore upper-bounded by:

(s)| ≤ γτmin∥Vts

t (s)|. Therefore,
∥, and

− V ∗
ts
i+1

i+1

Vts

i+1

(s) ≤ γτmin(∥Vts

i+1

− V ∗
ts
i

∥ + ∥V ∗
ts
i

− V ∗
ts
i+1

∥)

(1∗)
≤ γτmin (cid:0)∥Vts

i

∥ + 2ϵ)(cid:1) ,

(8.24)

for i ≥ k, where (1∗) follows from statement (ii) of Lemma 14. From (8.24), we get the

following recursion:

for i ≥ k.

∥Vts

i+1

∥ ≤ γτmin (cid:0)∥Vts

i

∥ + 2ϵ(cid:1) ,

(8.25)

Recursively performing (8.25) to obtain upper-bounds on ∥Vts

j

∥, for j = k, . . . i, and

substituting in (8.24), we get:

Vts

i+1

(s) ≤ γ(i+1)τmin∥Vts

k

∥ +

2γτmin(1 − γ(i+1)τmin)
1 − γτmin

ϵ,

In the limit i → ∞, Vts

i+1

= |Vts

for ϵ (cid:55)→ 0. Since for each s ∈ S, V ∗
ts
i+1
Vt(s) converges to V ∗(s) for any s.

Proof of Theorem 3

i+1+1(s) − V ∗
ts
i+1

(s)| ≤ ϵ4, where ϵ4 := 2γτmin

1−γτmin ϵ, and ϵ4 (cid:55)→ 0
(s) converges to V ∗(s) (Lemma 14), and ϵ is arbitrary,

■

We prove Theorem 3 using the following Theorem 10.

Theorem 10. Tr is a contraction mapping, and hence, there exists a unique fixed point

satisfying, Tr(V ) = V .

Proof. The proof follows similar to the case of robust MDPs [32].

Proof of Theorem 3: Since Tr is a contraction mapping (Theorem 10), and each state

is updated infinitely often, the robust adaptive asynchronous VI converges to a fixed point

131

Tr(V ) = V . Furthermore, bounded P a implies that the value function at any time t remains

bounded. Once P a converges to the singleton estimate P, the robust adaptive asynchronous

VI reduces to the adaptive asynchronous VI. Hence, the proof follows using Theorem 2. ■

132

APPENDIX C: CHAPTER 7

Proof of Theorem 1 [Existence of PNE]

We prove Theorem 6 using Brouwer’s fixed point theorem [66, Appendix C] applied to

the best response mapping with the help of following lemmas (Lemmas 15-17). Recall that

bi(λR

−i) is the best response of player i to the review admission rates of other players λR

−i

. For

brevity of notation, we will represent rR(λR

−i) using
i , λR
rR, p, fi, ˜ui, respectively. Furthermore, let q′ and q′′, respectively, represent the first and

−i), ˜ui(λR

−i), fi(λR

−i), p(λR

i , λR

i , λR

i , λR

the second partial derivatives of a generic function q with respect to λR
i

.

Lemma 15 (Strict concavity of incentive). For the CPR game Γ, under Assumptions

(A1-A2), the incentive function fi : S (cid:55)→ R is strictly concave in λR

i ∈ [0, λ

R
i ] and

any fixed λR

−i. Equivalently, fi(x) is strictly concave in x for x ∈ [0, µS

j∈N ,j̸=i ajλR
j ].

i , for λR
T − (cid:80)

Proof. Recall from (7.9) that

fi(λR

i , λR

−i) = fi(x) = rR(x)(1 − p(x)) − hirS.

The first and the second partial derivative of the incentive function fi with respect to λR
i

in the interval λR

R
i ∈ [0, λ

i ] are given by:

i = (rR)′(1 − p) − rRp′ = −ai
f ′

dfi
dx

,

f ′′
i = (rR)′′(1 − p) − 2(rR)′p′ − rRp′′ = a2
i

d2fi
dx2 .

(8.26a)

(8.26b)

From Assumptions (A1) and (A2), we have f ′′

i < 0 and d2fi

dx2 < 0 in the interval where

derivative of fi exists, thereby proving the strict concavity of fi in λR
i

and x.

Lemma 16 (Best response mapping ). For the CPR game Γ, under Assumptions (A1-

A2), the best response mapping bi(λR

−i) is unique for any λR

−i ∈ S−i and is given by:

bi(λR

−i) =






0,

if fi(λR

i , ·) ≤ 0, ∀λR

i ∈ Si,

αi,

if ∃αi ∈ Si s.t.

∂ ˜ui
∂λR
i

(αi) = 0, and fi(αi, ·) > 0,

µR
i , otherwise .

133

Proof. We establish uniqueness of the best response mapping through the following three

cases.

Case 1: fi(λR

i , ·) ≤ 0, for every λR
−i ∈ S−i, fi(λR

If for a given λR

admits a unique maximum at λR

i ∈ Si.
i , ·) ≤ 0, for every λR
i = 0, and therefore, bi(λR

Case 2: There exists a non-empty interval Si ⊂ Si, such that fi(λR

i ∈ Si, then from (7.8), ˜ui(λR

i , λR
−i)
−i) = 0 is the unique best response.
i , ·) > 0, and f ′

i , ·) < 0,

i(λR

for every λR

i ∈ Si.

For any given λR

−i ∈ S−i, recall that the system constraint (7.3) is violated for every

λR
i ∈ (λ

R
i , µR

i ] ⊂ Si, and p(λR

i , λR

−i) = 1. Therefore, for every λR

i ∈ (λ

R
i , µR

i ], we have

fi = −hirS < 0.

(8.27)

, since

Therefore, bi(λR

i ] ⊂ Si, for any given λR
p is continuously differentiable with respect to λR
i

−i) ∈ [0, λ

R

function on the set [0, λ

˜ui on the interval λR

R

i ), ˜ui is a smooth
i ] × S−i. Hence, the best response, which is a global maximizer of
i ∈ Si, either occurs at the boundary of Si or satisfies the first order

, for each λR

R
i ∈ (0, λ

−i ∈ S−i. Furthermore, for a fixed λR

−i

condition, ∂ ˜ui
∂λR
i

(bi) = 0 (see [123]).

Let there exist αi ∈ Si such that

fi(αi, ·) > 0,

and

(αi) = αif ′

i(αi, ·) + fi(αi, ·) = 0.

(8.28a)

(8.28b)

∂ ˜ui
∂λR
i
Since fi(αi, ·) > 0 and αi > 0,

R
fi(αi, ·) > 0 implies αi ∈ [0, λ

R
implies there exists a non-empty set Si ⊂ [0, λ

and f ′

i(αi, ·) < 0. For any λR

i ∈ Si, such that fi(λR

(8.28b) has a solution only if f ′

i(αi, ·) < 0. Furthermore,
i ] (see (8.27)). Therefore, existence of αi satisfying (8.28)
i ] ⊂ Si, such that for each αi ∈ Si, fi(αi, ·) > 0
i , ·) < 0, using Lemma 15,

i , ·) > 0 and f ′

i(λR

we get:

Hence, for λR

i ∈ Si, the expected utility ˜ui is strictly concave with a unique global maximizer

∂2 ˜ui
∂λR
i

2 = λR

i f ′′

i + 2f ′

i < 0.

(8.29)

αi ∈ Si that satisfies αi = min{− fi(bi,·)

i (bi,·) , µR

f ′

i } (see (8.28b)).

134

Case 3: There exists a non-empty interval ˜Si ⊂ Si, such that fi(λR

i , ·) > 0, for every λR

i ∈ ˜Si,

and f ′

i(λR

i ∈ Si.
i , ·) ≥ 0, for any λR
Finally, consider the case that f ′

˜Si ⊂ Si where fi(λR

i(λR
i , ·) > 0, for any λR

i , ·) ≥ 0, for every λR
i ∈ ˜Si. Since f ′

i(λR

i ∈ Si, and there exists an interval
i ∈ Si, i.e.,
i , ·) ≥ 0, for every λR
. Since there

i

i , ·) is increasing in λR

, and therefore, fi(λR
fi(λR
exists a non-empty interval ˜Si such that fi(λR
i ∈ ˜Si, and fi(µR

i , ·) is maximized at λR
i , ·) > 0, for every λR
i , ·) > 0. Therefore, in the interval λR
(8.28b) has no solution and the expected utility of player i is strictly increasing in λR
i

i ∈ ˜Si, monotonically
i ∈ ˜Si,
, i.e.,

i , ·), it follows µR

increasing fi(λR

i = µR
i

> 0, for every λR

∂ ˜ui
∂λR
i
occurs at the boundary µR
i

.

i ∈ Si. Therefore, the best response is the unique maximum of ˜ui which

We state some important intermediate results from three cases of Lemma 16 as a corollary

for later discussions.

Corollary 2 (Best response and incentive). For the CPR game Γ, under Assumptions

(A1-A3), the following statements hold:

(i) bi = 0, if and only if, fi(λR

i , ·) ≤ 0, for every λR

i ∈ Si; furthermore, fi(λR

i , ·) ≤ 0, for

every λR

i ∈ Si implies f ′

i(λR

i , ·) < 0, for every λR

i ∈ Si;

(ii) if there exists an interval Si ⊂ Si, such that fi(λR

i , ·) > 0, and f ′

i(λR

i , ·) < 0, ∀λR

i ∈

Si, then the unique best response for player i satisfies the implicit equation bi =

min{− fi(bi,·)

i (bi,·) , µR

f ′

i } ∈ Si; and

(iii) if f ′

i(λR

i , ·) ≥ 0, for every λR

i ∈ Si, then bi = µR
i .

Proof. We only establish the first statement of the corollary. The other statements are

established in the proof of Lemma 16. We have already established in Lemma 16 that if

fi(λR

i , ·) ≤ 0, then for every λR

i ∈ Si, the expected utility ˜ui is maximized for bi = 0. We

now establish the “only if" part. Recall from (7.8) that

˜ui(λR

i , λR

−i) = µS

i rS + λR

i fi(λR

i , λR

−i).

135

Let bi = 0 be the best response for player i for a fixed λR
−i

. If there exists b ∈ Si, such that

fi(b, ·) > 0, then ˜ui(b, ·) > ˜ui(bi, ·), and bi = 0 cannot be a best response. Hence, bi = 0 is

the best response for player i, if and only if, fi(λR

i , ·) ≤ 0, for every λR
i ∈ Si, then f ′
Since fi is strictly concave in x (from Lemma 15) and fi(µR

We now show that if fi(λR

i , ·) ≤ 0, for every λR

i(λR

i ∈ Si.

Assumption (A3), there exist γ1, γ2 ∈ R such that γ1 < µS

T − aiµR

only if x ∈ (γ1, γ2).

If fi(λR

i , λR

−i) = fi(x) ≤ 0 for each λR

i ∈ Si and for a given λR

−i

i , 0) = fi(µS

i , ·) < 0, for every λR

i ∈ Si.
i ) > 0 by
T − aiµR
i < γ2 and fi(x) > 0 if and

, then for each λR

i ∈ Si,
,

either x ≤ γ1, or x ≥ γ2. Suppose x ≥ γ2, for each λR

i = µR
i
T − aiµi < γ2, which is a contradiction. Hence, x ≤ γ1, for

i ∈ Si. However, for λR

j̸=i ajλR

j ≤ µS

x = µS

each λR

T − aiµi − (cid:80)
i ∈ Si.

Finally, from strict concavity of fi, fi is increasing in x for x ≤ γ1. Equivalently, fi is

decreasing in λR
i

, i.e., f ′

i(λR

i , ·) < 0, for every λR

i ∈ Si.

Theorem 11 (Berge Maximum Theorem, adapted from [124] ). Let ˜ui : Si × S−i (cid:55)→ R

be a continuous function on Si × S−i, and C : S−i (cid:55)→ Si be a compact valued correspondence

such that C(λR

−i) ̸= ∅ for all λR

−i ∈ Si. Define ˜u∗

i : S−i (cid:55)→ R by

i (λR
˜u∗

−i) = max{˜ui(λR

i , λR

−i) | λR

i ∈ C(λR

−i)},

and bi : S−i (cid:55)→ Si by

bi(λR

−i) = argmax{˜ui(λR

i , λR

−i) | λR

i ∈ C(λR

−i)}.

If C is continuous at λR

−i, then ˜u∗

i is continuous and bi is upper hemicontinuous with

nonempty and compact values. Furthermore, if ˜ui is strictly quasiconcave in λR

i ∈ Si for

each λR

−i and C is convex-valued, then bi(λR

−i) is single-valued, and thus is a continuous

function.

Lemma 17 (Continuity of best response mapping ). For the CPR game Γ, under As-

sumptions (A1-A3), the best response mapping bi(λR

−i) is continuous for each λR

−i ∈ S−i.

136

Proof. Let z(λR

−i) : S−i (cid:55)→ [

T −(cid:80)
µS

j∈N ,j̸=i aj µR
j

ai

, µS
T
ai

] be defined by

z(λR

−i) :=

T − (cid:80)
µS

j∈N ,j̸=i ajλR
j
ai

.

The mapping z(λR

−i) represents an upper bound on the value of λR

i

(8.30)

above which the system

constraint (7.3) is violated. Therefore, for each λR

i ∈ [z(λR

−i), ∞) ∩ Si, from (7.9), we get

fi = −hirS < 0, and f ′

i = −rRp′ ≤ 0,

(8.31)

where the latter follows from monotonicity of p (Assumption (A2)).

The mapping z(λR

−i) defined in (8.30) is continuous on S−i and linearly decreasing in λR
j ,
for every j ∈ N \ {i}. Therefore, to establish the continuity of the best response mapping

−i) on S−i, it is sufficient to show that bi(λR
bi(λR
T −(cid:80)
µS

, µS
T
ai
bi is unique and varies continuously with z(λR

] (cid:55)→ [0, µR

j∈N ,j̸=i aj µR
j

ϕ : [

ai

−i).

i ]. To this end, we show that for each fixed value of z(λR

−i)), for some continuous function
−i),

−i) = ϕ(z(λR

Let ˆλ+ : S−i (cid:55)→ [0, µR

i ] be defined by



0,

ˆλ+(λR

−i) =

if fi(λR

i , λR

−i) ≤ 0, ∀λR

i ∈ Si,

(8.32)



sup{λR

i ∈ Si| fi > 0}, otherwise.

The mapping ˆλ+(λR

−i), when non-zero, represents the maximal admissible review admission
rate for player i, that yields her a positive incentive to review the tasks. Fig. C.1 shows the

best response of player i ∈ N for the three possible cases of ˆλ+.

Case 1: z(λR

−i) ≤ 0. From (8.31) and statement (i) of Corollary 2, bi(λR

−i) = 0 is the unique

(continuous) best response for player i.

Case 2: z(λR

−i) > 0.

From (8.31), fi < 0 and f ′
−i). Now we consider three cases based on the value of ˆλ+.

i < 0 for any λR

ˆλ+ < z(λR

i ≥ z(λR

−i). Hence,

Case 2.1: ˆλ+ = 0. In this case, fi ≤ 0, for every λR

i ∈ Si, and therefore, from statement (i)

of Corollary 2, bi = 0 is the unique best response and continuity holds trivially.

Case 2.2: ˆλ+ ∈ (0, µR

such that fi > 0 and f ′

i ). In this case, for any λR
i < 0, for every λR

−i ∈ S−i, there exists an interval Si ⊂ [0, ˆλ+]
i < 0 follows from the fact that

i ∈ Si. Here, f ′

137

(a)

(b)

(c)

Figure C.1 Best response of player i with varying ˆλ+. The red curve shows different possi-
i ], based on the
i ) w.r.t λR
bilities for strictly concave incentive function fi(λR
i ∈ Si; in (b), there exists a subset of Si
i < 0 for all λR
value of λR
−i
i ) (represented by blue),
where fi > 0 and f ′
fi < 0 and f ′
−i) ∈ Si; and c) for
ˆλ+ = µR
i

i < 0; and in c) f ′
i < 0. a) For ˆλ+ = 0, bi(λR

i ≥ 0 for any λR
−i) = 0; b) for ˆλ+ ∈ (0, µR

. In (a), fi < 0 and f ′

i ∈ Si = [0, µR

i ) , bi(λR

. At z(λR

−i) = µR
i

, bi(λR

−i

.

the supremum in (8.32) corresponds to the decreasing segment of fi. From statement (ii)

of Corollary 2, there exists a unique bi(λR

−i) ∈ Si that maximizes ˜ui. Application of Berge

maximum theorem [124], yields the continuity of the unique maximizer.

Since, bi < ˆλ+, it follows that if ˆλ+ → 0+, then bi → 0+. Hence, the continuity holds at

ˆλ+ = 0.

. Since ˆλ+ < z(λR

Case 2.3: ˆλ+ = µR
i

i , λR
continuity follows analogously to Case 2.2. Now consider the case f ′(µR

−i) < 0, then the
−i) = −δ, for
and its derivative is decreasing, there exists ϵ > 0 such

δ > 0. Since fi is concave in λR
i

]. If f ′(µR

−i) ∈ (µR

−i), z(λR

i , λR

T
ai

i , µS

that f ′ < 0 for λi ∈ (µR

i − ϵ, µR
there exists at most one point λR
i

i ]. Since fi(λR
, such that f ′(λR

ϵ → 0+. Hence, in this limiting case Si = (µR

i , ·) is strictly concave in λR

i

(Lemma 15),

i , ·) = 0. Therefore, in the limit δ → 0+,
i ], where ϵ → 0+, and the best response

bi(λR

−i) ∈ Si = (µR
If f ′(µR

i , λR

i − ϵ, µR

i ] converges to µR

i

−i) ≥ 0, then it follows from strict concavity of fi that f ′(λR

every λR

i ∈ Si. Using statement (iii) of Corollary 2, bi(λR

−i) = µR
i

best response.

−i) ≥ 0, for
i , λR
is the unique (continuous)

i − ϵ, µR
.

Note that when z(λR

, i.e., when no other player reviews any task (λR

Assumption (A3), fi(µR

−i) = µR
i

. Hence, bi(λR

T
ai

−i) = µS
i , 0) > 0 and therefore bi(λR

−i = 0), from
−i) is continuous for

138

every z(λR

−i), and therefore, is continuous for λR

−i ∈ S−i.

Proof of Theorem 6: To prove the existence of a PNE, define a mapping M : S (cid:55)→ S as

follows:

M (λR

1 , λR

2 , ..., λR

N ) = (b1(λR

−1), b2(λR

−2), ..., bN (λR

−N )).

(8.33)

The mapping M is unique (Lemma 16) and continuous (Lemma 17), and maps the

compact convex set S (Si is convex and compact, ∀i ∈ N ) to itself. Hence, application of

Brouwer’s fixed point theorem [66, Appendix C] yields that there exists a strategy profile

λR = {λR
i

∗}i∈N ∈ S which is invariant under the best response mapping and therefore is a

PNE of the game.

Proof of Corollary 1 [PNE]:

□

Since PNE is a best response which remains invariant under the best-response mapping M

given by (8.33), Corollary 1 is a direct consequence of Corollary 2 with a simplification that

i

i

, λR∗

, λR∗

i(λR∗
f ′
(i), i.e. f ′

−i ) < 0 at PNE. Therefore, to prove Corollary 1, it is sufficient to show statement
i(λR∗
−i ) < 0 for any player i ∈ N at PNE, which we prove by contradiction. Let
−j ) ≥ 0 at PNE. From (8.26a), it can be seen that
i ≥ 0 for
all i ∈ N at that PNE. In such a case, (8.28b) implies that the expected utility of each player

remains the same for all players at a PNE. Therefore, f ′

there exist a player j such that f ′

j ≥ 0 implies f ′

the sign of f ′
i

j(λR∗

, λR∗

j

with λR∗

i > 0 (therefore, fi > 0) is increasing in λR

i

at that PNE, and therefore, each of these

players can improve their expected utility by unilaterally increasing their review admission

rate. Therefore λR∗ cannot be a PNE, which is a contradiction. Hence, f ′

i(λR∗

i

any player i ∈ N at a PNE, and the corollary follows.

Proof of Proposition 1 [Structure of PNE]

, λR∗

−i ) < 0 for
□

be the review admission rates at a PNE for players k1 and k2, respectively,

Let λR∗
k1
with hk1 ≤ hk2

and λR∗
k2
. By proving ak1λR∗
k1
< ak2λR∗
k2

. We assume ak1λR∗
k1

ak1 ≤ ak2

≥ ak2λR∗
k2

, λR∗
k1

≥ λR∗
k2

is established trivially since

and prove the first statement by establishing a

contradiction argument using two cases discussed below. Furthermore, the proof of the

139

second statement is contained within Case 1 below.

Case 1: λR∗
k1

= 0.

From statement (ii) of Corollary 1, fk1(λR∗
k1
for players k1 and k2 at a PNE satisfies:

fk2

, λR∗
−k1

) ≤ 0. From (7.9), the incentives fk1

and

fk2 = fk1 + (hk1 − hk2)rS ≤ 0.

Therefore, utilizing statement (ii) of Corollary 1 again implies λR∗
k2

= 0, which is a contra-

diction. This case also proves the second statement.

Case 2: λR∗
k1

> 0.

By assumption, ak1λR∗
k1

< ak2λR∗
k2
{k1, k2}, satisfy the implicit equation

, from statement (iii) of Corollary 1, λR∗

i

, where i ∈

(cid:40)

λR∗
i = min

−

fi(λR∗
i
i(λR∗
f ′
i

, λR∗
−i )
, λR∗
−i )

(cid:41)
.

, µR
i

We assume that λR∗
k1

< µR
k1

, and therefore, λR∗
k1

(cid:40)

ak2λR∗

k2 = min

− ak2

fk2
f ′
k2

, ak2µR
k2

. Using (7.9) and (8.26a), we get

= − fk1
f ′
k1

(cid:41)

≤ −ak2

fk2
f ′
k2

= −ak1

fk1 + (hk1 − hk2)rS
f ′
k1

≤ ak1λR∗
k1 ,

which is a contradiction. Hence, if λR∗
k1

< µR
k1

, then ak1λR∗
k1

≥ ak2λR∗
k2

and λR∗
k1

≥ λR∗
k2

for each

k2 > k1.

□

Proof of Theorem 2 [Uniqueness of PNE]

Suppose that the CPR game Γ has multiple PNEs. We define the support of a PNE

as the total number of players with non-zero review admission rate. Let PNE1 = λ1 =

[λ1

1, λ1

2, . . . , λ1

N ], be two different PNEs with distinct
supports m1 and m2, respectively. For brevity of notation, we have removed the superscript

N ] and PNE2 = λ2 = [λ2

2, . . . , λ2

1, λ2

R from the two PNEs and replaced it by their unique identifier. Without loss of generality,

let m2 > m1. Let x1 = µS

T − (cid:80)N
at PNE1 and PNE2, respectively.

i=1 aiλ1
i

and x2 = µS

T − (cid:80)N

i=1 aiλ2
i

be the slackness parameters

140

We prove the uniqueness of PNE using a six step process.

Step 1: We first show that if there exists two different PNEs with distinct supports m1 and

m2 (m1 < m2), then x1 < x2.

If m1 and m2 are the supports of PNE1 and PNE2, respectively, then λ1

for each i > m1, and j > m2, respectively (Proposition 1). Additionally, λ1

i = 0 and λ2
i > 0, and λ2

j = 0,
j > 0,

for each i ≤ m1, and j ≤ m2. Hence, m2 > m1 implies λ1

m2 = 0, while λ2
From statement (i) of Corollary 2, bi = 0, if and only if fi ≤ 0 and fi

m2 > 0.
′ < 0 (equivalently

dfi

dx > 0) for all λR
everywhere, while λ2

i ∈ Si. Therefore, λ1
m2 > 0 implies f 2

m2 = 0 implies f 1

m2 := fm2(λ2) > 0. Since f 2

m2 := fm2(λ1) ≤ 0 and dfm2
and dfm2
m2 > 0 > f 1
m2

dx > 0

dx > 0

everywhere, it follows that x1 < x2.

Step 2: We now show that x1 > x2 using Steps 2-5, which is a contradiction to the result

of Step 1, and consequently m1 = m2.

From statement (iii) of Corollary 1, the review admission rate of any player i, i ≤ m1, at

PNEk, k ∈ {1, 2}, satisfies

(cid:40)

λk
i = min

−

(cid:41)
.

(8.34)

f k
i
f k
i

′ , µR

i

Step 3: We show that f 2

i > f 1
i

for any player i, i ≤ m1.

From (7.9), the incentives fi and fj for any two distinct players i and j with j > i at a

PNEk, k ∈ {1, 2} satisfies:

i − f k
f k

j = (hj − hi)rS > 0, ∀j > i.

Notice that the right hand side of above equation is independent of λR
i

and therefore, a

constant for both PNEs. Hence, for every i < m2

i − f 1
f 1

m2 = f 2

i − f 2

m2.

Therefore, f 2

m2 > f 1
m2

implies f 2

i > f 1
i

, for every i ≤ m1 < m2.

Step 4: We show that f ′1

i < f ′2

i , for every player i, i ≤ m1.

Recall that fi is strictly concave in x (Lemma 15). Therefore, x1 < x2 (Step 1) implies

df 1
i

dx > df 2

i
dx

. Therefore, from (8.26a), f ′1

i < f ′2

i

, for any player i, i ≤ m1.

141

Step 5: We now show that x1 > x2, which is a contradiction to result of Step 1, and

consequently m1 = m2.

Since for all players i, i ≤ m1, f 2

i > f 1
i
, for each i ≤ m1. Therefore, (cid:80)N

(Step 3) and −f ′1
i > (cid:80)m1

i = (cid:80)N
i ≥ λ1
λ2
i
which implies x1 > x2, which is a contradiction to result of Step 1. Hence, m1 = m2

i ≥ (cid:80)m1

i=1 aiλ2

i=1 aiλ2

i=1 aiλ1

i > −f ′2

i

(Step 4), (8.34) implies

i=1 aiλ1
i

,

Step 6: We now show the value of slackness parameter x at any PNE is unique.

Steps 1 to 5 show that, at a PNE, the number of players with non-zero review admission

rate are unique. Therefore, let m be the identical support for PNE1 and PNE2. Without

loss of generality, let x1 > x2.

Let gi : R (cid:55)→ R, for i ≤ m, be defined by

gi(x) = −

f (x)
f ′(x)

.

Differentiating gi(x) w.r.t x, we get

dgi(x)
dx

=

2
( dfi(x)
dx )

− fi(x) d2fi(x)
dx2

2
ai( dfi(x)
dx )

.

Recall from statement (iii) of Corollary 1 that players have non-zero review admission rate

at PNE, if and only if, fi > 0 at PNE. Strict concavity of fi (Lemma 15) implies dgi(x)

dx > 0.
Consequently, at PNE, the review admission rate for any player i, i ≤ m, is increasing with

x. Therefore, assumption x1 > x2 implies λ1

i ≥ λ2
i

, for each player i ≤ m. Consequently,

T − (cid:80)m

x1 = µS

i = x2, which is a contradiction. Therefore, x1 = x2.
We now show the uniqueness of PNE. Steps 1 to 6 show that, at a PNE, the number of players

i=1 aiλ1

i=1 aiλ2

i ≤ µS

T − (cid:80)m

with non-zero review admission rate and the slackness parameter x are unique. Therefore,

the first order conditions (8.34) give the unique review admission rate for each player i for

unique slack parameter x, thereby implying uniqueness of PNE.

□

Proof of Lemma 11 [Non-increasing best response]

We prove this lemma by considering the three cases of the best response mapping in

Lemma 16 (Appendix 13):

142

Case 1: bi = 0. Recall that x = µS

T − (cid:80)N

i=1 aiλR
i

.

In this case, from statement (i) of

Corollary 2, fi ≤ 0 and f ′

re-written as x = µS

T − aiλR

i < 0 (equivalently, dfi
−i), therefore dfi

i − σi(λR

dx > 0), for all λR
dx > 0 implies ∂fi

i ∈ Si. Since x can be
< 0. Hence, with increase

∂σi

−i), bi = 0 remains the best response.
−i))
−i))

in σi(λR
Case 2: bi = − fi(bi,σi(λR
i (bi,σi(λR
f ′
i < 0, for every λR
fi > 0 and f ′

i ∈ Si. Thus,

. In this case, from statement (ii) of Corollary 2, bi ∈ Si such that

dbi
dσi

=

i fi

2 + f ′′
−f ′
i
2
aif ′
i

< 0.

(8.35)

Hence, bi is strictly decreasing in σi(λR

−i).

Case 3: bi = µR
i

. Since bi ∈ Si = [0, µR

increase in σi(λR

−i).

i ], bi either decreases or remains constant with
□

Proof of Theorem 8 [Analytic bounds on PNE Inefficiency]

We first establish the analytic upper bound on PoA, followed by upper bounds on ηT RI

and ηLI.

Let G be the family of CPR games parameterized by the ratios of the maximum service

and review admission rates of each player i ∈ N . Therefore, the CPR game Γ ∈ G, with

the corresponding ratio for player i given by hi. Define a set of homogeneous CPR games
T }H
GH ⊂ G, in which each player has a constant ratio {µS
{µR
N (1+h)
The superscript H is used to distinguish maximum service and review admission rates of the

i }H
i }H =: h, and mini{{µR

i }H} ≥ {µS

.

homogeneous game ΓH from the CPR game Γ. We obtain the bounds on PoA by comparing

the social utility of the heterogeneous CPR game with a corresponding homogeneous game

in which all players have the largest heterogeneity measure, i.e., hi = hN , for every i ∈ N .

For any CPR game in G, PoA is given by:

P oA =

(Ψ)SW
(Ψ)P N E

=

[(cid:80)N
[(cid:80)N

i=1 µS
i=1 µS

i rS + (cid:80)N
i rS + (cid:80)N

i=1 λR
i=1 λR

i fi(x)]SW
i fi(x)]P N E

.

(8.36)

We now provide an analytic upper bound on the PoA for the CPR game Γ using following

Lemmas.

143

Lemma 18 (PNE solution for homogeneous CPR game). For any homogeneous CPR
game ΓH ∈ GH, such that for each player i ∈ N , {µS
{µR

i }H
i }H = h, and mini{{µR

i }H} ≥ {µS

T }H
N (1+h),

each player participates in the review process with equal review admission rate λH

i = λH at

T be the total review admission rate at PNE for ΓH. The unique PNE solution

PNE. Let λH
is given by λH = λH

T

N , where λH

T = f (x)
(1+h) df
dx

and x = {µS

T }H − (1 + h)λH
T .

Proof. For the homogeneous CPR game ΓH, each player has equal incentive f (x) to review

the tasks. If f (x) ≤ 0 at PNE, all players have λH

i = 0 (statement (ii) of Corollary 1) which

contradicts assumption (A3). Hence, at PNE, each players has λH

i > 0.

Let mini{{µR

i }H} ≥ {µS
player indices such that for any i ∈ D, {µR

T }H
N (1+h)

i }H ≤ − f (x)
f ′(x)

. At PNE,

for ΓH. At PNE, x > 0. Let D ⊆ N be a non-empty set of

λH
T =

(cid:88)

i∈D

{µR

i }H +

(cid:88)

i∈N \D

−f (x)
f ′(x)

≥ N min

{{µR

i }H}

i
T }H
{µS
(1 + h)

.

≥

Therefore, at PNE,

x = {µS

T }H − (1 + h)λH
T

≤ {µS

T }H − (1 + h)N min

{{µR

i }H} ≤ 0,

i

which is a contradiction. Hence, D is an empty set and each player has equal review admission

rate at PNE, given by λH

i = λH = λH

T
N

. Hence, each player being a maximizer of their expected

utility maximizes:

˜ui = {µS

i }HrS +

λH
T
N

f (x),

(8.37)

where x = {µS

T }H − (1 + h)λH
T

. Setting ∂ ˜ui
∂λH
T

= 0, we get λH

T = f (x)
(1+h) df
dx

.

Lemma 19 (PoA=1 for homogeneous CPR game). For any homogeneous CPR game
ΓH ∈ GH, such that for each player i ∈ N , {µS
{µR

i }H
i }H = h, and mini{{µR

N (1+h) , PoA=1.

i }H} ≥ {µS

T }H

144

Proof. For homogeneous CPR game ΓH, social welfare function ΨH in (7.15) only depends

on λR
T

and is given by:

where x = {µS

Note that dΨH
dλR
T

T }H − (1 + h)λR
T
> 0 when df

show that ΨH is maximized by any λR
T

ΨH = {µS

T }HrS + λR

T f (x),

(8.38)

, and f (x) is the uniform incentive function for each player.

dx ≤ 0, and d2ΨH
dλR
T
satisfying λR

2 > 0 in the interval where df

dx > 0. It is easy to
= 0,

obtained by setting dΨH
dλR
T

T = f (x)
(1+h) df
dx

and df

dx > 0 at the maximizer.
Let λH
T

be the total review admission rate at PNE for ΓH. The unique PNE satisfies

T = f (x)
λH
(1+h) df
dx
1.

(Lemma 18), and hence, maximizes social welfare utility resulting in PoA=

Corresponding to the CPR game Γ, construct a homogeneous game ΓH

N ∈ GH with
, for each player i ∈ N . Note that for homogeneous players

{µS

in ΓH
N

i }H = µS

and {µR

i }H = µS
i
i }H
, {µS
i }H = h = hN , and (cid:80)N
{µR
i }H} > {µS
implies mini{{µR

i
hN

i=1{µS
T }H
N (1+h)

µS
T hN
N (1+hN )

. Furthermore, the assumption mini{µS

i } >

i }H = µS
T
. Hence, P oA = 1 for ΓH
N

(Lemma 19).

To obtain analytic bounds on PoA, we now compute a lower bound on the social utility

obtained at the unique PNE ΨΓ

P N E

for the CPR game Γ. In Lemma 20, we show that the

social utility obtained at the PNE ΨH

P N E

for the homogeneous game ΓH
N

lower bounds ΨΓ

P N E

.

For any x ∈ [0, µS
incentive ((cid:80) f ) to review tasks than players in Γ, and therefore, have a lower social utility

T ], homogeneous players with ratio hN in ΓH

have a lower cumulative

N

at PNE, i.e., ΨΓ

P N E ≥ ΨH

P N E

. We further lower bound ΨΓ

P N E

by computing a lower bound

on ΨH

P N E

.

Lemma 20 (Lower bound for social welfare at PNE ). Let ΓH

corresponding to CPR game Γ with each player i ∈ N having {µS

N be a homogeneous game
i }H = µS

i and {µR

i }H = µS

.

i
hN

Let ΨΓ

P N E and ΨH

P N E be the social welfare functions for Γ and ΓH

N , respectively, evaluated

at their unique PNEs. Then ΨΓ

P N E ≥ ΨH

P N E ≥ µS

T rS + µS

T −x
aN

fN (x), where x is the unique

maximizer of fi, i.e. dfi

dx (x) = 0.

145

Proof. Let λ∗ = [λ∗

1, . . . , λ∗

PNEs for the CPR games Γ and ΓH
N

N ] (Corollary 1) and λH = [λH, . . . , λH] (Lemma 18) be the unique
and xH =

, respectively. Let x∗ = µS

T − (cid:80)N

i=1 aiλ∗
i

T − N aN λH be their slackness parameters at respective PNEs.
µS

Step 1: We show that xH ≥ x∗ using contradiction.

Let x∗ > xH. Recall that at PNE, dfi

(Lemma 15), we have dfi

dx (xH) > dfi

that f1(x) ≥ · · · ≥ fN (x) for any x, and dfi

dx > 0 (Corollary 1). Using strict concavity of fi
. Recall

= aN λH
N

dx (x∗) > 0. Therefore, fN (x∗)
> fN (xH )
dfN
dfN
dx (x∗)
dx (xH )
dx = drR(x)
dx (1 − p(x)) − rR(x) dp(x)
dx
i = µS

for any i. Using hN ≥ hi, we get aiµR
(cid:26)

(cid:27)

is independent

i + µS

i
hi

≥ µS

i +

= aN {µR

µS
> aN λH
i
N
hN
x∗ < xH, which is a contradiction. Therefore, xH ≥ x∗ (equivalently, (cid:80)N

N . Therefore, aiλ∗

fi(x∗)
dfi
dx (x∗)

i = min

, aiµR
i

, for any i. Hence,

i=1 aiλ∗

i ≥ N aN λH
N

)

of i. Hence,

> aN λH
N

fi(x∗)
dfi
dx (x∗)
i }H > aN λH

and dfi

dx (x∗) ≥ dfi

dx (xH) > 0.
Step 2: We show that (cid:80)

i fi(x∗) ≥ N fN (xH) & (cid:80)
Let d ≤ N be the support for λ∗. Therefore, λ∗
i = 0 for any i > d. Therefore, (cid:80)d
dx (xH) > 0 ( dfi
dx (x∗) ≥ dfN
dx
i ≥ N aN λH
N

i=1 aiλ∗

i=1 λ∗

fi(x∗)
dfi
dx (x∗)

i=1

, and λ∗

Using dfi

Additionally, (cid:80)N

Step 3: We show that ΨΓ

i λ∗

i ≥ N λH
N
(cid:26)

.

fi(x∗)
dfi
dx (x∗)

i = min
≥ (cid:80)d

ai
i=1 aiλ∗

(cid:27)

, µR
i

for every i ≤ d,

i ≥ N aN λH

N = N fN (xH )
dfN
dx (xH )
i=1 fi(x∗) ≥ N fN (xH).

.

is independent of i), we get (cid:80)d
i ≥ (cid:80)N
implies (cid:80)d

i ≥ N λH
λ∗
N

i=1

ai
aN

.

Using hN ≥ hi, we have µR

= {µR

i }H > λH

N . Let d1 ≤ d be the largest index

of player satisfying fi(x∗)
dfi
dx

ai

< λH
N

, d1 < N . Therefore, λ∗
i

satisfies,

P N E.

P N E ≥ ΨH
i = µS
. Since

> λH
N

i
hi

≥ µS
i
hN
fN (x∗)
dfN
dx (x∗)

aN

λ∗
i =





(cid:26)

min

(cid:27)

fi(x∗)
dfi
dx (x∗)

ai

, µR
i

> λH
N ,

for i ≤ d1,

fi(x∗)
dfi
dx (x∗)

ai

≤ λH
N ,

for d1 + 1 ≤ i ≤ d.

(8.39)

Hence,

ΨΓ

P N E = µS

T rS +

d
(cid:88)

i=1

i fi(x∗)
λ∗

(1∗)
= µS

T rS + λH
N

d
(cid:88)

i=1

fi(x∗) +

146

d1(cid:88)

(λ∗

i − λH

N )fi(x∗)

i=1

d
(cid:88)

+

(λ∗

i − λH

N )fi(x∗)

i=d1+1

(2∗)
≥ µS

T rS + λH

N N fN (xH) +

d1(cid:88)

i=1

(λ∗

i − λH

N )fd1+1(x∗)

d
(cid:88)

+

(λ∗

i − λH

N )fd1+1(x∗)

i=d1+1

= µS

T rS + λH

N N fN (xH) + fd1+1(x∗)

d
(cid:88)

(λ∗

i − λH
N )

(3∗)
≥ µS

T rS + λH

N N fN (xH) = ΨH

P N E,

i=1

where (1∗) follows by adding and subtracting λH
i=1 fi(x∗) ≥
N
N fN (xH) (Step 2), (8.39), and the fact that f1(x∗) ≥ · · · ≥ fN (x∗). (3∗) follows by recalling
that (cid:80)d

i=1 fi(x∗). (2∗) follows from (cid:80)d

(Step 2).

(cid:80)d

i=1 λ∗

i ≥ N λH
N

Step 4: We show that ΨΓ

P N E ≥ ΨH
Let x be the unique maximizer of fi. From Lemma 19, ΨH

P N E ≥ µS

fN (x).

T rS + µS

T −x
aN

P N E = ΨH

SW = max{ΨH} ≥

T rS +λR
µS

T fN (µS

T −aN λR

T ) for any λR

T

. Choosing λR

T = µS

T −x
aN

, we obtain the desired bounds.

Proof of Theorem 8: The global optimum of social welfare function is upper bounded by:

ΨΓ

SW = µS

T rS + max
λR
i

(cid:41)

λR
i fi(x)

(cid:40) N
(cid:88)

i=1

≤ µS

T rS + max
λR
i

(cid:8)λR

T

(1∗)
≤ µS

T rS + µS

T fi(x)

(cid:9) max

x

{fi(x)}

≤ µS

T (rS + rR(x)(1 − p(x))),

(8.40)

where (1∗) is obtained using the system constraint x > 0 which implies µS

T ≥ (cid:80)N

i=1 aiλR

i ≥ λR
T

.

From Lemma 20, ΨΓ

P N E

is lower bounded by:

ΨΓ

P N E ≥ µS

T rS +

fN (x)

µS
T − x
aN
(cid:0)xaN rS + (µS

=

1
aN

T − x)(rS + rR(x)(1 − p(x)))(cid:1)

147

≥

1
aN

(µS

T − x)(rS + rR(x)(1 − p(x))).

(8.41)

Using (8.40) and (8.41), we get, P oA = ΨΓ
ΨΓ

SW

P N E

≤ µS
T aN
µS
T −x

.

Now we establish the bounds on ηT RI and ηLI. Let xP N E and xSW be the slackness

parameter corresponding to the PNE and social welfare, respectively. Recall that dfi

dx > 0
(Corollary 1), for x ∈ {xP N E, xSW }. Hence, using strict concavity of fi, we have xP N E, xSW ∈

(0, x). Therefore, µS

T − x < (cid:80)N

i=1 ai{λR

i }P N E < µS
T

, and µS

T − x < (cid:80)N

i=1 ai{λR

i }SW < µS
T

.

Hence, ηT RI and ηLI are upper bounded by:

ηT RI =

(λR
(λR

T )SW
T )P N E

<

µS
T aN
T − x)a1

(µS

, and

ηLI =

((cid:80)N
((cid:80)N

i=1 aiλR
i=1 aiλR

i )P N E
i )SW

<

µS
T
µS
T − x

.

□

148