ADVERSARIAL MODELING IN GAME-THEORETIC FRAMEWORKS FOR SECURING
CYBER-PHYSICAL SYSTEMS

By

Sandeep Banik

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Electrical and Computer Engineering—Doctor of Philosophy

2023

ABSTRACT

In an era where intricate cyber and physical systems are integrated into daily life, the imperative

for controlling and optimizing them emerges as a critical avenue. This avenue addresses global

challenges such as climate change, healthcare equity, security, and resilience, shaping the swiftly

evolving concepts of resource sharing and the shared economy. This complexity is particularly

evident in critical infrastructures, spanning electrical grids, building management systems, solar

farms, autonomous vehicles, and other Cyber-Physical Systems (CPS). Furthermore, securing

CPS through decision-making inherently involves collaboration or competition, engaging multiple

stakeholders with diverse perspectives and interests.

Game theory provides a powerful analytical framework to model strategic conflicts among

decision makers to assure security and resilience. The realm of security methodologies in CPS

is vast and has garnered considerable attention over the past two decades. Different adversarial

models impact CPS in various ways by targeting one or multiple security attributes. Therefore, a

fundamental aspect of securing a CPS is the characterization of the adversary type and developing

corresponding defense strategies.

In the first part of the thesis, we introduce a game-theoretic decision-making framework that

captures the interaction between a defender and two types of adversaries: a deterministic adver-

sary, and a stochastic adversary capable of both benign and adversarial actions. We analyze this

framework under different information structures and focus on characterizing the Nash equilibrium

of the game, particularly emphasizing closed-form solutions. We illustrate how this framework

can be applied in various domains such as path planning, motion planning, and in the context of

resilient estimation.

In the second part of the thesis, we design a game-theoretic framework to encompass state-

dependent decision-making and develop defensive strategies against an adversary capable of a

complete takeover of a dynamical system. We employ tools from optimization, control theory, and

backward induction to solve for the takeover strategies and control policies of both players. We

demonstrate the application of this framework in linear dynamical systems. Finally, we present a

domain-aware data-driven framework to determine defensive strategies by simulating an adversary

in a high-fidelity CPS. We illustrate the application of this data-driven framework in a smart

building system. In conclusion, we discuss potential future extensions and the integration of the

game-theoretic framework with the data-driven approach.

Copyright by
SANDEEP BANIK
2023

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to Prof. Shaunak D. Bopardikar for affording me the

invaluable opportunity to embark on this Ph.D. journey. His unwavering support has served as a

fount of inspiration and a wellspring of innovative ideas. Engaging discussions with him have been

both stimulating and enlightening, consistently providing deeper insights into the topics at hand.

My sincere thanks extend to Prof. Hayder Radha, Prof. Betty H.C. Cheng, and Prof. Bahare

Kiumarsi for their meticulous review of my thesis and for their roles on my Ph.D. committee. Their

constructive feedback has significantly contributed to broadening the horizons of my work.

The accomplishments detailed in this thesis are the fruit of collaboration with exceptional

individuals.

I am particularly indebted to Arnab Bhattacharya and Thiagarajan Ramachandran

from the Pacific Northwest National Laboratory (PNNL) for their joint efforts in the second part

of this thesis. Their mentorship and patience have been invaluable, not only in the thesis but also

during my enriching summer internship at PNNL.

Reflecting on my years at MSU, the transformative journey has been shaped by remarkable

individuals. Joining MSU as graduate students in 2019, Sankhadeep Basu and Aakash Khandelwal

were not just roommates but cherished companions in unforgettable movie nights, pizza indulgences,

and game sessions. The camaraderie extended to weekend coffees, the madness of preparing weekly

meals in a single evening, Friday binges, birthday celebrations, ping-pong games, and the ambitious

yet elusive gym plans. The shared experiences, summer walks, and festive celebrations are etched

in my memory. Likewise, Shivam Bajaj, with his camaraderie in coffee breaks, shared lunches,

Spartan village strolls, and whimsical whiteboard doodles, has left an indelible mark. I also extend

my gratitude to labmates Christopher Calle, Ethan Lau, Bhargav Jha, Pouria Tooranjipour, and

Richard Frost for their support and guidance during challenging times.

I owe immeasurable thanks to my family for their unwavering support. To Chetan M Rao, my

childhood friend, words cannot capture the depth of gratitude for the shared trips, laughter, and

meals. To Shivangi Agarwal, my wife, your steadfast support during tough times has been my

anchor. Our shared interests in travel, specialty coffee, South Indian cuisine, and TV series have

v

been a source of bliss. Ishani Banik, my sister, has been my inspiration since my bachelor’s degree,

motivating me to pursue research and offering invaluable lessons. My heartfelt appreciation goes

to my parents, whose unwavering faith and encouragement propelled me toward higher education.

In conclusion, I acknowledge the support received from the NSF Award CNS-2134076 under

the Secure and Trustworthy Cyberspace (SaTC) program and the NSF CAREER Award ECCS-

2236537, which played a pivotal role in advancing this research.

vi

LIST OF TABLES .

. .

LIST OF FIGURES .

. .

.

.

.

.

.

.

.

.

.

.

TABLE OF CONTENTS

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.1 Part I: State Independent Adversarial Models
1.2 Part II: State Based Adversarial Models

.
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .

x

1
2
4

CHAPTER 2

.

.

.

.

Introduction .

DETERMINISTIC ADVERSARY - STOPPING STATE GAMES AND
THEIR APPLICATION TO PATH PLANNING . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6
2.1
6
2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
. 15
2.3 Full Information Edge Game with a Termination Threshold . . . . . . . . . .
2.4 Full Information Edge Game with an Arbitrary Termination Threshold . . . . . 20
. 23
2.5 Partial Information Edge Game with a Termination Threshold . . . . . . . . .
2.6 Solution of the Meta-game . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 29
2.7 Robotic Simulation of PIE-game . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Summary . .
. 42
2.9 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. .

CHAPTER 3

.

.

.

.

Introduction .

STOCHASTIC ADVERSARY - STOCHASTIC STOPPING STATE
GAMES AND THEIR APPLICATION TO MOTION PLANNING . . . 47
3.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Solution to the M-SSG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Solution to SSG𝑚×𝑛 .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
.
.
3.5 Application .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
.
.
.
3.6 Summary .

. .
.
.

.
.

CHAPTER 4

.
.

.
.

.
.

Introduction .

TAKEOVER ADVERSARY - FLIPDYN: RESOURCE TAKEOVER
GAMES .
.

. 88
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3 FlipDyn for general systems
. . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 FlipDyn for LQ Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 FlipDyn-Con for LQ Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 118
. 145
4.6 Summary . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. .

CHAPTER 5

.

.

.

DATA-DRIVEN ADVERSARIAL MODEL . . . . . . . . . . . . . . .

5.1
.
Introduction .
5.2 Model Formulation .
5.3 Solution Approaches
5.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Summary . .

. 147
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
. 171
. 179

. .

. .

vii

CHAPTER 6

FUTURE DIRECTION . . . . . . . . . . . . . . . . . . . . . . . . . . 180

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

viii

LIST OF TABLES

Table 2.1 Performance of the meta-game vs. Algorithm 1 averaged over 100 runs with

an average degree between [2,3] for every vertex of the roadmap. . . . . . . . . . 36

Table 2.2 Performance of the meta-game vs. Algorithm 1 averaged over 100 runs in a

fully connected roadmap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Table 5.1 Cyber exploits and their corresponding probability of success.

. . . . . . . . . . 174

Table 5.2 Description of the variables in the building model.

. . . . . . . . . . . . . . . . 174

ix

LIST OF FIGURES

Figure 1.1

Illustration of attacks on a CAV. The dotted line and dashed lines represents
potentials spots for a cyber-attack. The solid line represent an authentic user
communicating with the CAV. . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

Figure 2.1 The full information edge-game with 𝐿 = 1, along the edge 𝜈𝜉 for given
graph with 𝑘 𝜈𝜉 as the number of stages. The information set for the adversary
and defender is indicated by the dotted line and nodes taking on a value
𝑉𝑖 for 𝑖 ∈ {0, 1, . . . 𝐾𝑒 − 1}, respectively. Actions of the adversary (resp.
defender) abbreviated as {𝐴, 𝑁 𝐴} (resp. {𝐷, 𝑁 𝐷}) for {Attack, No attack}
(resp. {Defend, No defend}). The stopping state is indicated by 𝑆𝑆. . . . . . . . 11

Figure 2.2 The partial information edge-game along edge 𝜈𝜉 for the given graph. The
information set for the adversary and defender is given by the dotted line
between the nodes of a stage and on nodes respectively, indicating the uncer-
tainty of each player. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Figure 2.3

Figure 2.4

(a) Value of a FIE-game vs. stages 𝐾𝑒 for a given set of 𝑟2 and 𝑟1 with a
termination threshold of to 𝐿 = 1. (b) Policy of the defender at 𝑘 = 0 of
a FIE-game vs. stages 𝐾𝑒 for the same set of 𝑟2 and 𝑟1 with a termination
threshold of to 𝐿 = 1.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

(a) Policy of the adversary at the start (𝑘 = 0) of a FIE-game vs. stages 𝐾𝑒
for the same conditions as in Figure 2.3a. (b) Percentage error between the
approximate value (equation (2.16)) and recursive value (equation (2.14)) of
the FIE-game with 𝐿 = 1 for a set of game parameters. . . . . . . . . . . . . .

. 20

Figure 2.5 The FIE-game along with termination threshold of 𝐿 = 2. The dynamic
game shown can tolerate an action pair of {Defend, Attack} twice followed
by disabling the adversary. The information set for the adversary and de-
fender is indicated by the dotted line and nodes taking on a value 𝑉 𝑗
for
𝑖
𝑖 ∈ {0, 1, . . . 𝐾𝑒 − 1} and 𝑗 ∈ 0, 1, . . . , 3 respectively. The actions of the
adversary (resp. defender) is abbreviated as {𝐴, 𝑁 𝐴} (resp. {𝐷, 𝑁 𝐷}) for
{Attack, No attack} (resp. {Defend, No defend}). The termination states are
denoted by 𝑆𝑆 (blue colored node).

. . . . . . . . . . . . . . . . . . . . . . . . 21

Figure 2.6

Figure 2.7

(a) Value of the FIE-game across multiple 𝑟2, stages 𝐾𝑒 and termination
threshold 𝐿 for given 𝑟1 = 1.5. (b) Value of the FIE-game across multiple 𝑟2,
stages 𝐾𝑒 and termination threshold 𝐿 for given 𝑟1 = 3.0.

. . . . . . . . . . . . 24

(a) Policy of the defender at start stage (𝑘 = 0) of FIE-game for increasing 𝑟2
and the number of attacks 𝐿 across given stages 𝐾𝑒.(b) Policy of the adversary
at start stage (𝑘 = 0) of FIE-game for increasing 𝑟2 and the number of attacks
𝐿 across given stages 𝐾𝑒.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

x

Figure 2.8 The PIE-game with a termination threshold of 𝐿 = 1 with 𝐾𝑒 = 2. The
leaf node 𝑆𝑆 represents the stopping state. The dotted line indicates the
information set for the corresponding player. The notation 𝛼𝑘 (resp. 𝛽𝑘 )
represents information set for the defender (resp. adversary) for the stage 𝑘 ∈
1, 2. The value of each leaf node is represented by 𝑄𝑚 for 𝑚 ∈ {1, 2, . . . , 8}.
The leaf node values are presented in Section 2.5.1.

. . . . . . . . . . . . . . . 25

Figure 2.9

Illustration of the PIE-game matrix 𝐴 structure for any given number of
stages, 𝐾𝑒. The solid square blocks indicate the leaf node entries, the triangle
blocks indicate the solution from the preceding stage game, the diamond block
indicates the value 𝑉𝐾𝑒 with 𝐾𝑒 = 2, and the empty space indicates zeros. For
a given number of stages 𝐾𝑒, the game matrix 𝐴 is recursively solved from
𝐴(2) to 𝐴(𝐾𝑒). .

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 2.10 (a) Value of the PIE-game for a set of 𝑟1 and 𝑟2 for given stages 𝐾𝑒. (b) Policy
of the adversary with the attack action at the start stage (𝑘 = 0) of a PIE-game
for given 𝑟1, 𝑟2, and stages 𝐾𝑒.

. . . . . . . . . . . . . . . . . . . . . . . . . . 30

Figure 2.11 (a) Policy of the defender with defend action at the start stage (𝑘 = 0) of a
PIE-game for given 𝑟1, 𝑟2, and stages 𝐾𝑒.(b) The value of a PIE-game and
FIE-game vs. stages 𝐾𝑒 for the same set of 𝑟1 and 𝑟2. . . . . . . . . . . . . . . . 30

Figure 2.12 (a) Illustration of a simple graph with 3 vertices and 3 edges. The start and
end vertex is indicated with 𝜈 and 𝜉 respectively. The number of stages
between the nodes 𝑖 and 𝑗 are given by 𝑘𝑖, 𝑗 . (b) The simple network (figure
2.12a) with stages over the edge, 𝑘 𝜈1 = 𝑘1𝜉 = 3 and 𝑘 𝜈𝜉 = 6. The shortest
path is calculated over the edge weights. (c) The solution of the simple graph
meta-game with the defender probabilities over the paths. The shortest path
is indicated with a larger arrow as compared to others and with lighter shade
of vertex. (d) The solution of the simple graph meta-game with the adversary
probabilities over the edges. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 32

Figure 2.13 (a) The sensitivity of choosing the shortest path (𝜋def) with changing 𝑟1 and 𝑟2
with fixed stages over each edge. (b) The sensitivity of choosing the shortest
path edge (𝑒att) with changing 𝑟1 and 𝑟2 with fixed stages over each edge (c)
The sensitivity of choosing the shortest path (𝜋def) with changing number
of stages over the edges 𝐾𝜋𝑆1,1𝑇 and 𝐾𝜋𝑆𝑇 given a fixed stage cost. (d) The
sensitivity of choosing the shortest path edge (𝑒att) with changing number of
stages over the edges 𝐾𝜋𝑆1,1𝑇 and 𝐾𝜋𝑆𝑇 given a fixed stage cost. . . . . . . . . . . 33

xi

Figure 2.14 (a) A graph consisting of 10 nodes which is sparsely connected. The output
of Algorithm 1 is path 1 − 10. Of all paths available, the path 1 − 2 − 10 has
highest likelihood of getting selected. (b) Edge 1 − 10 has the least chance of
being attacked, while edge 2 − 10 has the highest chance of getting attacked.
(c) The probability of choosing the shortest path for graphs with an average
vertex degree in the interval [2, 3]. (d) Probability of choosing the shortest
path in a fully connected graph.

. . . . . . . . . . . . . . . . . . . . . . . . .

. 34

Figure 2.15 Illustration of a vehicle attacked from extended view while performing a V2V

or V2X communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Figure 2.16 (a) An attack realized on ROS with Turtlebot3 burger. The attack is obstacles
(vehicles) in formation causing a larger deviation in normal trajectory. (b)
Influence of an attack (obstacle) on the deviation of path causing an increase in
time to destination (security loss). (c) The PIE-game evaluated experimental
and expected theoretical value of the PIE-game.

. . . . . . . . . . . . . . . . . 39

Figure 2.17 (a) The initial position of the robot with the planned trajectory indicated
(b) Attack along the trajectory causing a change in
with the dotted line.
deviation along the covered and planned trajectory. The covered trajectory is
represented by the solid line. (c) The change in trajectory after the defender
has intercepted the attack and recovery of the planned trajectory.(d) The final
position of the robot with the covered trajectory indicated in solid line.

. . . .

. 39

Figure 2.18 Illustration of a vehicle attacked from extended view under a V2V or V2X

communication in a single/two lane road. . . . . . . . . . . . . . . . . . . . . . 40

Figure 2.19 (a) PIE-game on single traffic lane attack scenario. The figure above is a
simulation with TurtleBot3 burger in a Gazebo environment and the one
below is the corresponding experimental setup. (b) Velocity profile of the
vehicle under different actions such as A(resp. NA) as attack (resp. no attack)
and D(resp. ND) as defend(resp. no defend). . . . . . . . . . . . . . . . . . . . 41

Figure 2.20 (a) The value of the PIE-game evaluated in ROS using TurtleBot3 burger and
gazebo over multiple simulations and compared with the expected value of the
PIE-game. (b) The value of the PIE-game evaluated in ROS with TurtleBot3
burger over multiple experiments and compared with the expected value of
the PIE-game.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

.

.

.

.

Figure 2.21 Illustration of a vehicle attacked from extended view under a V2V or V2X
communication in a single/two lane road. The dotted line represent the
planned trajectory and the solid line represents the trajectory followed. The
solid blocks around the robot at all the time instances represent the boundary
and the solid block in from of the robot at time instant 7.89 s represents a
spoofed trailer. .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xii

Figure 3.1 An M-SSG consisting of 𝐾 stages a termination threshold of 𝐿 = 2. The in-
formation set for the defender and second player is indicated by the dotted line
and nodes taking value 𝑉 𝑖
𝑘 for 𝑘 ∈ {1, 2, . . . 𝐾 }, 𝑖 ∈ {0, 1, . . . , 𝐿}. The value
of an M-SSG under an adversarial intent is indicated by 𝑉 𝑘 , 𝑘 ∈ {0, . . . , 𝐾 }
(see Remark 3.2.1). At every stage, the game branches with probability 𝜌𝑘
to indicate an adversarial player. Actions of an adversary (resp. defender)
abbreviated as {𝐴, 𝑁 𝐴} (resp.
{𝑆𝐷, 𝑊 𝐷}) for {Attack, No attack} (resp.
{Strong Defense, Weak defense}). SS indicates the stopping state.

. . . . . . . 53

Figure 3.2 An SSG𝑚×𝑛 refers to a stochastic stopping state game where there are 𝑚
possible actions for the defender and 𝑛 possible actions for the second player.
At stage 𝑘, when the game diverges with a probability of 1 − 𝜌𝑘 , it signifies a
non-adversarial player scenario where solely the actions of the defender are
applicable.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

.

.

.

.

.

.

.

Figure 3.3

Figure 3.4

Figure 3.5

(a) Value of an M-SSG vs. edge-game (𝜌 = 1) over stages 𝐾 for ˜𝑠2,𝑘 = 0.3
and ˜𝑠1,𝑘 = 1.25, ∀𝑘 ∈ {1, 2, . . . , 𝐾 } with termination threshold of 𝐿 = 2 and
4. (b) Probability parameter ˜𝜌𝐿

𝑘 for the same set of parameters.

. . . . . . . . . 64

(a) Probability of choosing strong defense action when 𝑖 = 𝐿 for the M-SSG
and edge-game (𝜌 = 1), solved for 𝐾 = 20 using the same set of parameters,
˜𝑠2,𝑘 , ˜𝑠1,𝑘 , and 𝐿. (b) Probability of choosing attack action when 𝑖 = 𝐿 for
the M-SSG and edge-game (𝜌 = 1) with the same parameters of stages 𝐾,
˜𝑠2,𝑘 , ˜𝑠1,𝑘 , and 𝐿.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

.

.

.

.

(a) Nash Equilibrium policy of the defender (rows 2,3 and 4) for an SSG𝑚×𝑛
for a range of 𝜌 solved over a total of 𝐾 = 20 stages. The SSG𝑚×𝑛 was
solved with stage cost matrix entries 𝑠1,𝑘 = 1.0, 𝑠2,𝑘 = 1.2 and 𝑠3,𝑘 = 0.3, 𝑘 ∈
{1, 2, . . . , 𝐾 }. (b) Nash Equilibrium policy of the second player (column 1
and 4) for the same stage cost parameters and total number of stage. . . . . . . . 72

Figure 3.6

(a) Ego and non-ego vehicle policy averaged over 50 experiment runs for
𝐾 = 25 with 𝜌 = 0.25. (b) Simulated ego and non-ego vehicle policy with
defined stage costs and 𝜌 = 0.25.

. . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 3.7

(a) Sample policy of the defender and attack for a given experimental run. (b)
Sampled and expected value of the SSG compared with the theoretical value
of the SSG. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

.

.

.

.

.

Figure 3.8

Illustration of trajectory under scaled and unscaled control input 𝛾𝑢𝑘 and 𝑢𝑘
respectively for a finite horizon 𝑇. . . . . . . . . . . . . . . . . . . . . . . . . . 76

Figure 3.9

(a) Simulated trajectory of an ego and non-ego vehicle in a lane change
scenario over 50 time steps with 𝜌 = 0.1 and 𝜌 = 1.0, and a final time of
25s. (b) Expected trajectory of the ego and non-ego vehicle averaged over 50
experiment runs, with 50 time steps for corresponding parameters of 𝜌.

. . . . 77

xiii

Figure 3.10 (a) Ego and Non-ego vehicle policy averaged over 50 experiment runs with a
nominal speed of 0.15 m/s and 0.18 m/s. (b) Ego and non-ego vehicle policy
averaged over 50 simulation runs for the same set of nominal speeds (0.15
and 0.18 m/s).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

.

.

.

Figure 3.11 (a) Simulated policy of an ego and non-ego vehicle (possible adversary) in
lane change scenario over a range of sample time with 𝜌 = 0.1. (b) Simulated
policy of an ego and non-ego vehicle for the same scenario over a range of
nominal speeds with 𝜌 = 0.1.

. . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Figure 3.12 A typical feedback control system with an estimator. The control law is a
function of the estimates. The estimator performance is dependent on the
channel used to communicate the data observed from a sensor. An adversary
might be present in the feedback loop impacting the performance of the
estimates by injecting noise on different channels.

. . . . . . . . . . . . . . . . 82

Figure 3.13 Value of the SSG𝑚×𝑛 for a range of fixed probability 𝜌 solved over a total of

𝐾 = 20 stages with a engagement budget of 𝐿 = 1.

. . . . . . . . . . . . . . . . 84

Figure 3.14 (a) Probability of the defender actions; defense 2,3 and 4 (rows 2,3 and 4) for
an SSG𝑚×𝑛 for the corresponding probability and stages 𝐾. (b) Probability of
the second player actions; attack 1,2 and 3 (column 1, 2 and 3) for the same
SSG𝑚×𝑛 for the corresponding probability and stages 𝐾.

. . . . . . . . . . . . . 84

Figure 4.1

Figure 4.2

Figure 4.3

Figure 4.4

(a) Closed-loop system with adversaries present at various locations infecting
the reference values, actuator, plant, measurement output and control input.
(b) Closed-loop system with the adversary present between the controller and
actuator trying to takeover the control signals. The takeover action at time 𝑘
of the defender (resp. adversary) is given by 𝜋0
𝑘 ). A FlipIt is setup
over the control signal between the defender and adversarial control.

𝑘 (resp. 𝜋1

. . . . . . 89

(a) Coefficient of the parameterized value function, p0 and p1 for a 1-
dimensional system where the state is bounded (𝐹 ≤ 1) over a horizon length
of 𝐿 = 50. (b) Attack and defense policy corresponding to the value function
in Figure 4.2a for the given set of costs. . . . . . . . . . . . . . . . . . . . . . . 109

(a) Coefficient of the parameterized value function, p0 and p1 for an un-
bounded (𝐹 ≥ 1) 1-dimensional system with a horizon length of 𝐿 = 50. (b)
Policy of defense and attack for the obtained parameterized value function
indicated in Figure 4.3a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 109

(a) Minimum eigenvalue of the saddle-point value parameters, 𝜆𝑛 ( ˆ𝑃0) and
𝜆𝑛 ( ˆ𝑃1), given 𝑒 = 0.99 over a horizon length of 𝐿 = 100.
(b) Takeover
strategies corresponding to the saddle-point value in Figure 4.4a for a given
initial state 𝑥1 and FlipDyn state 𝛼𝑘 = 0, ∀𝑘 ∈ K. . . . . . . . . . . . . . . . . . 116

xiv

Figure 4.5

(a) Minimum eigenvalue of the saddle-point value parameters, 𝜆𝑛 ( ˆ𝑃0) and
𝜆𝑛 ( ˆ𝑃1), given 𝑒 = 1.01 over the same horizon length of 𝐿 = 100. (b) Takeover
strategies corresponding to the saddle-point value in Figure 4.5a for a given
initial state 𝑥1 and FlipDyn state 𝛼𝑘 = 0, ∀𝑘 ∈ K.

. . . . . . . . . . . . . . . . 117

Figure 4.6 Saddle-point value parameters p𝑖

𝑘 , 𝑘 ∈ {1, 2, . . . , 𝐿}, 𝑖 ∈ {0, 1} for state tran-
sition constant (a) 𝐸 = 0.85, (b) 𝐸 = 1.0. The parameters p𝑖
𝑘 ,M-NE corre-
sponds to the parameters of the saddle-point under a mixed NE takeover over
the entire time horizon.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Figure 4.7 Defender takeover strategies 𝛽𝑘 and adversary takeover strategies 𝛾𝑘 for state
transition (a) 𝐸 = 0.85 and (b) 𝐸 = 1.0. M-NE corresponds to the mixed NE
policy.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

.

.

.

.

.

.

.

.

Figure 4.8 Maximum eigenvalues (𝜆1(𝑃

𝛼
𝑘 , 𝑘 ∈
{0, 1, . . . , 𝐿 + 1}, 𝛼 ∈ {0, 1} for state transition constant (a) 𝑒 = 0.85, (b)
𝑒 = 1.0.

𝛼
𝑘 )) of saddle-point value parameters 𝑃

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

.

.

.

.

.

.

.

.

Figure 4.9 The parameters 𝑃𝑖

𝑘 ,M-NE corresponds to saddle-point value parameter re-
cursion under a mixed NE takeover over the entire time horizon. Defender
takeover strategy 𝛽𝑘 and adversary takeover strategy 𝛾𝑘 for state transition (a)
𝑒 = 0.85 and (b) 𝑒 = 1.0. M-NE corresponds to the mixed NE policy. . . . . . . 143

Figure 5.1 A hybrid attack graph for a single-zone building with four cyber nodes (in
red) and one physical node (in blue) [28]. An adversary infiltrates the leaf
node (node 1) and progressively secures additional security attributes (nodes
2-4) before attacking the zone temperature controller by perturbing sensor
measurements at the root node 5.

. . . . . . . . . . . . . . . . . . . . . . . . . 153

Figure 5.2 Switching control graph with nodes 1, 4 and 6 representing adversary control,

and nodes 2, 3, 5 representing defender control.

. . . . . . . . . . . . . . . . . 170

Figure 5.3

Figure 5.4

(a) An HAG inspired from a ransomware attack graph [126]. The source
node 1 is represented by the dashed circle and the physical node (sink node)
9 is represented by concentric circles. (b) Trajectories of Zone 1 temperature
(Zone 1) along with the outside air temperature (Outside T) over a year with
upper (T max) and lower temperature (T min) comfort bounds.

. . . . . . . . . 173

(a) Defender’s (𝐽def) and Attacker’s (𝐽att) objective with a hardening cost
factor 𝑑𝑒 := 0.1, where min(𝑖) is defined as the 𝑖𝑡ℎ argument minimum of
𝐽def/att. (b) Defender’s (𝐽def) and Attacker’s (𝐽att) objective with a hardening
cost factor of 𝑑𝑒 := 0.5. (c) Defender’s (𝐽def) and Attacker’s (𝐽att) objective
with a hardening cost factor 𝑑𝑒 := 1.

. . . . . . . . . . . . . . . . . . . . . . . 176

xv

Figure 5.5

Figure 5.6

(a) Average time steps required to reach the physical node for the adversary
for the hardening factor of 𝑑𝑒 := 0.1.
(b) Average time steps to reach the
physical node with 𝑑𝑒 := 0.5 (c) Average time steps to reach the physical
node with 𝑑𝑒 := 1.0.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

.

.

(a) Cyber exploits weights obtained from the result of Algorithm 3 with a
cyber cost factor of 𝑑𝑒 = 0.1, 0.5 and 1.0. (b) Time to reach the physical node
9 for varying hardening cost factor and compared with the expected time to
reach (𝐽AMC) obtained from (5.11).

. . . . . . . . . . . . . . . . . . . . . . . . 178

Figure 5.7 Sample node trajectory obtained from an attack policy with a hardening cost
factor of (a) 𝑑𝑒 = 1.0, (b)𝑑𝑒 = 0.5, and (c) 𝑑𝑒 = 0.1, where with null action
corresponding to no action taken by the adversary.

. . . . . . . . . . . . . . . . 178

Figure 5.8 Discomfort corresponding under the optimal policy {𝜋∗, 𝜶∗} for the hardening

cost factor (a) 𝑑𝑒 := 1.0, (b) 𝑑𝑒 := 0.5, and (c) 𝑑𝑒 := 0.1. . . . . . . . . . . . . . 179

xvi

CHAPTER 1

INTRODUCTION

Decision making is an integral part of our daily lives, and it plays a crucial role in shaping the

outcomes we experience. Whether we are individuals, organizations, or governments, we are

constantly faced with a myriad of choices that have the potential to impact our lives and the lives

of those around us. The complexity of decision making becomes even more pronounced in the real

world, where we are often confronted with multifaceted problems that require careful analysis and

consideration.

Real-world decision making encompasses a wide range of contexts, including business, health-

care, finance, public policy, and security. In these domains, decision makers must navigate through

a maze of uncertainties, conflicting objectives, limited resources, and evolving circumstances.

This complexity is particularly evident in critical infrastructures such as electrical grids, building

management systems, solar farms, autonomous vehicles, and other Cyber-Physical Systems (CPS).

Furthermore, real-world decision making is inherently collaborative or competitive, involving mul-

tiple stakeholders with diverse perspectives and interests. To better understand and navigate the

strategic interactions among decision makers, game theory provides a powerful framework. By

analyzing the choices and behaviors of individuals or organizations in situations where the outcome

of each participant’s decision depends not only on their own actions but also on the actions of others,

game theory offers valuable insights into decision-making processes.

In this thesis, we address decision-making related to the security of CPS using tools from game

theory, control theory, and data-driven methods against diverse adversarial models. We split the

thesis into two parts. In part one, we focus on state independent adversaries, where there is no

explicit dependence on the underlying state of a CPS. We characterize defense strategies for this type

of adversarial model and demonstrate its application. In part two, we develop defensive strategies

against adversarial models which take into account the underlying dynamics of the systems.

1

1.1 Part I: State Independent Adversarial Models

Attacks in path planning are inspired by two specific models. The first is when an adversary

can replace the messages received by a vehicle (man-in-the-middle (MITM) or communication

attack [71]). The second is when the adversary spoofs the perception module of a vehicle to

introduce fake obstacles in the vehicle’s occupancy grid [34]. Adversarial methods have been used

to fool a multi-object tracking system, thereby causing a large deviation in vehicle’s trajectory

and impacting its safety [72]. The vulnerabilities in LiDAR-based perception architectures have

been targeted via spoofing attacks with a 80% mean success rate [138]. The vehicle can verify

authenticity of the messages through advanced encryption methods to detect an MITM attack

or can query the infrastructure (V2I) under a perception-based attack. However, doing so will

consume resources (energy or bandwidth) and introduce delays. Therefore, it is important to

find the right energy allocation between security and mobility. An illustration of different attacks

on autonomous vehicles is shown in Figure 1.1.

In Chapter 2, we study the problem of path

Figure 1.1 Illustration of attacks on a CAV. The dotted line and dashed lines represents potentials
spots for a cyber-attack. The solid line represent an authentic user communicating with the CAV.

planning on a graph, where a defender endeavors to chart an optimal course from a source to

a destination vertex. This endeavor unfolds in the presence of a deterministic adversary. We

frame this scenario as a zero-sum multi-stage game played over an edge, wherein the defender and

adversary engage in a strategic dance with a crucial element known as the stopping state. The

analysis unfolds under two distinct information paradigms: full information, granting each player

complete insight into the opponent’s past actions, and partial information structure, wherein the

defender gains a comprehensive understanding of the adversary’s actions only upon deploying the

2

Vehicle to  Vehicle attack - V2VVehicle to  Infrastructure attack - V2I countermeasure. We characterize the Nash equilibrium for this edge-game under both information

structures, accounting for the fact that each player possesses two strategic actions. The fundamental

advantage of this chapter lies in the construction of a meta-game for determining a path resilient

to attacks. This endeavor is juxtaposed against a novel heuristic, with a particular emphasis on

scenarios where the number of attacked edges is constrained.

Ascertaining adversarial behavior can be thought of as nature choosing one sub-tree of a

game at every instance. Bayesian games [50, 57, 112, 65, 153] are used to capture the impact of

nature and multiple types of adversaries. The application of Bayesian games extends to discerning

deceptive actions by adversaries and employing defensive deception techniques as a deterrent [65].

Notably, Garnaev et al. [57] delve into the game type played by an adversary, whether simultaneous

or sequential, aiming to maximize defender payoff. Horák et al.[63] introduced the concept of

cyber deception as a partially observable stochastic game, incorporating one-sided information to

evaluate the robustness of defense strategies. Chen et al.[37] proposed a stochastic game within

a data-fusion framework for asymmetric threat detection and prediction, relying on advanced

knowledge infrastructure. Salem et al. [128] present a two-state stochastic game involving a user

aiding a sophisticated intrusion detection system (IDS) and an adversary capable of launching

eavesdropping or jamming attacks, providing explicit solutions and numerical results.

In the

context of honeypot modeling, Tian et al. [141] frame it as a Bayesian game, determining Nash

equilibrium strategies under resource constraints. In the next chapter, we assume nature impacts

the game not in a Bayesian setting, but rather in governing the presence or absence of an adversary

at every stage of the game.

In Chapter 3, we introduce the assumption that the defender wields a classifier, which provides

a probability reflecting the second player’s potential adversarial intent. This introduces a new

layer of complexity in the decision-making process, necessitating an understanding of the potential

impact of adversarial actions over multiple planning stages (finite-horizon). Our proposed model

treats the second player as adversarial with a certain probability at each stage, until the point where

adversarial intent is confirmed after a set of actions is played for a specified number of times

3

(referred to as a budget). This pivotal event leads the game to a stopping state and results in its

termination.

We proceed to analytically characterize the Nash equilibria of this game, focusing specifically

on the case of two actions per player with an engagement budget—a model termed as M-SSG. This

initial framework is subsequently extended to incorporate two additional types of budget: an attack

and a defense budget. Furthermore, we broaden the action space for both players beyond the initial

two actions, and also provide an analytical condition governing the transition to a pure policy.

1.2 Part II: State Based Adversarial Models

The setup in Chapter 4 is inspired by the cybersecurity game of stealthy takeover known as

FlipIt [145]. FlipIT is a two-player game between an adversary and defender competing to control

a shared resource. The resource can be represented as a critical digital system such as a computing

device, virtual machines or a cloud service [30].

In particular Chapter 4 takes us further along the path an adversaries’ potential to seize control

of a system entirely. Picture an adversary endowed with the knowledge of a prototypical Cyber-

Physical Systems (CPS) control loop, capable of executing a takeover at various critical junctures.

These junctures encompass the reference inputs, actuator, state, sensor, and control output, each

bearing the potential to sway the system’s performance. Unlike conventional adversaries that

tamper with either the system’s states (actuator attack) or measurements (integrity attack) [1], this

chapter envisions a scenario where the adversary commandeers a resource, wielding the power to

transmit arbitrary values originating from the controlled resource. The focus of this chapter extends

beyond static systems, delving into the dynamic landscape of resource takeovers within a CPS. It

confronts the challenge of devising effective strategies to combat adversaries while navigating the

delicate balance between operational costs and system performance.

The core challenge in CPS security is the tight (often nebulous) integration of the cyber, physical

and computational elements. Such an integration, which can expand the CPS to arbitrary dimensions

proportional to the complexity of the real-world system, necessitates a scalable framework for

developing defense policies. Riding on recent successes, Machine Learning (ML)-based methods

4

use parametric representations to create computational models to represent multi-level abstraction

from data. ML has replaced hand-engineered tasks with computational models that offer high

accuracy and performance. Although ML is being increasingly used in specific aspects of CPS

security, such as anomaly detection [78], malware detection, intrusion detection [32], prevention of

blackouts, attacks and destruction [148], the explicit consideration of the hybrid dynamics governing

a CPS is relatively unexplored.

In Chapter 5, we introduce a data-driven domain-aware, optimization-based approach that

revolutionizes the way we fortify Cyber-Physical Systems (CPS) against potential threats. By

emulating a strategic adversary within the system, exploiting vulnerabilities, interconnections, and

the dynamics of physical components, we engineer an automated defense strategy. Our approach

leverages an adversarial decision-making model founded on a Markov Decision Process (MDP).

This model orchestrates the optimal cyber (discrete) and physical (continuous) attack actions across

a CPS attack graph. The defense planning problem takes shape as a non-zero-sum game, an intricate

dance between the adversary and defender. To solve the adversary’s problem, we employ a model-

free reinforcement learning technique, dynamically adapting to the chosen defense strategy. Next,

we employ Bayesian optimization to discern an approximate best-response for the defender, arming

the network against the ensuing adversary policy. This iterative process refines the strategies for

both players, creating a dynamic and adaptable defense against potential threats.

Finally, in Chapter 6, we discuss some future directions corresponding to both parts of the

thesis.

5

CHAPTER 2

DETERMINISTIC ADVERSARY - STOPPING STATE GAMES AND THEIR
APPLICATION TO PATH PLANNING

In this chapter, we introduce a persistent adversary in the CPS, termed as a deterministic adversary,

alongside the decision making framework of stopping state games.

Our objective is to introduce a mathematical framework to reason about a diverse range of

adversaries and deploy effective defense strategies to counter and possibly capture such adver-

saries. The aim of such a framework is to bring together tools from game-theory, optimization

and backward induction with an emphasis on closed-form solution. Such closed-form solutions

enable computational efficiency and easier transitions to real-world deployment. The concepts and

definitions we introduce in this chapter will be subsequently carried forward in the later chapters. In

particular, the focus of this chapter will be on discussing defense and attack strategies of a defender

and a deterministic adversary, respectively. The outcome of this chapter is to ensure security and

resilience in path planning problems while striking a balance between performance and costs.

2.1

Introduction

In this chapter, we consider a path planning problem on a graph wherein a vehicle (defender)

seeks to find an optimal path from a source to a destination vertex in the presence of a deterministic

adversary. The defender is equipped with a countermeasure that can detect and permanently disable

the attack if it occurs concurrently. We model the problem over an edge as a zero-sum multi-stage

game played between the defender and the adversary with a stopping state, termed as the edge-

game. We analyze this game under full information, in which each player has complete knowledge

of the past actions taken by the opponent at every stage. We also analyze the game under a partial

information structure, wherein the defender obtains complete knowledge of the attacker’s actions

only when the defender uses the countermeasure. We characterize the Nash equilibrium of the

edge-game in both information structures with two actions per player and analyze its sensitivity to

the game parameters. We then construct a meta-game using the edge-game solutions to determine

an attack-resilient path and compare it with an efficient novel heuristic with a constraint on the

6

number of edges attacked.

Attack resilience is an essential attribute for mobile robots, and has garnered a lot of attention

in the recent years. Several methods have been proposed to improve resilience, such as designing

robust estimators in presence of process noise and modeling errors [110, 82]. In the context of

input-output attacks and known disturbance bounds for a linear dynamical system, Hespanha et

al. [62] employed game-theoretic methods to compute locally optimal solutions. Liu et al.[91]

contributed by deriving secure trajectories for robotic systems navigating from a source to a

destination, elucidating conditions under which attacks can remain undetected. Furthermore, the

study conducted by Bianchin et al.[29] focuses on the localization and navigation of a robot in the

presence of attacks, exploring the conditions under which both detectable and undetectable attacks

may exist.

Highlighting the significance of communication in preventing damages and system manipula-

tion, Agarwal et al. [2] emphasize the importance of conveying information. In response to the

rising prevalence of swarm-robotics applications, a distributed robust sub-modular optimization

algorithm is proposed in [160]. Addressing security concerns in swarm-robotics motion planning,

Tsiamis et al. [142] contributed by implementing security measures to safeguard mobile robots

against eavesdropping. Furthermore, design of a sensor network has been used to address security

concerns in disaster relief applications, such as deploying a helicopter in a flood-hit region for

search and rescue operations [77].

Game theory can be used to model strategic decision-making in vehicular networks such as

the communication links, hardware and software [7]. Attacks have been studied over the wireless

communication channels [71] and inter-connectivity between different components of CPS [97],

where risk is modeled corresponding to a threat profile in conjunction with the communication

links, hardware and software. A MITM attack [122] was demonstrated on commercial UAV(s)

which showed that mission critical tasks are susceptible to such attacks and can be secured through

an appropriate set of countermeasures.

There have been several works on game theory applied to network interdiction [147], which

7

models the interaction between an evader attempting to travel between two nodes in presence

of an edge interdictor as a two-person zero-sum game. Similar network interdiction works have

been conducted between an evader and an interdictor under a budget [69]. A subsequent work

extends this line of inquiry, introducing asymmetric information between the evader and interdictor

and formulating the problem as a mixed-integer nonlinear bilevel challenge [24]. Addressing the

computational efficiency of managing asymmetric information in network interdiction, an efficient

approach has been proposed [87]. Additionally, variants of network interdiction exploit physical

flows [115, 43], employing multi-level optimization techniques [133].

Exploring attack-resilient path planning in mobile robotics, such as works from Sanjab et

al. [130, 129], address this emerging challenge. These studies present a comprehensive framework

for analyzing the security of drone delivery systems. Central to their approach is the formulation of a

zero-sum interdiction game involving a defender (e.g., the drone operator) and a malicious adversary.

In this strategic interplay, the drone aims to minimize delivery time, while the adversary strategically

identifies locations for interdiction, maximizing delivery time. Another key contribution involves

integrating prospect theory to capture nuanced perceptions of success and achievable delivery times

for both the defender and the adversary relative to a specified delivery time. The research extends

incorporating new concepts from cumulative prospect theory (PT) [129]. The game undergoes

analysis both with and without PT, leading to the development of innovative algorithms to attain

equilibria under PT and the interdiction game.

We conducted a validation of the partial information model using a ground robot traveling

from a source pose to a destination pose. This validation was performed through the utilization

of ROS in conjunction with the Gazebo simulation environment [74]. In a second scenario, we

simulated and experimented with the ground robot, representing a single traffic lane. This allowed

us to demonstrate how an attack could be executed to deceive the robot’s perception by introducing

spurious obstacles in front of it while in motion. Subsequently, we applied the derived solution

to make informed decisions regarding when to query the infrastructure for information validation.

The effectiveness of our theoretical solution was evaluated across multiple epochs of the attack

8

simulation and experiments.

The primary contributions of this chapter are four-fold.

1. Game-theoretic modeling: We model the interplay between costs related to mobility and

security in an attack-resilient path planning problem using the framework of dynamic zero-

sum games with a stopping state, i.e., the game terminates if the players play out of a given

subset of their actions at any stage (cf. Figure 2.1). The attack based on the MITM model

leads to a dynamic game with full information structure, i.e., every player has complete

knowledge of the past actions of the opponent at every stage. In contrast, the attack based on

sensor spoofing yields a partial information structure, i.e., only the adversary has complete

information of the past actions of the defender, but the adversary is constrained to attack at

all subsequent stages if it decides to spoof the sensor at any stage. The proposed models can

be considered as dual versions of the classic Chicken game or the War of attrition [61] with

an additive stage cost that models look-ahead and the novel aspect of partial (asymmetric)

information.

2. Solutions and parameter sensitivity: For both information structures, we first characterize

the Nash equilibria and present the solutions to be played over the edges of the roadmap mod-

eled as graph. For ease of exposition and presentation, we report the analysis and numerical

results for the special case of two actions per player, although our solution techniques are

applicable for any number of actions per player. Additionally, we study the sensitivity of the

obtained solution and the player strategies to: (i) the relative costs of mobility and security,

and (ii) the number of stages along with the extension to multiple attacks in the full informa-

tion model. We show that the partial information model leads to a linear programming-based

formulation yielding an efficient solution to the game. This solution has been inspired by

analogous techniques to solve games of incomplete information [119].

3. Computation of an optimal attack-resilient path: We use the solutions to the two classes

of games to construct a meta-game over a given roadmap. In the meta-game, the objective

9

of the defender vehicle is to go from a source to a destination vertex while minimizing the

impact of attack. The adversary’s objective is to target the most vulnerable set of edges over

the given set of feasible path(s). We assume that the adversary is resource constrained, and

is thus restricted to be active on only one edge of the graph. The assumption of a single edge

attack arises from the fact that, if an attack is detected along an edge of a given path, the

defender becomes highly vigilant along the current path and will either take an alternate path

or ensure higher security in the current path making the adversary less efficient. However,

this assumption can be relaxed by allowing the adversary to be active over multiple edges.

Making the adversary more capable (multi-edge attack) only changes the size of the meta-

game leading to an increased computation. Over a simple graph, we quantify the sensitivity

of the choice of the resulting path (resp. edges for the adversary) to the costs of mobility and

security.

4. Comparison with competing heuristics: For large-sized roadmaps, we compare the solution

of the meta-game against a simple heuristic proposed based on computing the shortest path

constrained to a single attack and defense scenario over an edge. We calculate the shortest path

by replacing edge weights with the corresponding solutions of the full or partial information

game. We observe that in sparse graphs, the solutions of the meta-game and the heuristic are

comparable, whereas in dense graphs, the meta-game yields a reduced cost but with longer

computation times.

Outline: This chapter is organized as follows. We present the problem formulation of the

zero-sum multi-stage game played over any edge and the meta-game representing the resilient

path planning problem in Section 2.2. This is followed by Section 2.3 and Section 2.4, wherein

we characterize the solution to the full information structure as a function of the stage costs, the

number of stages, and a threshold on the number of attacks. We present the solution to the partial

information structure in Section 2.5. Then, we present the solution to the meta-game in Section 2.6

on a simple graph and evaluate the sensitivity of choosing the shortest path versus the alternatives

10

Figure 2.1 The full information edge-game with 𝐿 = 1, along the edge 𝜈𝜉 for given graph with 𝑘 𝜈𝜉
as the number of stages. The information set for the adversary and defender is indicated by the
dotted line and nodes taking on a value 𝑉𝑖 for 𝑖 ∈ {0, 1, . . . 𝐾𝑒 − 1}, respectively. Actions of the
adversary (resp. defender) abbreviated as {𝐴, 𝑁 𝐴} (resp. {𝐷, 𝑁 𝐷}) for {Attack, No attack} (resp.
{Defend, No defend}). The stopping state is indicated by 𝑆𝑆.

Figure 2.2 The partial information edge-game along edge 𝜈𝜉 for the given graph. The information
set for the adversary and defender is given by the dotted line between the nodes of a stage and on
nodes respectively, indicating the uncertainty of each player.

as a function of the stage cost parameters and number of stages along an edge. Furthermore, we

simulate the meta-game over large graphs and compare it against the shortest path heuristic. We

implement the partial information game in an open path and single traffic lane scenario, and validate

the approach with simulations and experiments in Section 2.7. Finally, we conclude this chapter in

Section 2.8. We present the proofs of all mathematical claims in the appendix.

11

2.2 Problem Formulation

Consider a roadmap modeled as a graph 𝐺 with a set of vertices 𝑉 and directed edges 𝐸. Given

a directed edge 𝑒 ∈ 𝐸, we model the interaction between an adversary and a defender as a multi-

stage zero-sum finite game called an edge-game (e-game). Each stage is associated with a matrix

𝑆 whose rows represent the actions available to the defender and columns as actions available to

the adversary, with a row-column entry representing the payoff. Each stage is also associated with

multiple states and the current state of the game at a stage is a function of the past actions of each

player. An e-game is characterized by a sequence of matrices {𝑆1, 𝑆2, . . . , 𝑆𝐾𝑒 }, where 𝐾𝑒 is the

number of stages on edge 𝑒. For ease of exposition, we assume that at any stage 𝑘, the number of

actions for the defender (resp. adversary) are limited to two – defend or no defend (resp. attack or

no attack for the adversary), implying 𝑆𝑘 ∈ R2×2, although the model easily extends to more than

two actions per player. The actions of the players and the corresponding entries of the stage cost

matrices are given as,

attack

no attack

defense

no defense

𝑠11,𝑘

𝑠21,𝑘

𝑠12,𝑘

𝑠22,𝑘









.









At each stage 𝑘, the defender (row player) and the adversary (column player) simultaneously

select their respective actions 𝑖, 𝑗 from the set {Defend, No Defend} and {Attack, No Attack},

leading to an adversary payoff of 𝑠𝑖𝑘 𝑗𝑘,𝑘 ∈ 𝑆𝑘 . We consider that an attack is detected at any stage

𝑘 ≤ 𝐾𝑒, whenever the action pair {Defend, Attack} is played simultaneously. The game stops at

any stage 𝑘 if the adversary gets detected a total of 𝐿 ≥ 1 times, defined as the stopping state. We

call the parameter 𝐿 as the termination threshold. The edge cost is the net payoff to the adversary.

2.2.1 Full and partial information e-game

We consider two variations of e-games namely; full information and partial information e-

game. In the full information e-game, termed as FIE-game, the game state at any stage is common

knowledge to both players. In the partial information e-game, referred to as the PIE-game, if, at

any given stage, the defender chooses not to defend, they remain unaware of the action taken by

12

the adversary, meaning the game state remains unknown. In contrast, in a PIE-game, we assume

that the adversary is fully aware of the game state. However, if the adversary chooses to attack at

any stage of a PIE-game, then it must continue to play its action of attack until the game reaches a

stopping state or till the last stage 𝐾𝑒 is reached.

An illustration of a simple roadmap with FIE-game and PIE-game are shown in Figures 2.1

and 2.2, respectively. The stopping state (𝑆𝑆) models the fact that after getting detected a total of

𝐿 times, the adversary gets permanently disabled. To define engagement of an e-game formally,

consider an indicator function at stage 𝑘 defined by:




The game terminates in a stopping state if there exists a stage 𝑡 ≤ 𝐾𝑒 for which (cid:205)𝑡

if {𝑖𝑘 , 𝑗𝑘 } = {𝐷, 𝐴},

1(𝑖𝑘 , 𝑗𝑘 ) =

otherwise.

1,

0,

𝑘=1

(2.1)

𝐼 (𝑖𝑘 , 𝑗𝑘 ) = 𝐿.

Now, given a sequence of player actions {(𝑖1, 𝑗1), . . . , (𝑖𝐾𝑒, 𝑗𝐾𝑒)}, the net payoff to the adversary is

given by:

𝑡
∑︁

𝐽𝐾𝑒 =

𝑠𝑖𝑘 𝑗𝑘,𝑘 +

𝐾𝑒∑︁

𝑠11,𝜅,

(2.2)

𝜅=𝑡+1
since the game stops at stage 𝑡 ≤ 𝐾𝑒. The quantity (cid:205)𝐾𝑒

𝑘=1

𝜅=𝑡+1

𝑠11,𝜅 represents the cost-to-go from

a stopping state to the final stage, which is the additive mobility cost. This chapter analyzes the

FIE-game for any 𝐿 ≥ 1. For ease of exposition, we restrict ourselves to 𝐿 = 1 for the PIE-game,

although the approach can be extended to 𝐿 ≥ 1.

For both FIE-game and PIE-game, we consider the space of behavioral policies. A multi-

stage behavioral policy [61] for the defender and adversary is a set of probability distributions

Y𝑒 := {𝑦1, . . . , 𝑦𝐾𝑒 } ∈ Δ

𝐾𝑒
𝐾𝑒
2 and Z𝑒 := {𝑧1, . . . , 𝑧𝐾𝑒 } ∈ Δ
2 , respectively, where Δ2 is the probability
𝐾𝑒
𝐾𝑒
→ R to the adversary with respect
simplex in 2 dimensions. The net expected cost 𝐽𝐸 : Δ
× Δ
2
2

to the behavioral policies {Y𝑒, Z𝑒} is given by:

𝐽𝐸𝑒 (Y𝑒, Z𝑒) =

𝐾𝑒∑︁

𝑘=1

𝑘 𝑆𝑘 𝑧𝑘 .
𝑦T

(2.3)

13

It can be shown that 𝐽𝐸 is obtained from the forward recursive equation, given by:

𝐽𝑎 =

𝑎
∑︁

𝑘=1

𝑘 𝑆𝑘 𝑧𝑘 −
𝑦T

𝑎−1
∑︁

𝑏=1

𝑦𝑏,1𝑧𝑏,1𝐽𝑎−𝑏,

(2.4)

where 𝐽𝑎 is the expected pay-off at stage 𝑎 ∈ {1, 2, . . . , 𝐾𝑒}. If the adversary attacks an edge 𝑒 ∈ 𝐸,

then the cost over the edge 𝑒 is defined by a pair of behavioral policies (Y∗

𝑒 , Z∗

𝑒 ) that are in Nash

equilibrium [61], i.e., ∀Y, Z ∈ Δ

𝐾𝑒
2 , they satisfy

𝐽𝐸𝑒 (Y∗

𝑒 , Z𝑒) ≤ 𝐽𝐸𝑒 (Y∗

𝑒 , Z∗

𝑒 ) ≤ 𝐽𝐸𝑒 (Y𝑒, Z∗

𝑒 ).

We denote the outcome of the e-game as 𝐽∗

𝐸𝑒 := 𝐽𝐸𝑒 (Y∗

𝑒 , Z∗

𝑒 ).

2.2.2 Attack-resilient path planning

The cost of traversing an edge 𝑒 ∈ 𝐸 is contingent upon whether the edge has been attacked or

not, defined as:

𝑤𝑒 =

,

𝐽∗
𝐸𝑒

if edge 𝑒 is attacked,

(cid:205)𝐾𝑒
𝑘=1

𝑠22,𝑘 ,

otherwise,

(2.5)





where the no attack condition corresponds to the mobility cost over the edge. Let 𝑒𝑖 𝑗 ∈ 𝐸 denote

the directed edge connecting vertices 𝑖 and 𝑗. Let 𝜈, 𝜉 ∈ 𝑉 denote a start and destination pair of

vertices. A path from 𝜈 to 𝜉 is a collection of at most |𝑉 | − 1 directed edges. A sample path is

defined as 𝜋𝜈𝜉 := {𝑒𝜈𝑢, . . . , 𝑒𝑣𝜉 }, ∀𝑢, 𝑣 ∈ 𝑉 \ {𝜈, 𝜉}. The set of paths from 𝜈 to 𝜉 is denoted as 𝑃𝜈𝜉.

The cost of a path 𝜋𝜈𝜉 ∈ 𝑃𝜈𝜉 is defined as

𝑤𝜋𝜈 𝜉 =

∑︁

𝑒∈𝜋𝜈 𝜉

𝑤𝑒.

(2.6)

The cost 𝑤𝑒 can then be used to define a meta-game played between the path defender and an edge

adversary. In this game, the adversary selects a subset E ⊂ 𝐸 from a total of |𝐸 |𝑃𝜂 edges, where 𝜂

is the total number of edge attacks, while the defender selects a pathh 𝜋𝑆𝑇 ∈ 𝑃𝑆𝑇 .

Assumption 2.2.1 [Single edge attack] We focus on the scenario of a single possible attack edge.

Consequently, the total number of edges corresponds to |𝐸 |𝑃1 = |𝐸 |.

14

This allows for representing the meta-game equivalently through the entries of a matrix 𝑊

whose number of rows and columns equal the cardinality of |𝑃𝑆𝑇 | and |𝐸 |, respectively. A mixed

policy for the defender (resp. adversary) in the meta-game is a probability distribution ˆ𝑦 (resp. ˆ𝑧)

over the set of paths 𝑃𝑆𝑇 (resp. the subsets of edges E ⊂ 𝐸).In the e-game, we determine the

policies for both the defender and adversary over an edge 𝑒. In contrast, in attack-resilient path

planning, we establish a meta-level policy for the defender and an attack policy. Our objective is to

compute a Nash equilibrium for the meta-game, defined as:

𝑊𝑁 𝐸 = min
ˆ𝑦∈Δ| 𝑃𝑆𝑇 |

max
ˆ𝑧∈Δ|𝐸 |

ˆ𝑦𝑇𝑊 ˆ𝑧,

(2.7)

with the resulting optimal policies for each player computed as:

ˆ𝑦∗ ∈ arg min
ˆ𝑦∈Δ| 𝑃𝑆𝑇 |

ˆ𝑦T𝑊 ˆ𝑧∗,

ˆ𝑧∗ ∈ arg max
ˆ𝑧∈Δ| 𝐸 |

ˆ𝑦∗T𝑊 ˆ𝑧.

The optimal policies ˆ𝑦∗ and ˆ𝑧∗ represent the probabilities of picking a resilient path 𝜋𝑆𝑇 and

attacking an edge 𝑒 ∈ 𝐸, respectively. We expect the complexity of this approach to scale

undesirably (exponentially in the case of dense graphs) with the size of the roadmap. Therefore, a

second objective in this chapter is to design a computationally efficient approach to find a resilient

path.

2.3 Full Information Edge Game with a Termination Threshold

In this section, we analyze the full information game over an edge 𝑒 with a termination threshold

of 𝐿 = 1, i.e., the adversary is disabled immediately when the action pair {defend, attack} is played

simultaneously. Furthermore, we assume a fixed stage cost matrix across all the stages, i.e.,

𝑆𝑘 = 𝑆, ∀𝑘 ∈ {1, 2, . . . , 𝐾𝑒}. In particular, we will derive a method to compute the expected payoff

𝐽𝐸 (2.3), resulting in a Nash equilibrium for the FIE-game. As shown in Figure 2.1, the game stops

either in the states indicated by 𝑆𝑆 or at the final stage 𝐾𝑒.

2.3.1 Nash equilibria and value of the game

We define a matrix,

𝐷 =

0 1

1 1









,









15

which encodes if the action pair {defend, attack} was used in stage 𝑘 of the FIE-game. The value of

zero-sum matrix X is given by Val(𝑋) := min𝑦𝑘 ∈Δ2 max𝑧𝑘 ∈Δ2

𝑘 𝑋 𝑧𝑘 , where 𝑦𝑘 and 𝑧𝑘 are the space
𝑦T

of defender and adversary policies, respectively. For a full information game, a standard technique

to solve such games using the cost-to-go function (e.g., see [61]) is to compute the solution of the

Bellman equation backward in time of the form:

𝑉𝑘−1 = Val(𝑉𝑘 𝐷 + 𝑆),

(cid:18)
𝑉𝑘

= Val

+

0 1











1 1


(cid:32)
(cid:32)


(cid:125)
(cid:123)(cid:122)
(cid:124)
𝐷








(cid:124)

𝑠11 + (𝐾𝑒 − 𝑘)𝑠22

𝑠21
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:123)(cid:122)
ˆ𝑆𝑘

(cid:19)

𝑠12





𝑠22


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:125)

= min
𝑦𝑘 ∈Δ2

max
𝑧𝑘 ∈Δ2

𝑦T
𝑘

(cid:16)
𝑉𝑘 𝐷 + ˆ𝑆

(cid:17)

𝑧𝑘 ,

(2.8)

where 𝑘 ∈ {1, 2, . . . , 𝐾𝑒} denotes the stage, 𝑉𝑘 is the expected value of the game at the 𝑘 𝑡ℎ stage,

𝑆𝑘 is the stage cost matrix. The expected value of the game at any stage 𝑘 is given by:

𝑉𝑘−1 = 𝑦∗
𝑘

T (cid:16)

𝑉𝑘 𝐷 + ˆ𝑆

(cid:17)

𝑧∗
𝑘 ,

(2.9)

where {𝑦∗

𝑘 , 𝑧∗

𝑘 } is a Nash equilibrium policy at stage 𝑘.

The following assumption enables us to analyze the e-game and derive closed-form expressions

for the value of the game and player policies.

Assumption 2.3.1 The following stage cost inequalities hold at any stage 𝑘 of the FIE-game,

𝑠21 > 𝑠12 ≥ 𝑠11 > 𝑠22 ≥ 0

Assumption 2.3.1, is commonly encountered in security-related problems, implying the cost of

defense is lower against a no defense under an attack action. Consequently, the cost corresponding

to a defense is higher than not defending against an attack-free scenario.

The following theorem summarizes the analytic expressions for the Nash equilibrium policies

and the corresponding value at stage 𝑘.

16

Theorem 2.3.2 Under Assumption 2.3.1, the unique Nash equilibrium at any stage 𝑘 for a full

information edge game (FIE-game) with a termination threshold of 𝐿 = 1 is given by:

(cid:20)

(cid:20)

𝑦∗
𝑘 =

𝑧∗
𝑘 =

𝑠22 − 𝑠21
𝑠11 − 𝑠12 − 𝑠21 + (𝐾𝑒 − 𝑘 + 1)𝑠22 − 𝑉𝑘

𝑠11 − 𝑠12 + (𝐾𝑒 − 𝑘)𝑠22 − 𝑉𝑘
𝑠11 − 𝑠12 − 𝑠21 + (𝐾𝑒 − 𝑘 + 1)𝑠22 − 𝑉𝑘

(cid:21) T

, (2.10)

𝑠22 − 𝑠12
𝑠11 − 𝑠12 − 𝑠21 + (𝐾𝑒 − 𝑘 + 1)𝑠22 − 𝑉𝑘

𝑠11 − 𝑠21 + (𝐾𝑒 − 𝑘)𝑠22 − 𝑉𝑘
𝑠11 − 𝑠12 − 𝑠21 + (𝐾𝑒 − 𝑘 + 1)𝑠22 − 𝑉𝑘

(cid:21) T

,

(2.11)

with the boundary condition 𝑉𝐾𝑒 := 0. The value of the game is given by:

𝑉𝑘−1 = 𝑉𝑘 +

det(𝑆) + 𝑠22((𝐾𝑒 − 𝑘)𝑠22 − 𝑉𝑘 )
𝑠11 − 𝑠12 − 𝑠21 + (𝐾𝑒 − 𝑘 + 1)𝑠22 − 𝑉𝑘

,

where det(𝑆) is the determinant of the matrix 𝑆.

(2.12)

□

Please refer to 2.9 for the proof of Theorem 2.3.2. The derived Theorem 2.3.2 yields a closed-

form expression to compute the solution of a FIE-game recursively, and is computationally efficient

to evaluate even over very large number of stages 𝐾𝑒.

In order to study sensitivity of the solution of the FIE-game with respect to security and mobility

costs, we parameterize the stage cost matrix 𝑆 with two ratios 𝑟1 and 𝑟2. The parameterized stage

cost matrix is given by:

𝑆 = 𝑠11

1






𝑟1



1

𝑟2









, where 𝑟1 :=

𝑠21
𝑠11

, and 𝑟2 :=

𝑠22
𝑠11

.

(2.13)

The motivation for such a parameterization stems from the fact that the payoff for defense

remains independent of the action taken by the adversary. The cost of defense is represented by 𝑠11,

whereas the loss of security is represented by 𝑠21, i.e., the attack drains resources from the vehicle

and goes unnoticed. The mobility cost is denoted by 𝑠22, i.e., the cost incurred by the defender

vehicle when it goes from current to next stage under no attack. Furthermore, the assumption of

𝑟1 ≥ 1 and 𝑟2 < 1 carries over from Assumption 2.3.1. The condition of 𝑟1 ≥ 1 naturally fits the

incentive of an adversary to cause a loss and, 𝑟2 < 1 represents the minimum cost of mobility. The

parameterized matrix (2.13) results in the following recursive value of the game:

𝑉𝑘−1 = 𝑠11

(cid:18)
𝑉𝑘 +

𝑟2 − 𝑟1 + 𝑟2((𝐾𝑒 − 𝑘)𝑟2 − 𝑉𝑘 )
𝑟2(𝐾𝑒 − 𝑘 + 1) − 𝑟1 − 𝑉𝑘

(cid:19)

.

(2.14)

17

With the game well-defined in the final stage 𝐾𝑒 with 𝑉𝐾𝑒 := 0, the Nash equilibrium at the final

stage 𝐾𝑒 is given by the pair of policies:

𝑦∗
𝐾𝑒 =

(cid:105) T

(cid:104)

1 0

, 𝑧∗

𝐾𝑒 =

(cid:20) 1 − 𝑟2
𝑟1 − 𝑟2

(cid:21) T

.

𝑟1 − 1
𝑟1 − 𝑟2

(2.15)

Thus, we can determine the limiting probabilities of attack and defense using (2.15) and assign

appropriate costs to balance between performance and cost. The following result summarizes

another key property of the FIE-game.

Corollary 2.3.3 Given a FIE-game with a constant stage cost 𝑆, the mixed policy for the defender

and adversary at the start of a FIE-game in the limit as the number of stages 𝐾𝑒 → ∞ satisfies,

lim
𝐾𝑒→∞

𝑦∗
1 =

(cid:104)

0 1

(cid:105) T

,

lim
𝐾𝑒→∞

𝑧∗
1 =

(cid:104)

0 1

(cid:105) T

.

In short, this means that both players begin with not defending and not attacking, respectively,

at the beginning of the FIE-game, and gradually (monotonically) shift the weights toward defending

and attacking as the stages progress.

Our next result examines what happens when the multi-stage game is played with a very small

inter-stage period.

Proposition 2.3.4 (Approximate solution) In the limit as the inter-stage time interval tends to

zero, the value of an FIE-game at any stage 𝑘 satisfies

𝑉𝑘 ≈ −𝑟1 +

√︃

𝑟 2
1

+ (2𝑟1𝐾𝑒 + 2𝑟1(1 − 𝑘) + 1) + 𝑟2(𝐾𝑒 − 𝑘)

(2.16)

□

We present the proof in 2.9. Given the number of stages 𝐾𝑒, we determine the value of a

FIE-game at 𝑘 = 0, i.e., 𝑉0 using the solution of the game (equation (2.4)). From (2.16), we observe

that the value of the FIE-game increases sublinearly with 𝐾𝑒 for sufficiently small values of 𝑟2 and

linearly with 𝑟2.

We now investigate the sensitivity of the FIE-game solution to the values of 𝑟1 and 𝑟2 assuming

a unit defense cost, i.e., 𝑠11 = 1. Figure 2.3a presents the FIE-game solution for various 𝑟1 and

18

𝑟2 values. The ratio 𝑟2 has a greater impact on determining the FIE-game’s value compared to

𝑟1, as 𝑟2 represents the minimum possible cost to pay. As 𝑟2 increases, we observe that the value

of the FIE-game grows approximately linearly with the number of stages, in accordance with the

attack 𝑧∗

approximate solution in Equation (2.16). The Nash equilibrium probability of defense 𝑦∗

0(1) and
0(1) at the start of an FIE-game are shown in Figures 2.3b and 2.4a, respectively. For a
given 𝑟2 and 𝑟1, the probability of an attack is more likely for smaller values of 𝑟1 and 𝑟2, indicative

of a cautious adversary. In contrast, the defender defends with a lower chance for small values of

𝑟1 and 𝑟2. Furthermore, we observe that the defender and adversary probabilities monotonically

decrease with increasing stages 𝐾𝑒, which aligns with Corollary 2.3.3.

Finally, we compare the approximate value in Equation (2.16) with the recursive value of the

FIE-game in Equation (2.14) and plot the percentage error relative to the recursive value of the

FIE-game in Figure 2.4b. We notice that the approximation accuracy improves with decreasing 𝑟2

and 𝑟1, and the error for any given 𝑟1 and 𝑟2 tends to zero as the number of stages 𝐾𝑒 increases.

(a)

(b)

Figure 2.3 (a) Value of a FIE-game vs. stages 𝐾𝑒 for a given set of 𝑟2 and 𝑟1 with a termination
threshold of to 𝐿 = 1. (b) Policy of the defender at 𝑘 = 0 of a FIE-game vs. stages 𝐾𝑒 for the same
set of 𝑟2 and 𝑟1 with a termination threshold of to 𝐿 = 1.

19

5101520Stages Ke51015Value of the fie-game5101520Stages Ke00.20.40.60.81Defender policy (y0)(a)

(b)

Figure 2.4 (a) Policy of the adversary at the start (𝑘 = 0) of a FIE-game vs. stages 𝐾𝑒 for the same
conditions as in Figure 2.3a. (b) Percentage error between the approximate value (equation (2.16))
and recursive value (equation (2.14)) of the FIE-game with 𝐿 = 1 for a set of game parameters.

2.4 Full Information Edge Game with an Arbitrary Termination Threshold

We now extend the FIE-game for the case of 𝐿 > 1. A FIE-game with 𝐿 = 2 is illustrated

in Figure 2.5. With increasing 𝐿, the game tree extends further where it would have originally

terminated for lower values of 𝐿.

2.4.1 Value of the game and player policy

For a FIE-game with an arbitrary number of termination threshold, we introduce a second

matrix given by:








which encodes the value of the game corresponding to the action pair {defend, attack} at stage

0 0

1 0

𝐸 =









,

𝑘 + 1. We build on the procedure described in Section 2.3, and determine a recursive equation for

20

5101520Stages Ke0.10.20.30.40.50.6Attacker policy (z0)2004006008001000Stages Ke0102030Percentage errorFigure 2.5 The FIE-game along with termination threshold of 𝐿 = 2. The dynamic game shown
can tolerate an action pair of {Defend, Attack} twice followed by disabling the adversary. The
information set for the adversary and defender is indicated by the dotted line and nodes taking
on a value 𝑉 𝑗
for 𝑖 ∈ {0, 1, . . . 𝐾𝑒 − 1} and 𝑗 ∈ 0, 1, . . . , 3 respectively. The actions of the
𝑖
adversary (resp. defender) is abbreviated as {𝐴, 𝑁 𝐴} (resp. {𝐷, 𝑁 𝐷}) for {Attack, No attack}
(resp. {Defend, No defend}). The termination states are denoted by 𝑆𝑆 (blue colored node).

value of the game given by:

𝑉 𝑖
𝑘−1 = Val(𝑉 𝑖+1

𝑘 𝐸 + 𝑉 𝑖

𝑘 𝐷 + 𝑆),

(cid:18)
𝑉 𝑖+1
𝑘

= Val

+𝑉 𝑖
𝑘



1 0








0 0


(cid:32)
(cid:32)


(cid:125)
(cid:123)(cid:122)
(cid:124)
𝐸

(cid:19)

+𝑆



0 1








1 1


(cid:32)
(cid:32)


(cid:125)
(cid:123)(cid:122)
(cid:124)
𝐷

(2.17)

𝑦T
𝑘

1 0

max
𝑧𝑘 ∈Δ2

= min
𝑦𝑘 ∈Δ2








where the number of attacks defended prior to a stopping state is given by 𝑖 ∈ {𝐿 − 1, 𝐿 − 2, . . . , 1},
𝑘 is the expected value of the FIE-game at the 𝑘 𝑡ℎ stage after the 𝑖-th instance of an attack detection.
𝑉 𝑖
For 𝑖 = 𝐿 − 1, we use the solution of FIE-game with single termination threshold.

(cid:169)
𝑉 𝑖+1
(cid:173)
𝑘
(cid:173)
(cid:171)

+ 𝑆(cid:170)
(cid:174)
(cid:174)
(cid:172)

+ 𝑉 𝑖
𝑘

0 1

0 0

1 1

























𝑧𝑘 ,

At every stage, only one out of two events can occur – either the attack is detected and therefore

the value of 𝑖 increments by one, or the attack goes undetected and thus, 𝑖 remains unchanged. The

21

expected value of the game for stage 𝑘 after the 𝑖-th attack gets defended is given by:

𝑘−1 = 𝑦𝑖∗T
𝑉 𝑖

𝑘

(cid:16)
𝑘 𝐸 + 𝑉 𝑖
𝑉 𝑖+1

𝑘 𝐷 + 𝑆

(cid:17)

𝑧𝑖∗
𝑘 ,

(2.18)

where {𝑦𝑖∗

𝑘 𝑧𝑖∗

𝑘 } is the corresponding mixed Nash equilibrium policy. When 𝑖 = 𝐿, it corresponds

to immediate engagement of the adversary, thus represented by the recursive equation of a single

termination threshold (2.8). We now present the results for FIE-game with an arbitrary finite

termination threshold:

Corollary 2.4.1 Under Assumption 2.3.1, the unique Nash equilibrium policy at any stage 𝑘 after

𝑖 instances of {attack, defense} action pairs of the FIE-game with a termination threshold of 𝐿 is

given by:

(cid:34)

(cid:34)

𝑦𝑖∗
𝑘 =

𝑧𝑖∗
𝑘 =

𝑠22 − 𝑠21

𝑠11 − 𝑠12 − 𝑠21 + 𝑠22 + 𝑉 𝑖+1

𝑘 − 𝑉 𝑖
𝑘

𝑠11 − 𝑠12 + 𝑠22 + 𝑉 𝑖+1
𝑠11 − 𝑠12 − 𝑠21 + 𝑠22 + 𝑉 𝑖+1

𝑘 − 𝑉 𝑖
𝑘
𝑘 − 𝑉 𝑖
𝑘

𝑠22 − 𝑠12

𝑠11 − 𝑠12 − 𝑠21 + 𝑠22 + 𝑉 𝑖+1

𝑘 − 𝑉 𝑖
𝑘

𝑠11 − 𝑠21 + 𝑠22 + 𝑉 𝑖+1
𝑠11 − 𝑠12 − 𝑠21 + 𝑠22 + 𝑉 𝑖+1

𝑘 − 𝑉 𝑖
𝑘
𝑘 − 𝑉 𝑖
𝑘

(cid:35) T

(cid:35) T

,

,

(2.19)

(2.20)

with the boundary condition 𝑉 𝑖

𝐾𝑒 := 0, ∀𝑖 ∈ {1, . . . , 𝐿 − 1}. The value of the FIE-game is given by:
det(𝑆) + 𝑠22(−𝑉 𝑖
𝑘 + 𝑉 𝑖+1
)
𝑘
𝑘 + 𝑉 𝑖+1
𝑠11 − 𝑠12 − 𝑠21 + 𝑠22 − 𝑉 𝑖
𝑘

(2.21)

𝑘 +

.

𝑘−1 = 𝑉 𝑖
𝑉 𝑖

□

We skip the proof for Corollary 2.4.1, as the proof is analogous to Theorem 2.3.2 with a

change in the zero-sum matrix. The mixed policies in Equation (2.19) and (2.20) are defined for

attacks when the number of stages 𝐾𝑒 ≥ 𝐿. When 𝐾𝑒 < 𝐿, we determine the policies only for

𝑖 ∈ {1, 2, . . . , 𝐾𝑒}.

2.4.2 Parameterized stage cost and numerical evaluation

Using the parameterized cost matrix (2.13), the expected value of the FIE-game in equa-

tion (2.21) satisfies:

𝑉 𝑖
𝑘−1 = 𝑠11

(cid:32)
𝑉 𝑖

𝑘 + +

𝑟2 − 𝑟1 + 𝑟2(−𝑉 𝑖
𝑟2 − 𝑟1 − 𝑉 𝑖

𝑘 + 𝑉 𝑖+1
𝑘
𝑘 + 𝑉 𝑖+1
𝑘

(cid:33)

)

(2.22)

22

Equation (2.22) is defined for instances of 𝑖 ∈ {1, 2, . . . , 𝐿 − 1}. For 𝑖 = 𝐿, we resort to the

recursive expected value of the FIE-game defined in Equation (2.14). From Equation (2.22), we

observe that the value of the game at any stage 𝑘 − 1 after the 𝑖-th instance is dependent on the value

of the game at 𝑘 under instances 𝑖 and 𝑖 + 1. This dependency can also be observed in Figure 2.5,

where the 𝑖𝑡ℎ attack node branches into the (𝑖 + 1)𝑡ℎ attack. Following the previous section, we study

the effect of 𝑟1 and 𝑟2 under unit defense cost, i.e., 𝑠11 = 1. The value of the FIE-game for varying

numbers of stages 𝐾𝑒 and a set of 𝑟1 values are shown in Figures 2.6a and 2.6b. We observe that

the value of the FIE-game depends strongly on the termination threshold 𝐿; for a given 𝑟1 and 𝑟2,

the value of the FIE-game increases by a significant amount with increasing 𝐿. Furthermore, with

a larger termination threshold, the value of the FIE-game becomes independent of 𝑟1.

From the analysis of 𝐿 = 1, we observed that at equilibrium, the probability of defense increases

and that of attack decreases with increasing value of 𝑟1. Therefore, in this section, we observe the

player policies for a fixed 𝑟1. The equilibrium policies at the start of FIE-game with 𝐿 termination

threshold are shown in Figure 2.7a and 2.7b. We observe that the probability of defense increases

with a larger termination threshold 𝐿. This is reflective of a defender accounting for multiple attacks

before engagement. In contrast, we observe that the attack probability at any stage is much lower

for a large 𝐿. This indicates the adversary being aware of multiple attack possibilities and wants

to gain as much as possible. Additionally, we observe that the optimal attack policy decreases at a

lower rate when compared to smaller value of 𝐿.

2.5 Partial Information Edge Game with a Termination Threshold

We now present the solution to the PIE-game over an edge 𝑒 with 𝐾𝑒 stages. For ease of

exposition, we assume 𝐿 = 1, although by following steps similar to those outlined in Section 2.4, it

is possible to extend the approach to a general value of 𝐿 with careful book-keeping. Recall that the

defender has partial information since it is uncertain about the game state whenever it chooses not

to defend. This causes the information sets to span across different branches of the game tree (cf.

Figure 2.8). Consequently, this model introduces a constraint on the adversary; if it attacks at any

stage, then it is constrained to continue to attack at subsequent stages until it gets caught (reaches a

23

(a)

(b)

Figure 2.6 (a) Value of the FIE-game across multiple 𝑟2, stages 𝐾𝑒 and termination threshold 𝐿 for
given 𝑟1 = 1.5. (b) Value of the FIE-game across multiple 𝑟2, stages 𝐾𝑒 and termination threshold
𝐿 for given 𝑟1 = 3.0.

(a)

(b)

Figure 2.7 (a) Policy of the defender at start stage (𝑘 = 0) of FIE-game for increasing 𝑟2 and the
number of attacks 𝐿 across given stages 𝐾𝑒.(b) Policy of the adversary at start stage (𝑘 = 0) of
FIE-game for increasing 𝑟2 and the number of attacks 𝐿 across given stages 𝐾𝑒.

stopping state) or the game reaches its final stage. This constraint arises from the perspective of a

defender (CAV). In other words, when an attack occurs (spoofed vehicle) followed by a no-attack

stage (removing the spoofed vehicle), the defender would be alerted to the presence of an adversary

in the system. In a realistic scenario, if a CAV were to observe a vehicle (spoofed) signal toggling

24

5101520Stages Ke51015Value of the fie-game5101520Stages Ke51015Value of the fie-game5101520Stages Ke0.40.60.811.2Defender policy (y0)5101520Stages Ke0.10.20.30.40.5Attacker policy (z0)on and off, this would reveal the existence of an adversary in the current path.

Figure 2.8 The PIE-game with a termination threshold of 𝐿 = 1 with 𝐾𝑒 = 2. The leaf node 𝑆𝑆
represents the stopping state. The dotted line indicates the information set for the corresponding
player. The notation 𝛼𝑘 (resp. 𝛽𝑘 ) represents information set for the defender (resp. adversary) for
the stage 𝑘 ∈ 1, 2. The value of each leaf node is represented by 𝑄𝑚 for 𝑚 ∈ {1, 2, . . . , 8}. The
leaf node values are presented in Section 2.5.1.

2.5.1 Formulation and solution of a 2 stage game

We will illustrate a procedure to solve the PIE-game with 𝐾𝑒 = 2 and use mathematical induction

to solve any PIE-game with an arbitrary finite number of stages 𝐾𝑒. From Figure 2.8, we observe

that the PIE-game consists of 8 leaf nodes with values defined as:

𝑄1 = 𝑠11 + 𝑠21, 𝑄2 = 𝑠12 + 𝑉 1
1

, 𝑄3 = 𝑠21 + 𝑠11, 𝑄4 = 2𝑠21,

𝑄5 = 𝑠22 + 𝑠11, 𝑄6 = 𝑠22 + 𝑠12, 𝑄7 = 𝑠22 + 𝑠21, 𝑄8 = 2𝑠22.

Let 𝑦𝛼𝑘
𝑖

and 𝑧𝛽𝑘
𝑖

represent the defender and adversary policy with 2 actions, 𝑖 ∈ {1, 2} and at stage

𝑘 ∈ {1, 2, ...𝐾𝑒}. The information sets for the adversary and defender at stage 𝑘 are represented by

𝛼𝑘 and 𝛽𝑘 , respectively. The expected value of the 2 stage game is given by:

𝑉0(𝑦, 𝑧) = 𝑦𝛼1
1
𝑧𝛽1
2

𝑦𝛼1
2

𝑧𝛽1
1
𝑦𝛼2
1

1

𝑄1 + 𝑦𝛼1
𝑧𝛽2
1

𝑄5 + 𝑦𝛼1

𝑧𝛽1
2

2

𝑄2 + 𝑦𝛼1
2
𝑧𝛽2
𝑧𝛽1
𝑦𝛼2
2
1
2

𝑧𝛽1
1

𝑦𝛼2
1

𝑄3 + 𝑦𝛼1
2
𝑧𝛽1
𝑦𝛼2
2
2

𝑧𝛽1
1
𝑧𝛽2
1

2

𝑦𝛼2
2

𝑄4+
𝑄7 + 𝑦𝛼1

2

𝑄6 + 𝑦𝛼1

𝑧𝛽1
2

𝑦𝛼2
2

𝑧𝛽2
2

𝑄8,

(2.23)

25

where 𝑦, 𝑧 are the probability distributions of the defender and adversary actions given by:

𝑦 =

(cid:104)

𝑦𝛼1
1

(cid:105) T

𝑦𝛼1
2

, 𝑧 =

(cid:104)

𝑧𝛽1
1

(cid:105) T

.

𝑧𝛽1
2

(2.24)

By a change of variables, (2.23) can be re-written as,

𝑉0( ˜𝑦, ˜𝑧) = ˜𝑦1 ˜𝑧1𝑄1 + ˜𝑦1 ˜𝑧2𝑄2 + ˜𝑦2 ˜𝑧1𝑄3 + ˜𝑦3 ˜𝑧1𝑄4 + ˜𝑦2 ˜𝑧3𝑄5 + ˜𝑦2 ˜𝑧4𝑄6 + ˜𝑦3 ˜𝑧3𝑄7 + ˜𝑦3 ˜𝑧4𝑄8,

(2.25)

where ˜𝑦, ˜𝑧 are multinomial probability distributions over the defender and adversary actions, given

by:

˜𝑦 =

(cid:104)

𝑦𝛼1
1

𝑦𝛼1
2

𝑦𝛼2
1

𝑦𝛼1
2

𝑦𝛼2
2

(cid:105) T

, ˜𝑧 =

(cid:104)

𝑧𝛽1
1

𝑧𝛽1
2

𝑧𝛽1
2

𝑧𝛽2
1

𝑧𝛽1
2

𝑧𝛽2
2

(cid:105) T

.

(2.26)

The Nash equilibrium policy of the defender and adversary, and the value of the PIE-game are

determined by solving the following zero-sum matrix game,

𝑉0 (cid:17) min
˜𝑦∈Δ3

max
˜𝑧∈R4
≥0

˜𝑦T

0

0


𝑄1 𝑄2



𝑄3



𝑄4


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

0 𝑄5 𝑄6








0 𝑄7 𝑄8


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:123)(cid:122)
(cid:125)
𝐴

˜𝑧.

(2.27)

Equation (2.27) can be posed as a linear program to solve for the security level and actions at each

stage of both the adversary and defender.

2.5.2 Formulation and solution of a 𝐾𝑒 stage PIE-game

In order to solve a PIE-game with an arbitrary number of stages 𝐾𝑒, we need to determine the

total number of leaf nodes and then generate the corresponding zero-sum matrix (analogous to

Equation (2.27)). Using an induction argument, it follows that the structure of the game matrix for

varying numbers of stages is illustrated in Figure 2.9. For a given number of stages 𝐾𝑒 ≥ 2, the

total number of leaf nodes 𝑇 (entries in the game matrix) is given by:

𝑇 = 4𝐾𝑒 +

(𝐾𝑒 − 1) (𝐾𝑒 − 2)
2

.

(2.28)

26

The multinomial probability distribution for the defender and adversary comprising of the

behavioral policies for 𝐾𝑒 stages are defined as:

˜𝑦 = [𝑦𝛼1
1

𝑦𝛼1
2

𝑦𝛼2
1

𝑦𝛼1
2

𝑦𝛼2
2

𝑦𝛼3
1

˜𝑧 = [𝑧𝛽1
1

𝑧𝛽1
2

𝑧𝛽1
2

𝑧𝛽2
1

𝑧𝛽1
2

𝑧𝛽2
2

. . .

. . .

(cid:16)(cid:206)𝐾𝑒−1
𝑚=1
(cid:16)(cid:206)𝐾𝑒−1
𝑚=1

𝑦𝛼𝑚
2

𝑧𝛽𝑚
2

(cid:17)

(cid:17)

𝑦𝛼𝐾𝑒
1

𝑧𝛽𝐾𝑒
1

(cid:206)𝐾𝑒

𝑚=1

𝑦𝛼𝑚
2

]T,

(cid:206)𝐾𝑒

𝑚=1

𝑧𝛽𝑚
2

]T.

(2.29)

The dimensions of multinomial probability distribution for the defender and adversary are ˜𝑦 ∈ R𝐾𝑒+1

and ˜𝑧 ∈ R2𝐾𝑒, respectively. Similar to the FIE-game, for a given number of stages 𝐾𝑒, the value

of PIE-game at any stage 𝑘 is a function of the stage cost 𝑆 and the value of PIE-game at the next

stage, recursively defined as:

𝑉𝑘 (cid:17) 𝑓 (𝑉𝑘+1, 𝑆) = Val( 𝐴(𝐾𝑒 − 𝑘)), 𝑘 = {0, 1, . . . , 𝐾𝑒 − 1},

where 𝐴(𝐾𝑒 − 𝑘) ∈ R𝑘+1×2𝑘 is the zero-sum matrix at stage 𝑘. When 𝑘 = 𝐾𝑒 − 1, the zero-sum

matrix 𝐴(1) = 𝑆.

Figure 2.9 Illustration of the PIE-game matrix 𝐴 structure for any given number of stages, 𝐾𝑒. The
solid square blocks indicate the leaf node entries, the triangle blocks indicate the solution from the
preceding stage game, the diamond block indicates the value 𝑉𝐾𝑒 with 𝐾𝑒 = 2, and the empty space
indicates zeros. For a given number of stages 𝐾𝑒, the game matrix 𝐴 is recursively solved from
𝐴(2) to 𝐴(𝐾𝑒).

A pictorial representation in constructing the zero-sum matrix 𝐴(𝐾𝑒 − 𝑘) is illustrated in

Figure 2.9. The solution(s) of the previous stage(s) is (are) indicated by the triangular blocks. The

27

only exception is for a 2 stage game indicated by a diamond block, where we use the expected

solution of the stage cost matrix in a minimax setting. Similar to the 2 stage setting, we populate

the game matrix for any given number of stages 𝐾𝑒 to fill the entries of a game matrix for any given

number of stages 𝐾𝑒. We then obtain the value of the PIE-game by solving the problem:

The security level of the defender and corresponding probability distribution ˜𝑦 for a PIE-game with

𝑉0 (cid:17) min
˜𝑦∈Δ𝐾𝑒+1

max
˜𝑧∈R2𝐾𝑒
≥0

˜𝑦T 𝐴(𝐾𝑒) ˜𝑧.

(2.30)

𝐾𝑒 stages are computed from the solution of the linear program:

𝑣

min
𝑣∈R, ˜𝑦∈Δ𝐾𝑒+1
subject to 𝐴(𝐾𝑒)T ˜𝑦 ≤ 𝑣1𝐾𝑒+1,

(2.31)

where 1𝐾𝑒+1 denotes the vector of ones of size 𝐾𝑒 + 1. Similarly, the security level of the adversary

and corresponding probability distribution ˜𝑧 for the PIE-game with 𝐾𝑒 stages are obtained from the

linear program:

max
𝑣∈R, ˜𝑧∈R2𝐾𝑒
≥0

𝑣

subject to 𝐴(𝐾𝑒) ˜𝑧 ≥ 𝑣1𝐾𝑒+1,

˜𝑧1 + ˜𝑧2 = 1,

˜𝑧3 + ˜𝑧4 = ˜𝑧2,
...

˜𝑧2𝐾𝑒−1 + ˜𝑧2𝐾𝑒 = ˜𝑧2𝐾𝑒−2.

(2.32)

Let the solutions obtained from (2.31) and (2.32) be ˜𝑦∗ and ˜𝑧∗, respectively. Thus, the solution of

PIE-game for 𝐾𝑒 stages is:

To solve the PIE-game for 𝐾𝑒 stages, we recursively solve a zero-sum matrix game from stages 2 to

𝑉0 (cid:17) ˜𝑦∗T 𝐴(𝐾𝑒) ˜𝑧∗.

(2.33)

𝐾𝑒 − 1, to construct the matrix 𝐴(𝐾𝑒) as illustrated in Figure 2.9.

2.5.3 Parametric stage cost and numerical illustration

We now evaluate the value of the PIE-game with parametric stage costs and compare it against

different parameters. We use the same parametric stage costs 𝑆 as defined in Section 2.3 with a

28

unit defense cost 𝑠11 = 1. The values of the PIE-game for different sets of 𝑟1 and 𝑟2 are shown in

Figure 2.10a. Similar to the FIE-game, we observe that 𝑟2 impacts the value of the PIE-game to

a greater degree compared to 𝑟1. In other words, the mobility cost has a greater influence on the

value of the game compared to the security cost. For a given number of stages 𝐾𝑒, the adversary

and defender policies for attack and defense actions at the start of the game (first stage) are shown

in Figures 2.10b and 2.11a. The probability of attack decreases monotonically for all the values of

𝑟1 and 𝑟2. A larger security loss leads to a lower attack probability, indicating that the adversary

wants to prolong the game without being caught.

The defender policy, on the other hand, shows a monotonic decrease when the mobility cost is

zero and a decrease followed by an increase with the number of stages 𝐾𝑒 for non-zero mobility

cost. For the monotonically decreasing case, the defender opts to maintain a minimum net cost

from the no defense action as the probability of attack is low. This is contrary to the behavior

under a non-zero mobility cost, where the probability of defending decreases in the first few stages

and then maintains the same probability till the last stage, due to the partial information structure.

Finally, we compare the solution of the PIE-game with the FIE-game for the same values of 𝑟1

and 𝑟2 in Figure 2.11b. We observe that the value of the PIE-game and FIE-game are identical.

However, the policies of both games are significantly different due to the difference in information

structures between the games.

2.6 Solution of the Meta-game

In sections 2.3 and 2.5, we solved the edge-game under the full (FIE-game) and partial (PIE-

game) information structures, respectively. Since we observed the expected value of FIE-game and

PIE-game to be identical, in this section, we use either of the edge-game solutions from Section 2.3

and 2.5 to determine a secure path.

Consider the roadmap 𝐺 with vertices 𝑉 and directed edges 𝐸, with each edge 𝑒 ∈ 𝐸 being

associated with a finite number of stages 𝐾𝑒. For each edge 𝑒 with stages 𝐾𝑒 we determine the

solution to a FIE/PIE-game. The solutions are then used to populate a meta-game matrix 𝑊 (from

Section 2.2.2) that represents the choice of a path taken by the defender (row of 𝑊) and the choice

29

(a)

(b)

Figure 2.10 (a) Value of the PIE-game for a set of 𝑟1 and 𝑟2 for given stages 𝐾𝑒. (b) Policy of the
adversary with the attack action at the start stage (𝑘 = 0) of a PIE-game for given 𝑟1, 𝑟2, and stages
𝐾𝑒.

(a)

(b)

Figure 2.11 (a) Policy of the defender with defend action at the start stage (𝑘 = 0) of a PIE-game
for given 𝑟1, 𝑟2, and stages 𝐾𝑒.(b) The value of a PIE-game and FIE-game vs. stages 𝐾𝑒 for the
same set of 𝑟1 and 𝑟2.

of the edge to attack (column of 𝑊). With a slight abuse of notation, we denote 𝜋𝑖 ∈ 𝑃𝜈,𝜉 as the 𝑖th

path out of 𝑚 paths, such that |𝑃𝜈,𝜉 | = 𝑚. Similarly, we use 𝑒 𝑗 ∈ 𝐸 to denote the 𝑗 th edge out of 𝑛

edges, such that |𝐸 | = 𝑛. For 𝑚 possible paths and 𝑛 attack edges, the meta-game matrix is given

30

5101520Stages, Ke5101520Value of the gamer1:=1.5,r2:=0.0r1:=1.5,r2:=0.5r1:=3.0,r2:=0.0r1:=3.0,r2:=0.55101520Stages, Ke0.10.20.30.4Probability , z11(1)A, r1:=1.5,r2:=0.0A, r1:=1.5,r2:=0.5A, r1:=3.0,r2:=0.0A, r1:=3.0,r2:=0.55101520Stages, Ke0.40.60.81.01.2Probability , y11(1)D, r1:=1.5,r2:=0.0D, r1:=1.5,r2:=0.5D, r1:=3.0,r2:=0.0D, r1:=3.0,r2:=0.55101520Stages, Ke5101520Value of the gamePIF,r1:=1.5,r2:=0.5FIF,r1:=1.5,r2:=0.5PIF,r1:=3.0,r2:=0.5FIF,r1:=3.0,r2:=0.5by:

𝑊 =

. . .

𝑊𝜋1𝑒1 𝑊𝜋1𝑒2

𝑊𝜋2𝑒1 𝑊𝜋2𝑒2















𝑊𝜋𝑚𝑒1 𝑊𝜋𝑚𝑒2



. . .

. . .

. . .

. . . 𝑊𝜋1𝑒𝑛

. . . 𝑊𝜋2𝑒𝑛

. . .

. . .

. . .

. . .

. . . 𝑊𝜋𝑚𝑒𝑛


















.

A path 𝜋𝑖 contains 𝑝𝑖 ⊆ 𝐸 linked edges. 𝑊𝜋𝑖 𝑒 𝑗 represents the sum of edge costs on the path 𝜋𝑖

given the adversary attacks edge 𝑒 𝑗 , and is given by:

𝑊𝜋𝑖 𝑒 𝑗 =





(cid:205)𝑥∈𝑝𝑖 (cid:205)𝐾𝑒𝑥
𝑘=1

𝑆22,𝑘 ,

if 𝑒 𝑗 ∉ 𝜋𝑖

(cid:205)𝑒 𝑗 ∈𝜋𝑖 𝑤𝑒 𝑗 ,

otherwise.

(2.34)

The if condition refers to the cost of mobility under the assumption that the entire path is free

of any attack. The latter condition pertains to the cost of a path while under attack along one of

its edges, as defined in Equation (2.6). The zero-sum meta-game 𝑊 is solved using a standard

linear programming technique [61] to obtain an attack-resilient path. The policies obtained for the

defender and adversary in the FIE/PIE-game correspond to the actions over an edge 𝑒, whereas

here, the meta-policy of the defender and adversary provides the probability of selecting paths and

edges, respectively. However, the computational complexity of this approach scales undesirably

with the number of paths in the roadmap.

To address the scalability aspect, we propose the following heuristic: replace every edge of

the roadmap by the Nash equilibrium value of the e-game. Then, use any standard shortest path

algorithm (e.g., Dĳkstra algorithm [42] for directed acyclic roadmaps) to compute an optimal path.

For the resultant heuristic path, we determine the attack edge which maximizes the path cost. The

approach is summarized in Algorithm 1.

The meta-game solution 𝑊𝑁 𝐸 is compared against the length 𝐿𝑆𝐸 𝐴 of the shortest path 𝜋𝑆𝑃

(shortest path heuristic) following the constraints that only one of the edge 𝑒 in the graph 𝐺 can be

attacked. Here, 𝜋𝑆𝑃 denotes the shortest path.

31

(a)

(b)

(c)

(d)

Figure 2.12 (a) Illustration of a simple graph with 3 vertices and 3 edges. The start and end vertex
is indicated with 𝜈 and 𝜉 respectively. The number of stages between the nodes 𝑖 and 𝑗 are given
(b) The simple network (figure 2.12a) with stages over the edge, 𝑘 𝜈1 = 𝑘1𝜉 = 3 and
by 𝑘𝑖, 𝑗 .
𝑘 𝜈𝜉 = 6. The shortest path is calculated over the edge weights. (c) The solution of the simple graph
meta-game with the defender probabilities over the paths. The shortest path is indicated with a
larger arrow as compared to others and with lighter shade of vertex. (d) The solution of the simple
graph meta-game with the adversary probabilities over the edges.

Algorithm 1: Shortest path edge attack

Input: G(graph)
Output: 𝐿SEA
for every 𝑒 ∈ 𝐸 do

Set 𝑤𝑒 = 𝑉0 for edge 𝑒 ;

end
¯𝜋 = Dijkstra (𝑉, {𝑤𝑒1, . . . , 𝑤𝑒 | 𝐸 | })
Determine the row 𝑊 ¯𝜋 ∈ 𝑊 corresponding to the path ¯𝜋 ∈ 𝑃𝜈𝜉.
𝐿SEA = arg max𝑥∈𝐸 𝑊 ¯𝜋𝑥

Figure 2.12a illustrates this algorithm on a graph consisting of two paths namely; 𝑃𝜈𝜉 =

{{𝑒𝜉𝜈}, {𝑒𝜈1, 𝑒1𝜉 }}, i.e., from vertex 𝜈 to 𝜉, and from vertex 𝜈 → 1 followed by 1 → 𝜉. The set of

attack edges is given as 𝐸 = {𝑒𝜈𝜉, 𝑒𝜈1, 𝑒1𝜉 }. We assume the fixed stage cost matrix given by:

𝑆 =

,


30 30




70 10











for the graph, and solve the FIE-game and meta-game. We summarize the results of the simple

graph in Figure 2.12b. The figure also depicts the shortest path which resulted from Algorithm 1.

The defender and adversary probabilities are shown in Figures 2.12c and 2.12d, respectively. We

observe that the probability of choosing the shortest path by the defender and an attack edge on

the same path is higher compared to the alternate path. The obtained policies are dependent on the

32

S1T77.63158138.194177.63158Shortest pathS1T77.63158138.194177.631580.37860.6214Defender probabilitiesShortest pathS1T77.63158138.194177.631580.31070.3786Attacker probabilitiesShortest pathstages 𝐾𝑒 and stage costs along each edge 𝑒, thus motivating us to study the game parameters.

2.6.1 Sensitivity of optimal policies to the game parameters

We first study the sensitivity of defender (paths) and adversary (edges) policies for the simple

graph 𝐺 (Figure 2.12a) as a function of stage cost entries and stages 𝐾𝑒 along an edge 𝑒. In the

first scenario, we examine the sensitivity over stage costs. The stage cost is parameterized with two

ratios also defined in equation (2.13) as,





𝑟1



We abbreviate the shortest path as 𝜋def and edge along the same as 𝑒att. The sensitivity plot for

𝑆 = 𝑠11









𝑟2

1

1

,

both 𝜋def and 𝑒att for changing 𝑟1 and 𝑟2 are shown in Figures 2.13a and 2.13b, respectively. The

probability of choosing the shortest path and edge decreases with increasing 𝑟1, indicating that

the defender (resp. adversary) is aware of risks and chooses alternate paths as opposed to the

shortest path. With increasing 𝑟2, we observe that the probability of choosing the shortest path also

decreases. This relates to a high cost of mobility, i.e., under no defense and no attack, the payoff is

high, and therefore, the defender prefers alternate path(s).

(a)

(b)

(c)

(d)

Figure 2.13 (a) The sensitivity of choosing the shortest path (𝜋def) with changing 𝑟1 and 𝑟2 with fixed
stages over each edge. (b) The sensitivity of choosing the shortest path edge (𝑒att) with changing
𝑟1 and 𝑟2 with fixed stages over each edge (c) The sensitivity of choosing the shortest path (𝜋def)
with changing number of stages over the edges 𝐾𝜋𝑆1,1𝑇 and 𝐾𝜋𝑆𝑇 given a fixed stage cost. (d) The
sensitivity of choosing the shortest path edge (𝑒att) with changing number of stages over the edges
𝐾𝜋𝑆1,1𝑇 and 𝐾𝜋𝑆𝑇 given a fixed stage cost.

Next, we characterize the sensitivity of choosing 𝜋def and 𝑒att with varying number of stages

𝐾𝑒 along an edge 𝑒, i.e., along the shortest path and along the alternate path which consists of

33

12345r100.20.40.60.8r20.360.380.4Defender Path probabilities (def)12345r100.20.40.60.8r20.360.380.4Attacker Edge probabilities (eatt)12345KST12345KS1,1T00.51Defender Path probabilities (def)12345KST12345KS1,1T00.51Attack Edge probabilities (eatt)(a)

(b)

(c)

(d)

Figure 2.14 (a) A graph consisting of 10 nodes which is sparsely connected. The output of
Algorithm 1 is path 1 − 10. Of all paths available, the path 1 − 2 − 10 has highest likelihood of
getting selected. (b) Edge 1 − 10 has the least chance of being attacked, while edge 2 − 10 has
the highest chance of getting attacked. (c) The probability of choosing the shortest path for graphs
with an average vertex degree in the interval [2, 3]. (d) Probability of choosing the shortest path in
a fully connected graph.

two edges. We increase the number of stages on both edges equally. From Figure 2.13c, it can be

inferred that the probability of choosing the shortest path 𝜋def (resp. alternate path) monotonically

increases (resp. decreases) with the number of stages. Similarly, from Figure 2.13d, the probability

of choosing the shortest path edge 𝑒att is directly proportional to the number of stages over the

edge 𝑒𝜈𝜉 and is inversely proportional to the number of stages along the path 𝑒𝜈1 ∪ 𝑒1𝜉. Thus, if

the number of stages over a path is significantly higher than over other paths, then the defender’s

probability of selecting such a path is higher.

We conclude from the sensitivity analysis that the ratios 𝑟1 and 𝑟2 govern the defender’s

propensity to be either risk-seeking or risk-averse. That is, when the costs of mobility and security

loss are high, the defender is less likely to choose the shortest path, indicating risk aversion;

otherwise, it is risk-seeking. Furthermore, the influence of edge stages 𝐾𝑒 strongly governs the

defender and adversary policies. Alternate paths with multiple edges comprising a lower number

of stages significantly deviate the policy away from the shortest path. In the next subsection, we

will examine how the solution of the meta-game compares with that from Algorithm 1 over larger

sized graphs. We will also assess whether the shortest path obtained from Algorithm 1 can serve

as a reasonable attack-resilient route.

34

468101214Vertices0.650.70.750.80.850.9Probability (Shortest Path)Meta-Game468101214Vertices0.40.50.6Probability (Shortest Path)Meta-Game2.6.2 Comparisons on larger roadmaps

In this section, we solve the meta-game (Section 2.6) played over roadmaps of varying size

to determine an optimal attack-resilient path and compare the result against the solution provided

by Algorithm 1, which is treated as a baseline. The shortest path, along with the probabilities

of choosing the paths and edges on a sparsely connected graph with 10 vertices, is shown in

Figure 2.14a and Figure 2.14b. The shortest path is indicated by a square block vertex with an

arrow. The path of interest is from the source vertex 1 to the destination vertex 10. We observe

a higher probability of picking an alternate path as opposed to the shortest path. However, the

probability of choosing an attack edge is distributed across multiple paths. These results indicate

that even for a sparse graph, an attack-resilient path is not necessarily the shortest path.

In general, for densely connected directed acyclic graphs (DAG) with 𝑁 vertices, the possible

paths scale as 2𝑁−2 with the total number of edges being

𝑁 (𝑁+1)
2

− 𝑁. Therefore, the size of a

meta-game increases exponentially with the number of vertices leading to a meta-game matrix,
𝑊 ∈ R(2𝑁 −2)×

. Now, we will investigate the solutions of the meta-game from (2.7) and

(cid:16) 𝑁 ( 𝑁 +1)
2

−𝑁

(cid:17)

compare it against the output of Algorithm 1 for a given graph. The connectivity of a graph is

characterized by the degree of each vertex. For a sparse graph, the degree of each vertex is less

than the number of nodes (assuming no self-loops). A sparse graph is generated by uniformly

sampling 𝑁 vertices from a unit square and randomly connecting them such that a desired degree

for each vertex is obtained. The number of stages 𝐾𝑒 over an edge 𝑒 is proportional to the Euclidean

distance between the connected vertices. Finally, all the stage cost matrices along any edge of the

graph are set to a constant value (defined in Section 2.3). The computation times and costs of both

the meta-game and Algorithm 1 for sparse and fully connected graphs are reported in Tables 2.1

and 2.2, respectively. The average degree of each vertex in the sparse graph is set between the

limits 2 and 3.

From Table 2.1, we observe that the ratio of average time taken to solve the meta-game to that

by Algorithm 1 decreases with an increasing number of nodes, but with a decrease in benefit in

terms of cost optimality of Algorithm 1. This decrease in computation time is a consequence of the

35

average degree per vertex across graphs of various sizes, thereby increasing the sparsity of graphs

with a large number of vertices. In contrast, from Table 2.2, we observe that in dense graphs, the

ratio of cost performance between the two approaches decreases with the graph size, but at the

expense of increasing the ratio of computation times. Probabilities of picking the shortest path are

reported in Figures 2.14c and 2.14d. From both figures, we observe that the probabilities of picking

the shortest path corresponding to Algorithm 1 increase with the sparsity of a graph as opposed to

a densely connected graph. This implies that the defender becomes risk-seeking over sparse graphs

and risk-averse over densely connected graphs.

Table 2.1 Performance of the meta-game vs. Algorithm 1 averaged over 100 runs with an average
degree between [2,3] for every vertex of the roadmap.

Vertices

Time performance,
Time(𝑊𝑁 𝐸 )/Time(𝐿𝑆𝐸 𝐴)

Cost performance,
𝑊𝑁 𝐸 /𝐿𝑆𝐸 𝐴

4
6
8
10
12
14

173.833
145.143
148.500
146.556
111.556
112.222

0.83
0.85
0.87
0.89
0.92
0.92

Table 2.2 Performance of the meta-game vs. Algorithm 1 averaged over 100 runs in a fully
connected roadmap.

Vertices

Time performance,
Time(𝑊𝑁 𝐸 )/Time(𝐿𝑆𝐸 𝐴)

Cost performance,
𝑊𝑁 𝐸 /𝐿𝑆𝐸 𝐴

4
6
8
10
12
14

152.000
179.857
165.429
187.625
283.429
773.100

0.82
0.79
0.80
0.77
0.78
0.77

2.7 Robotic Simulation of PIE-game

In this section, we demonstrate the framework of the PIE-game applied to autonomous vehicle

navigation via simulations and experiments implemented in a robotic simulation engine. For the

36

setup, we use Robot Operating System (ROS) in conjunction with Gazebo [74]. The defender

vehicle in our simulations and experiments is a TurtleBot3 burger. We assume the existence of an

architecture, such as a camera network or a positioning and localization system, which provides

uncorrupted global knowledge of the environment. This architecture forms the infrastructure for

vehicle-to-infrastructure (V2I) communication.

In this context, we focus on the vulnerability of the CAV at the perception level and show its

impact on the time taken to reach a destination. The Turtlebot3, representing a CAV, is equipped

with a set of sensors, including a LiDAR, a camera, and/or a radar, to detect obstacles in its

own vicinity (local view). The robot relies on vehicle-to-vehicle (V2V) or vehicle-to-everything

(V2X) communication to gain information from the environment beyond its local view, known

as the extended view. We consider the presence of an adversary communicating false data, such

as position or velocity, from an extended view, as illustrated in Figure 2.15. We assume that

the infrastructure can verify any malicious data, like spoofed obstacles or vehicles present in the

environment, but at a cost (e.g., delay).

Figure 2.15 Illustration of a vehicle attacked from extended view while performing a V2V or V2X
communication.

2.7.1 Open Path Attack

We construct a PIE-game with the objective of traveling from a source pose (vertex) to a

destination pose (vertex) in the presence of an attack. We model the attack as an action that creates

fake obstacles in the occupancy grid of the robot (information passed via V2V communication).

The actions of the defender (vehicle) are to either communicate with the infrastructure (V2I),

analogous to validating the information received from any vehicles/malicious agents, or to do

nothing, equivalent to relying on the received information. We assume that the message exchange

37

Local viewExtendedviewAttackerwith the infrastructure occurs at a much lower rate compared to the vehicle’s controller. Therefore,

the vehicle slows down or does not accelerate during V2I communication to ensure safety. When

an attack is successful, the vehicle deviates from its planned trajectory by a certain amount, thus

adding to the time required to reach the destination. The stage cost matrix represents the time

required by the vehicle at every stage and is summarized as:

𝑆 =

𝜙Δ𝑑

𝜙Δ𝑑

Δ𝑑 + Δ𝑎 Δ𝑑

















,

(2.35)

where Δ𝑑 is the time spent per decision epoch, 𝜙 is the factor by which Δ𝑑 is increased during a V2I

communication (𝜙 > 1), and Δ𝑎 is the additional time spent per decision epoch under a successful

attack.

Although we can solve the PIE-game for varying stage costs, for the sake of clarity and to

maintain consistency with the methodology analyzed earlier in this section, we have established a

ROS environment with a constant stage cost. In Figures 2.16a and 2.16b, we observe an attack in

ROS using a TurtleBot3 burger. It illustrates how a deviation is linked to the attack and contributes

to additional time required for travel to a destination. Given a source pose, target pose, and the

planned velocity of the robot, we estimate an approximate number of decision epochs needed. This

determined number of decision epochs is then employed as the number of stages 𝐾𝑒 to solve the

PIE-game and to define the player policies for use in the simulation.

Simulations

We conducted multiple simulations of the PIE-game, guiding a TurtleBot3 burger in conjunction

with Gazebo from a source pose to a target pose. In Figure 2.17, you can see snapshots of a sample

robot trajectory. The planned trajectory is represented by the dotted line, while the solid line

depicts the actual trajectory covered by the robot. The attack induces a deviation from the normal

trajectory, as shown in Figure 2.17b. This deviation is subsequently corrected once the adversary

is apprehended, allowing the robot to safely reach its destination, as demonstrated in Figure 2.17c

and 2.17d respectively.

38

(a)

(b)

(c)

Figure 2.16 (a) An attack realized on ROS with Turtlebot3 burger. The attack is obstacles (vehicles)
in formation causing a larger deviation in normal trajectory. (b) Influence of an attack (obstacle) on
the deviation of path causing an increase in time to destination (security loss). (c) The PIE-game
evaluated experimental and expected theoretical value of the PIE-game.

Lastly, we present the average time taken by the robot to reach the destination over multiple

runs, comparing it against the value predicted by the PIE-game in Figure 2.16c. It’s evident that

the solution obtained from the simulations closely aligns with the theoretical value predicted by the

PIE-game.

(a)

(b)

(c)

(d)

Figure 2.17 (a) The initial position of the robot with the planned trajectory indicated with the dotted
line. (b) Attack along the trajectory causing a change in deviation along the covered and planned
trajectory. The covered trajectory is represented by the solid line. (c) The change in trajectory after
the defender has intercepted the attack and recovery of the planned trajectory.(d) The final position
of the robot with the covered trajectory indicated in solid line.

39

05101520Epoch18202224Time to destinationSimulationsPIE Game TheoreticalPIE Game Simulation-2-10X coordinate-2-101Y coordinatetime = 0.01 s-2-10X coordinate-2-101Y coordinatetime = 5.22 s-2-10X coordinate-2-101Y coordinatetime = 7.88 s-2-10X coordinate-2-101Y coordinatetime = 22.97 sFigure 2.18 Illustration of a vehicle attacked from extended view under a V2V or V2X communi-
cation in a single/two lane road.

2.7.2 One/Two Lane Attack

We now extend the previously described attack on an open path to a PIE-game on a TurtleBot

inspired by a realistic attack scenario, i.e., on a traffic lane. We assume that the vehicle (robot) is

traveling on a single or two lane road where passing another vehicle is unfeasible or unsafe. By

spoofing the sensor, the adversary can create a fake large object, such as a trailer in front of a vehicle

causing it to slow down. Such a scenario is illustrated in Figure 2.18. The allowable actions of

the defender are exactly as described in subsection 2.7.1 – either to validate with the infrastructure

(V2I) or do nothing.

We model the stage cost matrix for this scenario as:

𝑆 =

𝜙Δ𝑑 𝜙Δ𝑑

˜𝜙Δ𝑑 Δ𝑑









,









(2.36)

where ˜𝜙 is the additional time spent per decision epoch when the vehicle slows down due to an

attack and the remaining parameters are presented in Equation (2.35). In the given scenario, ˜𝜙 > 𝜙

to maintain Assumption 2.3.1.

Experiments and simulations

To realize the described scenario, we constructed the environment depicted in Figure 2.19a

for both simulation (top) and experiments (bottom). The environment was designed to emulate a

constant stage cost matrix as accurately as possible. Figure 2.19a also illustrates the execution of an

attack when moving from a source to a destination pose. Similar to subsection 2.7.1, we determined

an approximate number of decision instants 𝐾𝑒 to solve the PIE-game and establish player policies,

40

Local ViewSpoofedVehicle(a)

(b)

Figure 2.19 (a) PIE-game on single traffic lane attack scenario. The figure above is a simulation with
TurtleBot3 burger in a Gazebo environment and the one below is the corresponding experimental
setup. (b) Velocity profile of the vehicle under different actions such as A(resp. NA) as attack
(resp. no attack) and D(resp. ND) as defend(resp. no defend).

utilizing the maximum allowable velocity of the TurtleBot3 burger and the distance to be traveled.

Figure 2.21 displays a snapshot of a sample robot trajectory under successful attacks. The dotted

and solid lines denote the planned and actual trajectories, respectively.

A successful attack causes the vehicle to operate at a lower velocity compared to the nominal.

The action of V2I communication by the defender also reduces the vehicle’s velocity, and in the

absence of any active actions by either player or if the game concludes (shaded region), the vehicle

reverts back to its nominal velocity, as illustrated in Figure 2.20a. The time taken by the robot over

individual experimental and simulation runs (epochs) is presented in Figure 2.20a, 2.20b, and is

compared to the value of the PIE-game. We observe that the average time taken by the vehicle

closely matches the value of the PIE-game in simulations. However, the average time taken by the

robot is observed to be slightly lower than the value of the PIE-game due to several reasons, such

as 1) approximating a constant stage cost matrix, 2) using precomputed policies when the vehicle

slows down, as opposed to re-evaluating the game and using new policies, and 3) noise in the motion

of the vehicle. In conclusion, with appropriate models of the environment, the fie/PIE-game can

be successfully applied to a range of problems.

41

0246810120.150.20VelocityND,AD,NA/AND,NA024681012Time instant0.000.501.00ActionDefendAttack(a)

(b)

Figure 2.20 (a) The value of the PIE-game evaluated in ROS using TurtleBot3 burger and gazebo
over multiple simulations and compared with the expected value of the PIE-game. (b) The value of
the PIE-game evaluated in ROS with TurtleBot3 burger over multiple experiments and compared
with the expected value of the PIE-game.

2.8 Summary

In this chapter, we addressed a prototypical path planning problem defined over a roadmap,

where a vehicle aims to find an attack-resilient path from a given source to a destination in the

presence of an adversary capable of launching an attack on an edge of the roadmap. The defender

(vehicle) can take an action to detect an attack at the expense of some cost (energy) and disable the

attack permanently if detected multiple times. We formulated this scenario using the framework of

a zero-sum multi-stage game, with a stopping state being played simultaneously by the adversary

and defender.

We characterized the Nash equilibria of an edge-game and provided a detailed analysis for the

case where both the defender and adversary are limited to only two actions. Additionally, we con-

ducted a comprehensive study of two edge-game variants, namely the fie and PIE-games, defined in

terms of the information structure induced by constraints on the type of attack. We also investigated

the sensitivity of the edge-game with respect to (i) the cost of using the countermeasure, (ii) the cost

of motion, and (iii) the benefit of disabling the attack. Moreover, we demonstrated how the results

of either edge game can be used to create a zero-sum meta-game for a given roadmap, and compared

42

02040Epoch141618Time to destinationSimulationPIE Game TheoreticalPIE Game Simulation02040Epoch10111213Time to destinationExperimentsPIE Game TheoreticalPIE Game ExperimentsFigure 2.21 Illustration of a vehicle attacked from extended view under a V2V or V2X commu-
nication in a single/two lane road. The dotted line represent the planned trajectory and the solid
line represents the trajectory followed. The solid blocks around the robot at all the time instances
represent the boundary and the solid block in from of the robot at time instant 7.89 s represents a
spoofed trailer.

the meta-game solution with the result of a novel shortest path heuristic. Finally, we reported three

sets of numerical validations: (i) computation time and cost optimality of the proposed approaches,

(ii) implementation of the PIE-game solution in a robotic simulation engine, and (iii) realization of

the PIE-game in a robotic experiment. An initial version of the result consisting of full information

stopping state game with a single termination threshold appeared in [14]. The results of the partial

information structure along with the experiments were demonstrated in [15].

2.9 Supplementary Materials

Proof of Theorem 2.3.2

Since the edge-game consists of only 2 players, the following method can be used to determine the

policy for each player at any stage 𝑘. Equation (2.9) can be represented as a zero-sum matrix given

43

0X coordinate-101Y coordinatetime = 0.01 s0-101time = 7.89 s0-101time = 15.45 sby:

𝑉𝑘−1 = 𝑦T
𝑘

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

𝑎11,𝑘 𝑎12,𝑘

(cid:170)
(cid:174)


(cid:174)


(cid:174)


(cid:174)


(cid:174)
𝑎21,𝑘 𝑎22,𝑘


(cid:174)


(cid:174)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)


(cid:174)
(cid:125)
(cid:123)(cid:122)
(cid:124)
Ξ(𝑘)
(cid:172)

𝑧𝑘 .

(2.37)

Policy for Defender - The expected value of a zero-sum matrix Ξ(𝑘) at any stage 𝑘 is given by:

𝑉𝑘 (Ξ(𝑘)) (cid:17) min
𝑦𝑘 ∈Δ2

max
𝑧𝑘 ∈Δ2

𝑘 Ξ(𝑘)𝑧𝑘 ,
𝑦T

= min

𝑦∈{𝑦1,𝑘,𝑦2,𝑘 }

max
𝑧∈{𝑧1,𝑘,𝑧2,𝑘 }

(𝑦1,𝑘 𝑎11,𝑘 + 𝑦2,𝑘 𝑎21,𝑘 )𝑧1,𝑘 +

(𝑦1,𝑘 𝑎12,𝑘 + 𝑦2,𝑘 𝑎22,𝑘 )𝑧2,𝑘

(cid:169)
(cid:173)
(cid:173)
(cid:171)

,

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(2.38)

where, Δ2 is probability simplex in two dimensions, 𝑦𝑖,𝑘 and 𝑧𝑖,𝑘 , 𝑖 ∈ {1, 2} represent the 𝑖th element

of the probability vector 𝑦𝑘 and 𝑧𝑘 , respectively. The policy of any player is the space of mixed

policies can determined analytically or through a graphical approach [61]. Given the policy of

second player, the first player’s policy does not deviate unilaterally if the expected outcome over

any of its action result in the same outcome. From the probability simplex of dimension 2, we get

𝑦2 = 1 − 𝑦1. This leads to the following,

𝑉𝑘 (Ξ(𝑘)) = min

𝑦1,𝑘 ∈[0,1]

max

𝑦1,𝑘 𝑎11,𝑘 + (1 − 𝑦1,𝑘 )𝑎21,𝑘

𝑦1,𝑘 𝑎12,𝑘 + (1 − 𝑦1,𝑘 )𝑎22,𝑘









.









(2.39)

Equation (2.39) yields the policy for player 1/defender is given as:

𝑦1,𝑘 𝑎11,𝑘 + 𝑎21,𝑘 − 𝑎21,𝑘 𝑦1,𝑘 = 𝑦1,𝑘 𝑎12,𝑘 + 𝑎22,𝑘 − 𝑎22,𝑘 𝑦1,𝑘 ,

⇒ 𝑦∗

1,𝑘 =

𝑎22,𝑘 − 𝑎21,𝑘
(𝑎11,𝑘 − 𝑎21,𝑘 − 𝑎12,𝑘 + 𝑎22,𝑘 )

.

The probability of choosing the second action is,

𝑦∗
2,𝑘 =

𝑎11,𝑘 − 𝑎12,𝑘
(𝑎11,𝑘 − 𝑎21,𝑘 − 𝑎12,𝑘 + 𝑎22,𝑘 )

.

Substituting the values from matrix Ξ(𝑘), equation (2.10) yields the optimal policy for defender.

44

Policy for Attacker - The mixed policy for attacker satisfies,

𝑉𝑘 (Ξ(𝑘)) = min max
𝑧𝑘 ∈{𝑧1,𝑘 }

𝑎11,𝑘 𝑧1,𝑘 + 𝑎12,𝑘 (1 − 𝑧1,𝑘 ),

𝑎21,𝑘 𝑧1,𝑘 + 𝑎22,𝑘 (1 − 𝑧1,𝑘 )

(cid:169)
(cid:173)
(cid:173)
(cid:171)

.

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(2.40)

Equation (2.40) gives us the policy for attacker as,

𝑧1,𝑘 =

𝑎22,𝑘 − 𝑎12,𝑘
(𝑎11,𝑘 − 𝑎21,𝑘 − 𝑎12,𝑘 + 𝑎22,𝑘 )

, 𝑧2,𝑘 =

𝑎11,𝑘 − 𝑎21,𝑘
(𝑎11,𝑘 − 𝑎21,𝑘 − 𝑎12,𝑘 + 𝑎22,𝑘 )

This yields the mixed policy of the attacker in equation (2.11).

Value of the game

The value of the game at any stage ‘𝑘’ is given by,

𝑉𝑘−1 = 𝑦T∗

𝑘 Ξ(𝑘)𝑧∗
𝑘 ,

Expanding the terms,

𝑉𝑘−1 = 𝑉𝑘 +

det(𝑆𝑘 ) + 𝑠22,𝑘 ((𝐾𝑒 − 𝑘)𝑠22,𝑘 − 𝑉𝑘 )
(𝑠11,𝑘 − 𝑠21,𝑘 − 𝑠12,𝑘 + (𝐾𝑒 − 𝑘 + 1)𝑠22,𝑘 − 𝑉𝑘 )

.

(2.41)

Proof of Proposition 2.3.4

To determine an approximate solution of an edge-game at any stage 𝑘, we begin with the analysis

of equation (2.14). It is observed that the recursive equation can be divided into two parts namely;

when 𝑟2 = 0 and, when 𝑟2 > 0 with increasing number of stages 𝐾𝑒. Using the parametric recursive

equation with unit defense cost (𝑠11 = 1),

𝑉𝑘−1 = 𝑉𝑘 +

𝑟2 − 𝑟1
𝑟2(𝐾𝑒 − 𝑘 + 1) − 𝑟1 − 𝑉𝑘

+

−𝑟2((𝐾𝑒 − 𝑘)𝑟2 − 𝑉𝑘 )
𝑟2(𝐾𝑒 − 𝑘 + 1) − 𝑟1 − 𝑉𝑘

⇒ 𝑉𝑘−1 − 𝑉𝑘 =

𝑟2 − 𝑟1
𝑟2(𝐾𝑒 − 𝑘 + 1) − 𝑟1 − 𝑉𝑘

+

−𝑟2((𝐾𝑒 − 𝑘)𝑟2 − 𝑉𝑘 )
𝑟2(𝐾𝑒 − 𝑘 + 1) − 𝑟1 − 𝑉𝑘

.

,

(2.42)

We will first investigate the case where 𝑟2 = 0. It is observed that the recursive equation can be

formulated by a continuous version using Taylor series expansion at time instant 𝑘

𝑉 (𝑘 − Δ𝑘) = 𝑉 (𝑘) − Δ𝑘𝑉 ′(𝑘),

⇒

⇒ lim
Δ𝑘→0

𝑉 (𝑘) − 𝑉 (𝑘 − Δ𝑘)
Δ𝑘
𝑉 (𝑘) − 𝑉 (𝑘 − Δ𝑘)
Δ𝑘

= 𝑉 ′(𝑘),
−𝑟1
𝑉𝑘 + 𝑟1

=

.

45

We obtain the continuous form of equation (2.14) as,

𝑑𝑉
𝑑𝑘

=

−𝑟1
𝑉𝑘 + 𝑟1

.

Integrating with respect to 𝑉 and 𝑘,

∫ 𝑉𝐼

𝑉

𝑉𝑘 + 𝑟1 𝑑𝑉 = −

∫ 𝐾𝑒

𝑘

𝑟1 𝑑𝑠,

⇒

𝑉 2
𝐼
2

+ 𝑟1𝑉𝐼 −

𝑉 2

2

− 𝑟1𝑉 = −𝑟1(𝐾𝑒 − 𝑘),

where 𝑉𝐼 = 1, initial condition. Substituting the value of 𝑉𝐼 in the equation (2.44),

1
2

+ 𝑟1 −

𝑉 2

2

− 𝑟1𝑉 = −𝑟1(𝐾𝑒 − 𝑘).

The solution of equation (2.45) yields the desired result given by,

𝑉𝑘 = −𝑟1 +

√︃

𝑟 2
1

+ (2𝑟1𝐾𝑒 + 2𝑟1(1 − 𝑘) + 1).

(2.43)

(2.44)

(2.45)

Similarly, we now determine the solution for 𝑟2 > 0. For a given 𝐾𝑒 the value 𝑉𝑘 at stage 𝑘

monotonically increases. Therefore, for a large 𝐾𝑒 as 𝑘 → 0, equation (2.42) can be approximated

as,

𝑉𝑘−1 − 𝑉𝑘 ≈

−𝑟2((𝐾𝑒 − 𝑘)𝑟2 − 𝑉𝑘 )
𝑟2(𝐾𝑒 − 𝑘 + 1) − 𝑟1 − 𝑉𝑘

,

⇒ 𝑉𝑘−1 − 𝑉𝑘 ≈ −𝑟2

(2.46)

Using the Taylor series expansion method as described in equation (2.43), we obtain the following

solution,

𝑉𝑘−1 ≈ 𝑟2(𝐾𝑒 − 𝑘).

(2.47)

Therefore, combining the solutions when 𝑟2 = 0 given by equation (2.45) and equation (2.47), we

obtain the following approximation,

𝑉𝑘−1 ≈ −𝑟1 +

√︃

𝑟 2
1

+ (2𝑟1𝐾𝑒 + 2𝑟1(1 − 𝑘) + 1) + 𝑟2(𝐾𝑒 − 𝑘).

46

CHAPTER 3

STOCHASTIC ADVERSARY - STOCHASTIC STOPPING STATE GAMES AND THEIR
APPLICATION TO MOTION PLANNING

In the previous chapter, we introduced a deterministic adversary that aims to maximize it’s payoff

from a defender over a finite-stage. Such an interaction between the defender and adversary is

modeled as a zero-sum multi-stage game. Furthermore, the adversary is disabled if a specific pair

of actions is played 𝐿 times, and the game reaches a stopping state. We characterized the Nash

equilibrium of such a game and demonstrated its application on a path planning problem.

However, in many real-world applications, a defender may not encounter a deterministic ad-

versary, rather a second player which has a certain probability of moving adversarially over each

stage of a finite-stage process. We term such an adversary as a stochastic adversary, which can

acts in a benign or in an adversarial manner. Such an adversary model generalizes the previous

model of a deterministic adversary. Setting the probability of adversarial behavior to 1, retrieves

the deterministic adversary model.

3.1

Introduction

In this chapter, we develop a framework to plan the actions of a defender accounting for both,

the cost of safety and minimum cost per stage in presence of a second player, which might act

adversarially. We assume that the defender uses a classifier that outputs a probability indicating

the adversarial intent of the second player. Thus, the decision-making process for the defender

needs to model the impact of possible adversarial actions over multiple planning stages (finite-

horizon). The proposed model treats the second player as adversarial with a certain probability at

each stage of the decision-making. Once the second player becomes adversarial, it continues to

select adversarial actions for the remaining stages of the game. The adversarial intent of the second

player is confirmed only after a set of actions are played for a specified number of times (termed

as termination threshold) by both players. This event causes the game to reach a stopping state

and the game terminates. We analytically characterize the Nash equilibrium of this game for the

case of two actions per player with a termination threshold, termed as M-SSG. We then expand

47

the action space of both players to an arbitrary number of actions, termed as SSG𝑚×𝑛, and solve

it using linear programming for a termination threshold of unity. Furthermore, for SSG𝑚×𝑛, we

also characterize an analytical condition under which the players’ transition to a pure policy. We

demonstrate the application of M-SSG via two autonomous motion planning applications. The first

involves maintaining a safe distance from a non-ego vehicle ahead, modeled using fixed stage costs.

The second involves safe lane-changing with costs that are stage dependent. In both scenarios,

we provide a comparison between the analytic/simulated and experimental results using ground

robots. Finally, we apply our framework with a larger action space to address a resilient estimation

problem employing a Kalman filter.

Works such as [104, 144, 143] delve into the formulation of a game between an estimator, striving

to minimize worst-case error, and an attacker manipulating measurements. This game is explored in

both complete [104] and partial information [143] settings, accommodating an arbitrary number of

sensors [144]. The study presented by Huang et al. [66] introduces a framework to analyze cross-

layer coordinated attacks on Cyber-Physical Systems (CPS). Here, the defender has the option

to dispense observation, while an adversary can launch a jamming attack aimed at degrading

estimation performance. Additional insights into attacks on CPS are provided by Mahmoud et al.

in the survey [96]. In this work, along with the exploration of a defensive policy, considerations

are made for an adversary’s termination scenario (stopping state) once its presence is ascertained.

There have been a number of works addressing the problem of estimation and control in

the presence of an adversary. A hybrid game was proposed between a defender choosing a set

of controllers, detector, and estimator against an adversary capable of manipulating the sensor

measurements [100], where the authors derived a sub-optimal value iteration method along with a

moving horizon approach. The dynamic landscape of security in shared communication networks

is explored in the study presented by Xing et al.[150], where a dynamic non-zero-sum game

with asymmetric information engages multiple sensors in deciding their investments in security.

Addressing a multi-sensor transmission control problem within a signal-to-noise-and-interference

communication channel, Li et al.[89] contribute to the understanding of efficient sensor network

48

management. A comprehensive overview of security issues in Cyber-Physical Systems (CPS)

is provided in the survey by Zhu et al. [161], summarizing various game-theoretic approaches.

Notably, these prior works typically assume a deterministic adversary. In contrast, the present work

considers the presence of an adversary in a probabilistic manner.

Our prior works [14] and [15] focused on full and partial information scenarios for a single robot

with a route planning problem serving as the overall objective. This chapter extends the method-

ology from our prior works to: 1) probability of adversarial intent in the information structure of

the game, 2) condition on the existence of mixed and pure policy equilibria corresponding to the

new structure, 3) the case where the defender requires a finite number of detections to ascertain the

adversarial intent of the other player, and 4) the consideration of finitely many number of actions per

player. Specifically, in this chapter, we introduce the concept of probabilistic adversary and a finite

termination threshold, which represents a limited number of instances in which a specific pair of

actions can be played before the game concludes. The termination threshold refers to the number of

times a specific set of action pairs (e.g., strong defense being used against an attack) is played, after

which the game is terminated. In other words, a termination threshold of 𝐿 corresponds to playing

a specific action pair a total of 𝐿 times. The termination threshold can be viewed as a dual version

of the war of attrition or the chicken game [61], where the game continues only when a specific

pair of actions is chosen and terminates otherwise. The termination threshold acts as a counter;

once the counter reaches the count 𝐿, the multi-stage zero-sum game terminates. The use of such a

termination threshold in the presence of a deterministic adversary has been demonstrated in navi-

gation and path planning [15]. Additionally, the termination threshold can be modified to included

limited attack or defense which are applicable in settings with constrained energy [155, 88], limited

energy resources [114], denial-of-service attacks [155, 88], and remote state estimation [114]. A

formal definition of termination threshold is provided in Section 3.2 (Definition 1). The concept

of termination threshold allows us to consider the impact of error in detecting the player type (for

example, false positives). Pure policy Nash equilibrium are preferable since they greatly reduce the

computational complexity, which is known to increase polynomially with the number of actions.

49

The contributions of this chapter are as follows,

1. Modeling adversarial intent through a stochastic game with a termination threshold: We

model the interaction between a defender and a second player, which has a given probability

of turning adversarial at any stage of a multi-stage stochastic zero-sum game (M-SSG). We

assume that the stage cost matrices and the probability of turning adversarial are known. We

begin with a M-SSG with two actions per player. An M-SSG captures two main features:

(i) a probable adversarial intent of the second player, i.e., once the second player turns

adversarial, it continues to act adversarially for the remaining stages of the game, and (ii)

a balance between security and minimum per stage cost via stopping states. Furthermore,

we incorporate a termination criteria for the zero-sum game when a specified pair of actions

are played 𝐿 times (known as the termination threshold) and completely characterize the

M-SGG.

2. Arbitrary finite number of actions per player with a switching policy: We extend to the

case of finitely many number of actions per player and provide a numerical method to solve

the game. We then characterize analytic conditions on the problem parameters under which

the defender and second player switches to a pure policy of weak defense and no attack,

respectively.

We demonstrate three applications of the proposed model. The first application is a leader-follower

scenario, where a follower (ego vehicle) acts to maintain a safe distance from the leader (non-ego

vehicle). The second application is a lane-change scenario, where the velocity of the ego vehicle

is regulated while maintaining a balance between safety and the cost incurred to reach its goal.

For the motion planning application, we compare the results obtained from analytical/simulation

with the experiments. Finally, we demonstrate an application of with a large action space in the

context of resilient estimation. This application involves utilizing a Kalman filter with multiple

sensor feedback channels in the presence of an adversary that can strategically inject noise into

50

the measurement. Such a decision-making framework empowers a defender to strike a balance

between security and performance while operating in the presence of a probable adversary.

Outline: The chapter is organized as follows. In Section 3.2, we formulate the decision-making

problem as a stochastic multi-stage zero-sum game (M-SSG) with budgets.

In Section 3.3, we

characterize the solution of the described M-SSG with a engagement, defense and attack budget

along with numerical examples. We extend the solution methodology to an arbitrary number of

actions per player setting in Section 3.4. Finally, in Section 3.5, we present an application of the

M-SSG on i) motion planning problems, and ii) on resilient estimation using a Kalman filter.

3.2 Problem Formulation

We consider a finite-horizon decision-making problem between a defender and a second player,

whose type (whether non-adversarial or adversarial) is initially unknown. At each decision instant,

the second player continues to act in a benign manner or switches to playing adversarially, according

to a Bernoulli process. Once it reveals adversarial intent, it continues to act adversarially for all

subsequent stages of the problem. The Bernoulli parameter 𝜌 captures the adversarial intent of

the second player. We model this finite stage interaction between a probabilistic adversary and the

defender as a multi-stage zero-sum game, termed as the stochastic stopping state game (SSG).

In this chapter, we extend our previous setup [18] consisting of two actions per player to (i)

incorporate a finite termination threshold termed as M-SSG, and (ii) finitely many actions per player,

termed as SSG𝑚×𝑛, which consists of 𝑚 defender and 𝑛 second player actions with a termination

threshold of 1. Every stage 𝑘 ∈ {1, 2, . . . , 𝐾 } in a 𝐾 horizon M-SSG and SSG𝑚×𝑛 are associated

with two matrices:

1. 𝑆𝑘 ∈ R𝑚×𝑛: The rows and columns of this matrix correspond to the available actions for

the defender and the second player if it acts adversarially. Each row represents a defender’s

action, and each column represents a second player’s action. A pair of actions 𝑖 and 𝑗 chosen

by the defender and adversarial player, respectively, results in a stage cost by the defender,

given by the (𝑖, 𝑗)th entry of 𝑆𝑘 , denoted by 𝑠𝑖 𝑗,𝑘 .

51

2. 𝑅𝑘 ∈ R𝑚×1: This matrix contains a single column, wherein rows represent the defender’s

actions along with their associated costs. A single column of costs correspond to a game

where the second player’s action do not impact the defender’s cost. For the defender’s action

𝑖, the stage cost is determined by the 𝑖th row of 𝑅𝑘 denoted as 𝑟𝑖,𝑘 .

The discrete state of an M-SSG indicates whether the second player is playing adversarially or not.

At any stage 𝑘 ∈ {1, 2, . . . , 𝐾 } and state of an M-SSG, the uncertainty of player type (non-adversarial

or adversarial) is captured when the game branches out with probability 𝜌𝑘 of continued adversarial

intent and the complementary probability of 1 − 𝜌𝑘 representing a non-adversarial player. For the

case of 𝑚 = 𝑛 = 2, an M-SSG considers two actions per player, i.e., 𝑆𝑘 ∈ R2×2 and 𝑅𝑘 ∈ R2×1,

𝑘 ∈ {1, . . . , 𝐾 }. The actions of the players and the corresponding entries of the stage cost matrices

are given as,

attack

no attack

no attack

strong defense








Next, we formally define the term termination threshold and introduce the three types used in


𝑟1,𝑘



𝑟2,𝑘




strong defense

weak defense

weak defense

𝑠22,𝑘

𝑠21,𝑘

𝑠12,𝑘

𝑠11,𝑘

















,

.

the work.

Definition 1 Let T denote a given subset of action pairs corresponding to the stage cost matrices

of a 𝐾 stage zero-sum matrix game. A termination threshold of 𝐿 is defined as the number of times

an action pair from T can be played after which the game is terminated.

Figure 3.1 shows a game tree where the intent of the second player is uncertain when the set

T := {{strong defense, attack}}. Here, the termination threshold 𝐿 = 2. Figure 3.2 shows a similar

game tree with the addition of finitely many number of actions available to each player with a

termination threshold of 𝐿 = 1, given T := {{𝐷1, 𝐴1}, {𝐷1, 𝐴2}, . . . , {𝐷1, 𝐴𝑛}}. States which

realize an adversarial intent (under the branch with probability 𝜌𝑘 ) continue to branch out since the

second player continues to play adversarially for the rest of the game.

52

Figure 3.1 An M-SSG consisting of 𝐾 stages a termination threshold of 𝐿 = 2. The information
set for the defender and second player is indicated by the dotted line and nodes taking value 𝑉 𝑖
𝑘 for
𝑘 ∈ {1, 2, . . . 𝐾 }, 𝑖 ∈ {0, 1, . . . , 𝐿}. The value of an M-SSG under an adversarial intent is indicated
by 𝑉 𝑘 , 𝑘 ∈ {0, . . . , 𝐾 } (see Remark 3.2.1). At every stage, the game branches with probability 𝜌𝑘
to indicate an adversarial player. Actions of an adversary (resp. defender) abbreviated as {𝐴, 𝑁 𝐴}
(resp. {𝑆𝐷, 𝑊 𝐷}) for {Attack, No attack} (resp. {Strong Defense, Weak defense}). SS indicates
the stopping state.

Both the M-SSG and SSG𝑚×𝑛 are characterized by the sequence of matrices over 𝐾 stages

given by {{𝑆1, 𝑅1}, {𝑆2, 𝑅2}, . . . , {𝑆𝐾, 𝑅𝐾 }}. At any stage 𝑘, the number of actions available to
the defender and the second player are 𝑚𝑘 and 𝑛𝑘 , respectively, i.e., 𝑆𝑘 ∈ R𝑚𝑘×𝑛𝑘 , and 𝑅𝑘 ∈ R𝑚𝑘×1,

𝑘 ∈ {1, . . . , 𝐾 }. At every stage 𝑘, the defender (row player) and the second (column player)

simultaneously select their respective actions (𝑖, 𝑗), leading to an expected cost 𝜌𝑘 𝑠𝑖 𝑗,𝑘 + (1− 𝜌𝑘 )𝑟𝑖,𝑘 ,

where 𝑠𝑖 𝑗,𝑘 ∈ 𝑆𝑘 , 𝑟𝑖,𝑘 ∈ 𝑅𝑘 , conditioned on the intent of the second player being unknown in the

previous stages of the game. We assume that the game terminates at any stage 𝑘 ≤ 𝐾, whenever a

set of action pairs (𝑖, 𝑗) ∈ T is played a total of 𝐿 times.

In an M-SSG or SSG𝑚×𝑛, the current game state at any stage and the probability 𝜌𝑘 governing

the presence of an adversary are common knowledge to both players. Since the second player’s

type is governed by a Bernoulli process, we can determine the probability of revealing adversarial

intent at any stage 𝑘 using the parameter 𝜌𝑘 . For instance, revealing adversarial intent is equivalent

53

SSAdversarial IntentIntent uncertain SSAdversaryconfirmedSSFigure 3.2 An SSG𝑚×𝑛 refers to a stochastic stopping state game where there are 𝑚 possible actions
for the defender and 𝑛 possible actions for the second player. At stage 𝑘, when the game diverges
with a probability of 1 − 𝜌𝑘 , it signifies a non-adversarial player scenario where solely the actions
of the defender are applicable.

to flipping a coin and obtaining a heads (H), with the probability of getting heads being 𝜌𝑘 . When

the SSG is played only for three stages, i.e., 𝐾 = 3, the set of possible events at the final stage

are given by {TTT,TTH,THT,THH,HTT,HTH,HHT,HHH}, with T indicating tails (benign type).

The event of revealing adversarial intent given the second player is of type (T) in stages 1 and 2

is TTH. Therefore, when 𝐿 = 1 and 𝜌 = 𝜌𝑘 , ∀𝑘, the probability of revealing adversarial intent at

stage 3 is (1 − 𝜌)2𝜌, where (1 − 𝜌)2 corresponds to the probability of remaining benign until stage

2. Likewise, when 𝐿 = 2, the events TTH, THH and HTH describe the presence of an adversary

in stage 3. Therefore, when 𝐿 = 2 and 𝜌 = 𝜌𝑘 , ∀𝑘, the probability of revealing adversarial intent at
(cid:1) (1 − 𝜌)2𝜌 + (1 − 𝜌)2𝜌. In general, any prior presence of adversary is accounted through
stage 3 is (cid:0)2
1
the probability of (cid:0)𝐾−1
(cid:1) (1 − 𝜌) 𝑘−1𝜌, 𝑥 ∈ {1, 2, . . . , 𝐿 − 1}. Using induction, for a stage varying
𝑥

probability 𝜌𝑘 , we obtain the presence of an adversary any stage 𝑘 using the indicator function

54

SSSSSSAdversarial IntentIntent uncertain SSAdversaryconfirmedSSSS1 : {1, 2, . . . , 𝑚𝑘 } × {1, 2, . . . , 𝑛𝑘 } → {0, 1} as:

1(𝑖𝑘 , 𝑗𝑘 ) =






with probability

1,

if {𝑖𝑘 , 𝑗𝑘 } ∈ T ,

0,

otherwise.

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

𝐿−1
∑︁

𝑥=1
(cid:124)

(cid:18)𝐾 − 1
𝑥

(cid:19) 𝑘−1
(cid:214)

𝑘−1
(cid:214)

(1 − 𝜌𝑞) +

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

𝑞=1

𝑞=1
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(1 − 𝜌𝑞)

𝜌𝑘 ,

(3.1)

(cid:123)(cid:122)
𝜚𝑘

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:125)
(cid:172)

The game terminates at a stopping state if there exists a stage 𝑡 ≤ 𝐾 such that (cid:205)𝑡

𝑘=1

1(𝑖𝑘 , 𝑗𝑘 ) = 𝐿.

At any stage 𝑘, for a given pair of actions (𝑖𝑘 , 𝑗𝑘 ) the expected cost for the defender is computed
(cid:1). Here, the term 𝜚𝑘 𝜌𝑘 𝑠𝑖𝑘, 𝑗𝑘 corresponds to the expected cost of the
as, 𝜚𝑘 (cid:0)𝜌𝑘 𝑠𝑖𝑘, 𝑗𝑘 + (1 − 𝜌𝑘 )𝑟𝑖𝑘

defender when the second player is adversarial, while the term 𝜚𝑘 (1− 𝜌𝑘 )𝑟𝑖𝑘 represents the expected

cost for the defender when facing a non-adversarial player. Now, conditioned on a sequence of player

actions {(𝑖1, 𝑗1), . . . , (𝑖𝐾, 𝑗𝐾)}, the expected cost (with respect to the Bernoulli random variable
that defines adversarial intent) 𝐽𝐾 : (cid:206)𝐾
𝑘=1{1, 2 . . . , 𝑚𝑘 } × {1, 2 . . . , 𝑛𝑘 } → R for the defender is

given by

𝐽𝐾 ({(𝑖1, 𝑗1), . . . , (𝑖𝐾, 𝑗𝐾)}) =

𝐾
∑︁

𝜚𝑘 (cid:0)𝜌𝑘 𝑠𝑖𝑘, 𝑗𝑘 + (1 − 𝜌𝑘 )𝑟𝑖𝑘

(cid:1) .

(3.2)

𝑘=1
For both the players, we consider the space of behavioral policies. A multi-stage behavioral

policy [61] for the defender and second player (non-adversarial or adversarial) are defined by a set

of probability distributions Y := {𝑦1, . . . , 𝑦𝐾 } ∈ {Δ𝑚1

, Δ𝑚2

, . . . , Δ𝑚𝐾 } and Z := {𝑧1, . . . , 𝑧𝐾 } ∈

, . . . , Δ𝑛𝐾 }, respectively, where Δ𝑖𝑘 is the probability simplex in 𝑖 dimensions at stage 𝑘.

, Δ𝑛2

{Δ𝑛1
In particular, for the M-SSG, Y := {𝑦1, . . . , 𝑦𝐾 } ∈ {Δ21

, Δ22

, . . . , Δ2𝐾 } and Z := {𝑧1, . . . , 𝑧𝐾 } ∈

{Δ21

, Δ22

, . . . , Δ2𝐾 }.

Remark 3.2.1 When the probability 𝜌𝑘 , ∀𝑘 ∈ {1, 2, . . . , 𝐾 } is set to 1, an M-SSG or SSG𝑚×𝑛

reduces to an edge-game from our previous work [14].

The net expected cost 𝐽𝐸 : (cid:206)𝐾

𝑖=1 Δ𝑚𝑖 × (cid:206)𝐾

𝑖=1 Δ𝑛𝑖 → R for the defender with respect to the

55

behavioral policies {Y, Z} is given by,

𝐽𝐸 (Y, Z) =

(cid:16)

𝜚𝑘

𝐾
∑︁

𝑘=1

𝜌𝑘 𝑦′

𝑘 𝑆𝑘 𝑧𝑘 + (1 − 𝜌𝑘 )𝑦′

𝑘 𝑅𝑘

(cid:17)

.

(3.3)

The goal of this chapter is to find a pair of behavioral policies (Y∗, Z∗) that are in Nash

equilibrium [61] satisfying

𝐽𝐸 (Y∗, Z) ≤ 𝐽𝐸 (Y∗, Z∗) ≤ 𝐽𝐸 (Y, Z∗).

Since this a complete information full feedback game, there always exists a behavioral saddle

point [61]. We denote the outcome of any SSG (M-SSG or SSG𝑚×𝑛) as

𝐸 := 𝐽𝐸 (Y∗, Z∗).
𝐽∗

In the following sections we will derive an analytical and numerical method to solve both the

M-SSG and SSG𝑚×𝑛.

3.3 Solution to the M-SSG

In this section, we will solve and analyze the M-SSG. Limiting the number of actions to two

enables us to determine a closed-form expression for the value of the game and the corresponding

player policies. The set T indicates the condition for terminating the M-SSG. For the M-SSG,

we define the set T := {{strong defense, attack}}. For the given set T and a finite termination

threshold 𝐿, we will present a procedure to compute the outcome 𝐽𝐸 defined in equation (3.3),

resulting in a Nash equilibrium. The value of a zero-sum matrix game defined by the matrix 𝑋 is

given by Val(𝑋) := min𝑦𝑘 ∈Δ𝑚𝑘 max𝑧𝑘 ∈Δ𝑛𝑘
second player policies, respectively. For any 𝑖 ≥ 0, let 𝑉 𝑖

𝑘−1 denote the mixed value of an M-SSG
with a termination threshold of 𝐿 − 𝑖 at time instant 𝑘 − 1. Then, there are two possibilities – either

𝑦′
𝑘 𝑋 𝑧𝑘 , where 𝑦𝑘 and 𝑧𝑘 are the space of defender and

the pair of actions played at time 𝑘 belong to the set T or not. Thus, using backward iteration,
𝑘−1 is a linear combination of 𝑉 𝑖+1
𝑉 𝑖
the value of an edge-game, 𝑉 𝑘 [14] with probability 𝜌𝑘 and 𝑉 𝐿

𝑘 . When 𝑖 = 𝐿, the value of the M-SSG depends on
𝑘 with probability 1 − 𝜌𝑘 . The next

and 𝑉 𝑖

𝑘

subsubsection formalizes this intuitive description.

56

3.3.1 Nash equilibria and value of the game

For the given set T , we define two matrices,

𝐹 =

1 0

0 0

















, and 𝐷 =

0 1

1 1









,









which encodes the event that the action pair from T was used or not used at any stage of the M-SSG,

respectively (cf. Figure 3.1 for visualization). Notice the entry of 1 in the matrix 𝐹 and 0 in the

matrix 𝐷 correspond to the action pair in the set T . The following Bellman equation will show

how the matrices 𝐹 and 𝐷 are incorporated with the stage cost matrices. A standard technique to

solve such games using the cost-to-go function (e.g., see [61]) is to compute the solution of the

Bellman equation backward in time,

𝑉 𝑖
𝑘−1 =

(cid:16)

(cid:16)

Val

Val






𝜌𝑘 (𝑉 𝑖+1

𝑘 𝐹 + 𝑉 𝑖

𝑘 𝐷 + 𝑆𝑘 ) + (1 − 𝜌𝑘 ) (𝑉 𝑖

𝑘 1 + ˜𝑅𝑘 )

(cid:17)

,

for 𝑖 ≠ 𝐿,

(3.4)

𝜌𝑘 (𝑉 𝑘 𝐷 + 𝑆𝑘 ) + (1 − 𝜌𝑘 ) (𝑉 𝑖

𝑘 1 + ˜𝑅𝑘 )

(cid:17)

,

𝑖 = 𝐿,

where 𝑘 ∈ {𝐾, 𝐾 − 1, . . . , 1} is the stage, 𝑖 ∈ {0, 1, . . . , 𝐿} is the number of times an action pair

from T was used, 𝑉 𝑘 is the value of the edge-game, 𝑉 𝑖

𝑘 is the value of M-SSG with a threshold

of 𝐿 − 𝑖 respectively, 1 ∈ R2×2 is the matrix of ones, ˜𝑅𝑘 ∈ R2×2 is a matrix with repeated column

entries to convert the vector 𝑅𝑘 to a matrix.

The following mild assumption enables us to analyze the M-SSG and derive closed-form

expressions for the value of the game and player policies.

Assumption 3.3.1 The following stage cost inequalities hold at any stage 𝑘 of the M-SSG,

𝑠21,𝑘 > 𝑠12,𝑘 ≥ 𝑠11,𝑘 > 𝑠22,𝑘 ≥ 0 and 𝑟1,𝑘 > 𝑟2,𝑘 ≥ 0.

Assumption 3.3.1 is naturally applicable in security-related scenarios.

It implies that the cost

associated with implementing a strong defense is lower than the cost of using a weak defense

against an adversarial player. Similarly, the cost corresponding to a strong defense is higher than a

weak defense against a non-adversarial player.

57

Remark 3.3.2 To incorporate costs once a stopping state is reached, we can optionally augment

the stage cost entries corresponding to the action pair in T for 𝑉 𝐿
cost over the remaining stages, such as (cid:205)𝐾

𝑘 , i.e., augment 𝑠11,𝑘 with a fixed
𝑗=𝑘 𝑠22,𝑘 . In the analysis of the M-SSG, we do not consider

such an augmentation of stage costs.

To solve (3.4), we must first determine the value of an edge-game corresponding to the case of

𝑖 = 𝐿. This case is equivalent to the setting of 𝜌𝑘 = 1, ∀𝑘 ∈ {1, 2, . . . , 𝐾 }. For the case of two

actions per player, under Assumption 3.3.1, the value of the edge-game is computed recursively [14]

using

𝑠11,𝑘 𝑠22,𝑘 − 𝑠12,𝑘 𝑠21,𝑘 − 𝑠22,𝑘𝑉 𝑘
𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑘
with 𝑉𝐾 = 0. Next, the recursion (3.4) for the case of 𝑖 ≠ 𝐿 is given by

𝑉 𝑘−1 = 𝑉 𝑘 +

,

(3.5)

+

(cid:33)

.

(3.6)

𝑉 𝑖
𝑘−1 = Val

(cid:32)

𝜌𝑘𝑉 𝑖+1
𝑘

+𝜌𝑘𝑉 𝑖
𝑘

1 0











0 0


(cid:32)
(cid:32)


(cid:123)(cid:122)
(cid:125)
(cid:124)
𝐹

+𝜌𝑘



0 1








1 1


(cid:32)
(cid:32)


(cid:123)(cid:122)
(cid:125)
(cid:124)
𝐷

𝑠11,𝑘





𝑠21,𝑘


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

𝑠12,𝑘





𝑠22,𝑘


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:125)

(1 − 𝜌𝑘 )𝑉 𝑖
𝑘

1 1

1 1

















+ (1 − 𝜌𝑘 )

(cid:123)(cid:122)
𝑆𝑘

𝑟1,𝑘





𝑟2,𝑘


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:125)


𝑟1,𝑘



𝑟2,𝑘


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

(cid:123)(cid:122)
˜𝑅𝑘

Similarly, when 𝑖 = 𝐿, the value of an M-SSG at any stage 𝑘 is given by

𝑉 𝐿
𝑘−1 = Val

(cid:32)

𝜌𝑘𝑉 𝑘

+𝜌𝑘



0 1








1 1


(cid:32)
(cid:32)


(cid:123)(cid:122)
(cid:125)
(cid:124)
𝐷

𝑠11,𝑘





𝑠21,𝑘


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

+

𝑠12,𝑘





𝑠22,𝑘


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:125)

(cid:123)(cid:122)
𝑆𝑘

(1 − 𝜌𝑘 )𝑉 𝐿
𝑘

+ (1 − 𝜌𝑘 )

1 1

1 1


















𝑟1,𝑘



𝑟2,𝑘


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

𝑟1,𝑘





𝑟2,𝑘


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:125)

(cid:123)(cid:122)
˜𝑅𝑘

(cid:33)

.

(3.7)

A quick comparison of (3.6) and (3.7) reveals that due to the termination threshold, the quantities

𝑉 𝑘 and 𝑉 𝑖

𝑘 get coupled, ∀𝑘 ∈ {1, 2, . . . , 𝐾 } and ∀𝑖 ∈ {1, 2, . . . , 𝐿}. Thus, the expected value of the

58

M-SSG at any stage 𝑘 is given by,

𝑉 𝑖
𝑘−1 =

′ (cid:16)

′ (cid:16)

𝑦𝑖,∗
𝑘

𝑦𝑖,∗
𝑘





𝜌𝑘 (𝑉 𝑖+1

𝑘 𝐹 + 𝑉 𝑖

𝑘 𝐷 + 𝑆𝑘 ) + (1 − 𝜌𝑘 ) (𝑉 𝑖
𝑘 1 + ˜𝑅𝑘 )
(cid:17)

𝜌𝑘 (𝑉 𝑘 𝐷 + 𝑆𝑘 ) + (1 − 𝜌𝑘 ) (𝑉 𝑖

𝑘 1 + ˜𝑅𝑘 )

𝑧𝑖,∗
𝑘 ,

(cid:17)

𝑧𝑖,∗
𝑘 ,

if 𝑖 ≠ 𝐿,

otherwise,

(3.8)

where {𝑦𝑖,∗

𝑘 , 𝑧𝑖,∗
𝑖 ∈ {0, 1, . . . , 𝐿}.

𝑘 } is a Nash equilibrium pair at stage 𝑘 when the termination threshold equals

While a mixed Nash equilibrium always exists, it is computationally more efficient to identify

whether a pure Nash equilibrium exists at any given stage. Therefore, to aid the search of a pure

Nash equilibrium, we present the following result. We derive a general result of switching between

mixed and pure policies, corresponding to the stage cost matrices and the Bernoulli parameter 𝜌𝑘 ,

which will be used in the subsequent results.

Lemma 3.3.3 Given a Bernoulli parameter 𝜌 and stage cost matrices:








where 𝐵 corresponds to costs when playing against an adversarial player and 𝐶 corresponds costs

𝑏21 𝑏22

𝑏11 𝑏12

𝑐1 𝑐1

𝑐2 𝑐2

, 𝐶 =

𝐵 =

























when playing against a non-adversarial player, satisfying the inequalities:

𝑏21 > 𝑏12 ≥ 𝑏11 > 𝑏22 ≥ 0 and 𝑐1 > 𝑐2 ≥ 0.

If

where

0 < 𝜌 < ˆ𝜌,

ˆ𝜌 :=

𝑐1 − 𝑐2
𝑏21 − 𝑏11 − 𝑐2 + 𝑐1

.

(3.9)

Then, there exists a pure policy Nash equilibrium action pair of {weak defense, attack} for the zero

sum game defined by the 𝜌𝐵 + (1 − 𝜌)𝐶,

□

59

Proof: When 𝜌 = 1, the matrix 𝜌𝐵 + (1 − 𝜌)𝐶 simplifies to just the matrix 𝐵, which leads to a

mixed strategy Nash equilibrium [132] (no row or column domination). When 𝜌 = 0, the second

player’s action does not impact the cost. When 0 < 𝜌 < 1, we examine the entries of the matrix









𝜌𝑏11 + (1 − 𝜌)𝑐1 𝜌𝑏12 + (1 − 𝜌)𝑐1

𝜌𝑏21 + (1 − 𝜌)𝑐2 𝜌𝑏22 + (1 − 𝜌)𝑐2

.









(3.10)

Based on the entries in (3.10), the entry 𝜌𝑏22 + (1 − 𝜌)𝑐2 is the smallest entry of the matrix. As

𝜌 → 0, the defender switches to the pure policy of weak defense. Thus, we need to determine the

value of 𝜌 for which a row domination (weak defense) occurs. This condition corresponds to:

𝜌𝑏21 + (1 − 𝜌)𝑐2 < 𝜌𝑏11 + (1 − 𝜌)𝑐1,

⇒ 𝜌(𝑏21 − 𝑏11 − 𝑐2 + 𝑐1) < 𝑐1 − 𝑐2,

which upon further simplification, leads to equation (3.9).

□

Lemma 3.3.3 outlines a condition that dictates the shift from a mixed policy to a pure policy

Nash equilibrium. This result will help us derive the conditions for a termination threshold in the

following Theorem. Theorem 4 summarizes the analytic expressions for the Nash equilibria in

behavioral policies and the corresponding value at any stage 𝑘 for any given 𝜌𝑘 .

Theorem 3.3.4 The Nash equilibrium policies at any stage 𝑘 ∈ {1, . . . , 𝐾 } for a given M-SSG with

a termination threshold of 𝐿, stage cost matrices 𝑆𝑘 , and

˜𝑅𝑘 :=


𝑟1,𝑘



𝑟2,𝑘




𝑟1,𝑘

𝑟2,𝑘









=

𝑠12,𝑘

𝑠22,𝑘









𝑠12,𝑘

𝑠22,𝑘

,









60

under Assumption 3.3.1 are given by:

𝑦𝑖,∗
𝑘 =

𝑧𝑖,∗
𝑘 =






























𝑠22,𝑘 − 𝑠21,𝑘
(cid:101)𝑉𝑘

𝑠11,𝑘 − 𝑠12,𝑘 + 𝑠22,𝑘 + 𝑉 𝑖+1

𝑘 − 𝑉 𝑖
𝑘

(cid:101)𝑉𝑘
𝑠22,𝑘 − 𝑠21,𝑘
(cid:98)𝑉𝑘
𝑠11,𝑘 − 𝑠12,𝑘 − 𝑉 𝑘
(cid:98)𝑉𝑘












,












[0

1]′ ,

,

if 𝑖 ≠ 𝐿, and 𝜌𝑘 ≥ ˜𝜌𝑖
𝑘 ,












if 𝑖 = 𝐿 and 𝜌𝑘 ≥ ˜𝜌𝐿
𝑘 ,

(3.11)

𝑠22,𝑘 − 𝑠12,𝑘
𝜌𝑘 (cid:101)𝑉𝑘

𝑠11,𝑘 − 𝑠12,𝑘 + 𝑠22,𝑘 + 𝑉 𝑖+1

𝑘 − 𝑉 𝑖
𝑘

𝜌𝑘 (cid:101)𝑉𝑘
𝑠22,𝑘 − 𝑠12,𝑘
𝜌𝑘 (cid:98)𝑉𝑘
𝑠11,𝑘 − 𝑠21,𝑘 − 𝑉 𝑘
𝜌𝑘 (cid:98)𝑉𝑘












,












[1

0]′ ,

otherwise,

,

if 𝑖 ≠ 𝐿, and 𝜌𝑘 ≥ ˜𝜌𝑖
𝑘 ,












if 𝑖 = 𝐿 and 𝜌𝑘 ≥ ˜𝜌𝐿
𝑘 ,

(3.12)

otherwise,

where 𝑉𝐾 = 0, (cid:101)𝑉𝑘 := 𝑉 𝑖+1


𝑘 + 𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑖

𝑘 , (cid:98)𝑉𝑘 := 𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑘 ,

and

˜𝜌𝑖
𝑘 :=

𝑠22,𝑘 − 𝑠12,𝑘
(cid:101)𝑉𝑘

,

˜𝜌𝐿
𝑘 :=

𝑠22,𝑘 − 𝑠12,𝑘
(cid:98)𝑉𝑘

, ∀𝑖 ∈ {1, 2, . . . , 𝐿}.

(3.13)

The value of the game at stage 𝑘 satisfies
𝑘 − 𝑉 𝑖

𝑠22,𝑘 (𝑉 𝑖+1

𝑘 ) + det(()𝑆𝑘 )

𝑉 𝑖

𝑘 +

𝑘 + 𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑖
𝑉 𝑖+1
𝑘
det(()𝑆𝑘 ) − 𝑠22,𝑘𝑉 𝑘
(cid:98)𝑉𝑘

𝜌𝑘𝑉 𝑘 + (1 − 𝜌𝑘 )𝑉 𝑖

𝑘 +

,

𝑉 𝑖
𝑘 + 𝜌𝑘 𝑠21,𝑘 + (1 − 𝜌𝑘 )𝑠22,𝑘 ,

,

if 𝑖 ≠ 𝐿, and 𝜌𝑘 ≥ ˜𝜌𝑖
𝑘

if 𝑖 = 𝐿 and 𝜌𝑘 ≥ ˜𝜌𝐿
𝑘 ,

if 𝑖 ≠ 𝐿, and 𝜌𝑘 < ˜𝜌𝑖
𝑘 ,

𝑉 𝑖
𝑘−1 =

𝜌𝑘 (𝑠21,𝑘 + 𝑉 𝑘 ) + (1 − 𝜌𝑘 ) (𝑠22,𝑘 + 𝑉 𝐿

𝑘 ),

otherwise,

where det(() 𝑋) is the determinant of the matrix 𝑋.

61

(3.14)

□






Proof: We derive this result by first considering the case of 𝑖 ≠ 𝐿. Given any 2 × 2 zero-sum

game matrix

𝑈 =


𝑢1 𝑢2



𝑢3 𝑢4












that does not admit any row or column domination, the unique Nash equilibrium mixed policy of

the row (𝜋∗

row) and column player(𝜋∗

col), and the value of the game (see e.g., [132]) are given by

𝑢4−𝑢2
𝑢1−𝑢2+𝑢4−𝑢3

𝑢4−𝑢3
𝑢1−𝑢2+𝑢4−𝑢3

𝑢1−𝑢2
𝑢1−𝑢2+𝑢4−𝑢3

𝜋∗
row =









Val(𝑈) = 𝜋∗′









𝑈𝜋∗
col =

row

, 𝜋∗

col =










𝑢1−𝑢3
𝑢1−𝑢2+𝑢4−𝑢3
𝑢1𝑢4 − 𝑢2𝑢3
𝑢1 − 𝑢2 + 𝑢4 − 𝑢3










,

.

(3.15)

(3.16)

Rewriting the matrix in the argument of the Val(.) operator from (3.6) in a compact form, we obtain

the terms

𝑢1 = 𝜌𝑘 (𝑉 𝑖+1

𝑘 − 𝑉 𝑖

𝑘 + 𝑠11,𝑘 − 𝑠12,𝑘 ) + 𝑉 𝑖

𝑘 + 𝑠12,𝑘 ,

𝑢2 = 𝑉 𝑖

𝑘 + 𝑠12,𝑘 ,

𝑢3 = 𝜌𝑘 (𝑠21,𝑘 − 𝑠22,𝑘 ) + 𝑉 𝑖

𝑘 + 𝑠22,𝑘 ,

𝑢4 = 𝑉 𝑖

𝑘 + 𝑠22,𝑘 .

(3.17)

Substituting (3.17) in (3.15) and canceling the common terms including the probability 𝜌𝑘 in both

the numerator and denominator, we obtain the defender policy for 𝑖 ≠ 𝐿 as,

𝑦𝑖∗
𝑘 =











(𝑠21,𝑘 − 𝑠22,𝑘 )
𝑘 + 𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑖
𝑘 )

(𝑉 𝑖+1

(𝑉 𝑖+1

𝑘 − 𝑉 𝑖

𝑘 + 𝑠11,𝑘 − 𝑠12,𝑘 )

(𝑉 𝑖+1

𝑘 + 𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑖
𝑘 )

,











and the second player policy in (3.12) (derivation skipped for brevity). Similarly, substituting (3.17)

in (3.16) we obtain the value of game at stage 𝑘 and for any 𝑖 ≠ 𝐿,

(cid:16)
𝑘 (cid:101)𝑉𝑘 +𝑠22,𝑘 (𝑉 𝑖+1
𝑉 𝑖

𝑘 − 𝑉 𝑖

𝑘 + 𝑠11,𝑘 − 𝑠12,𝑘 ) − 𝑠12,𝑘 (𝑠21,𝑘 − 𝑠22,𝑘 )

(cid:17)

(cid:26)(cid:26)𝜌𝑘

𝑉 𝑖
𝑘−1 =

=

𝑘 (cid:101)𝑉𝑘 + 𝑠22,𝑘 (𝑉 𝑖+1
𝑘 − 𝑉 𝑖
𝑉 𝑖
(cid:101)𝑉𝑘

(cid:26)(cid:26)𝜌𝑘 (cid:101)𝑉𝑘
𝑘 ) + det(𝑆𝑘 )

,

62

where (cid:101)𝑉𝑘 := 𝑉 𝑖+1
for the case of 𝑖 ≠ 𝐿. Notice the dependency on probability 𝜌𝑘 is eliminated in computing the value

𝑘 and upon further simplification, we obtain (3.14)

𝑘 + 𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑖

of M-SSG for 𝑖 = 1, 2, . . . , 𝐿 − 1. The dependency is coupled only via the case 𝑖 = 𝐿, which is

solved using the results in [18] (Theorem 4.4) and are applied to (3.7).

𝜌(𝑉 𝑖+1

Next, we derive the expressions for the probabilities ˜𝜌𝑖
𝑘 𝐷 + 𝑆𝑘 ) for matrix 𝐵, and (1 − 𝜌) (𝑉 𝑖
𝑘 𝐹 +𝑉 𝑖

𝑘 . For 𝑖 ≠ 𝐿, we substitute the matrix
𝑘 1 + ˜𝑅𝑘 ) for matrix 𝐶 in Lemma 3.3.3, to obtain ˜𝜌𝑖.
Similarly, for the case where 𝑖 = 𝐿, we utilize the matrix 𝜌𝑘 (𝑉 𝑘 𝐷 + 𝑆𝑘 ) in place of matrix 𝐵, and
𝑘 1 + ˜𝑅𝑘) in place of matrix 𝐶 in Lemma 3.3.3 to yield the probability ˜𝜌𝐿. Combining

(1 − 𝜌𝑘 )(𝑉 𝑖

both scenarios, we arrive at the expression (3.13). The value of the game corresponding to the pure

policy pair are given by

𝜌(𝑉 𝑖

𝑘 + 𝑠21,𝑘 ) + (1 − 𝜌𝑘 ) (𝑉 𝑖

𝑘 + 𝑠22,𝑘 ),

𝑖 ≠ 𝐿,

𝜌(𝑉 𝑘 + 𝑠21,𝑘 ) + (1 − 𝜌𝑘 ) (𝑉 𝑖

𝑘 + 𝑠22,𝑘 ),

𝑖 = 𝐿.

Now we show that ˜𝜌𝐿

𝑘 ∈ [0, 1]. The probability threshold ˜𝜌𝐿

𝑘 from Theorem 3.3.4 is defined as

˜𝜌𝐿
𝑘 :=

𝑠22,𝑘 − 𝑠12,𝑘
𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑘

.

Under Assumption 3.3.1 we have

𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 < 0 and 𝑠22,𝑘 − 𝑠12,𝑘 < 0.

For the final stage 𝐾, 𝑉 𝐾 = 0, therefore,

𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 < 𝑠22,𝑘 − 𝑠12,𝑘 ≤ 0.

From (3.5) for the stage 𝐾 − 1, we have

𝑉 𝐾−1 =

𝑠11,𝑘 𝑠22,𝑘 − 𝑠12,𝑘 𝑠21,𝑘
𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘

.

Under Assumption 3.3.1, 𝑠11,𝑘 𝑠22,𝑘 − 𝑠12,𝑘 𝑠21,𝑘 < 0, and from the recursion (3.5), we infer that 𝑉 𝑘

is a monotonically increasing function from stage 𝑘 = 𝐾 to 𝑘 = 0. Therefore,

𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑘 < 𝑠22,𝑘 − 𝑠12,𝑘 ≤ 0.

63

Thus, we conclude that ˜𝜌𝐿

𝑘 ∈ [0, 1]. Under Assumption 3.3.1, ˜𝜌𝑖

𝑘 ∈ [0, 1], 𝑖 ∈ {1, 2, . . . , 𝐿 − 1} if

𝑠11,𝑘 − 𝑠12,𝑘 − 𝑠21,𝑘 + 𝑠22,𝑘 − 𝑉 𝑖

𝑘 + 𝑉 𝑖+1

𝑘 < 𝑠22,𝑘 − 𝑠12,𝑘 .

(3.18)

The value of the game and player policies for the case of 𝑖 = 𝐿 are obtained analogous to the

case of 𝑖 ≠ 𝐿. Using the zero-sum matrix from (3.7) in (3.16) and (3.15), we obtain (3.11), (3.12)

and (3.14) for the case of 𝑖 = 𝐿. Combining both the cases of 𝑖 ≠ 𝐿 and 𝑖 = 𝐿 along with the

probabilities ˜𝜌𝑖, we obtain the complete case of (3.11), (3.12) and (3.14).

□

Theorem 3.3.4 provides a closed-form M-SSG solution along with the player policies with a

termination threshold 𝐿. Such a solution provides computational efficiency, and a switching policy

indicates a clear trade-off between costs and security.

(a)

(b)

Figure 3.3 (a) Value of an M-SSG vs. edge-game (𝜌 = 1) over stages 𝐾 for ˜𝑠2,𝑘 = 0.3 and
˜𝑠1,𝑘 = 1.25, ∀𝑘 ∈ {1, 2, . . . , 𝐾 } with termination threshold of 𝐿 = 2 and 4.
(b) Probability
parameter ˜𝜌𝐿

𝑘 for the same set of parameters.

We can further simplify the recursion obtained from Theorem 3.3.4 by parameterizing the stage

cost matrices in terms of a uniform strong defense cost. The value of the M-SSG, player policies,

and a numerical evaluation for the parameterized stage costs are discussed as follows.

64

5101520 6 W D J H V K5101520V0010.911.011.17.58.08.59.09.5 = 0.50, L = 2 = 0.50, L = 4 = 0.75, L = 2 = 0.75, L = 4 = 1.00, L = 2 = 1.00, L = 451015200.20.40.60.81 = 0.50 = 0.75 = 1.00(a)

(b)

Figure 3.4 (a) Probability of choosing strong defense action when 𝑖 = 𝐿 for the M-SSG and
edge-game (𝜌 = 1), solved for 𝐾 = 20 using the same set of parameters, ˜𝑠2,𝑘 , ˜𝑠1,𝑘 , and 𝐿.
(b)
Probability of choosing attack action when 𝑖 = 𝐿 for the M-SSG and edge-game (𝜌 = 1) with the
same parameters of stages 𝐾, ˜𝑠2,𝑘 , ˜𝑠1,𝑘 , and 𝐿.

3.3.2 Parameterized stage cost and numerical evaluation

Following Theorem 3.3.4, we parameterize the stage cost matrix 𝑆𝑘 , for 𝑘 ∈ {1, 2, . . . , 𝐾 } with

˜𝑠1,𝑘 and ˜𝑠2,𝑘 as,

𝑆𝑘 = 𝑠11,𝑘

1

1

˜𝑠1,𝑘

˜𝑠2,𝑘

















, where ˜𝑠1,𝑘 :=

𝑠21,𝑘
𝑠11,𝑘

, and ˜𝑠2,𝑘 :=

𝑠22,𝑘
𝑠11,𝑘

,

(3.19)

𝑟1,𝑘 = 𝑠11,𝑘 , and 𝑟2,𝑘 = ˜𝑠2,𝑘 . A uniform strong defense cost indicates that the amount of resources

spent corresponding to such an action is independent of the action taken by the adversarial player.

The condition of ˜𝑠1,𝑘 ≥ 1 and ˜𝑠2,𝑘 < 1 follows from Assumption 3.3.1. The parameterized matrix

65

510152000.20.40.60.81 = 1.00, L =4 = 1.00, L =2 = 0.75, L =4 = 0.75, L =2 = 0.50, L =4 = 0.50, L =2510152000.20.40.60.81 = 1.00, L =4 = 1.00, L =2 = 0.75, L =4 = 0.75, L =2 = 0.50, L =4 = 0.50, L =2(3.19) with a unit strong defense 𝑠11,𝑘 = 1, ∀𝑘 results in the following recursive equation,

𝑘 −𝑉 𝑖
𝑘 + ˜𝑠2,𝑘− ˜𝑠1,𝑘+ ˜𝑠2,𝑘 (𝑉 𝑖+1
𝑘)
𝑉 𝑖
𝑘+𝑉 𝑖+1
˜𝑠2,𝑘− ˜𝑠1,𝑘−𝑉 𝑖
𝑘

,

𝜌𝑘𝑉 𝑘 + (1 − 𝜌𝑘 )𝑉 𝑖

𝑘 +

˜𝑠2,𝑘 − ˜𝑠1,𝑘 − ˜𝑠2,𝑘𝑉 𝑘
˜𝑠2,𝑘 − ˜𝑠1,𝑘 − 𝑉 𝑘

,

𝑉 𝑖
𝑘−1 =

𝑉𝑘 + 𝜌𝑘 ˜𝑠1,𝑘 + (1 − 𝜌𝑘 ) ˜𝑠2,𝑘 ,

if

if

if

𝑖 ≠ 𝐿

and 𝜌𝑘 ≥ ˜𝜌𝑖
𝑘 ,

𝑖 = 𝐿

and 𝜌𝑘 ≥ ˜𝜌𝑖
𝑘 ,

(3.20)

𝑖 ≠ 𝐿

and 𝜌𝑘 < ˜𝜌𝑖
𝑘 ,

𝜌𝑘 ( ˜𝑠1,𝑘 + 𝑉 𝑘 ) + (1 − 𝜌𝑘 ) ( ˜𝑠2,𝑘 + 𝑉 𝑖

𝑘 ),

otherwise.






Similarly, we can derive the parameterized policies for both players, but leave those details out for

brevity.

We now compare an M-SSG with an edge-game for a fixed set of probabilities, i.e., 𝜌𝑘 = 𝜌, ∀𝑘 ∈

{1, 2, . . . , 𝐾 }, and a termination threshold 𝐿 under a unit strong defense cost, i.e., 𝑠11,𝑘 = 1, ∀𝑘 ∈

{1, 2, . . . , 𝐾 }. The value of an M-SSG (𝑉 0

0 ) over the number of stages is shown in Figure 3.3a.
We observe that the value of M-SSG increases with increasing values of 𝐿 and probability 𝜌. For

instance, the value 𝑉 0

0 for a fixed 𝜌 = 1.0 is higher for 𝐿 = 4 compared to 𝐿 = 2. Similarly, the
0 for a fixed values of 𝐿 = 2 is higher for 𝜌 = 0.5 compared to 𝜌 = 1.0. In summary, as the
likelihood of an adversary increase, so does the value of M-SSG. The Nash equilibrium policies

value 𝑉 0

for an M-SSG with the corresponding probability 𝜌 and termination threshold 𝐿 are shown in

Figure 3.4a and 3.4b, respectively. We observe that the defender (resp. second player) switches to a

pure policy weak defense (resp. attack) at the stages when 𝜌𝑘 is below ˜𝜌𝑘 , indicated in Figure 3.3b.

In particular, the probabilities 𝜌 = 0.4, 0.65 are below ˜𝜌𝑘 for the stages 20 and 19. As we iterate

𝑖 = {𝐿 − 1, 𝐿 − 2, . . . , 1}, the termination threshold enables greater instances of the policy pair

{weak defense, attack}. Finally, for a given a termination threshold 𝐿 with a fixed probability 𝜌,

once a player switches to a pure policy, it continues to play the pure policy until the final stage 𝐾.

Remark 3.3.5 The presented analysis and numerical evaluation of the M-SSG corresponds to

a particular set T . However, with a change in the set T , we can derive corresponding Nash

66

equilibrium policies and value of the game, representing different game structures. For instance,

when

T := {{strong defense, attack}, {strong defense, no attack}},

the termination threshold corresponds to a limit on the number of times the defender can resort to

strong defense actions. Similarly, if

T := {{strong defense, attack}, {weak defense, attack}},

the termination threshold would correspond to a limit on the number of attack actions.

To summarize, in this section, we studied how the termination threshold of 𝐿 affects the solution

of the M-SSG. We derived a recursive equation for both the value of the M-SSG and player policies.

The numerical study provides insight into the player policies as a function of 𝐿 and the number of

stages of the game. In the next section, we will extend the SSG model to a larger action space, i.e.,

more than two actions per player setting and derive a condition similar to Theorem 3.3.4 to switch

to a pure player policy.

3.4 Solution to SSG𝑚×𝑛

In this section, we analyze the model and solution for an SSG𝑚×𝑛 with a termination threshold

set to 𝐿 = 1, i.e., the game reaches a stopping state when both the defender and the second

player jointly select an action pair from the set T . While the model SSG𝑚×𝑛 can be extended to

accommodate termination thresholds greater than 1, for ease of exposition, we present the case

of 𝐿 = 1. Similar to our approach with the M-SSG, we develop a methodology to compute the

outcome 𝐽𝐾 defined in equation (3.3), resulting in a Nash equilibrium for the SSG𝑚×𝑛. With 𝐿 = 1,

the expected value of an SSG𝑚×𝑛 at stage 𝑘 − 1 will be a function of an edge-game with value 𝑉 𝑘

[14] at stage 𝑘 with probability 𝜌𝑘 and 𝑉𝑘 (The superscript 1 has been omitted for clarity) with

probability 1 − 𝜌𝑘 .

The actions of the players and the corresponding entries of the stage cost matrices 𝑆𝑘 and 𝑅𝑘

67

are given as,

attack 1 . . . no attack

no attack

defense 1

defense 2

. . .

defense m

. . .

. . .

. . .

. . .

𝑠21,𝑘

𝑠11,𝑘












𝑠𝑚1,𝑘


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:124)

. . .

(cid:123)(cid:122)
𝑆𝑘

𝑠1𝑛,𝑘

𝑠2𝑛,𝑘

. . .














(cid:125)

defense 1

defense 2

,

. . .

defense 𝑚

,

𝑟2,𝑘

𝑟1,𝑘























𝑟𝑚,𝑘




(cid:124)(cid:123)(cid:122)(cid:125)
𝑅𝑘

. . .

𝑠𝑚𝑛,𝑘
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

where 𝑆𝑘 ∈ R𝑚𝑘×𝑛𝑘 and 𝑅𝑘 ∈ R𝑚𝑘 . Similar to M-SSG, the stage cost matrix ˜𝑅𝑘 ∈ R𝑚𝑘×𝑛𝑘 is a

matrix whose columns are all equal to 𝑅𝑘 , representing a zero-sum matrix with 𝑚𝑘 defender actions

and 𝑛𝑘 second player actions.

As per Definition 1, let {𝛼, 𝛽} represent the set of action pair indices corresponding to the set

T . In other words, the game stops at any stage when the players choose actions that belong to this

set. We define a matrix D such that

D𝛾,𝛿 = 0, ∀{𝛾, 𝛿} ∈ {𝛼, 𝛽},

D𝛾,𝛿 = 1,

otherwise,





where D𝛾,𝛿 is (𝛾, 𝛿)th entry of the matrix D. Analogous to the M-SSG, the value of an SSG𝑚×𝑛 at

stage 𝑘 is

𝑉𝑘−1 = Val(𝜌𝑘 (𝑉 𝑘 D + 𝑆𝑘 ) + (1 − 𝜌𝑘 ) (𝑉𝑘 1 + ˜𝑅𝑘 )),

(3.21)

where 𝑆𝑘 ∈ R𝑚𝑘×𝑛𝑘 and ˜𝑅𝑘 ∈ R𝑚𝑘×𝑛𝑘 are stage cost matrices corresponding to adversarial and non-

adversarial player type. At stage 𝑘, the defender and second player have 𝑚𝑘 and 𝑛𝑘 number of actions,

respectively. The solution to (3.21) depends on the edge-game at every stage 𝑘 ∈ {1, 2, . . . , 𝐾 },

which takes the form

𝑉 𝑘−1 = Val(𝑉 𝑘 D + 𝑆𝑘 ) := min
𝑦𝑘 ∈Δ𝑚𝑘

max
𝑧𝑘 ∈Δ𝑛𝑘

𝑦′
𝑘 (𝑉 𝑘 D + 𝑆𝑘 )𝑧𝑘 .

(3.22)

Problem (3.22) can be formulated as a linear program [61] with (𝑉 𝑘 D + 𝑆𝑘 ) as the zero-sum matrix,

whose outcome are 𝑉 𝑘−1, 𝑦𝑘 and 𝑧𝑘 . Similar to (3.22), we can establish a recursive form for an

68

SSG𝑚×𝑛 (3.21) given by

𝑉𝑘−1 = min
𝑦𝑘 ∈Δ𝑚𝑘

max
𝑧𝑘 ∈Δ𝑛𝑘

(cid:16)

𝑦′
𝑘

𝜌𝑘 (𝑉 𝑘 D + 𝑆𝑘 ) + (1 − 𝜌𝑘 ) (𝑉𝑘 1 + ˜𝑅𝑘 )

(cid:17)

𝑧𝑘 ,

(3.23)

where 1 ∈ R𝑚𝑘×𝑛𝑘 is a matrix of ones. Analogous to (3.22), we can formulate (3.23) as a linear

program, with (𝜌𝑘 (𝑉 𝑘 D + 𝑆𝑘 ) + (1 − 𝜌𝑘 ) (𝑉𝑘 1 + ˜𝑅𝑘 )) serving as a zero-sum matrix. The outcome

of this linear program yields the Nash equilibrium policies and the value of the game (𝑉𝑘 ) at every

stage 𝑘 ∈ {1, 2, . . . , 𝐾 }.

While it is possible to compute the solution for a given SSG𝑚×𝑛 and determine the corresponding

player policies, the computational load can increase significantly for large number of stages 𝐾.

Hence, in the subsequent result, we derive a sufficient condition to determine when to switch from

a numerical solution to an analytical one.

Assumption 3.4.1 We assume the following stage cost inequality holds for any stage 𝑘 for the

SSG𝑚×𝑛,

𝑟1,𝑘 > 𝑟𝑖,𝑘 ≥ 𝑟𝑚,𝑘 ≥ 0, ∀𝑖 ∈ {2, 3, . . . , 𝑚 − 1}

𝑠𝑚1,𝑘 > 𝑠𝑖𝑏,𝑘 ≥ 𝑠𝑚𝑛,𝑘 ≥ 0, ∀𝑏 ∈ {2, 3, . . . , 𝑛 − 1},

𝑠𝑚1,𝑘 > 𝑠𝑖 𝑗,𝑘 , ∀𝑖 ∈ {1, 2, . . . , 𝑚 − 1}, 𝑗 ∈ {1, 2, . . . , 𝑛},

𝑠𝑚𝑛,𝑘 ≤ 𝑠𝑖 𝑗,𝑘 , ∀𝑖 ∈ {1, 2, . . . , 𝑚 − 1}, 𝑗 ∈ {1, 2, . . . , 𝑛},

Assumption 3.4.1 can be considered as an extension of Assumption 3.3.1.

It signifies four

conditions:

i) Within the matrix 𝑅𝑘 , the defense costs rise monotonically from the last row of

defense (𝑚) to the first row of defense (1).

ii) In the matrix 𝑆𝑘 , the costs corresponding to

row m decreases across the column, i.e., from the action of attack 1 to no attack.

iii) The cost

corresponding to the action pair {defense 𝑚, attack 1} is the largest entry of the matrix 𝑆𝑘 . iv) The

cost corresponding to the action pair {defense 𝑚, no attack } is the smallest entry of the matrix

𝑆𝑘 . Under the specified Assumption 3.4.1, the following result summarizes a switching condition

under which a pure Nash equilibrium exists at a given stage.

69

Theorem 3.4.2 The Nash equilibrium policies at any stage 𝑘 ∈ {1, 2, . . . , 𝐾 } for a given SSG𝑚×𝑛

with a termination threshold of 𝐿 = 1, stage cost matrices {𝑆𝑘 , and ˜𝑅𝑘 } under Assumption 3.4.1 is

given by

where

(cid:20)

0 . . . 1

(cid:21) ′

,

Solve (3.23) as a

linear program [61],

𝑦∗
𝑘 =






(cid:20)

1 . . . 0

(cid:21) ′

,

Solve (3.23) as a

linear program [61],

𝑧∗
𝑘 =






if 𝜌𝑘 < ˆ𝜌𝑘 ,

otherwise,

if 𝜌𝑘 < ˆ𝜌𝑘 ,

otherwise,

ˆ𝜌𝑘 =

𝑟 𝑗 ∗,𝑘 − 𝑟𝑚,𝑘
𝑠𝑚1,𝑘 − 𝑠 𝑗 ∗1,𝑘 + 𝑟 𝑗 ∗,𝑘 − 𝑟𝑚,𝑘 + 𝑉 𝑘 − D𝑚, 𝑗 ∗𝑉 𝑘

,

𝑗 ∗ := arg min

𝑖∈{1,2,...,𝑚−1}

𝑠𝑖1,𝑘

The value of the game at stage 𝑘 is given by

𝑉𝑘−1 =






𝜌𝑘 (𝑠𝑚1,𝑘 + 𝑉 𝑘 ) + (1 − 𝜌𝑘 ) (𝑟𝑚,𝑘 + 𝑉𝑘 ),

if 𝜌𝑘 < ˆ𝜌𝑘

Solve (3.23) as a

linear program [61],

otherwise.

(3.24)

(3.25)

(3.26)

(3.27)

□

Proof: The proof closely follows from Lemma 3.3.3. When 𝜌𝑘 = 1, 𝑉 𝑘 D + 𝑆𝑘 is the zero-sum

matrix under consideration, where a linear program [61] is used to solve for the value of the

game and the player policies. In contrast, when 𝜌𝑘 = 0, the zero-sum matrix is 𝑉𝑘 1 + ˜𝑅𝑘 , which

70

corresponds to a repeated column matrix with a pure policy of weak defense, and where the cost is

invariant of the second player’s action. Hence, under Assumption 3.4.1 and when 0 < 𝜌𝑘 < 1, we

seek to determine a value of 𝜌𝑘 that causes a switch from a linear program solution to a pure policy

action of weak defense (row domination). Furthermore, when a row domination is encountered,

the second player also switches to a pure policy of attack 1. The pure Nash equilibrium of

(cid:104)

𝑦∗
𝑘 =

0 . . . 1

(cid:105) ′

(cid:104)

𝑧∗
𝑘 =

1 . . . 0

(cid:105) ′

.

arises when

𝜌𝑘 (𝑠𝑚1,𝑘 + 𝑉 𝑘 )+

𝜌𝑘 (𝑠𝑚 𝑗 ∗,𝑘 + D𝑚, 𝑗 ∗𝑉 𝑘 )+

<

(1 − 𝜌𝑘 )(𝑉𝑘 + 𝑟𝑚,𝑘 )

(1 − 𝜌𝑘 ) (𝑉𝑘 + 𝑟 𝑗 ∗,𝑘 ),

(3.28)

where 𝑗 ∗ := arg min𝑖∈{1,2,...,𝑚−1} 𝑠𝑖1,𝑘 . By rearranging the terms related to 𝜌𝑘 to the left-hand side,

the inequality becomes:

𝜌𝑘 (𝑠𝑚1,𝑘 + 𝑉 𝑘 − 𝑠𝑚 𝑗 ∗,𝑘 − D𝑚, 𝑗 ∗𝑉 𝑘 + 𝑟𝑚,𝑘 − 𝑟 𝑗 ∗,𝑘 ) < 𝑟𝑚,𝑘 − 𝑟 𝑗 ∗,𝑘 .

Further simplification leads to (3.26). We derive (3.27) for case of 𝜌𝑘 < ˜𝜌𝑘 from the left hand side

of the inequality (3.28) and the linear programming-based solution otherwise.

□

Theorem 3.4.2 provides us with a recursive approach that combines numerical and analytical

solutions, effectively reducing the computational burden when solving SSG𝑚×𝑛. Next, we illustrate

the results of Theorem 3.4.2 using a numerical example for a chosen set of stage cost matrices and

probability 𝜌𝑘 . This allows us to observe how the value of SSG𝑚×𝑛 and the corresponding player

policies change with varying probability 𝜌𝑘 .

Numerical example

We evaluate an SSG𝑚×𝑛 with four actions for both the defender and second player, i.e, 𝑆𝑘 ∈ R4×4

and ˜𝑅𝑘 ∈ R4×4, ∀𝑘 ∈ {1, 2, . . . , 𝐾 }. In this numerical example, we use the matrix D as:

D =

03×3 13×1

11×3

1









,









71

(a)

(b)

Figure 3.5 (a) Nash Equilibrium policy of the defender (rows 2,3 and 4) for an SSG𝑚×𝑛 for a range
of 𝜌 solved over a total of 𝐾 = 20 stages. The SSG𝑚×𝑛 was solved with stage cost matrix entries
𝑠1,𝑘 = 1.0, 𝑠2,𝑘 = 1.2 and 𝑠3,𝑘 = 0.3, 𝑘 ∈ {1, 2, . . . , 𝐾 }. (b) Nash Equilibrium policy of the second
player (column 1 and 4) for the same stage cost parameters and total number of stage.

where 0𝑎×𝑏 and 1𝑎×𝑏 correspond to a matrix of zeros or ones of size 𝑎 × 𝑏. We parameterize the

stage-cost matrices 𝑆𝑘 and ˜𝑅𝑘 using three terms, 𝑠1,𝑘 , 𝑠2,𝑘 , and 𝑠3,𝑘 to obtain:

𝑆𝑘 =

𝑠1,𝑘
𝑠1,𝑘+𝑠2,𝑘
2
𝑠1,𝑘+𝑠2,𝑘
2
𝑠2,𝑘















𝑠1,𝑘
𝑠1,𝑘+𝑠3,𝑘
2
𝑠1,𝑘+𝑠2,𝑘
2
𝑠1,𝑘+𝑠3,𝑘
2

𝑠1,𝑘
𝑠1,𝑘+𝑠2,𝑘
2
𝑠1,𝑘+𝑠3,𝑘
2
𝑠1,𝑘+𝑠3,𝑘
2

𝑠1,𝑘
𝑠1,𝑘+𝑠3,𝑘
2
𝑠1,𝑘+𝑠3,𝑘
2
𝑠3,𝑘















, ˜𝑅𝑘 =

𝑠1,𝑘
𝑠1,𝑘+𝑠3,𝑘
2
𝑠1,𝑘+𝑠3,𝑘
2
𝑠3,𝑘















𝑠1,𝑘
𝑠1,𝑘+𝑠3,𝑘
2
𝑠1,𝑘+𝑠3,𝑘
2
𝑠3,𝑘

𝑠1,𝑘
𝑠1,𝑘+𝑠3,𝑘
2
𝑠1,𝑘+𝑠3,𝑘
2
𝑠3,𝑘

.

𝑠1,𝑘
𝑠1,𝑘+𝑠3,𝑘
2
𝑠1,𝑘+𝑠3,𝑘
2
𝑠3,𝑘















Both the stage cost matrices 𝑆𝑘 and ˜𝑅𝑘 follow Assumption 3.4.1.

We evaluate an SSG𝑚×𝑛 using the parameters 𝑠1,𝑘 = 1.0, 𝑠2,𝑘 = 1.2 and 𝑠3,𝑘 = 0.3, ∀𝑘 ∈
{1, 2, . . . , 𝐾 }, for a total of 𝐾 = 20 stages with fixed probabilities 𝜌𝑘 = 𝜌, ∀𝑘 ∈ {1, 2, . . . , 𝐾 }. The

defender and second player policies are shown in Figure 3.5a, and 3.5b, respectively. Notably, due

to the nearly identical entries in the second and third rows of the parameterized stage cost matrix

𝑆𝑘 , the defender’s policy of playing defense 2 or 3 is indistinguishable, as seen in Figure 3.5a.

As the probability 𝜌 decreases, we observe the defender shifts to a pure policy of defense 4 in

the later stages of the SSG𝑚×𝑛. This shift is attributed to the reduced likelihood of having an

adversarial player in the game. The second player’s policy involves selecting extreme column

72

00.5105101520Stages, k00.51 = 0.2 = 0.5 = 0.700.5105101520Stages, k00.51 = 0.2 = 0.5 = 0.7choices, specifically attacking 1 or no attack, as indicated in Figure 3.5b.

In other words, the

second player mixes between attack 1 and no attack, while not playing semi-attack (attack 2 and

3) actions. Similar to the defender’s policy, a decrease in the probability 𝜌 leads to an increased

likelihood of attack 1 in the later stages of the SSG𝑚×𝑛 with a switch to the pure policy of attack

1. This numerical example demonstrates the framework of an SSG beyond two actions per player

and provide insights into symmetric policies for a parameterized version of stage cost matrices.

Furthermore, it provides an analytical condition under which a player can switch from a numerical

solution to a pure policy.

In the next section, we will apply the framework of SSG𝑚×𝑛 on an

estimation problem and demonstrate the player policies through a numerical example.

3.5 Application

We now apply an M-SSG to two motion planning scenarios that involve making decisions to

detect and counter adversarial intent with an engagement budget of 𝐿 = 1. Through these scenarios,

we demonstrate how to incorporate mobility aspects of an autonomous vehicle in the framework of

a M-SSG.

(a)

(b)

Figure 3.6 (a) Ego and non-ego vehicle policy averaged over 50 experiment runs for 𝐾 = 25 with
𝜌 = 0.25. (b) Simulated ego and non-ego vehicle policy with defined stage costs and 𝜌 = 0.25.

73

01020Time (s)00.20.40.60.81p(Defend/Attack)Var(Ego)Var(Non-ego)Ego,  = 0.25Non-ego,  = 0.2505101520Time (s)00.20.40.60.81p(Defend\Attack)Ego,  = 0.25Non-ego,  = 0.25(a)

(b)

Figure 3.7 (a) Sample policy of the defender and attack for a given experimental run. (b) Sampled
and expected value of the SSG compared with the theoretical value of the SSG.

3.5.1 Leader-Follower Game with Fixed Stage Cost Matrix

In this scenario, the ego vehicle aims to maintain a safe distance from a potentially adversarial

vehicle on the road, akin to a cruise control behavior [68]. The ego vehicle is assumed to be

equipped with sensors like LiDARs, cameras, and radar, providing information in the vicinity of

the ego vehicle, including the likelihood of any other non-ego vehicle behaving adversarially. This

likelihood information is used to derive the stochastic parameter 𝜌 of the M-SSG.

The adversary has the option to brake, potentially causing the ego vehicle to slow down, or can

continue traveling at nominal speeds. The ego vehicle, on the other hand, can choose to brake,

ensuring a safe distance is maintained, or can opt to travel at a nominal speed.

When a non-ego vehicle chooses to brake and is successful while the ego vehicle is traveling at

nominal speed, the distance between the vehicles is reduced, necessitating a slowdown by the ego

vehicle. This leads to additional time required to reach its goal. This scenario is modeled through

fixed stage cost matrices:

74

510152025Time (s)00.20.40.60.81ActionDefendAttack01020304050Experiment epoch26272829Value of SSGExperimentsSSG TheoreticalSSG ExperimentsBrake

𝑆 =

Nominal speed

Brake

Nominal
speed

𝜙Δ

˜𝜙Δ









𝜙Δ

Δ









Brake

,

𝑅 =

Nominal speed

Nominal
speed

Nominal
speed

𝜙Δ

Δ









𝜙Δ

Δ

.









where Δ represents the time required per decision instant for the ego vehicle. Here, 𝜙 > 1

represents the additional time factor required during a braking action applied by the ego vehicle,

and ˜𝜙 > 𝜙 represents the increased time factor when the ego vehicle moves at nominal speed and

the non-ego vehicle acts adversarially.

Simulation and Experiments: We conducted experiments using Turtlebot3 burgers to represent

both the ego and non-ego vehicles, maintaining a safe distance between them. An OptiTrack

Motion Capture system was used for localization and as an infrastructure. The Robot Operating

System (ROS) was employed for implementing the M-SSG policies and the corresponding actions

on both the ego and non-ego vehicles. Given the nominal velocity of the ego vehicle, along with

starting and goal positions, we determined the total number of stages 𝐾 for the M-SSG. A total of

50 experiments were carried out and compared against the numerical solution from Theorem 3.3.4.

The nominal linear velocity of the robot is set to 0.14 m/s in the absence of any brake action

by either the ego or non-ego vehicle. The linear velocity is reduced to 0.10 m/s under a successful

attack (brake action by the non-ego vehicle) and to 0.12 m/s under a brake action by the ego vehicle

when the non-ego vehicle travels at nominal speed. These nominal velocities were arbitrarily

chosen within the constraints feasible for the Turtlebot3. The change in velocity for each action was

determined to conform to the stage cost matrix structure (Assumption 3.3.1). The M-SSG is played

at 1 Hz, where Δ equals the safe distance (1 m) from the adversary agent ahead of the defender

divided by the velocity, representing the time required to cover that distance. The entries of the

stage cost matrix also account for the cost of returning to nominal velocity. The numerical entries

of the stage cost matrices are:

75

𝑆 =










1.18 1.18







1.38

1.0

, 𝑅 =

1.18 1.18

1.0

1.0









.









The actions sampled from the Nash equilibrium policies realized in the experiments and nu-

merical evaluation are averaged and illustrated in Figure 3.6a and 3.6b. We observe a constant

difference between averaged M-SSGs from the experiments and the numerical evaluation primarily

due to uncertainties stemming from lack of synchronization, delays, and the accuracy of the robot’s

position. Furthermore, we observe that both the non-ego and ego vehicles increase the probability

of braking as the game progresses over the stages, with the exception of the ego vehicle dropping

out at the end. A sample policy realization of both the ego vehicle and adversary (non-ego vehicle)

is shown in Figure 3.7a, where the shaded region represents the game being active and reaching

a stopping state, otherwise. Finally, the value of M-SSG from Theorem 3.3.4, i.e., the total time

taken to cover 𝐾 stages is compared against the averaged SSG from experiments in Figure 3.7b,

indicating a difference between the theoretical and experiments arising from the uncertainties as

indicated earlier. These findings show that with an appropriate choice of stage cost matrices and

scenarios, we can apply the framework of M-SSG with constant stage cost matrices to reason about

the possible actions in the presence of a probabilistic adversary.

Figure 3.8 Illustration of trajectory under scaled and unscaled control input 𝛾𝑢𝑘 and 𝑢𝑘 respectively
for a finite horizon 𝑇.

76

Difference between scaled andunscaled state - State under scaled control input.3.5.2 Lane-merging scenario

The M-SSG framework also supports varying stage costs, motivating us to consider dynamic

scenarios commonly encountered in multi-robot problems. We assume the existence of a path

prediction mechanism for the ego vehicle (defender), capable of generating its own trajectory and

that of any surrounding agents. Similar to the constant stage cost game, we assume that the defender

is equipped with an array of sensors. This prediction mechanism also provides the likelihood of a

predicted path, enabling us to determine the stochastic parameter 𝜌. Furthermore, once adversarial

intent is ascertained, the non-ego vehicle is modeled as an adversary in all subsequent stages.

(a)

(b)

Figure 3.9 (a) Simulated trajectory of an ego and non-ego vehicle in a lane change scenario over
50 time steps with 𝜌 = 0.1 and 𝜌 = 1.0, and a final time of 25s. (b) Expected trajectory of the
ego and non-ego vehicle averaged over 50 experiment runs, with 50 time steps for corresponding
parameters of 𝜌.

The M-SSG framework accommodates dynamic scenarios commonly encountered in multi-

robot problems. It relies on two key inputs: (i) a predicted path for the defender and surrounding

agents, and (ii) the likelihood of these predicted paths. We will analyze actions related to braking

within the predicted trajectories, comparing unscaled versus scaled control inputs.

Figure 3.8 illustrates the distinction between scaled and unscaled inputs over the last three time

steps within a horizon of 𝑇 ≥ 3. At time instant 𝑇 − 3, under a scaled control input 𝛾𝑢𝑇−3, a vehicle

reaches the shaded state to the left of 𝑥𝑇−2. This implies the existence of a control input that enables

77

-2-1012XEgo vehicleNon-egovehicle = 0.1Ego vehicleNon-ego vehicle = 1.0Difference inposition-1012XEgo vehicleNon-egovehicle = 0.1tend = 24.85Ego vehicleNon-ego vehicle = 1.0tend = 24.92Difference inposition(a)

(b)

Figure 3.10 (a) Ego and Non-ego vehicle policy averaged over 50 experiment runs with a nominal
speed of 0.15 m/s and 0.18 m/s. (b) Ego and non-ego vehicle policy averaged over 50 simulation
runs for the same set of nominal speeds (0.15 and 0.18 m/s).

the vehicle to reach 𝑥𝑇−1 from the shaded state at time 𝑇 − 2.

In particular, we make the following assumption:

Assumption 3.5.1 Given a predicted path, when a scaled control input is applied at any time

instant 𝑡 to reach a corresponding new state at 𝑡 + 1, there exists a control input at time instant 𝑡 + 1

to reach the exact predicted state at 𝑡 + 2.

This assumption empowers us to employ the principle of optimality. Essentially, if the vehicle

can reach the next state from either a scaled or an unscaled state, the vehicle incorporates the cost

difference between the current state and the next state into the cost-to-go. The stage cost at any

time instant comprises a state cost along with a safety cost that governs the distance between the

defender vehicle and any adversarial vehicles in its vicinity. In this work, we adopt a logarithmic

function for the safety cost:

𝜓𝑘 (𝑥𝑎

𝑘 , 𝑥𝑑

𝑘 , ¯𝑑) = −𝜆 log

(cid:32) ||𝑥𝑎

𝑘 − 𝑥𝑑
𝑘 ||
¯𝑑

(cid:33)

,

(3.29)

where 𝑥𝑎

𝑘 and 𝑥𝑑
𝑘 − 𝑥𝑑

Here, ¯𝑑 > ||𝑥𝑎

𝑘 represent the predicted positions of the adversary and defender, respectively.
𝑘 || is the minimum safe distance, and 𝜆 is a scaling factor for safety.

78

0510152025Time (s)00.51p(Defend/Attack)Ego, v = 0.15 m/sEgo, v = 0.18 m/sNon-ego, v = 0.15 m/sNon-ego, v = 0.18 m/s0510152025Time (s)00.51p(Defend/Attack)Ego, v = 0.15 m/sEgo, v = 0.18 m/sNon-ego, v = 0.15 m/sNon-ego, v = 0.18 m/sLet 𝑓𝑘 (𝑥𝑎

𝑘 , 𝑥𝑑

𝑘 , 𝛾𝑢𝑎

𝑘 , 𝛾𝑢𝑑

𝑘 ) denote the current payoff. Thus, the stage cost matrices for the Stochas-

tic Stopping Game (SSG) are chosen as:

𝑓𝑘 (𝑥𝑎

𝑘 , 𝑥𝑑

𝑘 , 𝛾𝑢𝑎
𝑓𝑘 (𝑥𝑎

𝑘 , 𝛾𝑢𝑑
𝑘 , 𝑥𝑑

𝑘 ) + ˜𝑓𝑘 (𝑥𝑑
𝑘 , 𝑢𝑑
𝑘 )

𝑘 , 𝛾𝑢𝑎

𝑘 , 𝛾𝑢𝑑
𝑘 )

𝑓𝑘 (𝑥𝑎

𝑘 , 𝑥𝑑
𝑘 , 𝑥𝑑

𝑘 , 𝑢𝑎
𝑘 , 𝑢𝑎

𝑘 , 𝛾𝑢𝑑
𝑘 )
𝑘 , 𝑢𝑑
𝑘 )

𝑓𝑘 (𝑥𝑎

𝑥𝑎
𝜓𝑘 ((cid:101)
𝑘+1
𝑥𝑎
𝜓𝑘 ((cid:101)
𝑘+1

,

𝑥𝑑
(cid:101)
𝑘+1
, 𝑥𝑑

𝑘+1

, ¯𝑑) 𝜓𝑘 (𝑥𝑎

𝑘+1

, ¯𝑑) 𝜓𝑘 (𝑥𝑎

𝑘+1

,

𝑥𝑑
(cid:101)
𝑘+1
, 𝑥𝑑

𝑘+1

, ¯𝑑)

, ¯𝑑)

,









,









(3.30)

˜𝑆 =









𝑆𝑘 = ˜𝑆 +

𝑅𝑘 =








𝑠12,𝑘

𝑠22,𝑘









𝑠12,𝑘

𝑠22,𝑘

,









𝑘+1 represents the adversary state with a scaled control 𝛾𝑢𝑎
𝑥𝑎
where (cid:101)
𝑘 , 𝛾𝑢𝑑

𝑥𝑑
𝑘+1 represents the defender
𝑘 , (cid:101)
𝑘 ) denotes the cost to go from a stopping state, i.e., the

state with scaled control 𝛾𝑢𝑑

𝑘 , and ˜𝑓𝑘 (𝑥𝑑

defender continues to travel using scaled control input for the remaining stage. The dynamic M-SSG

is implemented in an online manner, akin to Model Predictive Control (MPC). The implementation

is summarized in Algorithm 2.

Simulations and experiments: To demonstrate the framework’s capability, we consider a lane

merging scenario. In this scenario, an ego vehicle (defender) is traveling straight in a lane, while a

non-ego vehicle (other agent) is planning to merge into the defender’s lane. Since the ego vehicle

is unaware of the other agent’s intent, we assume the possibility of adversarial behavior. However,

the non-ego vehicle may not necessarily act adversarially. Therefore, we approach the Stochastic

Stopping Game (SSG) completely from the ego vehicle’s perspective, justifying the zero-sum nature

of the game. To ensure realistic and feasible trajectories, we construct predicted trajectories for

both the ego and non-ego vehicles, which reflect the typical planning behavior of autonomous

vehicles [9]. For our experiments, we utilize Turtlebot3 burgers equipped with ROS for both the

ego and non-ego vehicles.

We set the prediction horizon (number of stages) as 𝐾 = 30 and the safety distance parameter

as 𝜆 = 5.0. The algorithm (Algorithm 2) is executed at a frequency of 2 Hz for 50 iterations,

resulting in a total of 50 experimental runs. We introduce a stochastic parameter 𝜌 = 0.1, chosen

79

Algorithm 2: Dynamic M-SSG

𝑘:𝑘+𝐾, 𝑥𝑎

𝑘:𝑘+𝐾 - predicted trajectories

Input: 𝐾 (Time horizon), Δ𝑇 (Sample time), 𝛾 control scaling factor, k (current time)
repeat Every Δ𝑇
Obtain 𝑥𝑑
Obtain 𝜌 - likelihood of an adversary
Compute 𝑥𝑑
𝑘:𝑘+𝐾 (𝛾𝑢𝑎
Compute 𝑆𝑘:𝑘+𝐾 and 𝑅𝑘:𝑘+𝐾 defined in (3.30)
Compute policy (3.11), (3.12) by solving the SSG
Set, 𝑑𝑘 ∼ 𝑦∗
Adversary ∼ 𝜌𝑘

𝑘:𝑘+𝐾 (𝛾𝑢𝑑

𝑘:𝑘+𝐾), 𝑥𝑎

𝑘 , 𝑎𝑘 ∼ 𝑧∗
𝑘

𝑘:𝑘+𝐾)

if 𝑑𝑘 = 0
otherwise

defender action =

(cid:40)𝑢𝑑
𝑘 ,
𝛾𝑢𝑑
𝑘 ,
(cid:40)𝑢𝑎
𝑘 ,

𝛾𝑢𝑎
𝑘 ,


Update the states, 𝑥𝑑
𝑘+1
Increment time, 𝑘 = 𝑘 + 1

adversary action =

,

if 𝑎𝑘 = 0
otherwise
𝑢𝑎
𝑘 ,
(defender action), 𝑥𝑎

if Adversary

otherwise

(adversary action)

𝑘+1

until Goal;

arbitrarily to observe the algorithm’s performance. In real-world scenarios, we anticipate that the

parameter 𝜌 would be dynamically updated at each time instant, a situation easily accommodated

by Algorithm 2.

(a)

(b)

Figure 3.11 (a) Simulated policy of an ego and non-ego vehicle (possible adversary) in lane change
scenario over a range of sample time with 𝜌 = 0.1. (b) Simulated policy of an ego and non-ego
vehicle for the same scenario over a range of nominal speeds with 𝜌 = 0.1.

80

00.51p(Defend)Ego05101520Time (s)00.51p(Attack)Non-ego T = 0.2 T = 0.4 T = 0.6 T = 0.8 T = 100.51p(Defend)Ego0510152025Time (s)00.51p(Attack)Non-egov = 0.1m/sv = 0.12m/sv = 0.14m/sv = 0.16m/sv = 0.18m/sThe expected trajectories in both simulations and experiments are depicted in Figures 3.9a

and 3.9b. In these figures, dashed circles indicate the starting pose of the TurtleBots, while solid

circles represent the end pose assuming an uncertain adversarial intent (𝜌 = 0.1) and a deterministic

adversary (𝜌 = 1.0). It’s important to note that the experiment runs differ from the simulations due

to factors like model discrepancies, measurement inaccuracies, and localization errors. Overall, we

observe that the experiments closely approximate the simulations, demonstrating the effectiveness

of the Stochastic Stopping Game (SSG) framework in reasoning about such navigation tasks. Under

the assumption of a deterministic adversary, we observe that the ego vehicle adopts a defensive

stance, resulting in a shorter distance covered in the same duration of time. The discrepancy in

the final positions of the TurtleBots between the simulation and experiments arises from an initial

delay in applying control commands and localization inaccuracies.

The policies obtained from 50 runs of experiments and simulations, conducted at nominal

speeds of 𝑣 = 0.15 m/s and 𝑣 = 0.18 m/s, are displayed in Figures 3.10a and 3.10b. We notice

a slight shift in the policies obtained from experiments, approximately 2 seconds compared to

the simulation. This discrepancy is primarily attributed to the initial delays experienced by the

TurtleBots in synchronizing the control commands. Nevertheless, the trend observed in the policies

from experiments closely aligns with those from simulations, reflecting the online nature of the

Stochastic Stopping Game and accounting for model and measurement noise.

Furthermore, we conducted simulations for a range of sample times and nominal speeds, as

illustrated in Figures 3.11a and 3.11b respectively. We observe that with decreasing sample time,

the policies of both the ego and non-ego vehicles converge to pure policies, i.e., always travel

at reduced speed and nominal speed, respectively. Conversely, with increasing speed for a fixed

sample time, the duration of active braking by the ego vehicle (or brake by the non-ego vehicle)

decreases (or increases), indicating a faster merger with higher speeds. These simulations and

experiments provide valuable insights into the performance of the dynamic SSG framework across

a spectrum of parameters and scenarios.

81

3.5.3 Resilient estimation

We will now apply the SSG𝑚×𝑛 framework to address a resilient estimation problem using a

Kalman filter operating in a possibly adversarial environment. This example demonstrates how to

incorporate the framework of SSG𝑚×𝑛 in a typical cyber-physical system application.

Figure 3.12 A typical feedback control system with an estimator. The control law is a function of
the estimates. The estimator performance is dependent on the channel used to communicate the
data observed from a sensor. An adversary might be present in the feedback loop impacting the
performance of the estimates by injecting noise on different channels.

A standard feedback control system is depicted in Figure 3.12 accompanied by a potential

adversary. In this scenario, there are 𝑚 feedback channels to communicate the sensor information

to an estimator. Each channel choice corresponds to a choosing an estimator with a specific

sampling frequency. Equivalently, it can be viewed as choosing different plant models for different

channels. Opting for a channel with a high sampling frequency could yield a low error covariance,

but might be susceptible to a high measurement noise injected by an adversary. The steady-state

posterior error covariance for a channel 𝑖 ∈ {1, 2, . . . , 𝑚} and injected noise 𝑗 ∈ {1, 2, . . . , 𝑛} is

obtained by solving the discrete time algebraic Riccati equation (DARE):

(cid:16)

P∞ = F𝑖

P∞ − P∞HT (M)−1 HP∞

(cid:17)

𝑖 + Q𝑖,
FT

:= 𝑓 (F𝑖, R 𝑗 ),

(3.31)

where M := (cid:0)HP∞HT + R 𝑗 (cid:1) ∈ R𝑚×𝑚, P∞ ∈ R𝑛×𝑛 is the steady state covariance, F𝑖 ∈ R𝑛×𝑛 is

82

PlantControllerSensorEstimator Channel 1Channel mAdversarythe state transition model using the channel 𝑖, H ∈ R𝑚×𝑛 is the observation model, R 𝑗 ∈ R𝑚×𝑚

is the covariance of the observation noise and Q𝑖 ∈ R𝑛×𝑛 is the covariance of the process noise

corresponding to the 𝑖th channel. We assume that the pairs (F𝑖, H) is detectable and (F𝑖, Q1/2) is

controllable on and inside the unit circle [134], so that there exists at least one positive definite

P∞ := 𝑓 (F𝑖, R 𝑗 ), resulting in a stable steady-state Kalman filter. In this context, every stage can be

perceived as an episode composed of numerous time steps. We configure the stage cost matrices

𝑆𝑘 = 𝑆 and ˜𝑅𝑘 = ˜𝑅 for all 𝑘 ∈ {1, 2, . . . , 𝐾 }, considering 𝑚 defender channels and 𝑛 potential

noise levels.

In this section, we consider four channels for the defender and four potential noise levels

introduced by a probable adversary. Each noise level imparts different observation noise across

the various defender channels. For this analysis, we assume fixed stage cost matrices 𝑆𝑘 = 𝑆 and

𝑅𝑘 = 𝑅, ∀𝑘 ∈ {1, 2, . . . , 𝐾 } defined as,

𝑆 =

˜𝑅 =




























Tr( 𝑓 (F1, R1)) Tr( 𝑓 (F1, R1)) Tr( 𝑓 (F1, R1)) Tr( 𝑓 (F1, R1))
Tr( 𝑓 (F2, R2)) Tr( 𝑓 (F2, R2)) Tr( 𝑓 (F2, R2′ )) Tr( 𝑓 (F2, R2))
Tr( 𝑓 (F3, R3)) Tr( 𝑓 (F3, R3′ )) Tr( 𝑓 (F3, R3)) Tr( 𝑓 (F3, R3))
Tr( 𝑓 (F4, R4)) Tr( 𝑓 (F4, R5)) Tr( 𝑓 (F4, R6)) Tr( 𝑓 (F4, R7))

Tr( 𝑓 (F1, R1)) Tr( 𝑓 (F1, R1)) Tr( 𝑓 (F1, R1)) Tr( 𝑓 (F1, R1))

Tr( 𝑓 (F2, R2)) Tr( 𝑓 (F2, R2)) Tr( 𝑓 (F2, R2)) Tr( 𝑓 (F2, R2))

Tr( 𝑓 (F3, R3)) Tr( 𝑓 (F3, R3)) Tr( 𝑓 (F3, R3)) Tr( 𝑓 (F3, R3))

Tr( 𝑓 (F4, R7)) Tr( 𝑓 (F4, R7)) Tr( 𝑓 (F4, R7)) Tr( 𝑓 (F4, R7))















,















,

(3.32)

where R𝑥 := r𝑥 𝐼, 𝑥 ∈ {1, 2, 2′, 3, 3′, 4, 5, 6, 7}, i.e., the observation covariance matrix is diagonal.

Similarly, the diagonal process noise matrix is given by Q𝑥 := q𝑥 𝐼, 𝑥 ∈ {1, 2, 3, 4}. The matrix ˜𝑅

is a repeated column matrix of 𝑅, which is defined by the last column of the matrix 𝑆. The state

transition is defined as

F𝑥 :=

, 𝑥 ∈ {1, 2, 3, 4}.

The defender channel 𝑚, which corresponds to the fourth row, is the estimator with highest sampling


1 Λ𝑥




0



1









83

Figure 3.13 Value of the SSG𝑚×𝑛 for a range of fixed probability 𝜌 solved over a total of 𝐾 = 20
stages with a engagement budget of 𝐿 = 1.

(a)

(b)

Figure 3.14 (a) Probability of the defender actions; defense 2,3 and 4 (rows 2,3 and 4) for an SSG𝑚×𝑛
for the corresponding probability and stages 𝐾. (b) Probability of the second player actions; attack
1,2 and 3 (column 1, 2 and 3) for the same SSG𝑚×𝑛 for the corresponding probability and stages 𝐾.

rate but is the least secure, captured via the inequality r4 > r5 ≥ r6 ≥ r7. The rows above 𝑚

correspond to slower estimator rates but are secure, with the highest security corresponding to the

entries of the first row, captured via the observation covariance r4 > r𝑥, ∀𝑥 ∈ {1, 2, 2′, 3, 3′, 5, 6, 7}

and Λ1 > Λ2 > Λ3 > Λ4. The adversary has the ability to introduce varying noise levels when

the defender opts for the 𝑚-th row, which is the least secure channel. However, as the defender

picks more secure channels (rows < 𝑚) the adversarial impact degrades and completely diminishes

84

05101520Stages K51015202530 = 0.2 = 0.45 = 119202526270.000.501.000.000.501.0005101520Stages, k0.000.501.00 = 0.2 = 0.45 = 10.000.501.000.000.501.0005101520Stages, k-0.05 0.00 0.05 0.10 0.15 = 0.2 = 0.45 = 1against the first row. For rows ≠ 𝑚, adversary can exert maximum influence by selecting specific

column or action (r3′ ≥ r3 and r2′ ≥ r2). On the other hand, when the defender picks the first
row, the costs remain unaffected by the adversary’s actions. The matrices 𝑆 and ˜𝑅 in (3.32)

follow Assumption 3.4.1. We summarize the observation covariance and sampling frequency of

the estimators in the following inequalities:

r4 > r5 ≥ r6 ≥r3′ ≥ r2′ ≥ r3 ≥ r2 ≥ r7 > r1 ≥ 0

Λ1 > Λ2 > Λ3 > Λ4 > 0.

We used the following parameters,

Λ1 = 0.6, Λ2 = 0.5, Λ3 = 0.3, Λ4 = 0.1,

r4 = 2.0, r5 = 1.5, r6 = 1.2, r3′ = 0.45, r2′ = 0.35,

r3 = 0.25, r2 = 0.15, r7 = 0.2, r1 = 0.1.

Although the adversarial injected noise degrades from the least (𝑚) to the most secure channel

(1), the impact of process noise covariance has the opposite effect. It increases as we move from

the least secure channel (𝑚) to the most secure channel (1). We use the following parameters for

process noise covariance,

q1 = 0.6, q2 = 0.4, q3 = 0.3, q4 = 0.25.

Once an adversarial action is detected, the game reaches a stopping state (i.e., 𝐿 = 1). The cost-to-

go from a stopping state equals the number of remaining stages multiplied by the cost of using the

most secure channel. Finally, we define matrix D as

D =

03×3 13×1

11×3

1









.









The value of the SSG𝑚×𝑛 is shown in Figure 3.13 along with the corresponding defender and

second player policies in Figure 3.14a and 3.14b. Similar to the evaluated numerical example for

the M-SSG, we can observe that the value of the SSG𝑚×𝑛 increases as the probability 𝜌 increases.

Such an increase indicates that the cost incurred by the defender increases with an increase in

85

adversarial presence. The defender policy is supported over actions defense 2,3 and 4. This

suggests that the defender does not choose the most secure channel due to the associated increased

cost. Moreover, when the probability of adversarial behavior is low, specifically when 𝜌 = 2, the

defender selects defense 4 (the weakest defense) and only resorts to a more secure channel in the final

stage. In essence, as the probability 𝜌 decreases, the defender’s policy shifts towards improving the

posterior error covariance by selecting less secure channels. Below a certain probability threshold

( ˆ𝜌1 = 0.223), we observe the defender deterministically chooses the least secure channel. This

corresponds to a pure policy switch as indicated by Theorem 3.4.2. We observe a similar trend in

the second player’s policy, where the player policy is supported over actions of attack 1,2 and 3.

When 𝜌 = 1, the second player chooses over attack 2 and 3. As the probability 𝜌 decreases, the

adversarial policy shifts towards a pure policy of highest level of injected noise (attack 1).

This application effectively demonstrates the utility of the SSG framework. Solving the SSG𝑚×𝑛

empowers the defender to navigate the trade-off between security and lower posterior error covari-

ance in a environment influenced by potentially adversarial actions.

3.6 Summary

In this chapter, we modeled a finite-horizon decision-making process between a defender and

a second player, who has a probabilistic intent of turning into an adversarial player, as a multi-

stage zero-sum game with stopping states having a finite termination threshold. We analyzed the

M-SSG for an arbitrary finite termination threshold. We characterized the Nash equilibria and

value of the M-SSG for the case of two actions per player. We provided a detailed analysis with

respect to the termination threshold and a stage-dependent probability of the second player turning

adversarial. We derived conditions under which either player opts to play a pure policy. We then

extended the M-SSG to an arbitrary finite number of actions per player, termed as SSG𝑚×𝑛 with a

termination threshold of one. We characterized a condition for the SSG𝑚×𝑛 under which there exists

a pure Nash equilibrium as a function of the game parameters. We applied the SSG framework

to a cyber-physical system in a potentially adversarial environment, involving estimation using a

Kalman filter in the presence of a probable adversary. The initial set of results consisting of the

86

stochastic adversarial model along with it’s application to motion planning problems was presented

in [19]. More recently, the extension to large action space and its application to resilient estimation

is under review in [20].

87

CHAPTER 4

TAKEOVER ADVERSARY - FLIPDYN: RESOURCE TAKEOVER GAMES

In the previous chapter, we extended the deterministic adversarial model to a stochastic adversary.

Such an adversarial model can act as benign player or as an adversary governed by a Bernoulli

process. However, once it switches to an adversarial mode, it remains an adversary till the end of

the game. Similar to the deterministic adversary, we model the interaction between the defender

and stochastic adversary as a zero-sum multi-stage game. In addition to the model, we introduced

the concept of budgets, which allow for multiple action pairs from a defined set to be played before

terminating the game. We characterized the Nash equilibrium of such a game and demonstrated its

application on a motion planning problem and resilient estimation.

In this chapter, we further enhance the capability of an adversary to completely takeover the

system. Consider an adversary that has access to a prototypical CPS control loop and can achieve

a takeover at various points shown in Figure 4.1a. These points include the (i) reference inputs, (ii)

actuator, (iii) state, (iv) sensor and (v) control output, thereby affecting the system performance.

As opposed to conventional adversaries perturbing the states of the system (actuator attack) or

measurements (integrity attack) [52], this chapter supposes that an adversary completely takes over

a resource and can transmit arbitrary values originating from the controlled resource.

4.1

Introduction

We present the FlipDyn, a dynamic game in which two opponents (a defender and an adversary)

choose strategies to optimally takeover a resource that involves a dynamical system. At any time

instant, each player can take over the resource and thereby control the dynamical system after

incurring a state-dependent and a control-dependent costs. The resulting model becomes a hybrid

dynamical system where the discrete state (FlipDyn state) determines which player is in control of

the resource (cf. Figure 4.1b). Our objective is to compute the Nash equilibria of this dynamic

zero-sum game.

The security attributes of any CPS are broadly classified into three categories; confidentiality,

integrity, and availability [12]. Any type of attack on CPS impacts one or multiple of these

88

(a)

(b)

Figure 4.1 (a) Closed-loop system with adversaries present at various locations infecting the
reference values, actuator, plant, measurement output and control input. (b) Closed-loop system
with the adversary present between the controller and actuator trying to takeover the control signals.
The takeover action at time 𝑘 of the defender (resp. adversary) is given by 𝜋0
𝑘 ). A FlipIt
is setup over the control signal between the defender and adversarial control.

𝑘 (resp. 𝜋1

attributes. Confidentiality in a CPS is focused on preventing adversaries from deducing the

states of the dynamical system or its measurements. This is achieved by safeguarding against

eavesdropping activities that may occur between different components of the CPS. Integrity, on the

other hand, revolves around upholding the system’s operational goals. This is done by discouraging

and identifying deceptive attacks on the information exchanged between various components within

the CPS. Lastly, availability signifies the system’s capability to sustain its operational objectives.

This is ensured by actively countering denial-of-service (DOS) attacks that may target different

components of the CPS.

In this work, an adversary targets both confidentiality and integrity by taking control of a

dynamical system when the system is in a vulnerable state. The adversary can then send malicious

control signals to drive the system to undesirable states. Such actions can lead to permanent

damage, disrupting services and causing operational losses. Therefore, it becomes imperative to

develop defensive strategies to continuously scan and act against adversarial behavior while striking

a balance between operating costs and system performance. This paper puts forth a framework to

formally model the problem of dynamic resource takeovers, design effective defense policies and

analyze their performance.

89

PlantActuatorReferenceSensorControllerPlantSensorDefenderControllerAdversaryControllerThe concept of resource takeovers, embodied in the FlipIT game, was first introduced by

Van et al.[145]. This game models a zero-sum conflict between a defender and an adversary

striving to take over a common resource, such as a computing device, virtual machine, or a

cloud service[30]. Subsequent advancements in the FlipIT framework include the incorporation

of a dynamic environment with varying costs and success probabilities of attacks, as explored

by Johnson et al. [73]. The model of FlipIT was extended to multiple resources, termed as

FlipThem [81], which consisted of two models: an AND model, where all resources must be

compromised, and an OR model, where a single resource is chosen for compromise. There are

variations of FlipThem, such as a defender which configures the resources such that an adversary

has no incentive to attack beyond a certain number of resources [84]. References [157] and [158]

introduced resource constraints in a two-player non-zero-sum game of multiple resource takeovers.

Similar to the FlipIt model, a threshold based takeover was introduced with operational dynamics

of critical infrastructure as a part of the zero-sum game [33]. Reference [123] introduced FlipNet, a

graph based representation of FlipIT, investigating the graph structure, complexity of best response

strategies and Nash equilibria. Beyond cybersecurity, [92] introduced the model of FlipIT in

supervisory control and data acquisition (SCADA) to evaluate the impact of cyberattacks with

insider assistance. A diverse array of applications for the FlipIT model have been explored in

secure a variety of systems [30]. Notably, the aforementioned works primarily focused on resource

takeovers within a static system, lacking consideration for the dynamic evolution of physical

systems.

In contrast, our work incorporates the dynamics of a physical system in the game of

resource takeovers between an adversary and a defender.

The framework outlined by Ding et al.[47] for analyzing probabilistic reachability in discrete-

time stochastic hybrid systems addresses the synthesis of safety controls in a finite-horizon zero-

sum stochastic game. Our work also deals with a discrete-time game involving two hybrid states.

However, a key distinction is that only one player has control over the system in a hybrid state at any

given time, while allowing for a potential switch to the other hybrid state. A similar investigation

into safe controller designs within two hybrid states was conducted by Dallal et al.[44], formulating

90

a game between a controller aiming to enforce a safety property and an environment seeking to

violate it. The solution proposed by Dallal et al.[44] is confined to finite states and actions, whereas

our work extends to continuous states and actions. Fiscko et al.[55] introduce a multi-player game

with a superplayer controlling a parameterized utility of all the players, resulting in a cost-optimal

policy derived from dynamic programming. Building upon this, Fiscko et al.[53] focus on systems

with multiple agents that can be clustered, with the superplayer applying a cluster-based policy.

The work by Fiscko et al.[54] generalizes the cluster-based approach of a multi-player game with a

superplayer to a Transition Independent Markov Decision Process (TI-MDP), proposing a Clustered

Value Iteration method to solve the TI-MDP. The work presented in this chapter can be mapped

to the case of two clusters, albeit with the added challenge of determining control policies in the

presence of coupling between the clusters.

The application of game theory to formulate security policies in cyber-physical control systems

was addressed by Zhu et al.[161]. The setup introduced by Kontouras et al.[75] closely resembles the

game of resource takeover in the dynamical system FlipDyn[16]. However, Kontouras et al.[75] do

not address the action of takeovers; they assume takeovers can occur periodically and focus solely

on deriving control policies for both the defender and adversary, limited to a one-dimensional

control input. Expanding on this, Kontouras et al.[76] incorporate a multi-dimensional control

input, solving for a contractive control against covert attacks subject to control and state constraints.

The authors in[135] introduce a covert misappropriation of a plant, where a feedback structure

allows an attacker to take over control of the plant while remaining hidden from the supervisory

system. Similar covert attacks have also been explored that can take control of a load frequency

control (LFC) system using a covert reference signal [105]. In contrast to previous research, this

chapter provides a feedback signal to infer who is in control and offers the ability to take control of

the plant at any instant of time, balancing a trade-off between operational cost and performance.

Our prior work [16] presented a game of resource takeovers in a dynamical system. However,

we assumed the control policies were time-invariant for both the defender and adversary. In this

chapter, we relax this assumption of static control policies and solve for both the takeover strategies

91

and control policies. The contributions of this work are four-fold:

1. Takeover strategies for any discrete-time dynamical system: We formulate a two-player

zero-sum takeover game involving a defender and an adversary seeking to control a dynamical

system in discrete-time. This game encompasses dynamic takeover scenarios, considering

costs that are contingent on the system’s state and control inputs. Assuming knowledge of

the control policies, we establish analytical expressions for the NE takeover strategies and

saddle-point values, in the space of pure and mixed strategies.

2. Optimal linear state-feedback control policies: For a linear dynamical system with

quadratic takeover, state, and control costs, we derive an analytic state-feedback control

policy for both the defender and adversary. Furthermore, we provide sufficient conditions

under which the game admits a saddle-point in the space of feedback control policies that are

affine in the state.

3. Exact takeover strategies and saddle-point value parameters for scalar/1− dimensional

system: For a linear dynamical system in one dimension with quadratic takeover, state and

control costs, we derive the corresponding analytical state-feedback control policies for both

the defender and adversary.

In particular, we derive closed-form expressions for the NE

takeover and parameterized value of the game independent of the state.

4. Approximate takeover strategies and saddle-point value parameters for 𝑛− dimensional

system: For a linear dynamical system in 𝑛 dimensions with quadratic takeover, state and

control costs, we derive upper and lower bounds on the defender and the attacker value

functions, respectively, when both players use a linear state-feedback control policy. Using

these bounds, we derive approximate NE takeover strategies and the corresponding value of

the game in a parameterized form.

We illustrate our results for the scalar/1−dimensional and 𝑛−dimensional systems through

numerical examples.

92

Outline: This chapter is structured as follows. Section 4.2 formally defines the FlipDyn

problem, considering unknown control policies with state and control-dependent costs. In Sec-

tion 4.3, we outline a solution methodology applicable to discrete-time dynamical systems with

non-negative costs, under the assumption of known control policies. Section 4.5.1 presents a

solution for determining optimal linear state-feedback control policies, specifically designed for

linear discrete-time dynamical systems featuring quadratic costs. In Section 4.5.2, we delve into the

takeover strategies and saddle-point value parameters for the scalar/1−dimensional system. Finally,

Section 4.5.3 addresses the approximate takeover strategies and saddle-point value parameters for

the 𝑛−dimensional system. The chapter concludes with a discussion in Section 4.6.

4.2 Problem Formulation

Consider a discrete-time dynamical system, whose state evolution is given by:

𝑥𝑘+1 = 𝐹0

𝑘 (𝑥𝑘 , 𝑢𝑘 ),

(4.1)

where 𝑘 denotes the discrete time index, taking values from the integer set K := {1, 2, . . . , 𝐿} ⊂ N,

𝑥𝑘 ∈ X ⊆ R𝑛 is the state of the system with X denoting the Euclidean state space, 𝑢𝑘 ∈ U𝑘 ⊂ R𝑚 is

the control input of the system with U𝑘 as the Euclidean control input space at time instant 𝑘, and

𝐹0
𝑘 : X × U𝑘 → X is the state transition function. We consider a single adversary trying to takeover

the dynamical system resource. In particular, we assume the adversary to be located between the

controller and actuator. The FlipDyn state, 𝛼𝑘 ∈ {0, 1} indicates whether the defender (𝛼𝑘 = 0)

or the adversary (𝛼𝑘 = 1) has taken over the system at time 𝑘. We describe a takeover through
the action 𝜋 𝑗

𝑘 ∈ {0, 1}, which denotes the action of the player 𝑗 ∈ {0, 1} at time 𝑘, where 𝑗 = 0
denotes the defender and 𝑗 = 1 denotes the adversary. The binary FlipDyn state update based on

the player’s takeover action satisfies

𝛼𝑘+1 =

𝛼𝑘 ,

if 𝜋1

𝑘 ,
𝑘 = 𝜋0

𝑗,

if 𝜋 𝑗

𝑘 = 1.





(4.2)

The FlipDyn update (4.2) states that if both players act to takeover the resource at the same time

instant, then their actions are nullified, and the FlipDyn state remains unchanged. However, if the

93

resource is under control by one of the players who does not exert a takeover action, while the other

player moves to gain control at time 𝑘 + 1, then the FlipDyn state toggles at time 𝑘 + 1. Finally, if a

player is already in control and continues the takeover while the other player remains idle, then the

FlipDyn state is unchanged. Thus, the FlipDyn dynamics is compactly described by

(cid:16)

𝛼𝑘+1 =

𝑘 ¯𝜋1
¯𝜋0

𝑘 + 𝜋0

𝑘 𝜋1
𝑘

(cid:17)

𝛼𝑘 + ¯𝜋0

𝑘 (𝜋0

𝑘 + 𝜋1

𝑘 ),

(4.3)

where a binary variable ¯𝑥 := 1 − 𝑥. Takeovers are mutually exclusive, i.e., only one player is in

control of the system at any given time. The continuous state 𝑥𝑘+1 at time 𝑘 + 1 is dependent on

𝛼𝑘+1. The inclusion of an adversary modifies the state evolution (4.1) resulting in:

𝑥𝑘+1 = (1 − 𝛼𝑘+1)𝐹0

𝑘 (𝑥𝑘 , 𝑢𝑘 ) + 𝛼𝑘+1𝐹1

𝑘 (𝑥𝑘 , 𝑤 𝑘 ),

(4.4)

where 𝐹1

𝑘 : X × W𝑘 → X is the state transition function under the adversary’s control, 𝑤 𝑘 ∈ W𝑘 ⊂

R𝑝 is the attack input with W𝑘 as the Euclidean attack input space.

In this work, we aim to determine an optimal control input for the dynamical system along with

the corresponding takeover strategy for each player. Given a non-zero initial state 𝑥1, we pose the

resource takeover and dynamic system control problem as a zero-sum dynamic game described by

the dynamics (4.4) and (4.3) over a finite-horizon 𝐿, where the defender aims to minimize a net

cost given by:

𝐽 (𝑥1, 𝛼1, {𝜋1

L}, {𝜋0

L}, 𝑢∗

L

, 𝑤∗

L) =

𝐿
∑︁

𝑡=1

𝑔𝑡 (𝑥𝑡, 𝛼𝑡) + 𝜋0

𝑡 𝑑𝑡 (𝑥𝑡) + ¯𝛼𝑡𝑚𝑡 (𝑢𝑡) − 𝜋1

𝑡 𝑎𝑡 (𝑥𝑡) − 𝛼𝑡𝑛𝑡 (𝑤𝑡)

𝑔𝐿+1(𝑥𝐿+1, 𝛼𝐿+1),

(4.5)

where 𝑔𝑡 (𝑥𝑡, 𝛼𝑡) : R𝑛 ×{0, 1} → R represents the state cost with 𝑔𝐿+1(𝑥𝐿+1, 𝛼𝐿+1) : R𝑛 ×{0, 1} → R
representing the terminal state cost, 𝑑𝑡 (𝑥𝑡) : R𝑛 → R and 𝑎𝑡 (𝑥𝑡) : R𝑛 → R are the instantaneous

takeover costs for the defender and adversary, respectively. The terms 𝑚𝑡 (𝑢𝑡) : R𝑚 → R and

𝑛𝑡 (𝑤𝑡) : R𝑝 → R, are control costs corresponding to the defender and adversary, respectively.
The notations {𝜋 𝑗

𝐿 }, 𝑗 ∈ {0, 1}, 𝑢L := {𝑢1, . . . , 𝑢𝐿 }, and 𝑤L := {𝑤1, . . . , 𝑤 𝐿 }. In

, . . . , 𝜋 𝑗

L} := {𝜋 𝑗

1

contrast, the adversary aims to maximize the cost function (4.5) leading to a zero-sum dynamic

game, termed as the FlipDyn game [16] with control.

94

We seek to find Nash Equilibrium (NE) solutions of the game (4.5). To guarantee the existence

of a pure or mixed NE takeover strategy, we expand the set of player policies to behavioral strategies

– probability distributions over the space of discrete actions at each time step [61]. Specifically, let

(cid:104)

𝑦𝛼𝑘
𝑘 =

1 − 𝛽𝛼𝑘
𝑘

𝛽𝛼𝑘
𝑘

(cid:105) T

and 𝑧𝛼𝑘

𝑘 =

(cid:104)

1 − 𝛾𝛼𝑘
𝑘

𝛾𝛼𝑘
𝑘

(cid:105) T

,

(4.6)

be the behavioral strategies for the defender and adversary at time instant 𝑘 for the FlipDyn state
𝛼𝑘 , such that 𝛽𝛼𝑘

𝑘 ∈ [0, 1], respectively. The takeover actions

𝑘 ∈ [0, 1] and 𝛾𝛼𝑘

𝑘 ∼ 𝑦𝛼𝑘
𝑘 ,
𝜋0

𝑘 ∼ 𝑧𝛼𝑘
𝑘 ,
𝜋1

of each player at any time 𝑘 are sampled from the corresponding behavioral strategy. The behavioral
strategies, 𝑦𝛼𝑘
horizon 𝐿, let 𝑦L := {𝑦𝛼1

𝑘 ∈ Δ2, where Δ2 is the probability simplex in two dimensions. Over the finite

, . . . , 𝑦𝛼𝐿

, . . . , 𝑧𝛼𝐿

𝑘 , 𝑧𝛼𝑘

𝐿 } ∈ Δ𝐿

𝐿 } ∈ Δ𝐿

2 and 𝑧L := {𝑧𝛼1

2 be the sequence

, 𝑦𝛼2
2

, 𝑧𝛼2
2

1

1

of defender and adversary behavioral strategies. Thus, the expected outcome of the zero-sum

game (4.5) is given by

𝐽𝐸 (𝑥1, 𝛼1, 𝑦L, 𝑧L, 𝑢L, 𝑤L) := E[𝐽 (𝑥1, 𝛼1, {𝜋1

𝐿 }, {𝜋0

𝐿 }, 𝑢L, 𝑤L)],

(4.7)

where the expectation is computed with respect to the distributions 𝑦L and 𝑧L. Specifically, we

seek a saddle-point solution (𝑦∗
L

, 𝑧∗
L

, 𝑢∗
L

, 𝑤∗

L) in the space of behavioral strategies and control inputs

such that for any non-zero initial state 𝑥0 ∈ X, 𝛼0 ∈ {0, 1},

𝐽 𝐸 ≤ 𝐽𝐸 (𝑥0, 𝛼0, 𝑦∗
L

, 𝑧∗
L

, 𝑢∗
L

, 𝑤∗

L) ≤ 𝐽 𝐸 .

where 𝐽 𝐸 := 𝐽𝐸 (𝑥0, 𝛼0, 𝑦∗
L

, 𝑧L, 𝑢∗
L

, 𝑤L) and 𝐽 𝐸 := 𝐽𝐸 (𝑥0, 𝛼0, 𝑦L.𝑧∗
L

, 𝑢L, 𝑤∗

L). The FlipDyn game

with control of dynamical system - FlipDyn-Con, is completely defined by the expected cost (4.7)

and the space of player takeover strategies and control input policies subject to the dynamics in (4.4)

and (4.3). In the next section, we derive the outcome of the FlipDyn game with control for both the

FlipDyn state of 𝛼 = 0 and 𝛼 = 1 for general systems.

95

4.3 FlipDyn for general systems

We will begin by deriving the NE takeover strategies of the FlipDyn game, given any control

policy pair 𝑢L, 𝑤L, in each of the two takeover. Our approach begins by defining the saddle-point

value of the game.

4.3.1 Saddle-point value

At time instant 𝑘 ∈ K, given an initial FlipDyn state, the saddle-point value consists of the

instantaneous state and control-dependent cost and an additive cost-to-go based on the players

takeover actions. The cost-to-go is determined via a cost-to-go matrix in each of the FlipDyn

∈ R2×2 and Ξ1

state, represented by Ξ0

𝑘+1 ∈ R2×2 for the FlipDyn state 𝛼𝑘 = 0 and 𝛼𝑘 = 1,
𝑘 (𝑥, 𝑤 𝑘 , Ξ1
𝑘+1) be the saddle-point values at time instant
𝑘 with the continuous state 𝑥 for a given control policy pair 𝑢𝑘 and 𝑤 𝑘 and cost-to-go matrices,

𝑘+1
𝑘 (𝑥, 𝑢𝑘 , Ξ0

respectively. Let 𝑉 0

) and 𝑉 1

𝑘+1

corresponding to the FlipDyn state of 𝛼 = 0 and 1, respectively. The entries of the cost-to-go

matrix Ξ0

𝑘+1 corresponding to each pair of takeover actions are given by

Idle

𝑣0

𝑘+1

Idle

Takeover








(cid:124)

𝑣0
𝑣0
+ 𝑑𝑘 (𝑥)
𝑘+1
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

Takeover

𝑣1
𝑘+1 − 𝑎𝑘 (𝑥)
+ 𝑑𝑘 (𝑥) − 𝑎𝑘 (𝑥)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

,








(cid:125)

𝑘+1
(cid:123)(cid:122)
Ξ0
𝑘+1

where 𝑣0

𝑘+1 := 𝑉 0
𝑘+1

(cid:16)

𝑘 (𝑥, 𝑢𝑘 ), 𝑢𝑘+1, Ξ0
𝐹0

𝑘+2

(cid:17)

,

𝑘+1 := 𝑉 1
𝑣1

𝑘+1(𝐹1

𝑘 (𝑥, 𝑤 𝑘 ), 𝑤 𝑘+1, Ξ1

𝑘+2).

(4.8)

(4.9)

(4.10)

The matrix entries corresponding to Ξ0

𝑘+1 are determined using the defender and adversary control
policies, and the dynamics (4.4) and (4.3). 𝑋 (𝑖, 𝑗) corresponds to the (𝑖, 𝑗)-th entry of the matrix

𝑋. The diagonal entries Ξ0

𝑘+1

(1, 1) and Ξ0

𝑘+1

(2, 2) correspond to both the defender and adversary

acting idle and taking over, respectively. The off-diagonal entries correspond to one player taking

over the resource. The entries of the cost-to-go matrix couple the value functions in each FlipDyn

state. Thus, at time 𝑘 for a given control policy 𝑢𝑘 , state 𝑥 and 𝛼𝑘 = 0, the saddle-point value

96

satisfies

𝑘 (𝑥, 𝑢𝑘 , Ξ0
𝑉 0

𝑘+1) = 𝑔𝑘 (𝑥, 0) + 𝑚𝑘 (𝑢𝑘 ) + Val(Ξ0

𝑘+1),

(4.11)

where Val(𝑋 𝛼𝑘
𝑘+1

) := min𝑦 𝛼𝑘

𝑘

max𝑧 𝛼𝑘

𝑘

𝑘 𝑋𝑘+1𝑧𝛼𝑘
𝑦𝛼𝑘 T

𝑘 , represents the (mixed) saddle-point value of the

zero-sum matrix 𝑋𝑘+1 for the FlipDyn state 𝛼𝑘 , and Ξ0

𝑘+1

∈ R2×2 is the cost-to-go zero-sum matrix.

The defender’s (row player) and adversary’s (column player) action results in either an entry within

Ξ0

𝑘+1 (if the matrix has a saddle point in pure strategies) or in the expected sense, resulting in a

cost-to-go from state 𝑥 at time 𝑘.

Similarly, for 𝛼𝑘 = 1, ∀𝑘, the cost-to-go matrix entries Ξ1

𝑘+1 and the saddle-point value are given

by:

Idle

Takeover

Idle

Takeover








(cid:124)

𝑣1


𝑘+1 − 𝑎𝑘 (𝑥)
𝑣1

𝑘+1


𝑣0
𝑣1
+ 𝑑𝑘 (𝑥)
𝑘+1 + 𝑑𝑘 (𝑥) − 𝑎𝑘 (𝑥)

𝑘+1

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:123)(cid:122)
(cid:125)
Ξ1
𝑘+1

,

with 𝑉 1

𝑘 (𝑥, 𝑤 𝑘 , Ξ1

𝑘+1) = 𝑔𝑘 (𝑥, 1) − 𝑛𝑘 (𝑤 𝑘 ) + Val(Ξ1

𝑘+1).

(4.12)

(4.13)

With the saddle-point values established in each of the FlipDyn states, in the following subsec-

tion, we will characterize the NE takeover strategies and the saddle-point values for the entire time

horizon 𝐿.

4.3.2 NE takeover strategies of the FlipDyn game

In order to characterize the saddle-point value of the game, we restrict the cost functions to a

particular domain, stated in the following mild assumption.

Assumption 4.3.1 [Non-negative costs] For any time instant 𝑘 ∈ K, the state and control depen-

dent costs 𝑔𝑘 (𝑥, 𝛼), 𝑑𝑘 (𝑥), 𝑎𝑘 (𝑥), 𝑚𝑘 (𝑢𝑘 ), 𝑛𝑘 (𝑤 𝑘 ), for all 𝑥 ∈ X, 𝑢𝑘 ∈ U𝑘 , 𝑤 ∈ W𝑘 , and 𝛼 ∈ {0, 1}

are non-negative (R≥0).

Assumption 4.3.1 enables us to compare the entries of the cost-to-go matrix without changes in

the sign of the costs, thereby, characterizing the strategies of the players (pure or mixed strategies).

97

Under this assumption, we derive the following result to compute a recursive saddle-point value for

the entire horizon length and the NE takeover strategies for both the players.

Theorem 4.3.2 (Case 𝛼𝑘 = 0) Under Assumption 4.3.1, for a given choice of control policies, 𝑢L

and 𝑤L, the unique NE takeover strategies of the FlipDyn-Con game (4.7) at any time 𝑘 ∈ K,

subject to the continuous state dynamics (4.4) and FlipDyn dynamics (4.3) are given by:

𝑦0∗
𝑘 =

𝑧0∗
𝑘 =










(cid:20) 𝑎𝑘 (𝑥)
ˇΞ𝑘+1

1 −

(cid:21) T

,

𝑎𝑘 (𝑥)
ˇΞ𝑘+1

if

ˇΞ𝑘+1 > 𝑑𝑘 (𝑥),

ˇΞ𝑘+1 > 𝑎𝑘 (𝑥),

(cid:20)

1

0

(cid:20)

(cid:20)

(cid:20)

1 −

𝑑𝑘 (𝑥)
ˇΞ𝑘+1

𝑑𝑘 (𝑥)
ˇΞ𝑘+1

0

1

1

0

(cid:21) T

(cid:21) T

(cid:21) T

(cid:21) T

,

,

,

,

otherwise,

if

if

ˇΞ𝑘+1 > 𝑑𝑘 (𝑥),

ˇΞ𝑘+1 > 𝑎𝑘 (𝑥),

ˇΞ𝑘+1 ≤ 𝑑𝑘 (𝑥),

,

ˇΞ𝑘+1 > 𝑎𝑘 (𝑥),

otherwise,

(4.14)

(4.15)

where ˇΞ𝑘+1 := 𝑉 1

𝑘+1(𝐹1
The saddle-point value is given by:

𝑘 (𝑥, 𝑤 𝑘 ), 𝑤 𝑘+1, Ξ1

𝑘+2) − 𝑉 0
𝑘+1

(𝐹0

𝑘 (𝑥, 𝑢𝑘 ), 𝑢𝑘+1, Ξ0

𝑘+2

).

𝑘 (𝑥, 𝑢𝑘 , Ξ0
𝑉 0

𝑘+1) =






𝑔𝑘 (𝑥, 0) + 𝑣0

𝑘+1 + 𝑚𝑘 (𝑢𝑘 ) + 𝑑𝑘 (𝑥) − 𝑎𝑘 (𝑥)𝑑𝑘 (𝑥)

ˇΞ𝑘+1

𝑔𝑘 (𝑥, 0) + 𝑚𝑘 (𝑢𝑘 ) + 𝑣1

𝑘+1 − 𝑎𝑘 (𝑥),

𝑔𝑘 (𝑥, 0) + 𝑣0

𝑘+1

+ 𝑚𝑘 (𝑢𝑘 ),

98

,

if

ˇΞ𝑘+1 > 𝑑𝑘 (𝑥),

ˇΞ𝑘+1 > 𝑎𝑘 (𝑥),

if

ˇΞ𝑘+1 ≤ 𝑑𝑘 (𝑥),

ˇΞ𝑘+1 > 𝑎𝑘 (𝑥),

otherwise.

(4.16)

(4.17)

(4.18)

ˇΞ𝑘+1 > 𝑑𝑘 (𝑥)

ˇΞ𝑘+1 > 𝑎𝑘 (𝑥)

ˇΞ𝑘+1 > 𝑑𝑘 (𝑥)

ˇΞ𝑘+1 ≤ 𝑎𝑘 (𝑥)

(Case 𝛼𝑘 = 1) The unique NE takeover strategies are

(cid:20)

(cid:20)

(cid:20)










𝑦1∗
𝑘 =

𝑧1∗
𝑘 =

1 −

𝑎𝑘 (𝑥)
ˇΞ𝑘+1

(cid:21) T

,

𝑎𝑘 (𝑥)
ˇΞ𝑘+1

0

1

1

0

(cid:20) 𝑑𝑘 (𝑥)
ˇΞ𝑘+1

1 −

𝑑𝑘 (𝑥)
ˇΞ𝑘+1

(cid:21) T

(cid:21) T

(cid:21) T

,

,

,

if

if

ˇΞ𝑘+1 > 𝑑𝑘 (𝑥),

ˇΞ𝑘+1 > 𝑎𝑘 (𝑥),

ˇΞ𝑘+1 > 𝑑𝑘 (𝑥),

ˇΞ𝑘+1 ≤ 𝑎𝑘 (𝑥),

otherwise,

if

ˇΞ𝑘+1 > 𝑑𝑘 (𝑥),

ˇΞ𝑘+1 > 𝑎𝑘 (𝑥),

(cid:20)

1

(cid:21) T

,

0

otherwise.

The saddle-point value is given by:

𝑔𝑘 (𝑥, 1) + 𝑣1

𝑘+1 − 𝑛𝑘 (𝑤 𝑘 ) − 𝑎𝑘 (𝑥) + 𝑎𝑘 (𝑥)𝑑𝑘 (𝑥)

ˇΞ𝑘+1

,

if

𝑔𝑘 (𝑥, 1) − 𝑛𝑘 (𝑤 𝑘 ) + 𝑣0

𝑘+1 + 𝑑𝑘 (𝑥),

if

𝑘 (𝑥, 𝑤 𝑘 , Ξ1
𝑉 1

𝑘+1) =






𝑔𝑘 (𝑥, 1) + 𝑣1

𝑘+1 − 𝑛𝑘 (𝑤 𝑘 ),

otherwise.

The boundary condition at 𝑘 = 𝐿 is given by:

𝑢𝐿+1 := 0𝑚, 𝑤 𝐿+1 := 0𝑝, Ξ1

𝐿+2 := 02×2, Ξ0

𝐿+2 := 02×2,

where 0𝑖× 𝑗 ∈ R𝑖× 𝑗 represents a matrix of zeros.

(4.19)

(4.20)

□

Proof: We will only derive the NE takeover strategies and saddle-point value for case of 𝛼𝑘 = 0.

We leave out the derivations for 𝛼 = 1 as they are analogous to 𝛼 = 0. There are three cases to

consider for the 2 × 2 matrix game defined by the matrix in (4.8). We start by identifying the NE

takeover in pure strategies in the cost-to-go matrix Ξ0

𝑘 (4.11).

99

i) Pure strategy: Both the defender and adversary choose the action of staying idle. First, we

determine the conditions under which the defender always chooses to play idle. Under Assump-

tion 4.3.1, we compare the entries of Ξ0

𝑘+1 when the adversary opts to remain idle to obtain the

condition:

𝑣0
𝑘+1 ≤ 𝑣0

𝑘+1 + 𝑑𝑘 (𝑥).

(4.21)

Similarly, when the adversary opts to takeover, if the condition

𝑘+1 ≤ 𝑣0
𝑣1

𝑘+1 + 𝑑𝑘 (𝑥),

⇒𝑣1

𝑘+1 − 𝑣0

𝑘+1 ≤ 𝑑𝑘 (𝑥)

holds, then defender always remains idle. Next, we determine the conditions for the adversary to

always remain idle. Under Assumption 4.3.1, when the defender chooses to takeover, we compare

the entries of Ξ0

𝑘+1 to infer the condition

𝑘+1 + 𝑑𝑘 (𝑥) ≥𝑣0
𝑣0

𝑘+1 + 𝑑𝑘 (𝑥) − 𝑎𝑘 (𝑥)

⇒ 0 ≥ − 𝑎𝑘 (𝑥)

always holds. Finally, when the defender opts to remain idle, if the condition,

𝑘+1 ≤ 𝑣0
𝑣1

𝑘+1 + 𝑎𝑘 (𝑥),

holds, then the adversary always remains idle. The saddle-point value corresponding to the pure

strategy of playing idle by both the players, is the entry Ξ0

𝑘+1

(1, 1), given by:

𝑘 (𝑥, 𝑢𝑘 , Ξ0
𝑉 0

𝑘+1) = 𝑔𝑘 (𝑥, 0) + 𝑣0

𝑘+1 + 𝑚𝑘 (𝑢𝑘 ).

ii) Pure strategy: The defender chooses to stay idle whereas the adversary chooses to takeover.

We will derive the conditions under which the adversary opts to takeover. When the defender plays

idle, if the condition,

𝑘+1 ≥ 𝑣0
𝑣1

𝑘+1 + 𝑎𝑘 (𝑥),

100

holds, then the adversary always opts to takeover. The saddle-point corresponding to the pure

strategy of when the defender opts to remain idle while the adversary plays a takeover action

corresponds to the entry Ξ0

𝑘+1

(1, 2), which is given by:

𝑘 (𝑥, 𝑢𝑘 , Ξ0
𝑉 0

𝑘+1) = 𝑔𝑘 (𝑥, 0) + 𝑚𝑘 (𝑢𝑘 ) + 𝑣1

𝑘+1 − 𝑎𝑘 (𝑥).

Finally, we derive conditions under which the cost-to-go matrix Ξ0

𝑘+1 has a saddle-point in mixed

strategies.

iii) Mixed strategies: Mixed strategies are played by both players if none of the pure strategy

conditions are met, i.e., when both

𝑘+1 − 𝑣0
𝑣1

𝑘+1

> 𝑑𝑘 (𝑥),

𝑘+1 − 𝑣0
𝑣1

𝑘+1

> 𝑎𝑘 (𝑥),

hold. In this case, no single row or column dominates. A mixed strategy NE takeover for any 2 × 2

game is given by (cf. [132])

𝑦0∗
𝑘 =

(cid:20) 𝑎𝑘 (𝑥)
ˇΞ𝑘+1

1 −

(cid:21) T

𝑎𝑘 (𝑥)
ˇΞ𝑘+1

, 𝑧0∗

𝑘 =

(cid:20)

1 −

𝑑𝑘 (𝑥)
ˇΞ𝑘+1

(cid:21) T

.

𝑑𝑘 (𝑥)
ˇΞ𝑘+1

Thus, for the FlipDyn state of 𝛼𝑘 = 0, 𝑘 ∈ K, we characterize the complete NE takeover strategies

(pure and mixed strategies) for the defender and adversary (4.14) and (4.15), respectively.

The mixed saddle-point value of the 2 × 2 zero-sum matrix Ξ0

𝑘+1 is given by (cf. [132]),

𝑦0∗T
𝑘 Ξ0

𝑘+1

𝑧0∗
𝑘

:= 𝑣0

𝑘+1 + 𝑑𝑘 (𝑥) −

𝑎𝑘 (𝑥)𝑑𝑘 (𝑥)
ˇΞ𝑘+1

.

Collecting all the saddle-point values of the game corresponding to the pure and mixed strategy

NE, we obtain the saddle-point value update equation over the horizon of 𝐿 in (4.16). Notice

that 𝑔𝑘 (𝑥, 0) and 𝑚𝑘 (𝑢𝑘 ) represent the instantaneous state and control-dependent cost, and are not

part of the zero-sum matrix as shown in (4.11). The boundary conditions (4.20) imply that the

saddle-point values at 𝑘 = 𝐿 + 1 satisfy

𝐿+1(𝑥, 0𝑚, 02×2) = 𝑔0
𝑉 0

𝐿+1(𝑥, 0),

𝐿+1(𝑥, 0𝑝, 02×2) = 𝑔1
𝑉 1

𝐿+1(𝑥, 1).

101

□

For a finite cardinality of the state X, fixed player policies 𝑢𝑘 and 𝑤 𝑘 , 𝑘 ∈ K, and a finite horizon

𝐿, Theorem 4.3.2 yields an exact saddle-point value of the FlipDyn-Con game (4.7). However, the

computational and storage complexities scale undesirably with the cardinality of X, especially in

continuous state spaces. For this purpose, in the next section, we will provide a parametric form of

the saddle-point value for the case of linear dynamics with quadratic costs.

4.4 FlipDyn for LQ Problems

To address continuous state spaces arising in the FlipDyn-Con game, we restrict our attention

to a linear dynamical system with quadratic costs (LQ problems). Furthermore, we segment

our analysis into two distinct cases: a 1-dimensional (scalar) and an 𝑛-dimensional system. The

dynamics of a linear system at time instant 𝑘 ∈ K, when the defender has taken over satisfies:

𝑥𝑘+1 = 𝐹0

𝑘 (𝑥𝑘 , 𝑢𝑘 ) := 𝐸𝑘 𝑥𝑘 + 𝐵𝑘𝑢𝑘 ,

(4.22)

where 𝐸𝑘 ∈ R𝑛×𝑛 denotes the state transition matrix, while 𝐵𝑘 ∈ R𝑛×𝑚 represents the defender

control matrix. Similarly, the dynamics of the linear system if the adversary takes over satisfies:

𝑥𝑘+1 = 𝐹1

𝑘 (𝑥𝑘 , 𝑤 𝑘 ) := 𝐸𝑘 𝑥𝑘 + 𝐻𝑘 𝑤 𝑘 ,

(4.23)

where 𝐻𝑘 ∈ R𝑛×𝑝 signifies the adversary control matrix. The FlipDyn dynamics (4.4) then reduces

to

𝑥𝑘+1 = 𝐸𝑘 𝑥𝑘 + (1 − 𝛼𝑘 )𝐵𝑘𝑢𝑘 + 𝛼𝑘 𝐻𝑘 𝑤 𝑘 .

(4.24)

The stage, takeover and control quadratic costs for each player are given by:

𝑔𝑘 (𝑥, 𝛼𝑘 ) = 𝑥T𝐺𝛼𝑘

𝑘 𝑥,

𝑑𝑘 (𝑥) = 𝑥T𝐷 𝑘 𝑥,

𝑎𝑘 (𝑥) = 𝑥T 𝐴𝑘 𝑥,

𝑚𝑘 (𝑢) = 𝑢T𝑀𝑘𝑢,

𝑛𝑘 (𝑤) = 𝑤T𝑁𝑘 𝑤,

(4.25)

, 𝐷 𝑘 ∈ S𝑛×𝑛

+

, 𝐴𝑘 ∈ S𝑛×𝑛

+

, 𝑀𝑘 ∈ S𝑚×𝑚

+

and 𝑁𝑘 ∈ S𝑝×𝑝

+

are positive definite

where 𝐺𝛼𝑘

𝑘 ∈ S𝑛×𝑛

+

matrices.

102

Remark 4.4.1 The control policies for both players function in a mutually exclusive manner within

their respective FlipDyn state. Specifically, the defender control policy 𝑢𝑘 affects the dynamics

when the FlipDyn state 𝛼𝑘 = 0, while the adversary control policy 𝑤 𝑘 comes into effect when

𝛼𝑘 = 1.

If we constrain the control polices for both players to be functions of the continuous state 𝑥,

then the saddle-point value for each FlipDyn state relies solely on the continuous state 𝑥, in contrast

to being contingent on both continuous state 𝑥 and the control input for the corresponding FlipDyn

state. This restriction is formally stated in the subsequent assumption.

Assumption 4.4.2 At any time instant 𝑘 ∈ K, we define the defender and adversary control policies

to be linear state-feedback in the continuous state 𝑥, defined by

𝑢𝑘 (𝑥) := 𝐾𝑘 𝑥, 𝑤 𝑘 (𝑥) := 𝑊𝑘 𝑥,

(4.26)

where 𝐾𝑘 ∈ R𝑚×𝑛 and 𝑊𝑘 ∈ R𝑝×𝑛 are known defender and adversary control gains matrices,

respectively.

Under Assumption 4.4.2, and from saddle-point values (4.16) and (4.19), we postulate a para-

metric form for the saddle-point value in each FlipDyn state as follows:

𝑘 (𝑥, 𝑢𝑘 (𝑥), Ξ0
𝑉 0

𝑘+1) ⇒ 𝑉 0

𝑘 (𝑥) := 𝑥T𝑃0

𝑘 𝑥,

𝑘 (𝑥, 𝑤 𝑘 (𝑥), Ξ1
𝑉 1

𝑘+1) ⇒ 𝑉 1

𝑘 (𝑥) := 𝑥T𝑃1

𝑘 𝑥,

where 𝑃0

𝑘 and 𝑃1

𝑘 are 𝑛 × 𝑛 real symmetric matrices corresponding to the FlipDyn states 𝛼 = 0 and
1, respectively. We impose Assumption 4.4.2 to enable factoring out the state 𝑥 while computing

the saddle-point value update backward in time.

We redefine the dynamics of the defender and adversary under Assumption 4.4.2 given by:

𝑥𝑘+1 = (cid:101)𝐵𝑘 𝑥𝑘 := (𝐸𝑥 + 𝐵𝑘 𝐾𝑘 )𝑥𝑘 ,

𝑥𝑘+1 = (cid:101)𝑊𝑘 𝑥𝑘 := (𝐸𝑥 + 𝐻𝑘𝑊𝑘 )𝑥𝑘 .

(4.27)

103

Next, we outline the NE takeover strategies of both the players, along with the corresponding

parameters of the saddle-point values for each of the FlipDyn state, for discrete-time linear dynamics,

with known linear state-feedback control policies, and quadratic costs. We begin by analyzing the

case when 𝑥 is a scalar, for which we will compute the saddle-point value exactly, and subsequently

proceed to approximate the saddle-point value for the 𝑛- dimensional case.

4.4.1 Scalar/1-dimensional dynamical system

The quadratic costs any time 𝑘 ∈ K stated in (4.25) for a scalar dynamical system are represented

as:

𝑔𝑘 (𝑥, 𝛼𝑘 ) = 𝐺𝛼𝑘

𝑘 𝑥2,

𝑑𝑘 (𝑥) = 𝑑𝑘 𝑥2,

𝑎𝑘 (𝑥) = 𝑎𝑘 𝑥2,

𝑚𝑘 (𝑢) = 𝑀𝑘 𝐾 2

𝑘 𝑥2,

𝑛𝑘 (𝑤) = 𝑁𝑘𝑊 2

𝑘 𝑥2,

(4.28)

for non-negative values of 𝐺𝛼𝑘

𝑘 , 𝐷 𝑘 , 𝐴𝑘 , 𝑀𝑘 and 𝑁𝑘 . For a scalar dynamical system, we use the

following notation to represent the saddle-point value in each FlipDyn state. Let

𝑘 (𝑥) := p0
𝑉 0

𝑘 𝑥2, 𝑉 1

𝑘 (𝑥) := p1

𝑘 𝑥2,

where p𝛼

𝑘 ∈ R, 𝛼 ∈ {0, 1}, 𝑘 ∈ K. Building on Theorem 4.3.2, we present the following result,

which provides a closed-form expression for the NE takeover in both pure and mixed strategies of

both players, and outlines the saddle-point value update of the parameter p𝛼
𝑘 .

Corollary 1 (Case 𝛼𝑘 = 0) Under Assumption 4.4.2, the unique NE takeover strategies of the

FlipDyn game at any time 𝑘 ∈ K, subject to the dynamics (4.27) for a scalar dynamical system

104

(4.29)

(4.30)

(4.31)

with quadratic costs (4.28) and FlipDyn dynamics (4.3) are given by:

(cid:20) 𝑎𝑘
˜p𝑘+1

(cid:20)

1

1 −

(cid:21) T

,

𝑎𝑘
˜p𝑘+1

if

˜p𝑘+1 > 𝑑𝑘 ,

˜p𝑘+1 > 𝑎𝑘 ,

(cid:21) T

,

0

otherwise,

(cid:20)

(cid:20)

(cid:20)

1 −

𝑑𝑘
˜p𝑘+1

(cid:21) T

,

𝑑𝑘
˜p𝑘+1

0

1

(cid:21) T

(cid:21) T

,

,

1

0

if

if

˜p𝑘+1 > 𝑑𝑘 ,

˜p𝑘+1 > 𝑎𝑘 ,

˜p𝑘+1 ≤ 𝑑𝑘 ,

˜p𝑘+1 > 𝑎𝑘 ,

otherwise,










𝑦0∗
𝑘 =

𝑧0∗
𝑘 =

where

The saddle-point value parameter at time 𝑘 is given by:

˜p𝑘+1 := (cid:101)𝑊𝑘 p1

𝑘+1 − (cid:101)𝐵𝑘 p0

𝑘+1

.

if

if

˜p𝑘+1 > 𝑑𝑘 ,

˜p𝑘+1 > 𝑎𝑘 ,

˜p𝑘+1 ≤ 𝑑𝑘 ,

˜p𝑘+1 > 𝑎𝑘 ,

𝐺0

𝑘 + 𝑑𝑘 −

𝑑𝑘 𝑎𝑘
˜p𝑘+1

+ (cid:101)𝐵2

𝑘 p0

𝑘+1 + 𝐾 2

𝑘 𝑀𝑘 ,

𝐺0

𝑘 − 𝑎𝑘 + (cid:101)𝑊 2

𝑘 p1

𝑘+1 + 𝐾 2

𝑘 𝑀𝑘 ,

p0
𝑘 =






𝐺0

𝑘 + 𝐾 2

𝑘 𝑀𝑘 + (cid:101)𝐵2

𝑘 p0

𝑘+1

,

otherwise,

105

(Case 𝛼𝑘 = 1) The unique NE takeover strategies are given by:

(cid:20)

(cid:20)

(cid:20)

1 −

𝑎𝑘
˜p𝑘+1

(cid:21) T

,

𝑎𝑘
˜p𝑘+1

0

1

(cid:21) T

(cid:21) T

,

,

1

0

if

if

˜p𝑘+1 > 𝑑𝑘 ,

˜p𝑘+1 > 𝑎𝑘 ,

˜p𝑘+1 > 𝑑𝑘 ,

˜p𝑘+1 ≤ 𝑎𝑘 ,

otherwise,

(cid:20) 𝑑𝑘
ˇp𝑘+1

(cid:20)

1

1 −

(cid:21) T

,

𝑑𝑘
˜p𝑘+1

if

˜p𝑘+1 > 𝑑𝑘 ,

˜p𝑘+1 > 𝑎𝑘 ,

(cid:21) T

,

0

otherwise,










𝑦1∗
𝑘 =

𝑧1∗
𝑘 =

The saddle-point value parameter at time 𝑘 is given by,

p1
𝑘 =






𝐺1

𝑘 − 𝑎𝑘 +

𝑑𝑘 𝑎𝑘
˜p𝑘+1

+ (cid:101)𝑊 2

𝑘 p1

𝑘+1 − 𝑊 2

𝑘 𝑁𝑘 ,

𝐺1

𝑘 + 𝑑𝑘 − 𝑊 2

𝑘 𝑁𝑘 + (cid:101)𝐵2

𝑘 p0

𝑘+1

,

if

if

˜p𝑘+1 > 𝑑𝑘 ,

˜p𝑘+1 > 𝑎𝑘 ,

˜p𝑘+1 > 𝑑𝑘 ,

˜p𝑘+1 ≤ 𝑎𝑘 ,

𝐺1

𝑘 − 𝑊 2

𝑘 𝑁𝑘 + (cid:101)𝑊 2

𝑘 p1

𝑘+1

,

otherwise.

The terminal conditions for the recursions (4.31) and (4.34) are:

𝐿+1 := 𝐺0
p0

𝐿+1

,

𝐿+1 := 𝐺1
p1

𝐿+1

(4.32)

(4.33)

(4.34)

□

Proof: We begin the proof by determining the NE takeover in both pure and mixed strategies,

and computing the corresponding saddle-point value parameter for the FlipDyn state of 𝛼 = 0. We

substitute the quadratic costs (4.28) and linear dynamics (4.27) in the term (cid:101)𝑃𝑘+1(𝑥) from (4.53) to

106

obtain:

˜𝑃𝑘+1(𝑥) :=

(cid:16)

(𝐸𝑥 + 𝐻𝑘𝑊𝑘 )2p1

𝑘+1 − (𝐸𝑥 + 𝐵𝑘 𝐾𝑘 )2p0

𝑘+1

(cid:17)

𝑥2,

= ˜p𝑘+1𝑥2

Substituting ˜p𝑘+1 and takeover costs (4.28) in (4.14) and (4.15), we obtain the NE takeover strategies

presented in (4.29) and (4.30), respectively. The NE takeover strategies for the FlipDyn state of

𝛼𝑘 = 1 can trivially be obtained by taking the complementary of (4.29) and (4.30), resulting

in (4.32) and (4.33), respectively.

To obtain a backward recursion for the parameter p0

𝑘 , we substitute the linear dynamics (4.27)

along with quadratic costs (4.28) in (4.16), which yields:

𝑘 𝑥2 =
p0






(𝐺0

𝑘 + 𝑑𝑘 )𝑥2 −

𝑑𝑘 𝑎𝑘 𝑥4
˜p𝑘+1𝑥2

+ (𝐾 2

𝑘 𝑀𝑘 + (cid:101)𝐵2

𝑘 p0

𝑘+1)𝑥2,

(𝐺0

𝑘 + (cid:101)𝑊 2

𝑘 p1

𝑘+1 − 𝑎𝑘 + 𝐾 2

𝑘 𝑀𝑘 )𝑥2,

(𝐺0

𝑘 + 𝐾 2

𝑘 𝑀𝑘 + (cid:101)𝐵2

𝑘 p0

𝑘+1)𝑥2,

if

if

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

ˇp𝑘+1 ≤ 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

otherwise.

Factoring out the term 𝑥2, we arrive at (4.31). Similar substitutions for the FlipDyn state of 𝛼𝑘 = 1,

we obtain (4.34).

The state cost at the time instant 𝐿 + 1 yields the terminal condition on the parameters of

saddle-point values

𝐿+1 := 𝐺0
p0

𝐿+1

,

𝐿+1 := 𝐺1
p1

𝐿+1

□

Corollary 1 presents a closed-form solution for the FlipDyn (4.7) game for a given control

policy, with NE takeover strategies independent of state of the scalar/1-dimensional system. The

saddle-point values of the FlipDyn game for a given control policy corresponding to 𝛼1 = 0 and

𝛼1 = 1 are given by:

𝐽𝐸 (𝑥1, 0, 𝑦∗
L

, 𝑧∗
L

, 𝑢L, 𝑤L) = 𝑥T

1 p0
1

𝑥1,

𝐽𝐸 (𝑥1, 1, 𝑦∗
L

, 𝑧∗
L

, 𝑢L, 𝑤L) = 𝑥T

1 p1
1

𝑥1.

107

We can determine a minimum adversarial state cost 𝐺1∗
𝑘

that guarantees a mixed strategy NE

takeover at every time 𝑘 ∈ K. We characterize such an adversarial state cost in the following

remark.

Remark 4.4.3 Given a scalar/1-dimensional system (4.27) with quadratic costs (4.28), the mixed

strategy NE takeover and the corresponding recursion for the saddle-point value parameter, as

outlined in Corollary 1, exists for an adversary state-dependent cost 𝐺1∗

𝑘 ≤ 𝐺1

𝑘 provided

˜p𝑘+1 > 𝑑𝑘 ,

˜p𝑘+1 > 𝑎𝑘 , ∀𝑘 ∈ K,

with the parameters at the time 𝐿 + 1 satisfying

𝐿+1 = 𝐺1∗
p1
𝐿+1

,

𝐿+1 = 𝐺0
p0

𝐿+1

.

(4.35)

The parameters 𝐺1∗
𝑘

in Remark 4.4.3 can be computed using a bisection method at every time

𝑘 ∈ K. Given an arbitrary adversary control cost 𝐺1

𝑘 , we start by updating the parameter of

saddle-point value in (4.31) and (4.34) backward in time. At any time instant 𝑘 ∈ K, if any of

inequalities, ˜p𝑘+1 ≤ 𝑑𝑘 and ˜p𝑘+1 ≤ 𝑎𝑘 are not satisfied, the adversary cost 𝐺1

𝑘 is updated using the

bisection method. This process is iteratively repeated until the time instant 𝑘 = 0 and the bisection

method has converged. The determined cost 𝐺1∗
𝑘

indicates the minimal cost the adversary must

bear to achieve a mixed strategy takeover. Next, we illustrate the results of Corollary 1 through a

numerical example.

A Numerical Example (Mixed strategy NE)

In this numerical example we will only focus on the mixed strategy NE and the corresponding

saddle-point value parameters obtained in Corollary 1 on a linear time-invariant (LTI) scalar system

for a horizon length of 𝐿 = 50. The defender and adversary dynamics are given by:

𝑘 (𝑥𝑘 ) := (𝐸 + 𝐵𝐾)𝑥𝑘 , 𝐹1
𝐹0

𝑘 (𝑥𝑘 ) := 𝐸𝑥𝑘 .

In this numerical example we assume adversary cannot directly control the system. In other words,

the control matrix 𝐻𝑘 := 0𝑛×𝑝 and control gain 𝑊𝑘 := 0𝑝×𝑛, ∀𝑘 ∈ K. The quadratic costs (4.28)

108

(a)

(b)

Figure 4.2 (a) Coefficient of the parameterized value function, p0 and p1 for a 1-dimensional system
where the state is bounded (𝐹 ≤ 1) over a horizon length of 𝐿 = 50. (b) Attack and defense policy
corresponding to the value function in Figure 4.2a for the given set of costs.

(a)

(b)

Figure 4.3 (a) Coefficient of the parameterized value function, p0 and p1 for an unbounded (𝐹 ≥ 1)
1-dimensional system with a horizon length of 𝐿 = 50. (b) Policy of defense and attack for the
obtained parameterized value function indicated in Figure 4.3a.

are assumed to fixed ∀𝑘 ∈ K, given by

𝐺0

𝑘 = 𝐺0 = 1, 𝐺1

𝑘 = 𝐺1 = 1, 𝑀𝑘 = 𝑀 = 0.65,

𝑎𝑘 = 𝑎 = {0.5, 0.9},𝑑𝑘 = 𝑑 = {0.5, 0.9}.

109

102030405024680.20.40.610203040500.20.40.60.810203040502461040.20.40.60.810203040500.20.40.60.8The control matrix for the defender is:

𝐵𝑘 = Δ𝑡, ∀𝑘 ∈ K,

where Δ𝑡 = 0.1 for the numerical evaluation. We obtain the defender gain 𝐾 by solving the LQR

problem with arbitrarily weighted state and control cost. We solve for the NE takeover in the space

of mixed strategies and the saddle-point value parameters for two cases of fixed state transition

constant 𝐸𝑘 = 𝐸, ∀𝑘 ∈ K: 𝐸 = 0.99 and 1.1 corresponding to a given choice of takeover costs.

Figure 4.2a illustrates the saddle-point value parameters p𝑖

𝑘 , 𝑖 ∈ {0, 1}, 𝑘 ∈ K for 𝐸 = 0.99

corresponding to different takeover costs. We observe that the saddle-point value parameters

are bounded and reach an asymptotic value. On the contrary, from Figure 4.3a we observer for

𝐸 = 1.1, the saddle-point value parameters for the adversary increases exponentially backward

in time. Although the saddle-point value parameters of the defender are bounded and reach a

fixed value for both the cases of 𝐸 = 0.99 and 𝐸 = 1.1. Such an evolution of saddle-point value

parameters indicates that as the system moves from open-loop stable 𝐸 < 1 to unstable 𝐸 ≥ 1, there

is a large incentive for an adversary to takeover the system. Figure 4.2b shows the takeover policies

of both the players for the case of 𝐸 = 0.99 when 𝛼𝑘 = 0, ∀𝑘 ∈ K. When the defender takeover cost

is lower compared to the adversary, there is low probability of takeover by the defender compared to

the adversary, expect for last few time instants of the game. Such a takeover strategy changes when

the takeover cost of the defender is higher compared to that of the adversary, resulting in a higher

probability of takeover by both the players. Finally, Figure 4.3b illustrates the takeover policy for

the case of 𝐸 = 1.1 and 𝛼𝑘 = 0, 𝑘 ∈ K, where both the defender and adversary converge to an

asymptotic value probability of takeover and remaining idle, respectively. Next, we will extend our

derivation and analysis of the FlipDyn game with known control policies for discrete-time linear

dynamics and quadratic costs to 𝑛−dimensions.

4.4.2 n-dimensional system

Unlike the scalar case, wherein the state 𝑥 was factored out during the computation of the NE

takeover strategies and saddle-point value parameters p0

𝑘 and p1

𝑘 , that simplification does not yield

110

exact results for an 𝑛−dimensional system. The challenge for factoring out the state at any time

𝑘 ∈ K, arises from the term

𝑥T 𝐴𝑘 𝑥𝑥T𝐷 𝑘 𝑥
(cid:101)𝑃𝑘+1(𝑥)

=

𝑥T (cid:16)

(cid:101)𝑊 T

𝑘 𝑃1

𝑥T 𝐴𝑘 𝑥𝑥T𝐷 𝑘 𝑥
𝑘 𝑃0
𝑘+1 (cid:101)𝑊𝑘 − (cid:101)𝐵T

𝑘+1 (cid:101)𝐵𝑘

,

(cid:17)

𝑥

(4.36)

which arises when a mixed strategy NE takeover is played in either of the FlipDyn states. To address

the challenge in factoring out the state, we impose a particular form for the takeover costs, stated in

the following assumption. Here, and in the sequel, let I𝑛 ∈ R𝑛×𝑛 denote the identity matrix.

Assumption 4.4.4 At any time instant 𝑘 ∈ K, we define the defender and adversary costs as

follows:

𝑑𝑘 (𝑥) := 𝑑𝑘 𝑥T𝑥,

𝑎𝑘 (𝑥) := 𝑎𝑘 𝑥T𝑥,

(4.37)

where 𝑑𝑘 ∈ R and 𝑎𝑘 ∈ R are non-negative scalars.

Next, we introduce an approximation consisting of the state 𝑥 and a matrix 𝑃.

Proposition 4.4.5 Given a positive definite matrix Ψ, the term

𝑥T𝑥
𝑥TΨ𝑥,

consisting of state 𝑥 can be upper bounded by

𝑥T𝑥
𝑥TΨ𝑥,

≤

𝑥TΨ−1𝑥
𝑥T𝑥

.

Proof: Setting,

observe that the 2 × 2 matrix

Γ := Ψ1/2
𝑘+1

𝑥, Ω := Ψ−1/2
𝑘+1

𝑥,

M :=

Γ𝑇 Γ Γ𝑇 Ω





Ω𝑇 Γ Ω𝑇 Ω












(cid:104)

=

Γ Ω

(cid:105)𝑇 (cid:104)

(cid:105)

⪰ 0.

Γ Ω

111

(4.38)

□

Therefore,

det(M) = (Γ𝑇 Γ) (Ω𝑇 Ω) − (Γ𝑇 Ω)2 ≥ 0,

and thus our claim (4.38) holds.

□

Using the Assumption 4.4.4 and results from Proposition 4.4.5 we derive an approximation for

the saddle-point value parameters state in the following result.

Lemma 4.4.6 Under Assumptions 4.4.2 and 4.4.4, consider a linear dynamical system described

by (4.27) with quadratic costs (4.25) and FlipDyn dynamics (4.3) with known saddle-point value

parameters 𝑃1

𝑘+1 and 𝑃0

𝑘+1, such the following conditions are satisfied:

(𝐸𝑘 + 𝐻𝑘𝑊𝑘 )T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊𝑘 ) − (𝐸𝑘 + 𝐵𝑘 𝐾𝑘 )T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾𝑘 ) ≻ 𝑑𝑘 I𝑛,

(𝐸𝑘 + 𝐻𝑘𝑊𝑘 )T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊𝑘 ) − (𝐸𝑘 + 𝐵𝑘 𝐾𝑘 )T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾𝑘 ) ≻ 𝑎𝑘 I𝑛.

(4.39)

(4.40)

Then, the saddle-point value parameters at time 𝑘 ∈ K, under a mixed strategy NE takeover of

both the FlipDyn states satisfy

𝑃0

𝑘 ⪯ 𝐺0

𝑘 + 𝑑𝑘 I𝑛 + 𝐾 T

𝑘 𝑀𝑘 𝐾𝑘 + (cid:101)𝐵T

𝑘 𝑃0

𝑘+1 (cid:101)𝐵𝑘 − 𝑑𝑘 `𝑃−1
𝑘+1

𝑎𝑘 ,

𝑃1

𝑘 ⪰ 𝐺1

𝑘 − 𝑎𝑘 I𝑛 − 𝑊 T

𝑘 𝑁𝑘𝑊𝑘 + (cid:101)𝑊 T

𝑘 𝑃1

𝑘+1 (cid:101)𝑊𝑘 + 𝑑𝑘 `𝑃−1
𝑘+1

𝑎𝑘 ,

where `𝑃𝑘+1 := (cid:101)𝑊 T

𝑘 𝑃1

𝑘+1 (cid:101)𝑊𝑘 − (cid:101)𝐵T

𝑘 𝑃0

𝑘+1 (cid:101)𝐵𝑘 .

(4.41)

(4.42)

□

Proof: We will show the proof only for (4.41), as the derivation is analogous for (4.42). Under

a mixed strategy NE takeover, we substitute the linear dynamics (4.27), quadratic costs (4.25) and

takeover costs (4.37) in (4.16) to obtain:

𝑥T𝑃0

𝑘 𝑥 = 𝑥T (cid:16)

𝐺0

𝑘 + 𝑑𝑘 I𝑛 + 𝐾 T

𝑘 𝑀𝑘 𝐾𝑘 + (cid:101)𝐵T

𝑘 𝑃0

𝑘+1

(cid:17)

𝐵𝑘

𝑥 − 𝑑𝑘 𝑥T𝑥𝑎𝑘

𝑥T𝑥
𝑥T `𝑃𝑘+1𝑥

.

Using the results from (4.38) to bound the term consisting of the state 𝑥 and `𝑃 and factoring out

the state, we obtain (4.41).

□

112

The upper bound for the saddle-point value parameters derived in Lemma 4.4.6 enable us to

recursively define an approximate saddle-point value parameter of the form:

𝑘 (𝑥) := 𝑥T ˆ𝑃0
ˆ𝑉 0

𝑘 𝑥,

𝑘 (𝑥) := 𝑥T ˆ𝑃1
ˆ𝑉 1

𝑘 𝑥,

(4.43)

where ˆ𝑃1

𝑘 ∈ R𝑛×𝑛 and ˆ𝑃0

𝑘 ∈ R𝑛×𝑛.

Similar to results obtained in Corollary 1, we will use the results from Theorem 4.3.2 to provide

an approximate NE takeover pair { ˆ𝑦𝛼∗

𝑘 , ˆ𝑧𝛼∗
the corresponding approximate saddle-point value update of the parameter ˆ𝑃𝛼

𝑘 }, in both pure and mixed strategies of both players, and
𝑘 ∈ R𝑛×𝑛, 𝛼 ∈ {0, 1}.

Corollary 2 (Case 𝛼𝑘 = 0) The approximate NE takeover strategies of the FlipDyn-Con game (4.7)

with known control policies at any time 𝑘 ∈ K, subject to dynamics in (4.27), with quadratic

costs (4.25), takeover costs (4.37), and FlipDyn dynamics (4.3) are given by:

ˆ𝑦0∗
𝑘 =

ˆ𝑧0∗
𝑘 =










𝑎𝑘 𝑥T𝑥

𝑥T ˇ𝑃𝑘+1𝑥

1 −

𝑎𝑘 𝑥T𝑥

𝑥T ˇ𝑃𝑘+1𝑥







(cid:20)

1

0

T







(cid:21) T

, if

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

, otherwise,








(cid:20)

(cid:20)

1 −

𝑑𝑘 𝑥T𝑥

𝑑𝑘 𝑥T𝑥

𝑥T ˇ𝑃𝑘+1𝑥

𝑥T ˇ𝑃𝑘+1𝑥

0

1

1

0

T








(cid:21) T

(cid:21) T

, if

, if

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑑𝑘 𝑥T𝑥,

, otherwise,

(4.44)

(4.45)

where

and (cid:101)𝑃𝑘+1(𝑥) := 𝑥T ˇ𝑃𝑘+1𝑥.

ˇ𝑃𝑘+1 := (cid:101)𝑊 T

𝑘 ˆ𝑃1

𝑘+1 (cid:101)𝑊𝑘 − (cid:101)𝐵T

𝑘 ˆ𝑃0

𝑘+1 (cid:101)𝐵𝑘

113

The approximate saddle-point value parameter at time 𝑘 is given by:

ˆ𝑃0

𝑘 =






𝐺0

𝑘 + 𝑑𝑘 I𝑛 + 𝐾 T

𝑘 𝑀𝑘 𝐾𝑘 + (cid:101)𝐵T

𝑘 ˆ𝑃0

𝑘+1 (cid:101)𝐵𝑘 − 𝑑𝑘 ˇ𝑃−1
𝑘+1

𝑎𝑘

𝐺0

𝑘 − 𝑎𝑘 I𝑛 + 𝐾 T

𝑘 𝑀𝑘 𝐾𝑘 + (cid:101)𝑊 T

𝑘 ˆ𝑃1

𝑘+1 (cid:101)𝑊𝑘 − 𝑊 T

𝑘 𝑁𝑘𝑊𝑘 ,

𝐺0

𝑘 + 𝐾 T

𝑘 𝑀𝑘 𝐾𝑘 + (cid:101)𝐵T

𝑘 ˆ𝑃0

𝑘+1 (cid:101)𝐵𝑘 ,

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(4.46)

if

if

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑑𝑘 𝑥T𝑥,

otherwise.

(Case 𝛼𝑘 = 1) The approximate NE takeover strategies are given by:

ˆ𝑦1∗
𝑘 =




1 −




𝑎𝑘 𝑥T𝑥

𝑎𝑘 𝑥T𝑥

𝑥T ˇ𝑃𝑘+1𝑥

𝑥T ˇ𝑃𝑘+1𝑥

(cid:20)

(cid:20)

0

1

1

0

T








(cid:21) T

(cid:21) T

,

,

,

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

otherwise,

(cid:20) 𝑑𝑘 𝑥T𝑥
𝑥T ˇ𝑃𝑘+1𝑥

1 −

𝑑𝑘 𝑥T𝑥
𝑥T ˇ𝑃𝑘+1𝑥

(cid:21) T

, if

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

ˆ𝑧1∗
𝑘 =

1

(cid:21) T

0

, otherwise.










(cid:20)

(4.47)

(4.48)

The approximate saddle-point value parameter at time 𝑘 is given by,

ˆ𝑃1

𝑘 =






𝐺1

𝑘 − 𝑎𝑘 I𝑛 − 𝑊 T

𝑘 𝑁𝑘𝑊𝑘 + (cid:101)𝑊 T

𝑘 ˆ𝑃1

𝑘+1 (cid:101)𝑊𝑘 + 𝑑𝑘 `𝑃−1
𝑘+1

𝐺1

𝑘 + 𝑑𝑘 I𝑛 − 𝑊 T

𝑘 𝑁𝑘𝑊𝑘 + (cid:101)𝑊 T

𝑘 ˆ𝑃1

𝑘+1 (cid:101)𝑊𝑘 ,

𝐺1

𝑘 − 𝑊 T

𝑘 𝑁𝑘𝑊𝑘 + (cid:101)𝑊 T

𝑘 ˆ𝑃1

𝑘+1 (cid:101)𝑊𝑘 ,

114

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑎𝑘 𝑥T𝑥,

(4.49)

𝑎𝑘 ,

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

otherwise.

The terminal conditions for the recursions (4.46) and (4.49) are:

ˆ𝑃0
𝐿+1 := 𝐺0

𝐿+1

,

𝐿+1 := 𝐺1
ˆ𝑃1

𝐿+1

.

Proof:

[Outline] We begin the proof by determining the NE takeover in pure and mixed strategies

□

for the FlipDyn state of 𝛼 = 0. We substitute the quadratic costs (4.25), linear dynamics (4.27) in
(cid:101)𝑃𝑘+1(𝑥) with the approximate saddle-point value parameters 𝑃0

𝑘+1 and 𝑃1

𝑘+1 to obtain:

(cid:101)𝑃𝑘+1(𝑥) := ˇ𝑉 1
= 𝑥T (cid:16)

𝑘+1( (cid:101)𝑊𝑘 𝑥) − ˇ𝑉 0

𝑘+1( (cid:101)𝐵𝑘 𝑥),

(cid:101)𝑊 T

𝑘 ˆ𝑃1

𝑘+1 (cid:101)𝑊𝑘 − (cid:101)𝐵T

𝑘 ˆ𝑃0

𝑘+1 (cid:101)𝐵𝑘

(cid:17)

𝑥,

= 𝑥T ˇ𝑃𝑘+1𝑥.

We substitute the takeover cost (4.37) and 𝑥T ˇ𝑃𝑘+1𝑥 in (4.14) and (4.15), to obtain the NE takeover

policies in (4.44) and (4.45), respectively. The approximate NE takeover strategies of the FlipDyn

state 𝛼 = 1 are complementary to 𝛼 = 0, presented in (4.47) and (4.48).

To determine the approximate saddle-point value parameters under a mixed strategy NE takeover

of the FlipDyn state of 𝛼 = 0, we substitute the upper bound (4.41) from Lemma 4.4.6 and replace

𝑃0
𝑘+1 with ˇ𝑃0
and discrete-time linear dynamics (4.27) to obtain the approximate saddle-point value parameters.

𝑘+1. Under a pure strategy NE takeover, we substitute the quadratic costs (4.25)

Combining both the solutions from the mixed and pure strategy NE takeover, we obtain (4.46). We

skip the proof of the saddle-point value parameter ˆ𝑃1

𝑘+1 for brevity.

□

Recursions (4.46) and (4.49) provide an approximate solution to the FlipDyn problem (4.7) for

the 𝑛-dimensional case with known control policies.

Analogous to the scalar scar, we can determine a minimum adversarial state cost 𝐺1∗
𝑘

that

guarantees a mixed strategy NE takeover at every time 𝑘 ∈ K. We characterize such an adversarial

state cost in the following remark.

Remark 4.4.7 Given a 𝑛−dimensional linear system (4.27) with quadratic costs (4.25), the mixed

strategy NE takeover and the corresponding recursion for the approximate saddle-point value

115

parameter, as outlined in Corollary 2, exists for an adversary state-dependent cost 𝐺1∗

𝑘 ⪯ 𝐺1
𝑘

provided

ˇ𝑃𝑘+1 ≻ 𝑑𝑘 I𝑛,

ˇ𝑃𝑘+1 ≻ 𝑎𝑘 I𝑛, ∀𝑘 ∈ K,

with the parameters at the time 𝐿 + 1 given by:

𝐿+1 := 𝐺1∗
ˇ𝑃1
𝐿+1

,

𝐿+1 := 𝐺0
ˇ𝑃0

𝐿+1

.

(4.50)

Similar to the scalar/1-dimensional case, we can determine 𝐺1∗

𝑘 using a bisection method. Next,

we illustrate the results of the approximate value function on a numerical example.

A Numerical Example (Mixed strategy NE)

(a)

(b)

Figure 4.4 (a) Minimum eigenvalue of the saddle-point value parameters, 𝜆𝑛 ( ˆ𝑃0) and 𝜆𝑛 ( ˆ𝑃1),
given 𝑒 = 0.99 over a horizon length of 𝐿 = 100. (b) Takeover strategies corresponding to the
saddle-point value in Figure 4.4a for a given initial state 𝑥1 and FlipDyn state 𝛼𝑘 = 0, ∀𝑘 ∈ K.

Similar to the scalar/1-dimensional case, in this numerical example we will only focus on the

mixed strategy NE and the corresponding saddle-point value parameters obtained in Corollary 2

on a linear time-invariant (LTI) scalar system for a horizon length of 𝐿 = 100. For this example,

we use a double integrator for the defender and adversary given by:

𝑘 (𝑥𝑘 ) := (𝐸 + 𝐵𝐾)𝑥𝑘 , 𝐹1
𝐹0

𝑘 (𝑥𝑘 ) := 𝐸𝑥𝑘 ,

116

204060801002004006000.850.90.95204060801000.050.10.15(a)

(b)

Figure 4.5 (a) Minimum eigenvalue of the saddle-point value parameters, 𝜆𝑛 ( ˆ𝑃0) and 𝜆𝑛 ( ˆ𝑃1), given
𝑒 = 1.01 over the same horizon length of 𝐿 = 100. (b) Takeover strategies corresponding to the
saddle-point value in Figure 4.5a for a given initial state 𝑥1 and FlipDyn state 𝛼𝑘 = 0, ∀𝑘 ∈ K.

where

𝐸𝑘 = 𝐸 =

, 𝐵𝑘 =

, ∀𝑘 ∈ K.

𝑒 Δ𝑡

0

𝑒

















Δ𝑡

0

















Similar to the scalar/1−dimensional case, we solve for the approximate NE takeover strategies

and saddle-point value function parameters for two cases with a fixed state transition constant

𝑒𝑘 = 𝑒, ∀𝑘 ∈ K: 𝑒 = 0.99 and 1.01 given Δ𝑡 = 0.1. The system represents a second order system

with acceleration as the control input. Analogous to the scalar case, we obtain the defender’s gain

𝐾 using the LQR method. The quadratic costs (4.25) are assumed to fixed ∀𝑘 ∈ K, given by

𝐺0

𝑘 = 𝐺0 = I𝑛, 𝐺1

𝑘 = 𝐺1 =1.35I𝑛, 𝐷 𝑘 = 𝐷 = {0.5, 0.9}I𝑛,

𝐴𝑘 = 𝐴 = {0.5, 0.9}I𝑛, 𝑀𝑘 = 𝑀 = 0.65, 𝑁𝑘 = 𝑁 = 0.45

Since the saddle-point value parameters for 𝑛-dimensions are symmetric positive definite ma-

trices, we plot the minimum eigenvalues of ˆ𝑃1

𝑘 , ˆ𝑃0

𝑘 shown in Figure 4.4a and 4.5a

Akin to the scalar case, we observe a similar trend of converging coefficients when 𝑒 ≤ 1,

i.e., the system remains bounded upon lack of control, whereas the eigenvalues of saddle-point

parameters of the adversary diverges for 𝑒 > 1, indicating a large incentive for an adversary in

taking over the system backward in time. Since the player policies for the 𝑛-dimensional case

117

20406080100204060800.20.40.60.8204060801000.20.40.60.8are functions of state, and the FlipDyn state is a random variable, the attack and defense takeover

actions are averaged over 500 independent simulations for 𝑒 := 0.99 and 𝑒 := 1.01 are shown in

Figures 4.4b and 4.5b, respectively, with the initial state 𝑥0 =

(cid:105)𝑇

(cid:104)

0 1

. We observe a dynamic

policy over the horizon length for the case of 𝑒 := 0.99, and a converging pure policy for 𝑒 := 1.01

for the FlipDyn state 𝛼 = 0, respectively. The converging pure policy for 𝑒 := 1.01 is reflective of

the ever increasing value of the adversary over the horizon length.

Corollary 1 and 2 completely characterize the takeover strategies of the FlipDyn game when the

control policies are known in the space of pure and mixed strategies. The closed-form expressions

of takeover strategies provides computational efficiency and scalability for large horizons. Next,

we will derive both the control policies and takeover strategies of both the player.

4.5 FlipDyn-Con for LQ Problems

In this section we will solve the complete FlipDyn-Con game. We will state the underlying

control problem and show how it decouples from the game of takeover. We will derive conditions

under which we obtain a linear state-feedback policies and how it impacts the saddle-point value

of the game.

4.5.1 Control policy for the FlipDyn-Con LQ Problem

To determine the control policies for both players, we need to solve the following problems in

each FlipDyn state

min
𝑢𝑘 (𝑥)

max
𝑤 𝑘 (𝑥)






𝑘+1 + 𝑢T
𝑣0

𝑘 (𝑥) 𝑀𝑘𝑢𝑘 (𝑥) −

𝑥T𝐷 𝑘 𝑥𝑥T 𝐴𝑘 𝑥
(cid:101)𝑃𝑘+1(𝑥)

,

𝑘+1 + 𝑢T
𝑣1

𝑘 (𝑥) 𝑀𝑘𝑢𝑘 (𝑥) − 𝑥T

𝑘 𝐴𝑘 𝑥,

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝐷 𝑘 𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T 𝐴𝑘 𝑥,

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑥T𝐷 𝑘 𝑥,

(4.51)

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T 𝐴𝑘 𝑥,

𝑣0

𝑘+1

+ 𝑢T

𝑘 (𝑥) 𝑀𝑘𝑢𝑘 (𝑥),

otherwise,

118

and

where,

min
𝑢𝑘 (𝑥)

max
𝑤 𝑘 (𝑥)






𝑘+1 − 𝑤T
𝑣1

𝑘 (𝑥)𝑁𝑘 𝑤 𝑘 (𝑥) +

𝑥T𝐷 𝑘 𝑥𝑥T 𝐴𝑘 𝑥
(cid:101)𝑃𝑘+1(𝑥)

,

𝑣0
𝑘+1 − 𝑤T

𝑘 (𝑥)𝑁𝑘 𝑤 𝑘 (𝑥) + 𝑥T

𝑘 𝐷 𝑘 𝑥,

𝑘+1 − 𝑤T
𝑣1

𝑘 (𝑥)𝑁𝑘 𝑤 𝑘 (𝑥),

otherwise,

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝐷 𝑘 𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T 𝐴𝑘 𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝐷 𝑘 𝑥,

(4.52)

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑥T 𝐴𝑘 𝑥,

(cid:101)𝑃𝑘+1(𝑥) := 𝑣1

𝑘+1 − 𝑣0

𝑘+1

.

(4.53)

The terms 𝑣0

𝑘+1 and 𝑣1

𝑘+1 are defined in (4.9) and (4.10), respectively. The first condition

in both (4.51) and (4.52) pertains to NE takeover in mixed strategies by both players, while

the remaining conditions correspond to playing NE takeover in pure strategies. Notably, the

problems corresponding to NE takeover in mixed strategies contain the term (cid:101)𝑃𝑘+1(𝑥), which

couples the saddle-point values in both the FlipDyn state. In the following results, we will derive

the control policies for NE takeover, both in pure and mixed strategies, for each of the FlipDyn

state. Furthermore, we observe that the min-max problem corresponding to the NE takeover in

pure strategies in each of the FlipDyn relies on the solution to the NE takeover in mixed strategies

( (cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝐷 𝑘 𝑥, (cid:101)𝑃𝑘+1(𝑥) > 𝑥T 𝐴𝑘 𝑥). Thus, we will begin by deriving the control policies for

NE takeover in mixed strategies.

We restrict the control policies to be linear state-feedback in the continuous state 𝑥, as stated in

Assumption 4.4.2. Under Assumption 4.4.2 the parameteric form of the saddle-point value in each

of the FlipDyn states still hold, given by:

𝑘 (𝑥, 𝑢𝑘 (𝑥), Ξ0
𝑉 0

𝑘+1) ⇒ 𝑉 0

𝑘 (𝑥) := 𝑥T𝑃0

𝑘 𝑥,

𝑘 (𝑥, 𝑤 𝑘 (𝑥), Ξ1
𝑉 1

𝑘+1) ⇒ 𝑉 1

𝑘 (𝑥) := 𝑥T𝑃1

𝑘 𝑥,

where 𝑃0

𝑘 and 𝑃1

𝑘 are 𝑛 × 𝑛 real symmetric matrices corresponding to the FlipDyn states 𝛼 = 0 and

1, respectively.

119

Furthermore, Assumption 4.4.4 plays an essential role in computing the saddle-point value for

the 𝑛-dimensional dynamical system (Section 4.5.3). Next, we derive conditions under which there

exists an optimal linear state-feedback control policy pair {𝑢∗

𝑘 , 𝑤∗

𝑘 }.

Theorem 4.5.1 Under Assumptions 4.4.2 and 4.4.4, consider a linear dynamical system described

by (4.24) with quadratic costs (4.25), takeover costs (4.37), and FlipDyn dynamics (4.3). Suppose

that for every 𝑘 ∈ K,

𝐵T

𝑘 𝑃0

𝑘+1

𝐵𝑘 + 𝑀𝑘 ≻ 0, 𝐻T

𝑘 𝑃1

𝑘+1

𝐻𝑘 − 𝑁𝑘 ≺ 0.

(4.54)

Then, an optimal linear state-feedback control policies of the form (4.26), under a mixed strategy

NE takeover for both the defender and adversary are given by:

𝑘 (𝑥) := 𝐾 ∗
𝑢∗

𝑘 (𝜂𝑘 )𝑥 = −( ˆ𝜂𝐵T

𝑘 𝑃0

𝑘+1

𝐵𝑘 + 𝑀𝑘 )−1( ˆ𝜂𝐵T

𝑘 𝑃0

𝑘+1

𝐸𝑘 )𝑥,

𝑤∗

𝑘 (𝑥) := 𝑊 ∗

𝑘 (𝜂𝑘 )𝑥 = −( ˆ𝜂𝐻T

𝑘 𝑃1

𝑘+1

𝐻𝑘 − 𝑁𝑘 )−1( ˆ𝜂𝐻T

𝑘 𝑃1

𝑘+1

𝐸𝑘 )𝑥,

where ˆ𝜂𝑘 := 1 − 𝜂2

𝑘 and the parameter 𝜂𝑘 satisfies the following conditions:

(4.55)

(4.56)

(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 ))T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 )) − (𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂𝑘 ))T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂𝑘 )) ≻ 𝑑𝑘 I𝑛,

(4.57)

(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 ))T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 )) − (𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂𝑘 ))T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂𝑘 )) ≻ 𝑎𝑘 I𝑛,

𝑥T (cid:16)

(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 ))T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 ))−

(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂𝑘 ))T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂𝑘 ))

(cid:17)

𝑥 = 𝑥T𝑥

√

𝑎𝑘 𝑑𝑘
𝜂𝑘

.

(4.58)

(4.59)

□

Proof: Under Assumptions 4.4.2 and 4.4.4, if the adversary control policy 𝑤∗

𝑘 (𝑥) is known, then

the defender’s control problem reduces to

𝑥T𝑑𝑘 I𝑛𝑥𝑥T𝑎𝑘 I𝑛𝑥
− 𝑣0

𝑣1∗
𝑘+1

𝑘+1

,

(4.60)

min
𝐾𝑘

𝑘+1 + 𝑥T
𝑣0

𝑘 𝐾𝑘 𝑀𝑘 𝐾𝑘 𝑥 −

120

where 𝑣1∗

𝑘+1 := 𝑥T(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 )𝑥 and 𝑣0
adversary’s control problem for a known defender policy 𝑢∗

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 )T𝑃1

𝑘+1 is defined in (4.9). Similarly, the
𝑘 (𝑥) is given by

max
𝑊𝑘

𝑘+1 − 𝑥T
𝑣1

𝑘 𝑊𝑘 𝑁𝑘𝑊𝑘 𝑥 +

𝑥T𝑑𝑘 I𝑛𝑥𝑥T𝑎𝑘 I𝑛𝑥

𝑣1

𝑘+1

− 𝑣0∗
𝑘+1

,

(4.61)

where 𝑣0∗

𝑘+1 := 𝑥T(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 )T𝑃0

𝑘+1

(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 )𝑥, and 𝑣1

𝑘+1 is defined in (4.10).

Taking the first derivative of (4.60) and (4.61) with respect to the player control gains 𝐾𝑘 and

𝑊𝑘 , respectively, and solving the first-order optimality conditions, we obtain

𝐵T

𝑘 𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾𝑘 ) + 𝑀𝑘 𝐾𝑘 −
𝑎𝑘 𝑑𝑘 (𝑥T𝑥)2𝐵T
(𝑣1∗
𝑘+1

(𝐸𝑘+𝐵𝑘 𝐾𝑘)

𝑘 𝑃0
𝑘+1
𝑘+1)2
−𝑣0

𝐻T

𝑘 𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊𝑘 ) − 𝑁𝑘𝑊𝑘 −
𝑎𝑘 𝑑𝑘 (𝑥T𝑥)2𝐻T

𝑘 𝑃1
𝑘+1
−𝑣0∗
𝑘+1

(𝐸𝑘+𝐻𝑘𝑊𝑘)
)2

(𝑣1

𝑘+1

= 0𝑚×𝑛,

= 0𝑝×𝑛,

(4.62)

(4.63)

where 0𝑖× 𝑗 ∈ R𝑖× 𝑗 is a matrix of zeros. The terms

𝑎𝑘 𝑑𝑘 (𝑥T𝑥)2𝐵T
(𝑣1∗
𝑘+1

𝑘 𝑃0
𝑘+1
𝑘+1)2
−𝑣0

(𝐸𝑘+𝐵𝑘 𝐾𝑘)

and

𝑎𝑘 𝑑𝑘 (𝑥T𝑥)2𝐻T

𝑘 𝑃1
𝑘+1
−𝑣0∗
𝑘+1

(𝐸𝑘+𝐻𝑘𝑊𝑘)
)2

,

(𝑣1

𝑘+1

introduce non-linearity in 𝐾𝑘 and 𝑊𝑘 in (4.62) and (4.63), respectively. This prevents us from

deriving an optimal linear control policy of the form (4.26). In order to address this limitation and

achieve a linear control policy, we look for scalar parameters 𝜂𝑘,0 ∈ R and 𝜂𝑘,1 ∈ R such that they

satisfy

𝑥T (cid:16)

(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 )T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 ) −

(𝐸𝑘 + 𝐵𝑘 𝐾𝑘 )T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾𝑘 )

(cid:17)

𝑥 = 𝑥T𝑥

𝑥T (cid:16)

(𝐸𝑘 + 𝐻𝑘𝑊𝑘 )T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊𝑘 ) −

(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 )T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾 ∗
𝑘 )

(cid:17)

𝑥 = 𝑥T𝑥

√

𝑎𝑘 𝑑𝑘
𝜂𝑘,0

√

𝑎𝑘 𝑑𝑘
𝜂𝑘,1

,

.

(4.64)

(4.65)

Substituting (4.64) and (4.65) in (4.62) and (4.63), respectively, and solving for the parameterized

control gains we obtain:

𝐾 ∗

𝑘 = −((1 − 𝜂2

𝑘,0)𝐵T

𝑘 𝑃0

𝑘+1

𝐵𝑘 + 𝑀𝑘 )−1((1 − 𝜂2

𝑘,0)𝐵T

𝑘 𝑃0

𝑘+1

𝐸𝑘 ),

(4.66)

121

𝑊 ∗

𝑘 = −((1 − 𝜂2

𝑘,1)𝐻T

𝑘 𝑃1

𝑘+1

𝐻𝑘 − 𝑁𝑘 )−1((1 − 𝜂2

𝑘,1)𝐻T

𝑘 𝑃1

𝑘+1

𝐸𝑘 ),

(4.67)

∀𝑥 ∈ X. Substituting (4.66) and (4.67) in (4.64) and (4.65), respectively, yields an identical

equation. This observation implies that if there exists a common parameter 𝜂𝑘 such that 𝜂𝑘 =

𝜂𝑘,0 = 𝜂𝑘,1, we can derive the control policy pair (4.55) and (4.56), with the condition of existence

given in (4.59). The control policy pair {𝐾 ∗

the saddle-point values 𝑉 0

𝑘 (𝑥) and 𝑉 1

𝑘 , 𝑊 ∗

𝑘 } constitutes a mixed strategy NE takeover with
𝑘 (𝑥), provided the control policy pair satisfies the conditions,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥, (cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥.

Substituting the dynamics (4.24) and the parameterized optimal control policies (𝑢∗

𝑘 (𝑥), 𝑤∗

𝑘 (𝑥))

in (4.53) and factoring out the state 𝑥, we obtain the conditions (4.57) and (4.58).

Furthermore, substituting (4.64) and (4.65) in (4.62) and (4.63), respectively, and then taking

the second derivative with respect to 𝐾 ∗

𝑘 and 𝑊 ∗

𝑘 and solving for the second-order conditions, we

conclude that the controls are optimal provided

(1 − 𝜂2

𝑘 )𝐵T

𝑘 𝑃0

𝑘+1

𝐵𝑘 + 𝑀𝑘 ≻ 0,

(1 − 𝜂2

𝑘 )𝐻T

𝑘 𝑃1

𝑘+1

𝐻𝑘 − 𝑁𝑘 ≺ 0.

(4.68)

Given the quadratic costs (4.25), as 𝜂𝑘 → 0, the second-order optimality condition (4.68) is always

satisfied. Setting 𝜂𝑘 = 1 in (4.68), yields the limiting conditions (4.54). The obtained conditions

verify/certify strong convexity in the control gain 𝐾𝑘 and strong concavity in 𝑊𝑘 , ensuring the

existence of a unique saddle-point equilibrium.

□

Theorem 4.5.1 provides a condition under which there exists linear state-feedback control

policy pair. This characterization will enable us to compute the saddle-point value efficiently using

backward iteration. The following result further bounds the range of the parameter 𝜂𝑘 corresponding

to the mixed strategy NE takeover.

Proposition 4.5.2 The permissible range for the parameter 𝜂𝑘 , satisfying the condition (4.59)

satisfies

0 < 𝜂𝑘 <

√︄

min{𝑑𝑘,𝑎𝑘 }
max{𝑑𝑘,𝑎𝑘 }

< 1.

122

(4.69)

□

Proof: A permissible parameter 𝜂𝑘 satisfying the condition (4.59) corresponds to a control policy

𝑘 (𝑥)} that constitutes a mixed strategy NE takeover with saddle-point values 𝑉 0

𝑘 (𝑥)

pair {𝑢∗

and 𝑉 1

𝑘 (𝑥), 𝑤∗
𝑘 (𝑥). Such a control policy pair and 𝜂𝑘 must satisfy

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥, (cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥.

Since a lower bound on the term (cid:101)𝑃𝑘+1(𝑥) is equivalent to the condition (4.59), we substitute the

right-hand side of (4.59) into the prior stated conditions to obtain
√

√

𝑥T𝑥

𝑥 > 𝑑𝑘 𝑥T𝑥, 𝑥T𝑥

> 𝑎𝑘 𝑥T𝑥.

𝑎𝑘 𝑑𝑘
𝜂𝑘

𝑎𝑘 𝑑𝑘
𝜂𝑘

By eliminating the state 𝑥 and combining the terms, we arrive at (4.69).

□

This proposition enables us to reduce the search space of the permissible parameter 𝜂𝑘 . In the

subsequent sections, we will illustrate how this constrained range proves instrumental in determining

a feasible 𝜂𝑘 for both scalar and 𝑛−dimensional case. Given the control policies corresponding to

the mixed strategy NE, we will now characterize the control policies for the NE takeover in both

pure and mixed strategies.

Theorem 4.5.3 Under Assumptions 4.4.2 and 4.4.4, consider a linear dynamical system described

by (4.24) with quadratic costs (4.25), takeover costs (4.37), and FlipDyn dynamics (4.3). An optimal

linear state-feedback control policy of the form (4.26), parametrized by a scalar 𝜂𝑘 ∈ [0, 1] is given

by

𝑢∗
𝑘 (𝑥) =

𝐾 ∗

𝑘 (𝜂𝑘 )𝑥,

𝐾 ∗

𝑘 (1)𝑥,

if

if

𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,
(cid:101)𝑃∗

𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,
(cid:101)𝑃∗

𝑘+1(𝑥) ≤ 𝑥T𝑑𝑘 I𝑛𝑥,
(cid:101)𝑃∗

𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,
(cid:101)𝑃∗

𝐾 ∗

𝑘 (0)𝑥,

otherwise,






123

(4.70)

𝑤∗

𝑘 (𝑥) =

𝑊 ∗

𝑘 (𝜂𝑘 )𝑥,

𝑊 ∗

𝑘 (1)𝑥,

if

if

𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,
(cid:101)𝑃∗

𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,
(cid:101)𝑃∗

𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,
(cid:101)𝑃∗

𝑘+1(𝑥) ≤ 𝑥T𝑎𝑘 I𝑛𝑥,
(cid:101)𝑃∗

𝑊 ∗

𝑘 (0)𝑥,

otherwise,






where

𝑘+1(𝑥) := 𝑥T (cid:16)
(cid:101)𝑃∗

(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 )))T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 ))

− (𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂𝑘 ))T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂𝑘 )

and 𝜂𝑘 , 𝑃1

𝑘+1 and 𝑃0

𝑘+1 satisfy conditions (4.57), (4.58) and (4.59).

(4.71)

(cid:17)

𝑥,

□

Proof: We will establish the proof only for the defender’s control policy, as the derivation is

analogous for the adversary’s control policy. We begin by considering the condition in both (4.70)

and (4.71), specifically:

𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥, and (cid:101)𝑃∗
(cid:101)𝑃∗

𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥.

Under these conditions and the conditions (4.57), (4.58) and (4.59), Theorem 4.5.1 yields mixed

strategy NE takeover policies. To complete the remaining part of this claim, we proceed to derive

the control policies for NE takeovers in pure strategies.

i) Pure strategy: The defender chooses to stay idle whereas the adversary chooses to takeover.

This takeover strategy is characterized by the conditions

(cid:101)𝑃∗
𝑘+1(𝑥) ≤ 𝑥T𝑑𝑘 𝐼𝑥, and (cid:101)𝑃∗

𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥.

If the optimal adversary control policy 𝑤∗

𝑘 (𝑥) for the corresponding pure strategy NE takeover are

known, the defender’s control problem simplifies to

min
𝐾𝑘

𝑣1∗
𝑘+1 + 𝑥T

𝑘 𝐾 T

𝑘 𝑀𝑘 𝐾𝑘 𝑥 − 𝑥T

𝑘 𝑎𝑘 I𝑛𝑥.

(4.72)

124

Taking the first derivative of (4.72) with respect to 𝐾𝑘 , and subsequently applying the first-order

optimality condition given 𝑀𝑘 ∈ S𝑚×𝑚

+

, we obtain

𝑀𝑘 𝐾𝑘 𝑥𝑥T = 0𝑚×𝑛, ⇒ 𝑀 −1

𝑘 𝑀𝑘 𝐾𝑘 𝑥𝑥T = 0𝑚×𝑛, ∀𝑥 ∈ X

⇒ 𝐾𝑘 = 0𝑚 = 𝐾 ∗

𝑘 (𝜂𝑘 = 1).

This means that the defender refrains from applying any control input due to a deterministic

adversarial takeover at 𝑘 + 1. Notice that this condition of zero control gain is aligned with setting

𝜂𝑘 = 1 in (4.66).

ii) Pure strategy: Both the defender and adversary choose the action of staying idle. In this

case, the takeover strategy corresponds to the conditions:

𝑘+1(𝑥) ≤ 𝑥T𝑑𝑘 I𝑛𝑥, and (cid:101)𝑃∗
(cid:101)𝑃∗

𝑘+1(𝑥) ≤ 𝑥T𝑎𝑘 I𝑛𝑥.

Given the absence of an adversary control term in determining the saddle-point value of the game,

the defender’s control problem simplifies to:

min
𝐾𝑘

𝑘+1 + 𝑥T
𝑣0

𝑘 𝐾 T

𝑘 𝑀𝑘 𝐾𝑘 𝑥.

(4.73)

Taking the first derivative of (4.73) with respect to 𝐾𝑘 , and solving for the first-order optimality

condition, we obtain:

𝐵T

𝑘 𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾𝑘 ) = −𝑀𝑘 𝐾𝑘 ,

⇒ 𝐾𝑘 = (𝑀𝑘 + 𝐵T𝑃0

𝑘+1

𝐵𝑘 )−1𝐵T

𝑘 𝑃0

𝑘+1

𝐸𝑘 := 𝐾 ∗

𝑘 (𝜂𝑘 = 0).

This control policy pertains to a single-player control problem, given that the FlipDyn state deter-

ministically remains at 𝛼𝑘+1 = 0. Furthermore, this control policy corresponds to setting 𝜂𝑘 = 0

in (4.66).

□

Theorems 4.5.1 and 4.5.3 completely characterize the control policies of both players in the

space of pure and mixed NE strategies for the takeover. This characterization enables us to compute

the saddle-point value efficiently. If we define the dynamics of the defender and adversary using a

parameter 𝜁𝑘 ∈ R, then the continuous state evolution is given by

𝑥𝑘+1 = ˇ𝐵𝑘 (𝜁𝑘 )𝑥𝑘 := (𝐸𝑥 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜁𝑘 ))𝑥𝑘 ,

𝑥𝑘+1 = ˇ𝑊𝑘 (𝜁𝑘 )𝑥𝑘 := (𝐸𝑥 + 𝐻𝑘𝑊 ∗

𝑘 (𝜁𝑘 ))𝑥𝑘 .

(4.74)

125

The parameter 𝜁𝑘 = 𝜂𝑘 when we use the derived control policies (4.55) and (4.56) under a mixed

strategy NE.

Next, we outline the NE takeover strategies of both the players, along with the corresponding

saddle-point values for each of the FlipDyn state, for discrete-time linear dynamics, linear state-

feedback control policies, and quadratic costs. We begin by analyzing the case when 𝑥 is a scalar,

for which we will compute the saddle-point value exactly, and subsequently proceed to approximate

the saddle-point value for the 𝑛- dimensional case.

4.5.2 Scalar/1-dimensional dynamical system

The quadratic costs any time 𝑘 ∈ K stated in (4.28) for a scalar dynamical system. For a

scalar dynamical system, we use the following notation to represent the saddle-point value in each

FlipDyn state. Let

𝑘 (𝑥) := p0
𝑉 0

𝑘 𝑥2, 𝑉 1

𝑘 (𝑥) := p1

𝑘 𝑥2,

where p𝛼

𝑘 ∈ R, 𝛼 ∈ {0, 1}, 𝑘 ∈ K. Building on Theorem 4.3.2, we present the following result,

which provides a closed-form expression for the NE takeover in both pure and mixed strategies of

both players, and outlines the saddle-point value update of the parameter p𝛼
𝑘 .

Corollary 3 (Case 𝛼𝑘 = 0) The unique NE takeover strategies of the FlipDyn-Con game (4.7)

at any time 𝑘 ∈ K, subject to the dynamics (4.74) for a scalar dynamical system with quadratic

126

costs (4.28), takeover costs (4.37), and FlipDyn dynamics (4.3) are given by:

𝑦0∗
𝑘 =

𝑧0∗
𝑘 =

(cid:20) 𝑎𝑘
ˇp𝑘+1

(cid:20)

1

1 −

(cid:21) T

,

𝑎𝑘
ˇp𝑘+1

if

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

(cid:21) T

,

0

otherwise,

(cid:20)

(cid:20)

(cid:20)

1 −

𝑑𝑘
ˇp𝑘+1

(cid:21) T

,

𝑑𝑘
ˇp𝑘+1

0

1

(cid:21) T

(cid:21) T

,

,

1

0

if

if

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

ˇp𝑘+1 ≤ 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

otherwise,










where

(cid:32)

ˇp𝑘+1 :=

𝑁 2

𝑘 p1
(𝑁𝑘 − (1 − 𝜂2

𝑘+1
𝑘 )𝐻2

𝑘 p1

𝑘+1

−

)2

𝑀 2

𝑘 p0
(𝑀𝑘 + (1 − 𝜂2

𝑘+1
𝑘 )𝐵2

𝑘 p0

𝑘+1

The saddle-point value parameter at time 𝑘 is given by:

(4.75)

(4.76)

(cid:33)

𝑘 .
𝐸 2

)2

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

ˇp𝑘+1 ≤ 𝑑𝑘 ,

(4.77)

𝑘 ,
𝐸 2

)2

if

if

ˇp𝑘+1 > 𝑎𝑘 ,

otherwise,

p0
𝑘 =






𝐺0

𝑘 + 𝑑𝑘 −

𝑑𝑘 𝑎𝑘
ˇp𝑘+1

+ 𝐾𝑘 (𝜂𝑘 )∗2𝑀𝑘 +

𝑀 2

𝑘 p0
(𝑀𝑘 + (1 − 𝜂2

𝑘+1
𝑘 )𝐵2

𝑘 p0

𝑘+1

𝐺0

𝑘 − 𝑎𝑘 + 𝐾𝑘 (1)∗2𝑀𝑘 +

𝑁 2

𝑘 p1
(𝑁𝑘 − (1 − 𝜂2

𝑘+1
𝑘 )𝐻2

𝑘 p1

𝑘+1

𝑘 ,
𝐸 2

)2

𝐺0

𝑘 +

𝑀 2

𝑘 p0
(𝑀𝑘 + 𝐵2

𝑘+1
𝑘 p0

𝑘+1

𝐸 2

𝑘 + 𝐾𝑘 (0)∗2𝑀𝑘 ,

)2

127

(Case 𝛼𝑘 = 1) The unique NE takeover strategies are given by:










𝑦1∗
𝑘 =

𝑧1∗
𝑘 =

(cid:20)

(cid:20)

(cid:20)

1 −

𝑎𝑘
ˇp𝑘+1

(cid:21) T

,

𝑎𝑘
ˇp𝑘+1

0

1

(cid:21) T

(cid:21) T

,

,

1

0

if

if

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 ≤ 𝑎𝑘 ,

otherwise,

(cid:20) 𝑑𝑘
ˇp𝑘+1

(cid:20)

1

1 −

(cid:21) T

,

𝑑𝑘
ˇp𝑘+1

if

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

(cid:21) T

,

0

otherwise,

(4.78)

(4.79)

The saddle-point value parameter at time 𝑘 is given by,

p1
𝑘 =






𝐺1

𝑘 − 𝑎𝑘 +

𝑑𝑘 𝑎𝑘
ˇp𝑘+1

− 𝑊𝑘 (𝜂𝑘 )∗2𝑁𝑘 +

𝑁 2

𝑘 p1
(𝑁𝑘 − (1 − 𝜂2

𝑘+1
𝑘 )𝐻2

𝑘 p1

𝑘+1

𝐺1

𝑘 + 𝑑𝑘 − 𝑊𝑘 (1)∗2𝑁𝑘 +

𝑀 2

𝑘 p0
(𝑀𝑘 + (1 − 𝜂2

𝑘+1
𝑘 )𝐻2

𝑘 p0

𝑘+1

𝑘 ,
𝐸 2

)2

𝐺1

𝑘 +

𝑁 2

𝑘 p1
(𝑁𝑘 − 𝐻2

𝑘+1
𝑘 p1

𝑘+1

𝐸 2

𝑘 − 𝑊𝑘 (0)∗2𝑁𝑘 ,

)2

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

ˇp𝑘+1 > 𝑑𝑘 ,

(4.80)

𝑘 ,
𝐸 2

)2

if

if

ˇp𝑘+1 ≤ 𝑎𝑘 ,

otherwise.

The recursions (4.77) and (4.80) hold provided,

p0
𝑘+1

𝐵2

𝑘 + 𝑀𝑘 ≥ 0, p1

𝑘+1

𝐻2

𝑘 − 𝑁𝑘 ≤ 0.

(4.81)

The terminal conditions for the recursions (4.77) and (4.80) are:

𝐿+1 := 𝐺0
p0

𝐿+1

,

𝐿+1 := 𝐺1
p1

𝐿+1

□

128

Proof: We begin the proof by determining the NE takeover in both pure and mixed strategies,

and computing the corresponding saddle-point value parameter for the FlipDyn state of 𝛼 = 0.

We substitute the quadratic costs (4.28), linear dynamics (4.74), and the obtained optimal control

policies (4.70) and (4.71) in the term (cid:101)𝑃𝑘+1(𝑥) from (4.53) to obtain:

˜𝑃𝑘+1(𝑥) :=

(cid:16)

(𝐸𝑥 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 ))2p1

𝑘+1 − (𝐸𝑥 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂𝑘 ))2p0

𝑘+1

(cid:17)

𝑥2,

p1
𝑘+1

= (cid:169)
(cid:173)
(cid:171)
(cid:32)

=

p1
𝑘+1

(cid:32)

1 +

𝐻2

𝑘 p1
𝑁𝑘 − 𝐻2

𝑁 2

𝑘 − 𝐻2

𝑘 p1
(𝑁𝑘 − 𝐻2

𝑘+1
𝑘 p1
𝑘+1
𝑘+1 + 𝐻2
𝑘 p1
𝑘 p1
)2
𝑘+1

(cid:33) 2

(cid:32)

− p0

𝑘+1

1 −

𝑀 2

𝑘 p0
𝑀𝑘 + 𝐻2

𝑘+1
𝑘 p0

𝑘+1

− p0

𝑘+1

𝑀𝑘 + 𝐵2

𝑘 p0
𝑘+1
(𝑀𝑘 + 𝐵2

(cid:33) 2

(cid:170)
(cid:174)
(cid:172)
𝑘 p0
)2

𝑘+1
− 𝐵2
𝑘 p0

𝑘+1

𝐸 2

𝑘 𝑥2

𝑘+1

(cid:33)

𝐸 2

𝑘 𝑥2,

= ˇp𝑘+1𝑥2

Substituting ˇp𝑘+1 and takeover costs (4.37) in (4.14) and (4.15), we obtain the NE takeover strategies

presented in (4.75) and (4.76), respectively. Notably, as observed in Theorem 4.3.2, the NE takeover

strategies for the FlipDyn state of 𝛼𝑘 = 1 can be also be obtained by taking the complementary

of (4.75) and (4.76), resulting in (4.78) and (4.79), respectively.

To obtain a recurrence relation for the parameter p0

𝑘 , we substitute the linear dynamics (4.74)

along with quadratic costs (4.28), takeover costs (4.37). This yields

𝑘 𝑥2 =
p0






(𝐺0

𝑘 + 𝑑𝑘 )𝑥2 −

𝑑𝑘 𝑎𝑘 𝑥4
ˇp𝑘+1𝑥2

+ (𝐾 ∗

𝑘 (𝜂𝑘 )2𝑀𝑘 + ˇ𝐵𝑘 (𝜂𝑘 )2p0

𝑘+1)𝑥2,

(𝐺0

𝑘 + ˇ𝑊𝑘 (𝜂𝑘 )2p1

𝑘+1 − 𝑎𝑘 )𝑥2,

(𝐺0

𝑘 + 𝐾 ∗

𝑘 (0)2𝑀𝑘 + ˇ𝐵𝑘 (0)2p0

𝑘+1)𝑥2,

if

if

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

ˇp𝑘+1 ≤ 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 ,

otherwise.

Substituting the control gains 𝐾 ∗

𝑘 (𝜂𝑘 ) (4.71) and factoring out the term 𝑥2,
we arrive at (4.77). Employing analogous substitutions for the FlipDyn state of 𝛼𝑘 = 1, we

𝑘 (𝜂𝑘 ) (4.70) and 𝑊 ∗

obtain (4.80).

Condition (4.81) corresponds to a second-order optimality condition for the policy pair 𝑢∗

𝑘 (𝑥)
𝑘 (𝑥) derived from (4.54) for a scalar dynamical system. This condition ensures that the

and 𝑤∗

129

control policies form a saddle-point equilibrium.

□

Corollary 3 presents a closed-form solution for the FlipDyn-Con (4.7) game with NE takeover

strategies independent of state of the scalar/1-dimensional system. However, it is important to note

that not all control quadratic costs (4.28) satisfy the recursion of the saddle-point value parameter

outlined in Corollary 3. The following remark presents the minimum adversary control cost, 𝑁𝑘 ,

that satisfies the parameter recursions described in (4.77) and (4.80).

Remark 4.5.4 Given a scalar/1-dimensional system (4.74) with quadratic costs (4.28), the NE

takeover strategies and the recursion for the saddle-point value parameter, as outlined in Corol-

lary 3, exist for adversary control costs 𝑁 ∗

𝑘 ≤ 𝑁𝑘 provided

− 𝑁 ∗

𝑘 + 𝐻2

𝑘 p1

𝑘+1

< 0, ∀𝑘 ∈ K.

The parameters 𝑁 ∗

𝑘 in Remark 4.5.4 can be computed using a bisection method at every time

𝑘 ∈ K. Given an arbitrary adversary control cost 𝑁𝑘 , we start by updating the parameter of saddle-

point value in (4.77) and (4.80) backward in time. At any time instant 𝑘 ∈ K, if the inequality

𝑘 p1

−𝑁𝑘 + 𝐻2

𝑘+1 ≤ 0 is not satisfied, the adversary cost 𝑁𝑘 is updated using the bisection method.
This process is iteratively repeated until reaching the time 𝑘 = 0 and the bisection method has

converged. The determined cost 𝑁 ∗

𝑘 indicates the minimal cost the adversary must bear to control

the system effectively.

Similar to the findings presented in [16], in addition to the minimum adversary control costs, we

can determine a minimum adversarial state cost 𝐺1∗
𝑘

that guarantees a mixed strategy NE takeover

at every time 𝑘 ∈ K. We characterize such an adversarial state cost in the following remark.

Remark 4.5.5 Given a scalar/1-dimensional system (4.74) with quadratic costs (4.28), the mixed

strategy NE takeover and the corresponding recursion for the saddle-point value parameter, as

outlined in Corollary 3, exists for an adversary state-dependent cost 𝐺1∗

𝑘 ≤ 𝐺1

𝑘 provided

ˇp𝑘+1 > 𝑑𝑘 ,

ˇp𝑘+1 > 𝑎𝑘 , ∀𝑘 ∈ K,

130

with the parameters at the time 𝐿 + 1 satisfying

𝐿+1 = 𝐺1∗
p1
𝐿+1

,

𝐿+1 = 𝐺0
p0

𝐿+1

.

(4.82)

The process for determining the minimum adversary state cost 𝐺1∗
𝑘

is analogous to that of 𝑁 ∗
𝑘

and involves utilizing a bisection method. To simultaneously compute both 𝐺1

𝑘 and 𝑁 ∗

𝑘 requires

a dual bisection approach, with an outer bisection loop for 𝑁 ∗

𝑘 and an inner bisection loop for
𝑘 .This iterative procedure continues until we reach the time instant 𝑘 = 0, and both bisections

𝐺1∗

have converged. Next, we illustrate the results of Corollary 3 through a numerical example.

A Numerical Example

We evaluate the NE takeover strategies and saddle-point value parameters obtained in Corol-

lary 3 on a linear time-invariant (LTI) scalar system for a horizon length of 𝐿 = 20. The quadratic

costs (4.28) are assumed to fixed ∀𝑘 ∈ K, given by

𝐺0

𝑘 = 𝐺0 = 1, 𝐺1

𝑘 = 𝐺1 = 1, 𝑑𝑘 = 𝑑 = 0.45,

𝑎𝑘 = 𝑎 = 0.25, 𝑀𝑘 = 𝑀 = 0.65.

The control matrices of both the players reduce to

𝐵𝑘 = 𝐻𝑘 = Δ𝑡, ∀𝑘 ∈ K,

where Δ𝑡 = 0.1 for the numerical evaluation. We solve for the NE takeover strategies and the

saddle-point value parameters for two cases of a fixed state transition constant 𝐸𝑘 = 𝐸, ∀𝑘 ∈ K:

𝐸 = 0.85 and 1.0. For 𝐸 = 0.85, the minimal adversary control costs are

𝑁 ∗

𝑘 = 𝑁 ∗ =





0.39,

if ˇp𝑘+1 ≥ 𝑎𝑘 , ˇp𝑘+1 ≥ 𝑑𝑘

0.25,

otherwise,

whereas, for 𝐸 = 1.0, the minimal adversary control costs are

𝑁 ∗

𝑘 = 𝑁 ∗ =





2.17,

if ˇp𝑘+1 ≥ 𝑎𝑘 , ˇp𝑘+1 ≥ 𝑑𝑘 ,

1.51,

otherwise,

131

To obtain a mixed strategy NE takeover over the entire horizon 𝐿, we solve for adversary cost 𝐺1∗
𝑘

for each case given by

𝐺1∗

𝑘 = 𝐺1∗ =

1.56, when 𝐸 = 0.85,

1.43, when 𝐸 = 1.00.





Figures 4.6a and 4.6b illustrate the saddle-point value parameters p0

𝑘 for both 𝐸 = 0.85
and 1.00. In Figure 4.6a, M-NE represents a mixed strategy NE takeover over the entire horizon

𝑘 and p1

𝐿, achieved through 𝑁 ∗

𝑘 and 𝐺1∗

𝑘 . We observe the saddle-point parameter value for the adversary

increases with increasing value of 𝐸, i.e., as the system shifts from open-loop stable 𝐸 < 1 to

unstable 𝐸 ≥ 1, there is larger incentive for the adversary to takeover the system.

Figures 4.7a and 4.7a shows the probabilities of takeover by the defender and adversary when

𝛼𝑘 = 0. For both 𝐸 = 0.85 and 1.00, the probabilities follow a monotonic decrease (resp. increase)

for the defender (resp. adversary). When the obtained takeover strategies contain both the pure

and mixed strategy NE, there exists a time instant beyond which both the players switch to a pure

strategy NE for all future time instants. This switch indicates that under the given costs, there

is no incentive for either player to takeover. Finally, the difference between 𝐸 = 0.85 and 1.00,

shows the rate at which the takeover strategies change over time. The probability of taking over

when 𝐸 = 1.00 is higher compared to 𝐸 = 0.85, and decreases rapidly at the end of the horizon.

Next, we will extend our derivation and analysis of the FlipDyn-Con game with discrete-time linear

dynamics and quadratic costs to 𝑛−dimensions.

4.5.3 n-dimensional system

Unlike the scalar case, wherein the state 𝑥 was factored out during the computation of the NE

takeover strategies and saddle-point value parameters p0

𝑘 and p1

𝑘 , that simplification does not yield

exact results for an 𝑛−dimensional system. The challenge for factoring out the state at any time

𝑘 ∈ K, arises from the term

𝑥T𝑎𝑘 I𝑛𝑥𝑥T𝑑𝑘 I𝑛𝑥

𝑥T (cid:16)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

ˇ𝑊𝑘 (𝜂𝑘 )T𝑃1

𝑘+1

ˇ𝑊𝑘 (𝜂𝑘 ) − ˇ𝐵𝑘 (𝜂𝑘 )T𝑃0

ˇ𝐵𝑘 (𝜂𝑘 )
𝑘+1
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:17)

(cid:124)

(cid:125)

,

𝑥

(4.83)

(cid:123)(cid:122)
(cid:101)𝑃𝑘+1 (𝑥)

132

(a)

Figure 4.6 Saddle-point value parameters p𝑖
constant (a) 𝐸 = 0.85, (b) 𝐸 = 1.0. The parameters p𝑖
saddle-point under a mixed NE takeover over the entire time horizon.

(b)
𝑘 , 𝑘 ∈ {1, 2, . . . , 𝐿}, 𝑖 ∈ {0, 1} for state transition
𝑘 ,M-NE corresponds to the parameters of the

(a)

(b)

Figure 4.7 Defender takeover strategies 𝛽𝑘 and adversary takeover strategies 𝛾𝑘 for state transition
(a) 𝐸 = 0.85 and (b) 𝐸 = 1.0. M-NE corresponds to the mixed NE policy.

which arises when a mixed strategy NE takeover is played in either of the FlipDyn states. A similar

challenge was encountered in [16], where the aforementioned term was approximated to factor out

the state 𝑥 while computing the saddle-point value parameters backward in time. Here, we propose

a more general approach by leveraging the results of Theorem 4.5.1 to address this limitation.

Recall that the parameterized control policy pair {𝑢∗

𝑘 (𝜂𝑘 ), 𝑤∗

𝑘 (𝜂𝑘 )} with a feasible parameter 𝜂𝑘

133

5101520204060801001201401618202451015205001000150020002500161820246800.5510152000.5100.5510152000.51must satisfy condition (4.59):

𝑥T (cid:16)

ˇ𝑊𝑘 (𝜂𝑘 )T𝑃1

𝑘+1

ˇ𝑊𝑘 (𝜂𝑘 ) − ˇ𝐵𝑘 (𝜂𝑘 )T𝑃0

𝑘+1

(cid:17)

ˇ𝐵𝑘 (𝜂𝑘 )

𝑥 = 𝑥T𝑥

√

𝑎𝑘 𝑑𝑘
𝜂𝑘

.

Substituting condition (4.59) in (4.83) yields:

𝑥T𝑎𝑘 I𝑛𝑥𝑥T𝑑𝑘 I𝑛𝑥
(cid:101)𝑃𝑘+1(𝑥)

:=

𝜂𝑘 𝑥T𝑎𝑘 I𝑛𝑥𝑥T𝑑𝑘 I𝑛𝑥
√

𝑥T𝑥

𝑎𝑘 𝑑𝑘

= 𝜂𝑘

√︁𝑎𝑘 𝑑𝑘 𝑥T𝑥.

(4.84)

Analogous to the scalar/1−dimensional case, we will use Theorem 4.3.2 to present the following

result, which provides a closed-form expression for the NE takeover in both pure and mixed strategies
𝑘 ∈ R𝑛×𝑛, 𝛼 ∈ {0, 1}.

of both players, and outlines the saddle-point value update of the parameter 𝑃𝛼

Corollary 4 (Case 𝛼𝑘 = 0) The unique NE takeover strategies of the FlipDyn-Con game (4.7) for

every 𝑘 ∈ K, subject to the dynamics (4.74), with quadratic costs (4.25), takeover costs (4.37), and

FlipDyn dynamics (4.3) are given by

(4.85)

(4.86)

𝑦0∗
𝑘 =

𝑧0∗
𝑘 =

















(cid:20)

(cid:20)

(cid:118)(cid:117)(cid:116) 𝑎𝑘
𝑑𝑘

𝜂𝑘







(cid:20)

1

1 − 𝜂𝑘

(cid:118)(cid:117)(cid:116) 𝑎𝑘
𝑑𝑘








0

T

, if

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,

(cid:21) T

, otherwise,

1 − 𝜂𝑘

(cid:118)(cid:117)(cid:116) 𝑑𝑘
𝑎𝑘

T

,

(cid:118)(cid:117)(cid:116) 𝑑𝑘
𝑎𝑘








𝜂𝑘

0

1

(cid:21) T

(cid:21) T

,

,

1

0

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑥T𝑑𝑘 I𝑛𝑥,

otherwise.

134

The saddle-point value parameter at time 𝑘 is given by:

𝑃0

𝑘 =






𝐺0

𝑘 + ˇ𝐵𝑘 (𝜂𝑘 )T𝑃0

ˇ𝐵𝑘 (𝜂𝑘 ) + 𝐾 ∗

𝑘 (𝜂𝑘 )T𝑀𝑘 𝐾 ∗

𝑘 (𝜂𝑘 )

+ 𝑑𝑘 I𝑛 − I𝑛𝜂𝑘

𝑘+1
√︁𝑎𝑘 𝑑𝑘 ,

𝐺0

𝑘 + ˇ𝑊𝑘 (𝜂𝑘 )T𝑃1

𝑘+1

ˇ𝑊𝑘 (𝜂𝑘 ) − 𝑎𝑘 I𝑛,

𝐺0

𝑘 + 𝐾 ∗

𝑘 (0)T𝑀𝑘 𝐾 ∗

𝑘 (0)

+ ˇ𝐵𝑘 (0)T𝑃0

𝑘+1

ˇ𝐵𝑘 (0),

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑥T𝑑𝑘 I𝑛𝑥,

otherwise.

(Case 𝛼𝑘 = 1) The unique NE takeover strategies are given by:










𝑦1∗
𝑘 =

𝑧1∗
𝑘 =








(cid:20)

(cid:20)

1 − 𝜂𝑘

(cid:118)(cid:117)(cid:116) 𝑎𝑘
𝑑𝑘

T

,

(cid:118)(cid:117)(cid:116) 𝑎𝑘
𝑑𝑘








𝜂𝑘

0

1

(cid:21) T

(cid:21) T

,

,

1

0

(cid:118)(cid:117)(cid:116) 𝑑𝑘
𝑎𝑘




𝜂𝑘



(cid:20)

1

1 − 𝜂𝑘

(cid:118)(cid:117)(cid:116) 𝑑𝑘
𝑎𝑘








T

,

(cid:21) T

0

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑥T𝑎𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,

otherwise,

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,

,

otherwise.

(4.87)

(4.88)

(4.89)

The saddle-point value parameter at time 𝑘 is given by,

𝑃1

𝑘 =






𝐺1

𝑘 + ˇ𝑊𝑘 (𝜂𝑘 )T𝑃1

ˇ𝑊𝑘 (𝜂𝑘 ) − 𝑊 T∗

𝑘 (𝜂𝑘 )𝑁𝑘𝑊 ∗

𝑘 (𝜂𝑘 )

− 𝑎𝑘 I𝑛 + I𝑛𝜂𝑘

𝑘+1
√︁𝑎𝑘 𝑑𝑘 ,

𝐺1

𝑘 + ˇ𝐵𝑘 (𝜂𝑘 )T𝑃0

𝑘+1

ˇ𝐵𝑘 (𝜂𝑘 ) + 𝑑𝑘 I𝑛,

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑎𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑥T𝑎𝑘 I𝑛𝑥,

(4.90)

(cid:101)𝑃𝑘+1(𝑥) > 𝑥T𝑑𝑘 I𝑛𝑥,

𝐺1

𝑘 − 𝑊𝑘 (0)T∗𝑁𝑘𝑊 ∗

𝑘 (0) + ˇ𝑊𝑘 (0)T𝑃1

𝑘+1

ˇ𝑊𝑘 (0),

otherwise.

135

The recursions (4.87) and (4.90) hold provided,

𝐵T

𝑘 𝑃0

𝑘+1

𝐵𝑘 + 𝑀𝑘 ≻ 0, 𝐻T

𝑘 𝑃1

𝑘+1

𝐻𝑘 − 𝑁𝑘 ≺ 0.

(4.91)

The terminal conditions for the recursions (4.87) and (4.90) are:

𝑃0
𝐿+1 := 𝐺0

𝐿+1

, 𝑃1

𝐿+1 := 𝐺1

𝐿+1

.

□

Proof: We begin the proof by determining the NE takeover in pure and mixed strategies of the

FlipDyn state of 𝛼 = 0. We substitute the takeover cost (4.37) and the terms from (4.84) in (4.14)

and (4.15), to obtain the NE takeover policies in (4.85) and (4.86), respectively. Analogous to the

scalar/1−dimensional case, the NE takeover strategies in (4.88) and (4.89) for the FlipDyn state of

𝛼 = 1 are the complementary takeover strategies of the FlipDyn state 𝛼 = 0.

To determine the saddle-point value parameters for the FlipDyn state of 𝛼 = 0, we substi-

tute (4.84), discrete-time linear dynamics (4.74), quadratic costs (4.25) and takeover costs (4.37)

in (4.16) and factor out the state 𝑥 to obtain (4.87). Through similar substitutions and factorization

we can obtain (4.90) corresponding to the FlipDyn state of 𝛼 = 1.

□

Similar to the 1−dimensional case, Corollary 4 presents a closed-form solution for the FlipDyn-

Con (4.7) game with NE takeover strategies independent of state. However, this NE takeover

strategy and saddle-point value parameters are conditioned on finding a feasible parameter 𝜂𝑘 , ∀𝑘 ∈

K that satisfies (4.84). A feasible parameter 𝜂𝑘 is seldom found corresponding to the linear

dynamics (4.74), as the matrices ˇ𝐵𝑘 (𝜁𝑘 ) and ˇ𝑊𝑘 (𝜁𝑘 ) are generally non-diagonal. Therefore, there

is a need to find approximate NE takeover strategies and bounds on the saddle-point values for any

general 𝑛−dimensional case which need not satisfy (4.84). A solution addressing the limitation in

determining a parameter 𝜂𝑘 is found by re-visiting the optimal linear state-feedback control from

Theorem 4.5.1, described in the following result.

Lemma 4.5.6 Under Assumptions 4.4.2 and 4.4.4, consider a linear dynamical system described

by (4.24) with quadratic costs (4.25), takeover costs (4.37), and FlipDyn dynamics (4.3) with known

136

saddle-point value parameters 𝑃1

𝑘+1 and 𝑃0

𝑘+1. Suppose that for every 𝑘 ∈ K,

𝐵T

𝑘 𝑃0

𝑘+1

𝐵𝑘 + 𝑀𝑘 ≻ 0, 𝐻T

𝑘 𝑃1

𝑘+1

𝐻𝑘 − 𝑁𝑘 ≺ 0,

(4.92)

and for every 𝑥 ∈ X, there exists scalars 𝜂

𝑘

∈ R and 𝜂𝑘 ∈ R which correspond to an optimal linear

state-feedback control pair {𝐾 ∗

𝑘 (𝜂

𝑘

), 𝑊 ∗

𝑘 (𝜂𝑘 )} of the form (4.55) and (4.56), such the following

conditions are satisfied:

𝑥T𝑥

√

𝑎𝑘 𝑑𝑘
𝜂

𝑘

≤ 𝑥TP𝑘+1𝑥 ≤ 𝑥T𝑥

√

𝑎𝑘 𝑑𝑘
𝜂𝑘

,

(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 ))T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 )) − (𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂

(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 ))T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 )) − (𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂

))T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂

))T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂

𝑘

𝑘

(4.93)

)) ≻ 𝑑𝑘 I𝑛.

(4.94)

)) ≻ 𝑎𝑘 I𝑛.

𝑘

𝑘

(4.95)

where

P𝑘+1 =(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 ))T𝑃1

𝑘+1(𝐸𝑘 + 𝐻𝑘𝑊 ∗

𝑘 (𝜂𝑘 )) − (𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂

))T𝑃0

𝑘+1(𝐸𝑘 + 𝐵𝑘 𝐾 ∗

𝑘 (𝜂

)).

𝑘

𝑘

Then, the saddle-point value parameters at time 𝑘 ∈ K, under a mixed strategy NE takeover of

both the FlipDyn states satisfy

𝑃0

𝑘 ⪯ 𝐺0

𝑘 + 𝑑𝑘 I𝑛 + 𝐾 ∗

𝑘 (𝜂

)T𝑀𝑘 𝐾 ∗

𝑘 (𝜂

𝑘

𝑘

) − I𝑛𝜂

𝑘

√︁𝑎𝑘 𝑑𝑘 + ˇ𝐵𝑘 (𝜂

)T𝑃0

𝑘+1

𝑘

ˇ𝐵𝑘 (𝜂

),

𝑘

𝑃1

𝑘 ⪰ 𝐺1

𝑘 − 𝑎𝑘 I𝑛 − 𝑊 ∗

𝑘 (𝜂𝑘 )T𝑁𝑘𝑊 ∗

𝑘 (𝜂𝑘 ) + I𝑛𝜂𝑘

√︁𝑎𝑘 𝑑𝑘 + ˇ𝑊𝑘 (𝜂𝑘 )T𝑃1

𝑘+1

ˇ𝑊𝑘 (𝜂𝑘 ).

(4.96)

(4.97)

□

Proof: From (4.55), a linear defender control policy gain parameterized by a scalar 𝜂

𝑘 , is given

by:

𝐾 ∗

𝑘 (𝜂

𝑘

) = −(𝜗(𝜂

𝑘

)𝐵T

𝑘 𝑃0

𝑘+1

𝐵𝑘 + 𝑀𝑘 )−1(𝜗(𝜂

𝑘

)𝐵T

𝑘 𝑃0

𝑘+1

𝐸𝑘 ),

(4.98)

where 𝜗(𝑐) := 1 − 𝑐2. Likewise, from (4.56), a linear adversary control policy gain parameterized

by a scalar 𝜂𝑘 , is given by:

𝑊 ∗

𝑘 (𝜂𝑘 ) = −(𝜗(𝜂𝑘 )𝐻T

𝑘 𝑃1

𝑘+1

𝐻𝑘 − 𝑁𝑘 )−1(𝜗(𝜂𝑘 )𝐻T

𝑘 𝑃1

𝑘+1

𝐸𝑘 ).

(4.99)

137

Upon substituting the condition (4.93) in (4.60) and (4.61) and solving for the second-order optimal-

ity condition (similar to Theorem 4.5.1) yields (4.92), which certifies a saddle-point equilibrium.

Recall that any control policy pair {𝐾𝑘 , 𝑊𝑘 } that constitutes a mixed strategy NE takeover to

both the saddle-point values 𝑉 0

𝑘 (𝑥) and 𝑉 1

𝑘 (𝑥) must satisfy the conditions:

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥, (cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥.

Thus, upon substituting the linear dynamics (4.24) and the optimal control gains {𝐾 ∗

𝑘 (𝜂

), 𝑊 ∗

𝑘 (𝜂𝑘 )}

𝑘

in (4.53) and factoring out the state 𝑥, we obtain the conditions (4.94) and (4.95).

Next, we will only establish (4.96), as the derivation for (4.97) is analogous. Under a mixed

strategy NE takeover, we substitute the quadratic costs (4.25), discrete-time linear dynamics (4.74)

and the defender control (4.98) in (4.16) to obtain:

𝑥T𝑃0

𝑘 𝑥 =𝑥T (cid:16)

𝐺0

𝑘 + 𝑑𝑘 I𝑛 + 𝐾 ∗

𝑘

)T𝑀𝑘 𝐾 ∗

𝑘 (𝜂
𝑘 (𝜂
𝑘
𝑥T𝑎𝑘 I𝑛𝑥𝑥T𝑑𝑘 I𝑛𝑥

(cid:17)

)

𝑥 + 𝑥T (cid:16)

ˇ𝐵𝑘 (𝜂

)T𝑃0

𝑘+1

𝑘

ˇ𝐵𝑘 (𝜂

)

𝑘

(cid:17)

𝑥−

𝑥T (cid:16)

ˇ𝑊𝑘 (𝜂𝑘 )T𝑃1
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:124)

ˇ𝑊𝑘 (𝜂𝑘 ) − ˇ𝐵𝑘 (𝜂
(cid:123)(cid:122)
P𝑘+1

𝑘+1

)T𝑃0
)
𝑘
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

ˇ𝐵𝑘 (𝜂

𝑘+1

𝑘

.

(cid:17)

𝑥

(cid:125)

Using condition (4.93), we bound the term containing P𝑘+1 by

𝑥T𝑎𝑘 I𝑛𝑥𝑥T𝑑𝑘 I𝑛𝑥
𝑥TP𝑘+1𝑥

≤ 𝜂

≤ 𝜂

𝑘

𝑘

𝑥T𝑎𝑘 I𝑛𝑥𝑥T𝑑𝑘 I𝑛𝑥
√

𝑎𝑘 𝑑𝑘

𝑥T𝑥
𝑥T𝑥√︁𝑎𝑘 𝑑𝑘 .

,

Substituting this bound in 𝑥T𝑃0

𝑘 𝑥 and factoring out the state 𝑥, we obtain (4.96).

□

Lemma 4.5.6 provides a linear state-feedback control for the defender (resp. adversary) and

enables us to compute bounds on the saddle-point values independent of the state 𝑥 backward

in time. More importantly, condition (4.93) serves as a relaxation for (4.84). Such a relaxation

enables us to determine an upper and lower bound in a semi-definite sense, for the saddle-point value

parameters using the scalars 𝜂𝑘 and 𝜂

𝑘 , which can be used to approximately compute the saddle-

point value parameters recursively. Therefore, following the same methodology from [16], for

138

the 𝑛−dimensional case, we will solve for an approximate NE takeover strategies and saddle-point

values using the parameterization,

𝑘 (𝑥) := 𝑥T𝑃0
𝑉 0

𝑘 𝑥, 𝑉 1

𝑘 (𝑥) := 𝑥T𝑃1

𝑘 𝑥,

(4.100)

where 𝑃1

𝑘 ∈ R𝑛×𝑛 and 𝑃0

𝑘 ∈ R𝑛×𝑛.

As shown in Corollary 4, we will use the results from Theorem 4.3.2 to provide an approximate

NE takeover pair {𝑦𝛼∗

𝑘 , 𝑧𝛼∗

𝑘 }, in both pure and mixed strategies of both players, and the corresponding

approximate saddle-point value update of the parameter 𝑃

𝛼
𝑘 ∈ R𝑛×𝑛, 𝛼 ∈ {0, 1}.

Corollary 5 (Case 𝛼𝑘 = 0) The approximate NE takeover strategies of the FlipDyn-Con game (4.7)

at any time 𝑘 ∈ K, subject to dynamics in (4.74), with quadratic costs (4.25), takeover costs (4.37),

and FlipDyn dynamics (4.3) are given by:

𝑦0∗
𝑘 =

𝑧0∗
𝑘 =










1




1 −




(cid:20)

(cid:20)

𝑎𝑘 𝑥T𝑥

𝑥TP𝑘+1𝑥

1 −

𝑎𝑘 𝑥T𝑥

𝑥TP𝑘+1𝑥







(cid:20)

T








, if

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

(cid:21) T

, otherwise,

0

𝑑𝑘 𝑥T𝑥

𝑑𝑘 𝑥T𝑥

𝑥TP𝑘+1𝑥

𝑥TP𝑘+1𝑥

, if

T








(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

0

1

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:21) T

, if

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑑𝑘 𝑥T𝑥,

(cid:21) T

, otherwise,

1

0

(4.101)

(4.102)

where

P𝑘+1 := ˇ𝑊𝑘 (𝜂𝑘 )T𝑃1

𝑘+1

ˇ𝑊𝑘 (𝜂𝑘 ) − ˇ𝐵𝑘 (𝜂

)T𝑃0

𝑘+1

𝑘

ˇ𝐵𝑘 (𝜂

),

𝑘

and (cid:101)𝑃𝑘+1(𝑥) := 𝑥TP𝑘+1𝑥.

139

The approximate saddle-point value parameter at time 𝑘 is given by:

𝑃0

𝑘 =






𝐺0

𝑘 + ˇ𝐵𝑘 (𝜂

𝑘
+ 𝑑𝑘 I𝑛 − I𝑛𝜂

)T𝑃0

𝑘+1
√︁𝑎𝑘 𝑑𝑘 ,

𝑘

ˇ𝐵𝑘 (𝜂

) + 𝐾 ∗

𝑘 (𝜂

𝑘

𝑘

)T𝑀𝑘 𝐾 ∗

𝑘 (𝜂

)

𝑘

𝐺0

𝑘 + ˇ𝑊𝑘 (𝜂𝑘 )T𝑃1

𝑘+1

ˇ𝑊𝑘 (𝜂𝑘 ) − 𝑎𝑘 I𝑛,

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(4.103)

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑑𝑘 𝑥T𝑥,

𝐺0

𝑘 + 𝐾 ∗

𝑘 (0)T𝑀𝑘 𝐾 ∗

𝑘 (0) + ˇ𝐵𝑘 (0)T𝑃0

𝑘+1

ˇ𝐵𝑘 (0),

𝑑otherwise.

(Case 𝛼𝑘 = 1) The approximate NE takeover strategies are given by:

𝑦1∗
𝑘 =




1 −




𝑎𝑘 𝑥T𝑥

𝑎𝑘 𝑥T𝑥

𝑥TP𝑘+1𝑥

𝑥TP𝑘+1𝑥

, if

T








(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

0

1

(cid:21) T

, if

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

(cid:21) T

, otherwise,

1

0

𝑧1∗
𝑘 =

(cid:34) 𝑑𝑘 𝑥T𝑥
𝑥TP𝑘+1𝑥

1 −

𝑑𝑘 𝑥T𝑥
𝑥TP𝑘+1𝑥

(cid:35) T

, if

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

(cid:20)

1

(cid:21) T

0

, otherwise.

(cid:20)

(cid:20)










The approximate saddle-point value parameter at time 𝑘 is given by,

(4.104)

(4.105)

𝑃1

𝑘 =






𝐺1

𝑘 + ˇ𝑊𝑘 (𝜂𝑘 )T𝑃1

ˇ𝑊𝑘 (𝜂𝑘 ) − 𝑊 ∗

𝑘 (𝜂𝑘 )T𝑁𝑘𝑊 ∗

𝑘 (𝜂𝑘 )

− 𝑎𝑘 I𝑛 + I𝑛𝜂𝑘

𝑘+1
√︁𝑎𝑘 𝑑𝑘 ,

𝐺1

𝑘 + ˇ𝐵𝑘 (𝜂

)T𝑃0

𝑘+1

𝑘

ˇ𝐵𝑘 (𝜂

𝑘

) + 𝑑𝑘 I𝑛,

if

if

(cid:101)𝑃𝑘+1(𝑥) > 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) ≤ 𝑎𝑘 𝑥T𝑥,

(cid:101)𝑃𝑘+1(𝑥) > 𝑑𝑘 𝑥T𝑥,

𝐺1

𝑘 + 𝑊𝑘 (0)T∗𝑁𝑘𝑊 ∗

𝑘 (0) + ˇ𝑊𝑘 (0)T𝑃1

𝑘+1

ˇ𝑊𝑘 (0),

otherwise.

(4.106)

140

The recursions (4.103) and (4.106) hold provided,

𝑘 𝑃0
𝐵T

𝑘+1

𝐵𝑘 + 𝑀𝑘 ≻ 0, 𝐻T

𝑘 𝑃1

𝑘+1

𝐻𝑘 − 𝑁𝑘 ≺ 0.

(4.107)

The terminal conditions for the recursions (4.103) and (4.106) are:

𝑃0
𝐿+1 := 𝐺0

𝐿+1

, 𝑃1

𝐿+1 := 𝐺1

𝐿+1

.

□

Proof:

[Outline] Similar to the proofs in the prior sections, we begin the proof by determining

the NE takeover in pure and mixed strategies for the FlipDyn state of 𝛼 = 0. We substitute the

quadratic costs (4.25), linear dynamics (4.74), and linear control gains (4.98) and (4.99) in the term
(cid:101)𝑃𝑘+1(𝑥) with the approximate saddle-point value parameters 𝑃0

𝑘+1 from (4.53) to obtain:

𝑘+1 and 𝑃1

˜𝑃𝑘+1(𝑥) := 𝑉 1
= 𝑥T (cid:16)

𝑘+1( ˇ𝑊𝑘 (𝜂𝑘 )𝑥) − 𝑉 0
ˇ𝑊𝑘 (𝜂𝑘 )𝑃1

ˇ𝑊𝑘 (𝜂𝑘 )

𝑘+1

𝑘+1( ˇ𝐵𝑘 (𝜂

)𝑥),

𝑘

− ˇ𝐵𝑘 (𝜂

)𝑃0

𝑘+1

𝑘

ˇ𝐵𝑘 (𝜂

)

𝑘

(cid:17)

𝑥,

= 𝑥TP𝑘+1𝑥.

We substitute the takeover cost (4.37) and 𝑥TP𝑘+1𝑥 in (4.14) and (4.15), to obtain the NE takeover

policies in (4.101) and (4.102), respectively. The approximate NE takeover strategies of the FlipDyn

state 𝛼 = 1 are complementary to 𝛼 = 0, presented in (4.104) and (4.105).

To determine the approximate saddle-point value parameters under a mixed strategy NE takeover

of the FlipDyn state of 𝛼 = 0, we substitute the upper bound (4.96) from Lemma 4.5.6 and replace
𝑘+1 with 𝑃0
𝑃0
discrete-time linear dynamics (4.74) and the adversary linear state-feedback control

𝑘+1. Under a pure strategy NE takeover, we substitute the quadratic costs (4.25),

(4.99) to

obtain the exact saddle-point value parameters. Combining both the solutions from the mixed and

pure strategy NE takeover, we obtain (4.103).

□

Recursions (4.103) and (4.106) provide an approximate solution to the FlipDynCon prob-

lem (4.7) for the 𝑛-dimensional case with corresponding takeover and control policies. Similar to

141

the range of the parameter 𝜂𝑘 presented in Lemma 4.5.6, the parameters 𝜂𝑘 and 𝜂

𝑘 under a mixed

strategy NE takeover can be bounded using condition (4.59), indicated in the following remark.

Remark 4.5.7 The permissible range for the parameters 𝜂𝑘 and 𝜂

𝑘 satisfying condition (4.93)

corresponding to a mixed strategy NE are given by:

0 < 𝜂𝑘 ≤ 𝜂𝑘 ≤ 𝜂

<

𝑘

√︄

min{𝑑𝑘,𝑎𝑘 }
max{𝑑𝑘,𝑎𝑘 }

< 1.

(4.108)

Remark 4.5.7 is a direct consequence of Lemma 4.5.6. Similar to the scalar/1−dimensional case,

not all control costs (4.25) satisfy the approximate saddle-point recursion. The following remark

provides the minimum adversarial control cost required to satisfy the recursions (4.103) and (4.106).

Remark 4.5.8 Given an 𝑛−dimensional system (4.74) with quadratic costs (4.28), the NE takeover

strategies and the recursion for the approximate saddle-point value parameter, as outlined in

Corollary 5, exists for an adversary control costs 𝑁 ∗

𝑘 ≺𝑁𝑘 provided

− 𝑁 ∗

𝑘 + 𝐻T

𝑘 𝑃1

𝑘+1

𝐻𝑘 ≺ 0, ∀𝑘 ∈ K.

Analogous to the scalar/1−dimensional system, the parameter 𝑁 ∗

𝑘 can be found using a bisection

method at every stage 𝑘 ∈ K. A candidate initial value for all 𝑘 ∈ K can be set to 𝑁𝐿 := 𝜈R>0
such that 𝜈I𝑝 ≻ 𝐻T

𝐻𝐿. Similarly, we can also determine a minimum adversarial state cost

I𝑝

𝐿 𝑃1

𝐿+1

𝐺1∗
𝑘

to guarantee a mixed strategy NE takeover at every time 𝑘 for the 𝑛-dimensional system. The

following remark summarizes such an adversarial cost.

Remark 4.5.9 Given a 𝑛−dimensional system (4.74) with quadratic costs (4.28), the mixed strategy

NE takeover and the corresponding recursion for the approximate saddle-point value parameter,

as outlined in Corollary 5, exists for an adversary state-dependent cost 𝐺1∗

𝑘 ⪯ 𝐺1

𝑘 provided

P𝑘+1 ≻ 𝑑𝑘 I𝑛, P𝑘+1 ≻ 𝑎𝑘 I𝑛, ∀𝑘 ∈ K,

with the parameters at the time 𝐿 + 1 given by:

𝑃1
𝐿+1 := 𝐺1∗
𝐿+1

, 𝑃0

𝐿+1 := 𝐺0

𝐿+1

.

(4.109)

142

As indicated in the scalar/1−dimensional case, we can determine 𝐺1∗

𝑘 using a bisection method,

and simultaneously determine 𝐺1∗

𝑘 and 𝑁 ∗

𝑘 using a double bisection method. Next, we illustrate the

results of the approximate value function on a numerical example.

(a)

(b)

Figure 4.8 Maximum eigenvalues (𝜆1(𝑃
1}, 𝛼 ∈ {0, 1} for state transition constant (a) 𝑒 = 0.85, (b) 𝑒 = 1.0.

𝛼
𝑘 )) of saddle-point value parameters 𝑃

𝛼
𝑘 , 𝑘 ∈ {0, 1, . . . , 𝐿 +

(a)

(b)

Figure 4.9 The parameters 𝑃𝑖
𝑘 ,M-NE corresponds to saddle-point value parameter recursion under
a mixed NE takeover over the entire time horizon. Defender takeover strategy 𝛽𝑘 and adversary
takeover strategy 𝛾𝑘 for state transition (a) 𝑒 = 0.85 and (b) 𝑒 = 1.0. M-NE corresponds to the
mixed NE policy.

143

51015202040608010016182024651015205001000150020002500300016182051000.51510152000.5100.51510152000.51A Numerical Example

We now evaluate the results of the approximate NE takeover and the corresponding saddle-point

value parameters presented in Corollary 5, on a discrete-time two-dimensional linear time-invariant

system (LTI) for a horizon length of 𝐿 = 20. The quadratic costs (4.25) are assumed to be fixed

∀𝑘 ∈ K, and are given by:

𝐺0

𝑘 = 𝐺0 = I𝑛, 𝐺1

𝑘 = 𝐺1 =1.35I𝑛, 𝐷 𝑘 = 𝐷 = 0.45I𝑛,

𝐴𝑘 = 𝐴 = 0.25I𝑛, 𝑀𝑘 = 𝑀 = 0.65.

The system transition matrix 𝐸𝑘 = 𝐸 and control matrices for the defender and adversary are given

by:

𝐸𝑘 = 𝐸 =

𝑒 Δ𝑡

0

𝑒


















Δ𝑡







0









, 𝐵𝑘 = 𝐻𝑘 =

, ∀𝑘 ∈ {0, 1, . . . , 𝐿},

where Δ𝑡 = 0.1 for the numerical example. Similar to the scalar/1−dimensional case, we solve for

the approximate NE takeover strategies and saddle-point value function parameters for two cases

with a fixed state transition constant 𝑒𝑘 = 𝑒, ∀𝑘 ∈ K: 𝑒 = 0.85 and 1.0. Since the saddle-point

value parameters for 𝑛-dimensions are symmetric positive definite matrices, we plot the maximum
𝑘 , 𝑃0
eigenvalues of the value function matrices 𝑃1

𝑘 shown in Figure 4.8a and 4.8b, with M-NE

indicating a mixed strategy NE takeover over the entire horizon L, achieved through 𝑁 ∗ and 𝐺1∗.

We obtain the adversary control costs 𝑁 ∗

𝑘 , ∀𝑘 ∈ K for the case of 𝑒 = 0.85 given by:

𝑁 ∗

𝑘 = 𝑁 ∗ =

and for the case of 𝑒 = 1.0 as:

𝑁 ∗

𝑘 = 𝑁 ∗ =











0.42,

if

𝑃𝑘+1(𝑥) ≥ 𝑥T𝑎𝑘 𝑥,

𝑃𝑘+1(𝑥) ≥ 𝑥T𝑑𝑘 𝑥,

0.45,

otherwise ,

3.73,

if

𝑃𝑘+1(𝑥) ≥ 𝑥T𝑎𝑘 𝑥,

𝑃𝑘+1(𝑥) ≥ 𝑥T𝑑𝑘 𝑥,

3.40,

otherwise.

144

Similarly, we determine the minimum adversary cost 𝐺1∗

𝑘 for each case of 𝑒, which corresponds to

a mixed strategy NE takeover over the entire time horizon 𝐿 given by:

𝐺1∗

𝑘 = 𝐺1∗ =

1.67I𝑛, when 𝑒 = 0.85,

1.48I𝑛, when 𝑒 = 1.00,





Similar to the scalar/1−dimensional case, we observe that the eigenvalues of the saddle-point

value parameters are significantly lower when 𝑒 = 0.85 compared to 𝑒 = 1.0. This corresponds to

lower incentives for a takeover when the system is open-loop stable 𝑒 < 1 as opposed to unstable
condition of 𝑒 ≥ 1. However, the value function parameter 𝑃0

𝑘 always reaches a steady-state value

for either values 𝑒, implying that the system will remain stable under a defender’s control.

For the 𝑛-dimensional case, the takeover policy is a function of the state 𝑥. Therefore, we

simulate the system for a total of 100 iterations with the initial state 𝑥0 =

(cid:105) T

(cid:104)

1 0

and show the

average takeover policies in Figure 4.9a and 4.9b. For the case of playing mixed NE takeover

(M-NE), we observe for both 𝑒 = 0.85 and 𝑒 = 1.0, given 𝛼 = 0, i.e., when the defender is in

control, the probability of takeover for the defender (resp. adversary) increase (resp. decreases)

backward in time. This takeover policy indicates that the defender retains control of the system

while the adversary remains idle. For the case of playing between pure and mixed NE takeover, we

observe that for both 𝑒 = 0.85 and 𝑒 = 1.0 given 𝛼 = 0, both players alternate between pure and

mixed NE over the horizon.

This numerical example illustrates the use of the approximate saddle-point value parameters

in determining the takeover strategies for each player. Additionally, it provides insight into the

system’s behavior for the given costs and the system’s stability properties. These insights are useful

while designing the costs, which further impact the control and takeover policies.

4.6 Summary

This chapter introduced FlipDyn-Con, a finite-horizon, zero-sum game of resource takeovers

involving a discrete-time dynamical system. Our contributions are distilled into four key facets.

First, we presented analytical expressions for the saddle-point value of the FlipDyn-Con game,

alongside the corresponding NE takeover in both the pure and mixed strategies. Second, we

145

derived optimal control policies for linear dynamical systems characterized by quadratic costs.

We provided sufficient conditions under which there is a saddle-point in the space of linear state-

feedback policies. Third, for scalar/1−dimensional dynamical systems with quadratic costs, we

derived exact saddle-point value parameters and NE takeover strategies, independent of the state

of the dynamical system. Finally, for higher dimensional dynamical systems with quadratic costs,

we provided approximate NE takeover strategies and control policies. Our approach enables

computation for general linear systems, broadening its applicability. The practical implications of

our findings were showcased through a numerical study involving the control of a linear dynamical

system in the presence of an adversary. The results of the NE takeover strategies with known

control policies were demonstrated in [17]. The results containing both the control policies and

takeover strategies are under review in [21].

146

CHAPTER 5

DATA-DRIVEN ADVERSARIAL MODEL

All the prior chapters introduced various types of adversarial models and formulated corresponding

decision-making frameworks to reason about defensive strategies given the underlying costs and

model of the system. These framework are computationally efficient and scalable over long

time horizons. However, when the underlying model of the system or costs are unknown, such

frameworks are not suitable. Therefore, there necessitates a need to develop a framework which

can reason conditioned on the data available from the underlying system.

5.1

Introduction

In this chapter, we introduce a novel data-driven domain-aware, optimization-based approach to

determine an effective defense strategy for CPS in an automated fashion – by emulating a strategic

adversary in the loop that exploits system vulnerabilities, interconnection of the CPS, and the

dynamics of the physical components. Our approach builds on an adversarial decision-making

model based on a Markov Decision Process (MDP) that determines the optimal cyber (discrete)

and physical (continuous) attack actions over a CPS attack graph. The defense planning problem

is modeled as a non-zero-sum game between the adversary and defender. We use a model-free

reinforcement learning method to solve the adversary’s problem as a function of the defense strategy.

We then employ Bayesian optimization (BO) to find an approximate best-response for the defender

to harden the network against the resulting adversary policy. This process is iterated multiple times

to improve the strategy for both players.

A majority of the world’s critical infrastructures depends on Cyber-Physical Systems (CPS)

to manage essential and complex, domain-specific operational processes. Historically, CPS op-

erational risk could be attributed to human operator errors, natural disasters, and acts of physical

sabotage. However, with the rapid integration of physical and cyber-security processes and in-

creased reliance on internet-based networks, CPS is now vulnerable to sophisticated cyber attacks

that can result in significant equipment damage, service disruptions, and potential loss of life. These

attacks vary in severity and application; well-known examples include the StuxNet attack [80] on

147

supervisory control and data acquisition (SCADA) systems, the German steel mill attack [83]

caused by advanced persistent threats (APTs), the Ukrainian grid attack [120] via Denial of service

(DOS) tactics and derailment of trams [85] using basic network access methods. In each instance,

strategic threat actors used a sequence of atomic attack actions to exploit known vulnerabilities in

both the cyber and physical layers of the system. MITRE ATT&CK framework is a continuously

growing database of such atomic actions corresponding to specific goals on different platforms, pri-

marily used to characterize post-compromise adversarial behavior in cybersecurity and in Industrial

Control System (ICS) [6].

This chapter proposes a general framework for modeling and uncovering an adversary’s move-

ments using a hybrid attack graph (HAG) and relating the security status of the cyber with the

physical layer, while effectively configuring the HAG to ensure resilient operation of the CPS. The

proposed framework has two components: (a) an adversary’s model and policy and (b) a defender’s

network hardening policy. The adversary’s movement is modeled using a Markov Decision Process

(MDP) on the HAG, while the policy is determined using an ML method. The defender evaluates the

security of the CPS using partial observations of the HAG. The security of the CPS is quantified by

the adversary’s movements and disruption in some measurable services of the physical processes.

The defender uses partial observations to reason about the security of the CPS and to balance

reconfiguring the HAG via network hardening and the corresponding costs. This chapter extends

the linear parameterized ML method, introduced in our preliminary work [28], with a defender

using Bayesian optimization to achieve successful network hardening. The proposed framework

can be applied to a wide range of CPS and enhances the security of the system by preventing attacks

and ensuring resilient operation.

There is a large body of work on securing CPS from an attack prevention perspective in the

cyber layer, categorized broadly into (a) resilience-by-design and (b) resilience-by-reaction [36].

To position our work in literature, we organize the literature in appropriate categories as below.

Control-Theoretic Methods: The utilization of control theory for securing Cyber-Physical

Systems (CPS) has received substantial attention in the literature. For instance, Miehling et

148

al.[103] propose a sampling-based worst-case design approach to overcome observation challenges

and develop corresponding policies. Similarly, the work by Nguyen et al.[106] introduces a

system identification and control-theoretical framework to ensure safety-critical operations in CPS.

Comprehensive surveys of control-theoretic methods used for securing CPS are presented by Dibaji

et al.[46] and Lun et al.[154]. Recently, Miehling et al. [101] developed a model that explicitly

links the security status between the cyber and physical layers to design an intrusion response

system.However, all of these approaches require knowledge of the system model at the cyber or at

the physical level or both, making them challenging to apply in scenarios where the system model

is unknown.

Attack-graphs: Attack graphs are commonly used to model the movement of adversaries in a

cyber environment, allowing for the quantification of attack path vulnerabilities using a common

vulnerability scoring system (CVSS) [146]. Bayesian attack graph are used to determine the cyber

attack scenarios on supervisory control and data acquisition (SCADA) and energy management

system (EMS) of wind farms [159]. Petri nets, with their increased flexibility and resolution

compared to attack graphs, have been a long standing tool for a range of application, including

modeling cyber attacks [98, 38].

Adversarial identification frameworks: The MITRE Adversarial Tactics, Techniques, and

Common Knowledge (ATT&CK) [1] framework provides a knowledge database to characterize

post-compromise detection of an adversary targeting a given platform. The MITRE ATT&CK

has recently been extended for Industrial Control System (ICS) [6]. Using MITRE ATT&CK

framework, a cyber kill chain (CKC) has been developed and evaluated to determine the resiliency

of Distributed Energy Resouces (DER) [3, 111]. Similar models characterizing the security

attributes of a CPS are presented by Bakirtzis et al. [13] using a post-compromise database, like

MITRE ATT&CK.

Attack Detection via Machine Learning: ML methods have shown significant success in

enhancing the security of CPS in various applications [148]. Some of these methods include

attack detection, which have been used in [99] to detect false data injection attacks. To capture the

149

temporal and spatial structure of an anomaly, convolutional and memory based encoder-decoder

models are employed [95]. A comprehensive list of ML based attack detectors have been provided

in the survey [109] by Olowononi et al.. However, these methods are only used for detecting attacks

and lack a defense mechanism to counteract an attack on the system.

Defense Mechanism via Reinforcement Learning: Reinforcement learning (RL), a subfield

of ML, has been used to develop a variety of defense mechanism in CPS [86, 107]. For instance,

RL has been used to develop anti-jamming [39], and anti-spoofing policies, such as the use of

dynamic threshold hypothesis testing for authentic user verification in [90, 149]. Moreover, RL

methods have also been used to indentify vulnerabilities in smart grid CPS [152, 40]. However, the

described RL methods assume a fixed policy for the CPS or the adversary, and do not account for

any deviation while identifying system vulnerability or developing a defense mechanism.

Game-Theoretic Methods and Network Hardening: Game-theoretic formulations in con-

junction with Reinforcement Learning (RL) have been employed, as seen in studies such as [108],

where an adversary-defender zero-sum dynamic game is formulated to determine optimal ac-

tions for damaging (resp.

protecting) transmission lines in smart grids. Two-player games

have been utilized to model the security policies of Cyber-Physical Systems (CPS) in Vehicu-

lar ad-hoc networks (VANETs)[94], addressing vulnerabilities to jamming attacks. Game theory

has found application in modeling preemptive defender measures, such as anti-virus software

or honeypot mechanisms[49], designed to secure IT systems before granting access to poten-

tial users. Furthermore, game theory has been instrumental in analyzing advanced persistent

threats [123, 127, 125, 126], where a defender can resort to Dynamic Information Flow Track-

ing (DIFT) – a mechanism developed to dynamically track the usage of information flows during

program executions [137].

In addition to game theory, network hardening techniques have been employed to secure Cyber-

Physical Systems (CPS). However, the problem of network hardening has been shown to be NP-

hard [131], necessitating the use of heuristic solutions. Identifying system vulnerability along with

attack graph-based hardening is proposed by Saha et al.[124], offering efficient algorithms with

150

provable guarantees and exploring trade-offs between hardening cost and damages inflicted on the

system.

In this chapter, we present a novel approach to securing CPS through a non-zero-sum

game between an adversary and a defender. The adversary’s policy is dynamic in nature and is

determined using a Reinforcement Learning (RL) agent, while the defender’s policy is static and

chooses to sequentially harden the network. For a principled approach to updating the defender

actions, we resort to Bayesian optimization methods[136].

Blackbox Optimization: Blackbox (particularly, Bayesian) optimization has its roots in early

methods such as Taguchi techniques [58, 11]. Techniques for blackbox optimization can be classified

into two categories, deterministic [27, 25, 26] and stochastic. Among stochastic approaches to

blackbox optimization, a popular approach is based on the assumption that the unknown function can

be represented as a Gaussian process [136]. Recent research has applied Bayesian optimization to

compute approximate Nash equilibria of general sum games with continuous action spaces [117, 4]

or potential games [10].

This chapter presents a framework to design network hardening strategies for CPS by integrating

a learning-based adversarial attack modeling approach [28] with the defense planning process. The

contributions of this work are three-fold.

1. A game-theoretic formulation under information asymmetry and partial system observ-

ability: This work presents a game-theoretic formulation for a CPS using an HAG model

to capture probabilistic transitions of the adversary. We assume the defender does not have

direct access to the adversary’s actions (policy) and rewards during an attack, and operate

solely on a belief of the cyber layer security status and some measurable attributes in the

physical layer (e.g., temperature measurements in smart buildings). The interaction between

the adversary and the defender is modeled as a non-zero-sum game, where the goal is to

find a defense strategy solely based on appropriately modeled reward/cost functions. By

formulating the adversary’s and defender’s problems as an MDP [28] with cyber (discrete)

and physical (continuous) states, the defender’s actions correspond to hardening the network,

i.e., to impact the success probabilities on the cyber exploits. The solution concept that we

151

seek is that of a Nash equilibrium, i.e., a pair of policies from which neither player has any

incentive to deviate.

2. Data-driven adversarial network hardening: Our work starts by demonstrating that the

network hardening problem is equivalent to designing a slow absorbing Markov chain that

represents the progression of an attack in any CPS. Such a slow absorbing Markov chain

design is cast into a constrained optimization problem, which is non-convex, and hence, a

global solution is not guaranteed using standard optimization methods. To address this, we

propose a data-driven approach to compute a best response for each player iteratively, and

then find an approximate NE using the best iterated response.

Given a security policy of the defender, we adapt an Actor Critic algorithm – an RL method

to solve the adversary’s problem and extract the corresponding policy. To solve for the

defender’s best response, Bayesian optimization (BO) is used given the adversary’s policy.

Neither the Actor Critic nor the BO require explicit knowledge of the underlying dynamics of

the physical or cyber processes, making them attractive for joint attack and defense planning

(referred to as purple teaming) of any complex CPS.

3. Evaluation on a smart building case-study: We evaluate our proposed approach on a

smart building system, where the dynamics of the physical process were obtained from a

highly accurate truncated model based on real-world measurements. The cyber layer of the

CPS is modeled as a truncated version of a ransomware graph, created using an information

flow graph [126]. The simulation results demonstrate the effectiveness of our approach

in hardening the network, while also characterizing a trade-off between hardening costs

and security status of the CPS. Furthermore, we observe that the adversary and defender

objectives exhibit a diminishing marginal improvement with increasing number of iterations

of our approach, suggesting proximity to an approximate NE of the game.

Outline: The chapter is organized as follows. The model formulation of the HAG describing

the dynamics of cyber (discrete) and physical (continuous) components and their interactions are

152

presented along in Section 5.2. Solution approaches for the adversary and defender problems are

described in Section 5.3. Numerical experiments with the description of the cyber layer, physical

layer, defense layer and results of proposed approaches in a smart building case-study in presented

in Section 5.4. Finally, we conclude this chapter in Section 5.5.

5.2 Model Formulation

In this section, we present our adversarial threat model that characterizes the cross-layer coupling

between the cyber and physical vulnerabilities in a CPS using an HAG [93, 60, 67, 79, 102,

28, 8]. An HAG is a directed acyclic graph, which represent exploitable security attributes

and physical processes as nodes, and adversarial exploits (or actions) as edges. The leaf nodes

represent exploitable cyber attributes as an attack entry-point (e.g., malware download into a

local workstation), while root nodes denote an adversary’s target set of physical-layer attributes

(e.g., energy consumption, thermal comfort, traffic-lane assist). An HAG models the space of all

possible attack paths available to a strategic adversary aiming to compromise cyber and physical

components. Figure 5.1 illustrates a representative HAG used in [28] to model cross-layer sensor-

deception attacks in buildings.

Figure 5.1 A hybrid attack graph for a single-zone building with four cyber nodes (in red) and
one physical node (in blue) [28]. An adversary infiltrates the leaf node (node 1) and progressively
secures additional security attributes (nodes 2-4) before attacking the zone temperature controller
by perturbing sensor measurements at the root node 5.

The success probability of each cyber exploit along the edge of an HAG is dependent on the

defense configuration. For example, the probability of detecting adversarial activity (equivalent to

an unsuccessful attack action) is a function of the number of honeypots installed in the network [64].

The cyber exploits are represented using techniques from the MITRE ATT&CK framework for

153

ICS, as done in previous works such as [6, 41]. The authors in [41] developed an automated attack

sequence generator represented as a hidden Markov model (HMM) using the same framework,

with transition probabilities between tactics (nodes) and emission probabilities from tactics to

a techniques.

In our work, we use similar representations, i.e., the nodes can be presented as

equivalent tactics and exploitable edges as techniques. Once a root node is breached, every attack

action in the physical system is assumed to be successful with probability 1, and the adversary

earns a corresponding reward. The adversary’s objective is to progressively learn the best attack

path(s) in the HAG to reach the target root node and maximize the cumulative rewards earned over

a finite attack horizon. This learning problem is posed as a Markov Decision Process (MDP).

On the other hand, the defender’s objective is to preemptively minimize any costs incurred due to

the adversary compromising any physical attributes at the root node(s), such as any disruption of

physical processes and the cost of network hardening. This is achieved by selecting the success

probabilities on the cyber exploits appropriately. Next, we present the modeling assumptions in

our problem setup.

5.2.1 Modeling Assumptions

Assumption 5.2.1 The adversary has full knowledge of the HAG topology but has limited (no)

knowledge of the success probabilities (set by the defender) at the onset of an attack.

Assumption 5.2.2 The defender has complete knowledge of the cyber exploits (edges in the HAG)

and can allocate resources to harden the cyber network but not at the physical layer.

Assumption 5.2.3 The defender cannot observe the adversary’s sequence of actions and rewards

while the system is under attack1.

Assumption 5.2.4 The HAG exhibits the well-known monotonicity property, which states that an

adversary never willingly relinquish attributes once obtained [102]. This simplifies our analysis

by avoiding any attack paths with self-loops.

1Under full information scenario between the adversary and defender, the defender’s cost and adversary’s net

reward would be interchangeable.

154

In what follows, G = (N , E) is used to denote an HAG, where N and E are the set of nodes and

edges in G, respectively. For notational clarity, we assume that G has only one root node; however,

this assumption can be relaxed. Next, we discuss the preliminaries for the adversary’s MDP model.

5.2.2 Preliminaries

States, Actions and Rewards

We define Φ as the set of attack success probabilities over all edges of G. The success probability

of a cyber exploit 𝑒 ∈ E, conditioned on the adversary using 𝑒 is denoted by Φ𝑒 ∈ Φ, and is given

by

Φ𝑒 (cid:17) 𝛼𝑒𝑤𝑒,

where 𝛼𝑒 ∈ [𝛼, 1] is chosen by the defender, 𝛼 ∈ (0, 1) is a positive lower bound on 𝛼𝑒, and 𝑤𝑒 is

a default (nominal) value. The defender can adjust 𝛼𝑒 to control the success probability of 𝑒; as

𝛼𝑒 increases, so does the success probability of 𝑒. Note that, 𝛼 > 0 ensures that an exploit 𝑒 is not

made redundant by assigning a zero success probability. Let

𝜶 = (𝛼𝑒 : 𝑒 ∈ E)

be the tuple of all defender-assigned weights in G; henceforth, we will refer to 𝜶 as the defender’s

policy. Note that 𝜶 is set prior to the onset of an attack and is constant over the attack horizon.

Hardening an exploitable edge corresponds to improving defense mechanisms over the techniques

(MITRE ATT&CK for ICS) used by the adversary. For instance, a cyber node such as impair

process control (a tactic) can be hardened over exploitable techniques such as alarm suppression,

denial of service, and others that require corresponding costs. Let T = {1, 2, . . . , 𝑇 } be a finite

attack horizon. The security state of the CPS at time 𝑡 is denoted by a hybrid state variable

𝑠𝑡 = (𝛾𝑡, 𝑥𝑡), where (a) 𝛾𝑡 ∈ {0, 1}|N | is the discrete security state describing the current state of

compromise of each node (1 means node is compromised and 0 means otherwise), and (b) 𝑥𝑡 ∈ R𝑚

is the continuous state of the physical process at the root node. The set of available attack actions

in the cyber and physical layers at time 𝑡 is denoted by A (𝑠𝑡). Let Υ be the total number of root

nodes in G, and 𝛾𝑡 = 𝛾root,𝑖, for any 𝑖 ∈ {1, 2, . . . , Υ} represent the breach of the 𝑖th physical node.

155

Let 𝑎(𝑠𝑡) ∈ A (𝑠𝑡) denote an attack action taken in state 𝑠𝑡 for a given defense policy 𝜶. Then, we

denote the adversary’s instantaneous net reward at time 𝑡 by 𝑟 (𝑠𝑡, 𝑎(𝑠𝑡), 𝜶) ∈ R. Note that the net

reward includes the cost incurred to launch an exploit, irrespective of whether it is successful or

not.

CPS State Transitions

Suppose a non-root node 𝑛 is compromised at time 𝑡, and there are E𝑛,𝑛′ exploits available to

compromise a neighboring node 𝑛′. Assuming independence between different exploits, the proba-
bility that 𝑛′ is compromised at time 𝑡 +1 is given by 1−(cid:206)𝑒∈E𝑛,𝑛′ (1−Φ𝑒). Such transitions represent

various techniques from MITRE ATT&CK [5], and the graph nodes N represent equivalent tactics.

For instance, an entry leaf node can be represented as an initial access (tactic), connected to lateral

movement (another tactic) via cyber exploits (techniques), such as default credentials, I/O module

discovery, and so on. Thus, the success probabilities Φ (or equivalently the defender policy 𝜶)

influence the probabilistic evolution of the discrete state 𝛾𝑡; this dependence is compactly expressed

as:

𝛾𝑡+1 = 𝑔cyb(𝛾𝑡, 𝑎(𝑠𝑡), 𝜶),

(5.1)

where 𝑔cyb is an appropriate probability transition kernel. Moreover, the physical-process

dynamics at the root node is represented using a state-space model of the form:

𝑥𝑡+1 = 𝑔phy(𝑥𝑡, 𝑢𝑡, 𝑤𝑡, 𝑎(𝑠𝑡)),

𝑦𝑡 = 𝐻 (𝑥𝑡, 𝑤𝑡, 𝑎(𝑠𝑡), 𝜶),

(5.2)

(5.3)

where 𝑔phy is the state transition function, 𝑦𝑡 is the measurements, 𝐻 is the measurement function,

𝑢𝑡 is a suitably designed control, and 𝑤𝑡 is the disturbance. Note that the attack term 𝑎(𝑠𝑡) in (5.1)

and (5.2) accounts for the attack impact on the root (physical) node, only after the root node is

compromised. Combining (5.1) and (5.2), the security state 𝑠𝑡 transition can be compactly denoted

as

156

𝑠𝑡+1 = 𝑔(𝑠𝑡, 𝑎(𝑠𝑡), 𝜶),

(5.4)

where 𝑔 comprises 𝑔cyb and 𝑔phy. A detailed version of the HAG and its components are

described in [28]. Next, we formally present the adversary’s MDP model.

5.2.3 Adversary’s Learning Problem

Let 𝜋(𝑠𝑡) denote a stationary attack policy that assigns a probability to each action in the set A (𝑠𝑡)

for a given state 𝑠𝑡 and a defender’s policy 𝜶. If 𝑠𝑡 is the physical node, then 𝜋 is a distribution over

a finite set of actions on the physical dynamics. Let Π be the space of all feasible attack policies.

Starting from an initial state 𝑠0 ∈ S and for a given defender policy 𝜶, the adversary seeks a policy

𝜋∗ ∈ Π that maximizes the objective function 𝐽att comprising the cumulative net reward over the

attack horizon T ,

𝐽att(𝑠0, 𝜋, 𝜶) := E

(cid:35)

𝑟 (𝑠𝑡, 𝜋, 𝜶)

,

(cid:34)

∑︁

𝑡∈T

𝜋∗(𝑠0, 𝜶) ∈ arg max

𝜋∈Π

𝐽att(𝑠0, 𝜋, 𝜶),

(5.5)

(5.6)

where the expectation is taken with respect to the transition kernel that defines the evolution in

(5.4).

5.2.4 Defender’s Cyber Network Hardening Problem

The defender’s objective is to minimize the combined impact of cyber attacks on the CPS and

the cost of network hardening by choosing its actions 𝛼𝛼𝛼. Let 𝑐𝑑 (𝑠, 𝜋(𝑠), 𝜶) be the cost incurred

by the defender under an attack policy 𝜋(.) for a given choice of defense action 𝛼𝛼𝛼. The cost may

depend on the cyber states and/or physical layer attributes (discomfort or temperature fluctuations).

Given a tuple of non-negative weights 𝛼𝛼𝛼, the network hardening cost is computed as

ℎ(𝛼𝛼𝛼) =

𝑑𝑒

∑︁

𝑒∈E

(cid:18) 1 − 𝛼𝑒
𝛼𝑒

(cid:19)

,

(5.7)

where 𝑑𝑒 is a hardening cost factor, which will be studied in Section 5.4. If a cyber exploit 𝑒 is not

hardened, then the corresponding cost is zero, i.e., 𝛼𝑒 = 1.

157

We seek to minimize the defender’s objective 𝐽def over the attack horizon T , given any initial

state 𝑠0 ∈ S and an attack policy 𝜋 ∈ Π. The objective function is defined as follows:
(cid:34)

(cid:35)

𝐽def(𝑠0, 𝜋, 𝜶) := E

∑︁

𝑐𝑑 (𝑠𝑡, 𝜋(𝑠𝑡), 𝜶)

𝑡∈T

𝛼𝛼𝛼∗(𝑠0, 𝜋) ∈ arg min
𝛼𝛼𝛼∈[𝛼,1] | E |

𝐽def(𝑠0, 𝜋, 𝜶),

+ ℎ(𝛼𝛼𝛼),

(5.8)

(5.9)

where the expectation is taken with respect to the transition kernel in (5.4).

Using (5.4), (5.5) and (5.8) we define a non-zero-sum stochastic game being played between the

defender and the adversary. The desired solution concept is that of an open-loop Nash equilibrium

[23], , where we find a pair of attack-defense policies {𝜋∗, 𝛼𝛼𝛼∗} that are best-responses to each other,

i.e, for which (5.6) and (5.9) hold simultaneously, given any 𝑠0. We identify sufficient conditions

such as the stochastic game being zero-sum or having a specific structure (such as additive rewards

for one player while the transitions are controlled by the other [70]) that guarantee the existence of

Nash equilibrium policies. In particular, we adopt an iterative approach to find the best response of

one player by fixing the policy of the other. We formally characterize technical conditions on the

cost functions that ensures our proposed approach converges to a Nash equilibrium in a zero-sum

and non zero-sum settings.

The defender’s and adversary’s objectives are interdependent through each other’s policy,

creating a paradox for solving either of the problems. For a non-zero-sum game, the defender’s and

adversary’s objectives should be evaluated and optimized simultaneously. Since simultaneously

solving non-zero-sum games is challenging, we propose an iterative approach to tackle the joint

problem, i.e., by fixing the policy of a player first (e.g., the defender), solving for an optimal attack

policy, then optimizing over the defender’s policies. We numerically investigate the convergence

of this approach on a CPS example in Section 5.4.

5.2.5 Computational Challenges

We elaborate on the major challenges in solving both, the adversary’s and defender’s problems.

The adversary’s problem focuses on solving the MDP (5.5). Traditional dynamic programming

algorithms, such as value-iteration and policy-iteration [139], are infeasible for solving the opti-

158

mality equation in each state due to the uncountable hybrid state space S. Moreover, these methods

assume perfect knowledge of the system and transition probabilities. However, an adversary usually

has limited knowledge of the dynamics in (5.2) and the attack success probabilities.

Similarly, the defender’s objective is to solve Equation (5.9) using the HAG and adversary’s

policy. However, the defender also lacks explicit knowledge of the dynamics in the HAG and

adversary’s policy. This motivates the need for an automated purple teaming process, wherein both

players solve their respective problems sequentially, until an equilibrium is reached or a specified

number of iterations have been completed. In the next section, we discuss how an actor critic (AC)

RL algorithm is used to approximately solve the adversary’s problem (5.5), as also described in

our recent work [28]. For the defender’s problem, we propose the use of Bayesian optimization to

efficiently explore the defender’s search space and identify a potential solution.

5.3 Solution Approaches

In this section, we will begin by deriving an analytical expression for the expected time required

by the adversary to reach the physical node(s), utilizing the properties of Markov chains. The

expected time to reach the physical node(s) is a function of the cyber exploits, meaning hardening

the network results in a longer expected time to reach. However, we will see that the underlying

problem of network hardening is non-convex, which necessitates the use of efficient search methods,

such as Bayesian optimization, for the defender.

5.3.1 Markov Chain Hardening using Expected Time

The attributes of the HAG namely, (a) directed acyclic nature of the defined attack graph, (b) the

presence of leaf and root node acting as source (cyber) and sink (physical) nodes respectively, and

(c) a probabilistic distribution over the cyber exploits, make it ideal for modeling as an Absorbing

Markov chain (AMC). Using the defender’s actions 𝛼𝛼𝛼 and the adversary’s policy 𝜋, we determine

the transition probabilities of the AMC states. We elaborate the components of the AMC and how

the network hardening is posed as a constrained optimization problem.

Given N nodes and E edges in an HAG, we define a Markov chain 𝑀 with a transition probability

matrix (cid:101)𝑃 ∈ [0, 1] |N |×|N |. The Markov chain 𝑀 defined by 𝑔cyb is naturally absorbing due to the

159

presence of sink nodes (physical nodes). Let (cid:101)𝑆 ⊆ N be the set of absorbing states, and (cid:101)𝑇 ⊆ N be
the set of transient states, such that N = (cid:101)𝑆 ∪ (cid:101)𝑇. The canonical form of the transition probability (cid:101)𝑃

is given by,

(cid:101)𝑄 0
(cid:101)𝑃 = (cid:169)
(cid:170)
(cid:173)
(cid:174)
(cid:173)
(cid:174)
(cid:101)𝑅 𝐼
(cid:171)
(cid:172)
where (cid:101)𝑄 ∈ R|(cid:101)𝑇 |×|(cid:101)𝑇 |, is the matrix corresponding to the transient states, (cid:101)𝑅 ∈ R|(cid:101)𝑆|×|(cid:101)𝑇 | is the matrix
corresponding to the absorbing states, 0 ∈ R|(cid:101)𝑇 |×|(cid:101)𝑆| zero matrix, and 𝐼 ∈ R|(cid:101)𝑆|×|(cid:101)𝑆| identity matrix

(5.10)

,

corresponding to the absorbing states.

Let 𝜁0 ∈ Γ be the initial state distribution of the Markov chain. Note that 𝜁0 only contains the
cyber state and represents a distribution over the transient states. For the transition probability (cid:101)𝑃,
the expected absorption time [51] starting at the state 𝜁0 is given by,

E[tabsorb((cid:101)P)] = 𝐽AMC( (cid:101)𝑄, 𝜁0) := 1𝑇 (𝐼 − (cid:101)𝑄)−1𝜁0.

(5.11)

The expected time governs how quickly the adversary can reach the physical node(s). The work

in [51] focuses on designing fast absorbing Markov chains, such that the absorbing state is reached

as soon as possible. However, hardening the network requires designing the matrix (cid:101)𝑄 to deter
the adversary from reaching the sink node.The optimization problem for modifying the matrix (cid:101)𝑄
through the defender actions 𝛼𝛼𝛼 is given by,

max
𝛼𝛼𝛼

𝐽AMC( (cid:101)𝑄(𝛼𝛼𝛼), 𝜁0) = 1𝑇 (𝐼 − (cid:101)𝑄(𝛼𝛼𝛼))−1𝜁0

s.t. 𝛼𝛼𝛼 ∈ [𝛼, 1] |E |,

(5.12a)

(5.12b)

where 1 > 𝛼 > 0 is a user-defined lower bound for the cyber exploit cost. The directed acyclic

structure of HAG makes the transition matrix (cid:101)𝑃 a block lower triangular, column stochastic matrix.
The elements of the fundamental matrix 𝐽FM := (𝐼 − (cid:101)𝑄(𝛼𝛼𝛼))−1 are given by,

160

𝐽FM =

(cid:205)
𝑗,∀( 𝑗,1)∈E

𝛼 𝑗,1 𝑝 𝑗,1

0

−𝛼2,1 𝑝2,1

. . .

. . .

. . .

. . .

. . .

. . .

. . .

(cid:205)
𝑗,∀( 𝑗,𝑖)∈E

𝛼 𝑗,𝑖 𝑝 𝑗,𝑖

. . .

















Equation (5.12a) can be re-expressed as,

𝐽AMC(𝛼𝛼𝛼) = 1𝑇 adj(𝐼 − (cid:101)𝑄(𝛼𝛼𝛼))
det(𝐼 − (cid:101)𝑄(𝛼𝛼𝛼))

𝜁0,

−1

.

0

. . .

. . .

. . .

















(5.13)

(5.14)

where det(𝐴) and adj( 𝐴) corresponds to the determinant and adjugate of the matrix 𝐴, respectively.

Since (cid:101)𝑄 is affine in 𝛼𝛼𝛼, we can express Equation (5.14) as the ratio of two polynomials in the entries
of 𝛼𝛼𝛼 given by,

𝐽AMC(𝛼𝛼𝛼) =

𝒫|N |−1(𝛼𝛼𝛼)
𝒫|N | (𝛼𝛼𝛼)

,

(5.15)

where 𝒫|N |−1(𝑥) is a polynomial in 𝑥 of degree at most |N | − 1. Note that the denominator has at

least one more degree than the numerator, so 𝐽AMC tends to infinity if and only if 𝛼𝑒 approaches

zero for all 𝑒 ∈ E. This leads to a solution to make all the cyber exploit weights 𝛼𝑒 set to zero.

However, setting 𝛼𝑒 to zero in practice can disconnect different components of a CPS, rendering the

problem infeasible. Moreover, the optimization problem under constraints (5.12b) is non-convex,

and hence, a global solution is not guaranteed.

Proposition 5.3.1 (Convexity of cost) Suppose all entries in 𝛼𝛼𝛼 are identical, i.e., 𝛼𝑒 = 𝛼𝑎, ∀𝑒 ∈ E.

Then,

1. 𝐽AMC(𝛼𝑎) is convex in 𝛼𝑎, ∀𝛼𝑎 ∈ [𝛼, 1];

2. the optimizer of (5.12a) lies on the constraint boundary.

161

Proof: Under the assumption of 𝛼𝑒 = 𝛼𝑎, ∀𝑒 ∈ E, (5.15) changes from a ratio of polynomial in

𝛼𝑎 to a monomial in 𝛼𝑎, given as,

𝐽AMC(𝛼𝑎) =

𝑘1𝛼|N |−1
𝑎

+ 𝑘2𝛼|N |−2
𝑎
𝑞1𝛼|N |
𝑎

+ · · · + 𝑘 |N |

,

(5.16)

where 𝑘𝑖, 𝑖 ∈ 1, 2, . . . , |N | and 𝑞1 are positive coefficients. Since 1
𝛼𝑘
𝑎

is convex for 𝑘 ≥ 1 and

𝛼𝑎 > 0, 𝐽AMC is a sum of convex functions and therefore, is convex. The second part follows from

the fact that the maximizer of a convex function always lies at the boundary of the domain. □

□

Observe that we are yet to include an additional or marginal cost for hardening the network in

the formulation. Under the assumption 𝛼𝑒 = 𝛼𝑎, ∀𝑒 ∈ E, we add an additional cost to harden the

network to obtain the hardening Markov chain objective 𝐽HMC given by,

𝐽HMC(𝛼𝑎) (cid:17) 𝐽AMC(𝛼𝑎) + ℎ(𝛼𝑎),

(5.17)

where ℎ(𝛼𝑎) is the cost of hardening. If ℎ(𝛼𝑎) is also convex, then 𝐽HMC remains convex over

𝛼𝑒. Therefore, by Proposition 5.3.1, the solution will always lie at the boundary, i.e., for a given

topology and costs, it will choose the cyber exploits 𝛼𝑒 to either completely harden or not harden

at all.

In order to model more general reward functions that also include the physical attributes, we

present the use of Bayesian optimization for efficiently searching for non-trivial solutions.However,

before we describe the approach, we briefly review the technique used to compute the optimal

attack policy.

5.3.2 Model-Free Reinforcement Learning for Adversarial Policy Learning

Actor Critic (AC) is a model-free RL approach that learns an agent’s (in this case, the adversary)

policy without explicit knowledge of the probabilistic dynamics of the system (5.2), even for hybrid

MDP state spaces. AC concurrently trains two models (called the actor and the critic) to learn a

parametric form of a policy in an interactive setting with the environment (HAG). Let 𝜃 ∈ Θ be a

vector used to represent a parameterized value function of the form

𝑉 ∗(𝑠𝑡, 𝜶) = max
𝜋∈Π

E [𝑟 (𝑠𝑡, 𝜋(𝑠𝑡), 𝜶) + 𝑉 ∗(𝑠𝑡+1, 𝜶)] ,

(5.18)

162

where 𝑉 ∗(𝑠𝑡, 𝛼𝛼𝛼) is the optimal value function for the state 𝑠𝑡 and Θ has much lower dimensions as

compared to S. The AC aims to learn 𝜃∗ ∈ Θ such that ∀𝑠 ∈ S, |𝑉 ∗(𝑠, 𝛼𝛼𝛼) − 𝐽att(𝑠, 𝛼𝛼𝛼; 𝜃∗)| < 𝜖,

where 𝐽att(𝑠, 𝛼𝛼𝛼; 𝜃) is a parametrized value function, and 𝜖 > 0 is an error tolerance. Analogous to

the parameterized value function, let 𝜋(𝑠, 𝛼𝛼𝛼; 𝜓) denote a parameterized stochastic policy by 𝜓 ∈ Ψ.

At each time step, the critic updates the value-function parameters 𝜃 using sampled actions and

successor states, while the actor updates the policy parameters 𝜓 in a direction suggested by the

critic. The parameters 𝜓 and 𝜃 are updated using a stochastic gradient scheme of the form

𝜃 ← 𝜃 + 𝛽𝜃 (𝑟𝑡 + 𝜂𝐽att(𝑠′, 𝛼𝛼𝛼; 𝜃) − 𝐽att(𝑠, 𝛼𝛼𝛼; 𝜃)) ∇𝜃,

𝜓 ← 𝜓 + 𝛽𝜓 (𝑟𝑡 + 𝜂𝐽att(𝑠′, 𝛼𝛼𝛼; 𝜃) − 𝐽att(𝑠, 𝛼𝛼𝛼; 𝜃)) ∇𝜓 ln 𝜋(𝑠, 𝛼𝛼𝛼; 𝜓),

(5.19a)

(5.19b)

where 𝛽𝜓 > 0 and 𝛽𝜃 > 0 are step-sizes for the actor and critic, respectively, that vary over the

iterations, and ∇𝜃 is the gradient of 𝐽att with respect to 𝜃 evaluated at (𝑠, 𝛼𝛼𝛼, 𝜃), 𝜂 is the discount

factor, and 𝑠′ is the next state. The process is repeated until 𝜃 converges or a prescribed number of

iterations is completed. To apply the AC algorithm in the MDP (5.5) with discrete actions, we use

an exponential softmax distribution

𝑒ℎ(𝑠,𝑎,𝜓)
(cid:205)𝑏∈A𝑡 (𝑠) 𝑒ℎ(𝑠,𝑏,𝜓)
where 𝑒 is the Euler constant. Here, the function ℎ(𝑠, 𝑎, 𝜓) denotes a real-valued parametric

, ∀𝑎 ∈ A𝑡 (𝑠),

𝜋(𝑠, 𝛼𝛼𝛼; 𝜓) =

(5.20)

preference defined for each state-action pair, which can be determined using tile coding or deep

neural networks. The complete steps of various AC algorithms are described in [139]. To implement

the AC algorithm, we use an on-policy linear function approximation [139]. We use tile coding

to represent multi-dimensional continuous state space, where the receptive fields of the features

are grouped into partitions of state space. The convergence of temporal difference (TD) (𝜆) with

probability 1 when the learning rates follow certain properties was demonstrated in [45]. Similarly,

the author in [31] proved the convergence of on-line TD(0) with probability 1 while using a linear

function approximator. [140] introduced fast convergence algorithms for both on-line and offline

policy training with linear function approximation. A comprehensive list of RL using function

approximation and its convergence were reported in [151]. We use the policy obtained from AC

163

algorithm to determine an effective sequence of attacks to eventually reach the physical node(s)

causing damage or disruption in service. Next, we present the solution to the defender’s problem

while keeping the obtained adversary policy fixed.

5.3.3 Bayesian Optimization for Network Hardening

Algorithm 3: Adversarial Network Hardening

Input: HAG, 𝑇 (Time horizon), {𝑤𝑒}, ∀𝑒 ∈ E (default success probabilities), 𝐾

(Hardening iteration)

Result: Attack policy 𝜋∗, Defender’s actions 𝛼∗𝛼∗𝛼∗
Initialize 𝛼𝛼𝛼1 (𝛼𝑒 := 1, ∀𝑒 ∈ E)
for 𝑘 ← 1 to 𝐾: do

# Actor Critic for adversary
Initialize Actor Critic weights;
# Number of episodes of the attack
for episode ← 1 to 𝑁: do
Initialize 𝑠0 ∈ S
for 𝑡 ← 1 to 𝑇: do
𝑎𝑡 ∼ 𝜋𝑘 (𝑠𝑡; 𝜓)
𝑠𝑡+1 = 𝑔(𝑠𝑡, 𝑎𝑡)
Update 𝜃 and 𝜓

end

end
# Bayesian optimization for defender
Initialize surrogate model parameters: 𝜇0(·), 𝜎0(·), 𝑘 (·, ·), 𝜌, 𝐷0 = ∅
for b ← 1 to 𝐵: do

Obtain 𝜉𝑏 = 𝐹 (𝛼𝛼𝛼𝑘,𝑏, 𝜋𝑘 )
Augment data, 𝐷 𝑏 = 𝐷 𝑏−1 ∪ {𝛼𝛼𝛼𝑘,𝑏, 𝜉𝑏}
Update the GP parameters 𝜇𝑏 (·), 𝜎𝑏 (·) using (5.22)
Choose 𝜶𝑘,𝑏+1 ∈ arg min𝜶

𝑞(𝛼𝛼𝛼|𝐷 𝑏),

end
Choose 𝛼𝛼𝛼𝑘+1 = arg min 𝑞(𝛼𝛼𝛼|𝐷 𝐵)

end
Output: 𝜋∗ = 𝜋𝐾, 𝛼𝛼𝛼∗ = 𝛼𝛼𝛼𝐾+1

Recall that our best-response based solution approach is iterative in nature: We begin with a

defender policy, compute the optimal policy for the adversary (using the AC algorithm in Section

5.3.2), update the defender policy and repeat the process. Due to lack of knowledge of the underlying

physical dynamics (5.2) along with requiring multiple evaluations (expected value), we treat the

problem as a black box and use Bayesian optimization [113] to solve the defender’s problem. To

164

account for the computational complexity of the defender’s problem using Bayesian optimization

(BO), we evaluate the expectation with limited samples, to average out any measurement noise.

We initialize the defender’s policies with 𝛼𝑒 = 1, ∀𝑒 ∈ E, and train the adversary’s policy using

AC algorithm with weights 𝜃 and 𝜓. Once we learn an attack policy, we determine the defender’s

best response with respect to each exploit using BO. The goal of a BO process is to minimize an

unknown function given by (5.7) expressed by,

𝐹 (𝛼𝛼𝛼, 𝜋) = E

𝑐𝑑 (𝑠𝑡, 𝜋(𝑠𝑡), 𝜶)

(cid:35)

+ ℎ(𝛼𝛼𝛼).

(cid:34)

∑︁

𝑡∈T

(5.21)

At each BO iteration 𝑏 we select a tuple 𝛼𝛼𝛼𝑘,𝑏 and evaluate the corresponding function value

𝐹 (𝛼𝛼𝛼𝑘, 𝑏, 𝜋𝑘 ), where 𝜋𝑘 is the attack policy for the 𝑘-th hardening epoch. The main idea behind

BO is to maintain a surrogate function of 𝐹, such as a Gaussian process2, which is updated with

noisy observations 𝜉 := [𝜉1, . . . , 𝜉𝐵]′ of 𝐹 at the set 𝐴𝐵 := {𝛼𝛼𝛼𝑘,1, . . . , 𝛼𝛼𝛼𝑘,𝐵} using an acquisition

function 𝑞(𝛼𝛼𝛼). The posterior over 𝐹 is a Gaussian distribution with mean 𝜇𝐵 (𝛼𝛼𝛼) and covariance

𝑘 𝐵 (𝛼𝛼𝛼, 𝛼𝛼𝛼′) given by

𝜇𝐵 (𝛼𝛼𝛼) = k𝐵 (𝛼𝛼𝛼)𝑇 (𝐾𝐵 + 𝜌𝐼)−1𝜉,

𝑘 𝐵 (𝛼𝛼𝛼, 𝛼𝛼𝛼′) = 𝑘 (𝛼𝛼𝛼, 𝛼𝛼𝛼′) − k𝐵 (𝛼𝛼𝛼)𝑇 (𝐾𝐵 + 𝜌𝐼)−1k𝐵 (𝛼𝛼𝛼′),

𝜎𝐵 (𝛼𝛼𝛼)2 = 𝑘 𝐵 (𝛼𝛼𝛼, 𝛼𝛼𝛼′),

(5.22)

where 𝑘 : A × A → R≥0 is the kernel function, the vector k𝐵 (𝛼𝛼𝛼) := [𝑘 (𝛼𝛼𝛼𝑘,1, 𝛼𝛼𝛼) . . . 𝑘 (𝛼𝛼𝛼𝑘,𝐵, 𝛼𝛼𝛼)]𝑇 ,
𝐾𝐵 is the positive semi-definite kernel matrix [𝑘 (𝛼𝛼𝛼, 𝛼𝛼𝛼′)]𝛼𝛼𝛼,𝛼𝛼𝛼′∈𝐴𝑛, 𝜌 ≥ 0, and 𝜎𝐵 (𝛼𝛼𝛼) is the standard

deviation of the Gaussian measurement noise for the samples 𝜉. In this work, we use the expected

improvement as the acquisition function, which is defined by (5.23). Let 𝐹′

𝐵 (𝛼𝛼𝛼) := min𝑚≤𝐵 𝐹 (𝛼𝛼𝛼𝑘,𝑚)
represent the minimal observed value of 𝐹 () at the current iterate 𝐵, then expected improvement is

defined as,

𝑞(𝛼𝛼𝛼) = EI𝐵 (𝛼𝛼𝛼) := E(cid:104) (cid:0)𝐹′

𝐵 (𝛼𝛼𝛼) − 𝐹 (𝛼𝛼𝛼)(cid:1) + (cid:12)
(cid:12)
(cid:12)

𝛼𝛼𝛼𝑘,1:𝐵, 𝜉1:𝐵

(cid:105)

,

(5.23)

2A Gaussian process is a stochastic process, i.e., random variables indexed by space and time, such that any finite

collection of those random variables has a multivariate normal distribution.

165

where 𝑥+ (cid:17) max{𝑥, 0}. To obtain theoretical guarantees on the suboptimality of 𝛼 after 𝐵 iterations,

we also use the upper confidence bound (UCB)[136], which is given by

𝑞𝐵 (𝜶) := 𝜇𝐵 (𝜶) + √︁𝛽𝐵𝜎𝐵 (𝜶),

where, for a discrete choice of 𝜶, 𝛽𝐵 := 2 ln(|𝜶|𝜉𝐵/𝜐) with an user-defined 𝜐 ∈ (0, 1) , and 𝜉𝑘 is a

sequence such that (cid:205)∞
𝑘=1

𝜉−1
𝑘 = 1.

The BO algorithm in conjuction with AC is summarized in Algorithm 3, where 𝐾 is the total

number of BO iterations, 𝑁 is the total number of episodes of the AC algorithm and 𝑇 is total time

duration for the system. At each BO iteration 𝑘, we return the updated cyber exploits which are

used to re-train the adversary’s policy with the new set of success probabilities and repeat the same

process for the defined number of iterations 𝐾. Once this process terminates, we obtain the best set

of defender’s actions (non-negative weights) 𝛼∗

𝑒, ∀𝑒 ∈ E and the corresponding adversary policy

𝜋∗.

5.3.4 Analytic properties for Zero-sum games

We provide analytical guarantees for our proposed approach, which involves analyzing Algo-

rithm 1 in a zero-sum scenario by considering a finite set of pure policies for each player. For

the zero-sum analysis, we swap the minimizer and maximizer. In particular, the adversary (min-

imizer) picks out of the set {𝜋1, 𝜋2, . . . , 𝜋𝑚} and the defender (maximizer) picks out of the set
{𝛼1𝛼1𝛼1, 𝛼2𝛼2𝛼2, . . . , 𝛼𝑛𝛼𝑛𝛼𝑛}. The cost of player policy 𝜋𝑖 against 𝛼𝛼𝛼 𝑗 equals 𝑀𝑖 𝑗 (𝑠0), where 𝑀 (𝑠0) ∈ R𝑚×𝑛 is
the cost/payoff matrix. In what follows, we will drop the explicit dependence of 𝑀 on 𝑠0 for ease

of notation.

Any Hannan consistent algorithm has properties of (i) time-average convergence to the best

response policy, and (ii) 2𝜀− approximate Nash equilibrium with 𝜀 ≥ 0 when both players update

their policy using a Hannan consistent algorithm [59]. As such, our proposed approach employs

a single-agent reinforcement learning (adversary) to determine Nash equilibria for such repeated

zero-sum games [156].

Assuming 𝐾 iterations of Algorithm 1, we will leverage the following properties :

166

Proposition 5.3.2 ( [35] Theorem 4.1 and 7.2) Given 𝐾 as the number of iterations of Algorithm 3,

and let {𝑃1, . . . , 𝑃𝐾 } and { 𝑗1, . . . , 𝑗𝐾 } be the possibly mixed adversary policies and pure defender

policies at the corresponding iterations, respectively. Then, the adversary algorithm satisfies the

following inequality

1
𝐾

𝐾
∑︁

𝑘=1

𝑃𝑇
𝑘 𝑀𝑒 𝑗𝑘 ≤

1
𝐾 min
¯𝑃∈Δ𝑚

¯𝑃𝑇

𝐾
∑︁

𝑘=1

𝑀𝑒 𝑗𝑘 + 𝛿(𝑚, 𝐾),

where 𝑒 𝑗𝑘 is the 𝑗𝑘 -th basis vector in R𝑛, Δ𝑚 is the probability simplex in 𝑚 dimensions, 𝛿(𝑚, 𝐾) ≥ 0

is an Hannan consistent regret that depends on the number of adversary actions 𝑚 and number of

iterations 𝐾, obtained using any fixed distribution ¯𝑃. 𝛿(𝑚, 𝐾) ≥ 0 corresponds to regret when the

adversary uses a Hannan consistent [59] algorithm to update its policy every iteration.

There exist many Hannan consistent algorithms, such as exponential weighted average [35] or
multiplicative weight update [56], where 𝛿(𝑚, 𝐾) = O (√︁log(𝑚)/𝐾). Before we proceed with the

defender’s analysis, we need to make the following assumption on the entries of 𝑀.

Assumption 5.3.3 Each row of 𝑀 is assumed to be drawn out of a Gaussian process with a given

mean (typically equal to zero) and prior covariance defined by a kernel matrix 𝐾𝑖 ( 𝑗, ℓ) ≥ 0, for

the 𝑖-th row.

Note that this assumption automatically implies that any linear combination of the rows is also a

sample of a Gaussian process with a mean and a linear combination of the kernel matrices.

Proposition 5.3.4 ( [136] Theorem 1 and Lemma 7.6) Suppose that Assumption 5.3.3 holds. Then,

against any attack distribution 𝑃𝑘 , Bayesian optimization yields a pure policy 𝑒 𝑗𝑘 , such that

𝑘 𝑀𝛼𝛼𝛼 ≤ 𝑃𝑇
𝑃𝑇

𝑘 𝑀𝑒 𝑗𝑘 + 𝜖𝑘 ,

max
𝜶

with probability of at least 1 − 𝜐, where

√︄

𝛾𝐵 (𝑃𝑇

𝑘 𝑀) 𝛽𝐵 (𝑛)

𝐵

.

(cid:170)
(cid:174)
(cid:172)

𝜖𝑘 ∈ 𝑂 (cid:169)
(cid:173)
(cid:171)

167

gain 𝛾𝐵 (𝑃𝑇

Recall that 𝛽𝐵 (𝑛) = 2 ln(𝑛𝜉𝐵/𝜐), where the sequence 𝜉𝑘 is such that (cid:205)∞
𝜉−1
𝑘 = 1. The information
𝑘=1
ℓ=1 log(1 + 𝜎−2𝑔ℓ𝜆ℓ), where 𝜆’s are the eigenvalues
𝑘 𝑀, and 𝜎 is the variance of the noise in obtaining the

𝑘 𝑀) := 0.5/(1 − 1/𝑒) max𝑔1,...,𝑔𝑘
of the kernel matrix of the weighted rows 𝑃𝑇

(cid:205)𝐵

payoff.

We are now ready to state and prove a convergence result for the zero-sum setting.

Proposition 5.3.5 Consider the average of the attack distributions produced by Algorithm 1, ˆ𝑃𝐾 :=

1
𝐾

(cid:205)𝐾

𝑘=1

𝑃𝑘 . This distribution satisfies

max
𝜶

ˆ𝑃𝑇
𝐾 𝑀𝛼𝛼𝛼 ≤ min

𝑃𝑇 𝑀𝛼
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

𝑃∈Δ𝑚 max
𝛼𝛼𝛼
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:124)
Value of the matrix game 𝑀

(cid:125)

+

1
𝐾

𝐾
∑︁

𝑘=1

𝜖𝑘 + 𝛿(𝑚, 𝐾),

with probability of at least 1 − 𝐾𝜐.

Proof: We start with

ˆ𝑃𝑇
𝐾 𝑀𝛼𝛼𝛼 =

max
𝛼

1
𝐾 max
𝛼𝛼𝛼

𝐾
∑︁

𝑘=1

𝑃𝑇
𝑘 𝑀𝛼 ≤

≤

≤

1
𝐾

1
𝐾

1
𝐾

𝐾
∑︁

𝑘=1

𝐾
∑︁

𝑘=1

𝐾
∑︁

𝑘=1

𝑃𝑇
𝑘 𝑀𝛼

max
𝛼𝛼𝛼

(𝑃𝑇

𝑘 𝑀𝑒 𝑗𝑘 + 𝜖𝑘 )

Using Prop. 5.3.4 with

prob. at least 1 − 𝐾𝜐,

( ¯𝑃𝑇 𝑀𝑒 𝑗𝑘 + 𝜖𝑘 ) + 𝛿(𝑚, 𝐾)

using Prop. 5.3.2,

= min
¯𝑃∈Δ𝑚

¯𝑃𝑇 1
𝐾

𝐾
∑︁

𝑘=1

≤ max
𝛼𝛼𝛼

¯𝑃𝑇 𝑀𝛼 +

𝑀𝑒 𝑗𝑘 +

1
𝐾

𝐾
∑︁

𝑘=1

𝜖𝑘 + 𝛿(𝑚, 𝐾)

1
𝐾

𝐾
∑︁

𝑘=1

𝜖𝑘 + 𝛿(𝑚, 𝐾).

Since this holds for any fixed distribution ¯𝑃, one such particular choice is a saddle-point policy for

the adversary. This completes the proof.

□

Remark 5.3.6 Proposition 5.3.5 quantifies the proximity of the outcome of Algorithm 1 to the

saddle-point value (i.e., the Nash equilibrium) of the matrix game 𝑀 with high probability, under

168

certain technical assumptions on the entries of the payoff matrix. Furthermore, the error in the

outcome depends logarithmically on the number of rows 𝑚 and columns 𝑛 of the payoff matrix 𝑀.

This means that one can use a large number of pure policies while incurring only a modest increase

in the error bound.

5.3.5 Analytic properties of the non zero-sum set-up

In this subsection, we derive analytical properties of the non-zero-sum game under some

assumptions. Consider a two-player stochastic game with a finite state space 𝑠 ∈ S, having finite

action spaces 𝜋(𝑠) and 𝜶 for the adversary and defender, respectively, in each state 𝑠. We denote

this game by

Γ = {S, 𝜋(𝑠), 𝜶, ˆ𝑟, 𝑝},

(5.24)

where ˆ𝑟 := { ˆ𝑟1, ˆ𝑟2} is a vector-valued function for the defender and adversary, respectively, in the

domain

Z = {(𝑠, 𝜋(𝑠), 𝜶); 𝑠 ∈ S, 𝜋(𝑠) ∈ Π, 𝜶 ∈ [𝛼, 1] |E |}.

In particular, ˆ𝑟 := { ˆ𝑟1 := 𝑐𝑑 (𝑠, 𝜋, 𝜶), ˆ𝑟2 := 𝑟 (𝑠, 𝜋, 𝜶)} for the described problem (5.9) and (5.6),

respectively. Lastly, the state transition probability is given by

p = {𝑝(𝑧|𝑠, 𝜋(𝑠), 𝜶); 𝑧 ∈ S, (𝑠, 𝜋(𝑠), 𝜶) ∈ Z},

where 𝑝(𝑧|𝑠, 𝜋(𝑠), 𝜶) denotes the probability that the state moves from state 𝑠 to 𝑧 when the actions

𝜋(𝑠) and 𝜶 are taken in the state 𝑠. The state transition probabilities satisfy the following properties,

𝑝(𝑧|𝑠, 𝜋(𝑠), 𝜶) ≥ 0,

and

𝑝(𝑧|𝑠, 𝜋(𝑠), 𝜶) = 1.

∑︁

𝑧∈S

Definition 2 (Additive reward (AR) and additive transition (AT) game (ARAT game) [121])

The stochastic game Γ (5.24) possesses an additive rewards property, if for all (𝑠, 𝜋(𝑠), 𝜶) ∈ Z,

𝑐𝑑 (𝑠, 𝜋, 𝜶) = 𝑐𝑑

1 (𝑠, 𝜶) + 𝑐𝑑

2 (𝑠, 𝜋),

𝑟 (𝑠, 𝜋, 𝜶) = 𝑟1(𝑠, 𝜶) + 𝑟2(𝑠, 𝜋),

169

for appropriate functions 𝑐𝑑
1

, 𝑐𝑑
2

, 𝑟1 and 𝑟2 on the domain. The game Γ (5.24) simplifies to a

controlling game if the states can be partitioned into two sets S1 and S2 such that

∀𝑠 ∈ S1,

𝑝(𝑧|𝑠, 𝜋(𝑠), 𝜶) = 𝑝1(𝑧|𝑠, 𝜶)

∀𝑠 ∈ S2,

𝑝(𝑧|𝑠, 𝜋(𝑠), 𝜶) = 𝑝2(𝑧|𝑠, 𝜋(𝑠)).

The partitioning of states enables the game Γ (5.24) to possess additive transitions for all

(𝑠, 𝜋(𝑠), 𝜶) ∈ Z of the form

𝑝(𝑧|𝑠, 𝜋(𝑠), 𝜶) = 𝑝1(𝑧|𝑠, 𝜶) + 𝑝2(𝑧|𝑠, 𝜋(𝑠))

Assumption 5.3.7 ("Switching control graphs") The graph G satisfies the following properties:

1. There are no self loops,

2. the defense policy 𝜶 is such that for every cyber node with a single outgoing edge 𝑒, 𝛼𝑒 ≠ 1,

3. for every other edge, 𝛼𝑒 = 1, and

4. the game is played over an infinite horizon in a discounted setting

A line graph represents one such example. Figure 5.2 shows a non-trivial example of a switching

control graph. Then, the following is a property of the game described in Sections 5.4.2 and 5.4.3.

Figure 5.2 Switching control graph with nodes 1, 4 and 6 representing adversary control, and nodes
2, 3, 5 representing defender control.

Proposition 5.3.8 ( ARAT game with switching control graphs ) Under Assumption 5.3.7, the

stochastic game Γ defined by (5.24) is an ARAT game.

170

156324132Proof: We will verify that the cyber rewards (5.25) and physical rewards (5.27) satisfy the AR

property, and the state transitions satisfy the AT properties. Under Assumption 5.3.7, the expected

cyber rewards are partitioned as

𝑟 (𝑠𝑡, 𝜋, 𝜶) =





𝑟1(𝑠𝑡, 𝜶) := 𝛼𝑒𝑤𝑒 − 𝑐,

𝑠𝑡 ∈ S1,

𝑟2(𝑠𝑡, 𝜋) := 𝜋𝑒 (𝑠𝑡)𝑤𝑒 − 𝑐(𝜋𝑒 (𝑠𝑡)),

𝑠𝑡 ∈ S2,

where 𝑐 corresponds to the cyber cost for all the states belonging to the set S1, i.e., states under

defender’s control.

Note that when the adversary reaches the physical state 𝑠𝑡 = {𝛾root, 𝑥𝑡 } the defender’s action has

no impact on the reward obtained by the adversary. The state transitions under Assumption 5.3.7

are of the form

𝑝1(𝑧|𝑠𝑡, 𝜶) = 𝛼𝑒𝑤𝑒, ∀𝑠𝑡 ∈ S1,

𝑝2(𝑧|𝑠𝑡, 𝜶) = 𝜋𝑒 (𝑠𝑡)𝑤𝑒, ∀𝑠𝑡 ∈ S2.

Therefore, we satisfy both ARAT property for the stochastic game Γ (5.24).

□

Using Theorem 3.1 from [121], we conclude that the ARAT game Γ (5.24) admits a Nash

equilibrium in stationary strategies which uses at most two pure actions for each player in each

state. This result will allow us to significantly prune down the adversary edges of a large graph that

satisfies Assumption 5.3.7.

5.4 Numerical Experiments

We now demonstrate the effectiveness of our proposed network hardening algorithm on a smart

building case-study with a cyber layer inspired by a ransomware attack graph and the physical layer

obtained from a truncated model identified using real-world experiments.

5.4.1 Case-study: Sensor Deception Attacks on Building

In this use case, the adversary aims to maximize the occupant discomfort of a single zone in the

given building over a defined time horizon, while the defender seeks to minimize a combination of

the discomfort and the hardening cost. The building’s air-handling unit (AHU) performs standard

171

operations by reconditioning ambient air and return air to a specific supply-air temperature and

then supplying it to various building zones using a supply fan. The adversary aims to manipulate

temperature measurements from various zone-level sensors to deceive the AHU control system

and send poorly conditioned air into various zones, causing comfort-bound violations over time.

However, to gain access to the temperature sensors at various zones, the adversary has to penetrate

the sensor unit via a set of cyber exploits present on different components of a Building Automation

System (BAS), such as IoT devices (e.g., IP cameras and smart thermostats), building-management

workstations, and programmable logic controllers (PLC).

For the cyber layer, we use a pruned version of a ransomware attack graph [126] created using

information flow. The original graph represents multiple stages of an attack progression: (a) a

privilege escalation stage, (b) lateral movement over the cyber nodes and (c) reaching the goal

node. We use an HAG to represent these specific stages, as shown in Figure 5.3a. Similar attack

graphs for BAS were used in [48], where the attack paths involved executing a subset of tactics

defined in popular attack frameworks, such as MITRE’s ATT&CK [1].

The reward functions used for an adversary in the cyber and physical layer of a CPS is usually

system-specific and depends on the system’s overall security objective and specifications. For

instance, the cyber reward at a certain node in an HAG can be set equal to the loss a defender or

system administrator would incur in case an adversary were to successfully access the corresponding

node. For this case study, we set the cyber reward to a positive value that incentivizes a resource-

and/or time-constrained adversary to reach the physical node as quickly as possible. However, other

cyber-layer reward specifications can be easily integrated in our framework. On the other hand,

reward in the physical layer is generally associated with a metric that corresponds to loss in physical-

system performance due to the adversary’s actions. Examples include power, energy, efficiency or

deviation of performance beyond a specified bound. It is also important to note that probability of

transitions between different nodes in an HAG is usually determined from related attack-incident

reports in the literature (see [41] for more details). However, we use synthetic transition-probability

values in the ransomware attack graph for demonstrative purposes only. Next, we elaborate the

172

cyber and physical layer components of the proposed HAG using notation described in Section 5.2.

(a)

(b)

Figure 5.3 (a) An HAG inspired from a ransomware attack graph [126]. The source node 1 is
represented by the dashed circle and the physical node (sink node) 9 is represented by concentric
circles. (b) Trajectories of Zone 1 temperature (Zone 1) along with the outside air temperature
(Outside T) over a year with upper (T max) and lower temperature (T min) comfort bounds.

5.4.2 Cyber Layer

The HAG consists of eight cyber vertices with the associated cyber exploits also known as

tactics from MITRE ATT&CK framework. The physical node is represented via concentric blue

circles (node 9 in Figure 5.3a). Each vertex (tactic) and its corresponding edge (technique) are

shown in Table 5.1. A user can generate such attack graphs and models using the framework

in [41]. The success probability of any of the cyber exploit is independently sampled from a

uniform distribution, U ∼ [0.5, 1). For an attack action 𝑎𝑡 ∼ 𝜋(𝑠𝑡, 𝛼𝛼𝛼; 𝜓) on the cyber layer, the

adversary incurs a cost 𝑐(𝑎𝑡) of 0.1 and a nominal reward of 1 if an exploit is successful, while the

reward for doing nothing is assigned a value of 0. The reward from the cyber layer to the adversary

is given by,

𝑟 (𝑠𝑡, 𝜋, 𝜶) =





1 − 𝑐(𝜋𝑒 (𝑠𝑡)), with probability

𝛼𝑒𝑤𝑒𝜋𝑒 (𝑠𝑡),

−𝑐(𝜋𝑒 (𝑠𝑡)),

with probability

1 − 𝛼𝑒𝑤𝑒𝜋𝑒 (𝑠𝑡),

(5.25)

where 𝜋𝑒 (𝑠𝑡) denotes the adversary’s probability of choosing exploit 𝑒 while in the state 𝑠𝑡. Then,

the expected reward until the root (physical) node is not compromised is given by

E[𝑟 (𝑠𝑡, 𝜋, 𝜶)] = 𝛼𝑒𝑤𝑒𝜋𝑒 (𝑠𝑡) − 𝑐(𝜋𝑒 (𝑠𝑡)),

173

123456782345Physical processTemperaturesensorAHU UnitZone temperaturedynamics90100200300Day0102030Temperature COutside TZone 1T minT max5010022.525.0Table 5.1 Cyber exploits and their corresponding probability of success.

Node

Tactic

1

2

Initial
access

Execution

3

Persistence

4

(1,2)

Edge Transition Probability Node
0.82 (Internet
Accessible
Device)
0.63 (Execution
through
API)
0.88 (Man in
the middle)
0.89 (Module
Firmware)

(3,4)

(2,3)

(2,4)

6

5

Tactic

Edge Transition Probability Node

Tactic

Edge Transition Probability

Evasion

(4,5)

0.56 (Utilize/Change
operating module)

(4,6)

0.94 (Rootkit)

Discovery

(5,6)

Lateral
movement

(6,7)

0.97 (Control Device
Identification)
0.59 (External Remote
Services)

6

7

8

Lateral
movement

Inhibit response
function
Impair process
control

(6,8)

(6,9)

(7,8)

(8,9)

0.87 (Remote File
Copy)

0.78 (Program
Organization Units)

0.87 (Block serial
COM)
0.50 (Change Program
State)

subject to the dynamics in (5.4). Note that each exploit has a positive expected net reward, which

incentivizes the adversary to reach the root node as quickly as possible.

5.4.3 Physical Layer

We consider a multi-zone residential building with a single floor as our representative building,

which is based on the setup described in [116]. The building has 6 conditioning zones and a central

Air Handling Unit (AHU) that sends thermally conditioned air to each zone using a supply-air fan.

The AHU unit uses an absorption chiller for conventional cooling and a backup boiler for emergency

heating during very low ambient temperatures. Conventional heating is provided by Variable Air

Volume (VAV) terminal units with reheat coils that regulate the temperature and flow-rate of the

air entering each zone.

To accurately model the building dynamics, a linearized, time-invariant, discrete-time, reduced-

order state-space model (SSM) can be used, as discussed in [116]. We use the RenoLight SSM

as part of the Python Systems Library (PSL) [118] to simulate the dynamics of our representative

building. The RenoLight model comprises of 250 states (building envelope variables), 6 control

inputs (amount of heating or cooling for each zone) and 6 observations (zone temperatures). The

sampling frequency of the model is set to 15 minutes. Notation and description of the different

components of the SSM are reported in Table 5.2. We use a rule-based controller to provide

Table 5.2 Description of the variables in the building model.

Variable Description

𝑥𝑡
𝑦𝑡
𝑢𝑡
𝑤𝑡

Building envelope states
Zone temperature measurements
Amount of heating or cooling (control inputs)
Ambient temperature (disturbance)

Unit
◦C
◦C
◦C kg s−1
◦C

174

occupant thermal comfort by maintaining zone temperature in each zone within specified comfort

bounds. Specifically, the amount of heating or cooling at time 𝑡 in zone 𝑖 was set according to

−𝑢max min

−𝑢max min

(cid:110) 𝑦𝑖

𝑡 −𝑦max+𝛿
˜𝜖

(cid:111)

,

, 1

(cid:110) −𝑦𝑖

𝑡 +𝑦min+𝛿
˜𝜖

(cid:111)

,

, 1

if 𝑦𝑖

𝑡 > 𝑦max − 𝑜,

if 𝑦𝑖

𝑡 ≤ 𝑦min + 𝑜,

0,

otherwise,

𝑢𝑖
𝑡 =






where 𝑦min and 𝑦max are the prescribed lower and upper comfort bounds, 𝑜 is hysteresis parameter,

˜𝜖 is proportional gain and 𝑢max is the maximum heating or cooling capacity of the controller. For

our experiments, we set 𝑦min = 23◦C and 𝑦max = 25◦C, respectively. Figure 5.3b shows the nominal

annual performance of the rule-based controller (under no attacks), which clearly shows that the

zone temperatures stay within the comfort bounds with high probability.

On acquiring access to a zone temperature sensor, the adversary can perturb the sensor mea-

surements to cause occupant discomfort in that zone. With a slight abuse of notation, let 𝑎𝑡 be the

adversarial temperature perturbation at time 𝑡. For demonstrative purposes, only the temperatures

in zone 1 are allowed to be perturbed; henceforth, we drop the zone superscripts. The perturbed

zone temperature measurement at time 𝑡 changes to 𝑦𝑡 = 𝑥𝑡 + 𝑎𝑡. The adversary’s reward for

executing the action 𝑎𝑡 when the physical state is 𝑠𝑡 = {𝛾𝑡, 𝑥𝑡 } = {𝛾𝑟𝑜𝑜𝑡, 𝑥𝑡 }, denoted by 𝑟 (𝑠𝑡, 𝑎𝑡),

equals

𝑟 (𝑠𝑡, 𝑎𝑡) = (𝑦min − 𝑦𝑡)+ + (𝑦𝑡 − 𝑦max)+ − 𝑐𝑎2
𝑡 .

(5.26)

For 𝑢 ∈ R, the first (resp. second) term is the thermal discomfort caused by temperature deviation

from the lower (resp. upper) comfort bound. The cost for executing an action 𝑎𝑡 is scaled by a

proportional term 𝑐. Since 𝑎𝑡 takes values in a discrete set, the expected reward is given by

E[𝑟 (𝑠𝑡, 𝜋𝑡)] =

∑︁

(cid:16)

𝜋𝑎𝑡 (𝑠𝑡)

(𝑦min − 𝑦𝑡)+ + (𝑦𝑡 − 𝑦max)+ − 𝑐𝑎2
𝑡

(cid:17)

.

𝑎𝑡 ∈A (𝑠𝑡 )

Note that based on the action and the state, the adversary will either observe the cyber or the

physical reward.

Once the root node is compromised, the defender can only measure the discomfort caused by

the adversary’s perturbation in any zone. The expected return incurred under a set of defenses and

175

adversary policy in state 𝑠𝑡 equals

𝑐𝑑 (𝑠𝑡, 𝜋) = −E [𝑟 (𝑠𝑡, 𝜋)] .

(5.27)

Since the defender’s actions are purely on the cyber layer, once an adversary reaches a root node,

the return 𝑐𝑑 is invariant of the defender policy 𝜶.

Network Hardening

(a)

(b)
Figure 5.4 (a) Defender’s (𝐽def) and Attacker’s (𝐽att) objective with a hardening cost factor 𝑑𝑒 := 0.1,
where min(𝑖) is defined as the 𝑖𝑡ℎ argument minimum of 𝐽def/att. (b) Defender’s (𝐽def) and Attacker’s
(𝐽att) objective with a hardening cost factor of 𝑑𝑒 := 0.5. (c) Defender’s (𝐽def) and Attacker’s (𝐽att)
objective with a hardening cost factor 𝑑𝑒 := 1.

(c)

(a)

(b)

(c)

Figure 5.5 (a) Average time steps required to reach the physical node for the adversary for the
hardening factor of 𝑑𝑒 := 0.1. (b) Average time steps to reach the physical node with 𝑑𝑒 := 0.5 (c)
Average time steps to reach the physical node with 𝑑𝑒 := 1.0.

We numerically demonstrate the outcome of Algorithm 3 with the following parameters, (a)

time horizon 𝑇 = 48, (b) hardening iteration 𝐾 = 100, (c) AC episodes 𝑁 = 30000 and (d) lower

176

020406080100Hardening Epoch10.0020.00Jdefmin 0min 1020406080100Hardening Epoch10.0020.00Jattmin0min10255075100Hardening Epoch20.0040.00Jdefmin 0min 10255075100Hardening Epoch20.0030.00Jattmin0min10255075100Hardening Epoch25.0050.0075.00Jdefmin 0min 10255075100Hardening Epoch0.0020.00Jattmin0min10255075100Hardening Epoch15.0020.0025.0030.0035.00[Time]argmin Jdef0255075100Hardening Epoch10.0015.0020.0025.0030.00[Time]argmin Jdef0255075100Hardening Epoch10.0020.0030.00[Time]argmin Jdefbound for hardening 𝛼 = 0.1. Figure 5.4 illustrates the defender’s and adversary’s objectives at

the end of the hardening iteration for different values of the hardening cost factor 𝑑𝑒 = 0.1, 0.5,

and 1. As shown in Figures 5.4a, 5.4b and 5.4c, increasing values of 𝑑𝑒 lead to higher objectives

for both the adversary and defender. The adversary’s and defender’s objective show diminishing

marginal improvement with increasing number of iterations of the approach, suggesting proximity

to an approximate NE of the game. But since this is a non-zero-sum game, characterizing additional

properties such as the price of anarchy and convergence to an NE will require additional assumptions

on the structure of the players’ objectives, and is a topic of future investigation.

We quantify the effectiveness of Algorithm 3, by measuring the average time taken by the

adversary to reach the physical node during the BO process for different values of 𝑑𝑒, as shown in

Figures 5.5a, 5.5b and 5.5c. We observe that as 𝑑𝑒 increases, the average time taken to reach the

physical node decreases. Furthermore, we compare the distribution of the time required to reach

the physical node for the corresponding values of 𝑑𝑒 against the expected absorption time in (5.11)

as shown in Figure 5.6b. We observe that the expected absorption time 𝐽AMC is greater than the

median value of the empirically determined time. This result justifies the use of the proposed

approach over standard optimization methods for optimizing 𝐽AMC

Next, we visualize the defender policy 𝛼𝛼𝛼 for the three values of 𝑑𝑒 shown in Figure 5.6a. We

observe that a majority of the weights are hardened for a smaller values of 𝑑𝑒, indicating the

effectiveness of our approach in balancing between the cost of hardening and the cost of securing

the CPS. We demonstrate a sample node trajectory for the corresponding values of 𝑑𝑒 shown in

Figures 5.7a, 5.7b and 5.7c. As expected, the adversary takes significantly longer to reach the

physical node with 𝑑𝑒 = 0.1 as compared to 𝑑𝑒 = 1.0. Finally, as the defender can only observe

the discomfort in HAG, we evaluated the same for the prior defined values of 𝑑𝑒 using the obtained

policies of {𝜋∗, 𝜶∗} shown in Figure 5.8a, 5.8b and 5.8c. The results show a decrease in discomfort

for the lowest value of 𝑑𝑒 := 0.1.

Our approach to adversarial network hardening provides a principled defense planning solution

in the presence of an adversary. Despite the defender’s limited knowledge of the adversary’s

177

movements and only being able to measure physical attributes, our approach prevents the adversary

from gaining privileges in the HAG. Our framework optimizes network hardening and adversary

cost simultaneously, resulting in robust policies for both players, leading to an approximate best-

response pair for the non-zero-sum game. This approach offers a promising defense mechanism

against adversarial attacks.

(a)

(b)

Figure 5.6 (a) Cyber exploits weights obtained from the result of Algorithm 3 with a cyber cost
factor of 𝑑𝑒 = 0.1, 0.5 and 1.0. (b) Time to reach the physical node 9 for varying hardening cost
factor and compared with the expected time to reach (𝐽AMC) obtained from (5.11).

(a)

(b)

(c)

Figure 5.7 Sample node trajectory obtained from an attack policy with a hardening cost factor of
(a) 𝑑𝑒 = 1.0, (b)𝑑𝑒 = 0.5, and (c) 𝑑𝑒 = 0.1, where with null action corresponding to no action taken
by the adversary.

178

0.00.51.0123456789101112Parametersde=0.1de=0.5de=1.00.10.51.0Hardening cost factor de1020304050Time to reach node 9JAMC0255075Time (hour)123456789NodesTransitionPhysical node0255075Time (hour)123456789NodesTransitionPhysical node0255075Time (hour)123456789NodesTransitionNullPhysical node(a)

(b)
Figure 5.8 Discomfort corresponding under the optimal policy {𝜋∗, 𝜶∗} for the hardening cost
factor (a) 𝑑𝑒 := 1.0, (b) 𝑑𝑒 := 0.5, and (c) 𝑑𝑒 := 0.1.

(c)

5.5 Summary

This chapter developed a domain-aware framework for automated adversarial defense planning,

accounting for cross-layer interaction between the cyber and physical components of a CPS. Our

approach leveraged an MDP with a hybrid state representing the cyber (discrete) and physical

(continuous) state of the system to capture the adversary’s progression over the HAG. We formulated

the automated defense planning as a non-zero-sum game between an adversary and a defender. We

used Actor Critic, a RL method and Bayesian optimization to iteratively solve the adversary’s

and defender’s problem, respectively. Finally, we demonstrated the effectiveness of our proposed

framework on a ransomware inspired graph in conjunction with smart building dynamics. The

obtained results show a hardened network for varying hardening costs along with diminishing

marginal improvement for both players. A preliminary set of results with an adversary emulation

were presented in [28]. The results presented in this chapter were recently published in [22].

179

0255075Time (hour)0.00.10.20.30.4Discomfort0255075Time (hour)0.00.20.4Discomfort0255075Time (hour)0.00.10.20.30.4DiscomfortCHAPTER 6

FUTURE DIRECTION

In this thesis, we developed a diverse range of adversarial models commonly encountered in various

CPS. For each type of adversarial model, we developed a corresponding game-theoretic framework

to reason about the possible defensive actions.

The first part of the thesis consisted of state-independent adversarial models. First, we presented

a deterministic adversary and developed corresponding defensive strategies, demonstrating the use

of such a framework in path planning applications. Next, we extended the adversarial model

to a stochastic adversary, which includes both benign and adversarial actions. Furthermore, we

incorporated the concept of budget and its variants. The budget enables accounting for false

positives and prevents the defender from choosing overly conservative policies. We also extended

the analysis to more than two actions per player and characterized the Nash equilibrium policies

for both players. We demonstrated the use of this stochastic adversarial game-theoretic framework

in motion planning problems and resilient estimation.

In the second part of the thesis, we focused on state-based adversarial models, accounting for

state and control-dependent costs. We presented a game of resource takeovers in dynamical systems.

In such games, the adversary can completely take over the system and drive it to undesirable states.

We characterized the Nash equilibrium takeover strategies for both players and derived conditions

under which a linear state-feedback control law exists for both players. We applied this game-

theoretic framework of resource takeovers to linear dynamical systems.

Finally, we presented a data-driven domain-aware approach to safeguard CPS. We created an

automated approach to emulate an adversary in a high-fidelity model and determine the corre-

sponding optimal adversary policy using reinforcement learning. For the optimal adversary policy,

we determined defensive strategies using Bayesian optimization. We demonstrated the application

of this framework in a smart building system. We present future directions for each part of the

thesis as follows:

• Deterministic Adversary: Future directions encompass non-zero-sum formulations, which

180

could model different objectives for both the adversary and defender. We also aim to relax the

constraint of single-edge attacks and defenses over the graph to include attacks over multiple

edges. Formulations involving multiple vehicles are also a topic of future investigation.

• Stochastic Adversary: Future directions include a non-zero-sum formulation of the M-SSG,

taking into account different objectives for the defender and the malign player. Asymmetric

or partial information in the SSG is another promising direction. Furthermore, we aim to

integrate control-oriented applications into the M-SSG framework, such as the incorporation

of adversarial multi-armed bandit systems and multi-plant control.

• Takeover Adversary: Our future efforts will focus on expanding the scope of our model.

We aim to incorporate partial state observability, wherein the discrete FlipDyn state of the

system needs to be estimated. We also plan to introduce bounded process and measurement

noise into the framework, investigating its impact on the FlipDyn-Con game. Additionally,

we plan to extend the number of FlipDyn states to more than two. Lastly, we intend to

conduct a comparative study between our established solution and a learning-based approach,

evaluating their performance across various objectives and cost functions.

• Data-Driven Adversary: Future work will focus on studying the convergence properties of

our proposed approach. Additionally, integrating an Intrusion Detection System (IDS) and

an Intrusion Response System (IRS) on the cyber layer would enable a more informed and

active defender. We also plan to extend the defender’s policy from a static network hardening

approach to an active network reconfiguration with one or multiple adversaries in the HAG.

Exploring zero-day exploits and preemptive defense mechanisms within the framework is

another area of interest. Finally, we will investigate the strategic use of backup systems

and their interaction within the CPS. These backup systems could represent hidden parts

of the HAG, and the defender may choose to activate them to improve the current system’s

performance.

181

[1] MITRE ATT&CK, https://attack.mitre.org, 2021.

BIBLIOGRAPHY

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Gaurav Kumar Agarwal, Mohammed Karmoose, Suhas N. Diggavi, Christina Fragouli,
and Paulo Tabuada. Distorting an adversary’s view in cyber-physical systems. 2018 IEEE
Conference on Decision and Control (CDC), pages 1476–1481, 2018.

BoHyun Ahn, Taesic Kim, Jinchun Choi, Sung-won Park, Kuchan Park, and Dongjun Won.
A cyber kill chain model for distributed energy resources (der) aggregation systems. In 2021
IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT),
pages 1–5, 2021.

Abdullah Al-Dujaili, Erik Hemberg, and Una-May O’Reilly. Approximating Nash Equilibria
for Black-Box Games: A Bayesian Optimization Approach. In International Workshop on
Optimization in Multiagent Systems. AAMAS, 2018.

Rawan Al-Shaer, Jonathan M. Spring, and Eliana Christou. Learning the associations of
mitre att & ck adversarial techniques. In 2020 IEEE Conference on Communications and
Network Security (CNS), pages 1–9, 2020.

Otis Alexander, Misha Belisle, and Jacob Steele. Mitre att&ck® for industrial control
systems: Design and philosophy. Technical report, 2020.

Tansu Alpcan and Tamer Başar. Network security: A decision and game-theoretic approach.
Cambridge University Press, 2010.

Paul Ammann, Duminda Wĳesekera, and Saket Kaushik. Scalable, graph-based network
vulnerability analysis. CCS ’02, page 217–224, New York, NY, USA, 2002. Association for
Computing Machinery.

Cyrus Anderson, Ram Vasudevan, and Matthew Johnson-Roberson. A kinematic model for
trajectory prediction in general highway scenarios. IEEE Robotics and Automation Letters,
6(4):6757–6764, 2021.

[10] Anup Aprem and Stephen Roberts. A Bayesian Optimization Approach to Compute
Nash Equilibrium of Potential Games Using Bandit Feedback. The Computer Journal,
64(12):1801–1813, 12 2019.

[11] Martin Arvidsson and Ida Gremyr. Principles of robust design methodology. Quality and

Reliability Engineering International, 24(1):23–35, 2008.

[12] Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. Basic concepts and
IEEE transactions on dependable and

taxonomy of dependable and secure computing.
secure computing, 1(1):11–33, 2004.

[13] Georgios Bakirtzis, Bryan T. Carter, Carl R. Elks, and Cody H. Fleming. A model-based
approach to security analysis for cyber-physical systems. In 2018 Annual IEEE International
Systems Conference (SysCon), pages 1–8, 2018.

182

[14] Sandeep Banik and Shaunak D. Bopardikar. Secure route planning using dynamic games
with stopping states. In 2020 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 2404–2409, 2020.

[15] Sandeep Banik and Shaunak D. Bopardikar. Attack-resilient path planning using dynamic

games with stopping states. IEEE Transactions on Robotics, 38(1):25–41, 2022.

[16] Sandeep Banik and Shaunak D. Bopardikar. Flipdyn: A game of resource takeovers in
dynamical systems. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages
2506–2511, 2022.

[17] Sandeep Banik and Shaunak D. Bopardikar. FlipDyn: A game of resource takeovers in
dynamical systems. In to appead, 2022 IEEE Conference on Decision and Control (CDC).
IEEE, 2022.

[18] Sandeep Banik and Shaunak D. Bopardikar. Stochastic games with stopping states and
their application to adversarial motion planning problems. In 2022 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 13181–13188, 2022.

[19] Sandeep Banik and Shaunak D. Bopardikar. Stochastic games with stopping states and
their application to adversarial motion planning problems. In 2022 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 13181–13188, 2022.

[20] Sandeep Banik and Shaunak D. Bopardikar. Budget-based stochastic games with stopping

states. In European Journal of Control (Under review), 2023.

[21] Sandeep Banik and Shaunak D. Bopardikar. FlipDyn with control: Resource takeover games

with dynamics. IEEE Transactions on Automatic Control (under review), 2023.

[22] Sandeep Banik, Thiagarajan Ramachandran, Arnab Bhattacharya, and Shaunak D Bopar-
dikar. Automated adversary-in-the-loop cyber-physical defense planning. ACM Transactions
on Cyber-Physical Systems, 2023.

[23] Tamer Başar and Geert Jan Olsder. Dynamic Noncooperative Game Theory, 2nd Edition.

Society for Industrial and Applied Mathematics, 1998.

[24] Halil Bayrak and Matthew D Bailey. Shortest path network interdiction with asymmetric

information. Networks: An International Journal, 52(3):133–140, 2008.

[25] Dimitris Bertsimas and Omid Nohadani. Robust optimization with simulated annealing.

Journal of Global Optimization, 48(2):323–334, 2010.

[26] Dimitris Bertsimas, Omid Nohadani, and Kwong Meng Teo. Nonconvex robust optimization
for problems with constraints. INFORMS journal on computing, 22(1):44–58, 2010.

[27] Dimitris Bertsimas, Omid Nohadani, and Kwong Meng Teo. Robust optimization for un-
constrained simulation-based problems. Operations research, 58(1):161–178, 2010.

183

[28] Arnab Bhattacharya, Thiagarajan Ramachandran, Sandeep Banik, Chase P Dowling, and
Shaunak D Bopardikar. Automated adversary emulation for cyber-physical systems via
reinforcement learning. In 2020 IEEE International Conference on Intelligence and Security
Informatics (ISI), pages 1–6. IEEE, 2020.

[29] Gianluca Bianchin, Yin-Chen Liu, and Fabio Pasqualetti. Secure navigation of robots in

adversarial environments. IEEE Control Systems Letters, 4(1):1–6, 2019.

[30] Kevin D Bowers, Marten Van Dĳk, Robert Griffin, Ari Juels, Alina Oprea, Ronald L Rivest,
and Nikos Triandopoulos. Defending against the unknown enemy: Applying flipit to system
security. In International Conference on Decision and Game Theory for Security, pages
248–263. Springer, 2012.

[31] Steven J Bradtke. Incremental dynamic programming for on-line adaptive optimal control.

PhD thesis, Citeseer, 1994.

[32] Anna L Buczak and Erhan Guven. A survey of data mining and machine learning methods for
cyber security intrusion detection. IEEE Communications surveys & tutorials, 18(2):1153–
1176, 2015.

[33] Elisa Canzani and Stefan Pickl. Cyber epidemics: Modeling attacker-defender dynamics in
critical infrastructure systems. In Advances in Human Factors in Cybersecurity: Proceedings
of the AHFE 2016 International Conference on Human Factors in Cybersecurity, July 27-31,
2016, Walt Disney World®, Florida, USA, pages 377–389. Springer, 2016.

[34] Yulong Cao, Chaowei Xiao, Benjamin Cyr, Yimeng Zhou, Won Park, Sara Rampazzi,
Qi Alfred Chen, Kevin Fu, and Zhuoqing Morley Mao. Adversarial Sensor Attack on LiDAR-
based Perception in Autonomous Driving. In Proceedings of the 26th ACM Conference on
Computer and Communications Security (CCS’19), London, UK, November 2019.

[35] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge

university press, 2006.

[36] Somali Chaterji, Parinaz Naghizadeh, Muhammad Ashraful Alam, Saurabh Bagchi, Mung
Chiang, David Corman, Brian Henz, Suman Jana, Na Li, Shaoshuai Mou, Meeko Oishi,
Chunyi Peng, Tiark Rompf, Ashutosh Sabharwal, Shreyas Sundaram, James Weimer, and
Jennifer Weller. Resilient cyberphysical systems and their application drivers: A technology
roadmap, 2019.

[37] Genshe Chen, Dan Shen, Chiman Kwan, Jose B. Cruz, and Martin Kruger. Game theoretic
approach to threat prediction and situation awareness. In 2006 9th International Conference
on Information Fusion, pages 1–8, 2006.

[38] Thomas M Chen, Juan Carlos Sanchez-Aarnoutse, and John Buford. Petri net modeling of
cyber-physical attacks on smart grid. IEEE Transactions on smart grid, 2(4):741–749, 2011.

[39] Ye Chen, Yanda Li, Dongjin Xu, and Liang Xiao. DQN-based power control for IoT
transmission against jamming. In 2018 IEEE 87th Vehicular Technology Conference (VTC
Spring), pages 1–5. IEEE, 2018.

184

[40] Ying Chen, Shaowei Huang, Feng Liu, Zhisheng Wang, and Xinwei Sun. Evaluation of
reinforcement learning-based false data injection attack to automatic voltage control. IEEE
Transactions on Smart Grid, 10(2):2158–2169, 2018.

[41] Seungoh Choi, Jeong-Han Yun, and Byung-Gil Min. Probabilistic attack sequence generation
and execution based on mitre att&ck for ics datasets. In Cyber Security Experimentation
and Test Workshop, CSET ’21, page 41–48, New York, NY, USA, 2021. Association for
Computing Machinery.

[42] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction

to algorithms. MIT press, 2009.

[43] Mathieu Dahan and Saurabh Amin. Network flow routing under strategic link disruptions.
In 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton),
pages 353–360. IEEE, 2015.

[44] Eric Dallal, Daniel Neider, and Paulo Tabuada. Synthesis of safety controllers robust to
In 2016 IEEE 55th Conference on Decision and

unmodeled intermittent disturbances.
Control (CDC), pages 7425–7430. IEEE, 2016.

[45] Peter Dayan and Terrence J Sejnowski. Td converges with probability 1. Machine Learning,

14(3):295–301, 1994.

[46] Seyed Mehran Dibaji, Mohammad Pirani, David Bezalel Flamholz, Anuradha M. An-
naswamy, Karl Henrik Johansson, and Aranya Chakrabortty. A systems and control perspec-
tive of cps security. Annual Reviews in Control, 47:394–411, 2019.

[47]

Jerry Ding, Maryam Kamgarpour, Sean Summers, Alessandro Abate, John Lygeros, and
Claire Tomlin. A stochastic games framework for verification and control of discrete time
stochastic hybrid systems. Automatica, 49(9):2665–2674, 2013.

[48] Daniel dos Santos, Clement Speybrouck, and Elisa Costante. Cybersecurity in Building

Automation Systems. Technical report, Forescout Technologies, 2019.

[49] Karel Durkota, Viliam Lisy, Branislav Bošansky, and Christopher Kiekintveld. Optimal net-
work security hardening using attack graph games. In Proceedings of the 24th International
Conference on Artificial Intelligence, ĲCAI’15, page 526–532. AAAI Press, 2015.

[50] Mahsa Emami-Taba and Ladan Tahvildari. A Bayesian game decision-making model for
uncertain adversary types. In Proceedings of the 26th Annual International Conference on
Computer Science and Software Engineering, CASCON ’16, page 39–49, USA, 2016. IBM
Corp.

[51] Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Designing fast absorbing
markov chains. Proceedings of the AAAI Conference on Artificial Intelligence, 28(1), Jun.
2014.

[52] Hamza Fawzi, Paulo Tabuada, and Suhas Diggavi. Security for control systems under sensor
and actuator attacks. In 2012 IEEE 51st IEEE conference on Decision and Control (CDC),
pages 3412–3417. IEEE, 2012.

185

[53] Carmel Fiscko, Soummya Kar, and Bruno Sinopoli. Efficient solutions for targeted control
of multi-agent mdps. In 2021 American control conference (acc), pages 690–696. IEEE,
2021.

[54] Carmel Fiscko, Soummya Kar, and Bruno Sinopoli. Cluster-based control of transition-

independent mdps. arXiv preprint arXiv:2207.05224, 2022.

[55] Carmel Fiscko, Brian Swenson, Soummya Kar, and Bruno Sinopoli. Control of parametric
games. In 2019 18th European Control Conference (ECC), pages 1036–1042. IEEE, 2019.

[56] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights.

Games and Economic Behavior, 29(1-2):79–103, 1999.

[57] Andrey Garnaev, Melike Baykal-Gursoy, and H Vincent Poor. Security games with unknown
adversarial strategies. IEEE transactions on cybernetics, 46(10):2291–2299, 2015.

[58] TN Goh. Taguchi methods: some technical, cultural and pedagogical perspectives. Quality

and Reliability Engineering International, 9(3):185–202, 1993.

[59]

James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory
of Games, 3(2):97–139, 1957.

[60] Peter J Hawrylak, Michael Haney, Mauricio Papa, and John Hale. Using hybrid attack graphs
to model cyber-physical attacks in the smart grid. In 2012 5th International Symposium on
Resilient Control Systems, pages 161–164. IEEE, 2012.

[61]

[62]

João P Hespanha. Noncooperative game theory: An introduction for engineers and computer
scientists. Princeton University Press, 2017.

João P Hespanha and Shaunak D Bopardikar. Output-feedback linear quadratic robust control
under actuation and deception attacks. In 2019 American Control Conference (ACC), pages
489–496. IEEE, 2019.

[63] Karel Horák, Quanyan Zhu, and Branislav Bošanský. Manipulating adversary’s belief: A
dynamic game approach to deception by design for proactive network security. In Stefan
Rass, Bo An, Christopher Kiekintveld, Fei Fang, and Stefan Schauer, editors, Decision and
Game Theory for Security, pages 273–294, Cham, 2017. Springer International Publishing.

[64] Ashish R. Hota, Abraham A. Clements, Saurabh Bagchi, and Shreyas Sundaram. A Game-
Theoretic Framework for Securing Interdependent Assets in Networks, pages 157–184.
Springer International Publishing, Cham, 2018.

[65] Linan Huang and Quanyan Zhu. Dynamic Bayesian Games for Adversarial and Defensive

Cyber Deception, pages 75–97. Springer International Publishing, Cham, 2019.

[66] Yunhan Huang, Zehui Xiong, and Quanyan Zhu. Cross-layer coordinated attacks on cyber-
physical systems: A lqg game framework with controlled observations. In 2021 European
Control Conference (ECC), pages 521–528. IEEE, 2021.

186

[67] Mariam Ibrahim and Ahmad Alsheikh. Automatic hybrid attack graph (ahag) generation for

complex engineering systems. Processes, 7(11), 2019.

[68] Petros A Ioannou and Cheng-Chih Chien. Autonomous intelligent cruise control. IEEE

Transactions on Vehicular technology, 42(4):657–672, 1993.

[69] Eitan Israeli and R Kevin Wood. Shortest-path network interdiction. Networks: An Interna-

tional Journal, 40(2):97–111, 2002.

[70] Anna Jaśkiewicz and Andrzej S Nowak. On pure stationary almost markov nash equilibria
in nonzero-sum arat stochastic games. Mathematical Methods of Operations Research,
81(2):169–179, 2015.

[71] A. Y. Javaid, W. Sun, V. K. Devabhaktuni, and M. Alam. Cyber security threat analysis and
modeling of an unmanned aerial vehicle system. In 2012 IEEE Conference on Technologies
for Homeland Security (HST), pages 585–590, Nov 2012.

[72] Yunhan Jia Jia, Yantao Lu, Junjie Shen, Qi Alfred Chen, Hao Chen, Zhenyu Zhong, and
Tao Wei Wei. Fooling detection alone is not enough: Adversarial attack against multiple
object tracking. In International Conference on Learning Representations (ICLR’20), 2020.

[73] Benjamin Johnson, Aron Laszka, and Jens Grossklags. Games of timing for security in
In Decision and Game Theory for Security: 6th International
dynamic environments.
Conference, GameSec 2015, London, UK, November 4-5, 2015, Proceedings 6, pages 57–
73. Springer, 2015.

[74] Nathan Koenig and Andrew Howard. Design and use paradigms for gazebo, an open-source
multi-robot simulator. In 2004 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS)(IEEE Cat. No. 04CH37566), volume 3, pages 2149–2154. IEEE, 2004.

[75] Efstathios Kontouras, Anthony Tzes, and Leonidas Dritsas. Adversary control strategies
for discrete-time systems. In 2014 European Control Conference (ECC), pages 2508–2513.
IEEE, 2014.

[76] Efstathios Kontouras, Anthony Tzes, and Leonidas Dritsas. Covert attack on a discrete-time
system with limited use of the available disruption resources. In 2015 European Control
Conference (ECC), pages 812–817. IEEE, 2015.

[77] Andreas Krause, Alex Roper, and Daniel Golovin. Randomized sensing in adversarial

environments. In International Joint Conference on Artificial Intelligence, 2011.

[78] Donghwoon Kwon, Hyunjoo Kim, Jinoh Kim, Sang C Suh, Ikkyun Kim, and Kuinam J
Kim. A survey of deep learning-based network anomaly detection. Cluster Computing,
22(1):949–961, 2019.

[79] Harjinder Singh Lallie, Kurt Debattista, and Jay Bal. A review of attack graph and attack

tree visual syntax in cyber security. Computer Science Review, 35:100219, 2020.

[80] Ralph Langner. Stuxnet: Dissecting a cyberwarfare weapon. IEEE Security & Privacy,

9(3):49–51, 2011.

187

[81] Aron Laszka, Gabor Horvath, Mark Felegyhazi, and Levente Buttyán. FlipThem: Modeling
targeted attacks with FlipIt for multiple resources. In International Conference on Decision
and Game Theory for Security, pages 175–194. Springer, 2014.

[82] Chanhwa Lee, Hyungbo Shim, and Yongsoon Eun. Secure and robust state estimation under
sensor attacks, measurement noises, and process disturbances: Observer-based combina-
torial approach. In 2015 European Control Conference (ECC), pages 1872–1877. IEEE,
2015.

[83] Robert M Lee, Michael J Assante, and Tim Conway. German steel mill cyber attack.

Technical report, 2014.

[84] David Leslie, Chris Sherfield, and Nigel P Smart. Threshold flipthem: When the winner does
not need to take all. In Decision and Game Theory for Security: 6th International Conference,
GameSec 2015, London, UK, November 4-5, 2015, Proceedings 6, pages 74–92. Springer,
2015.

[85]

John Leyden. Polish teen derails tram after hacking train network, 2008.

[86] Chong Li and Meikang Qiu. Reinforcement Learning for Cyber-Physical Systems: with

Cybersecurity Case Studies. CRC Press, 2019.

[87] L. Li and J. S. Shamma. Efficient strategy computation in zero-sum asymmetric information
repeated games. IEEE Transactions on Automatic Control, 65(7):2785–2800, 2020.

[88] Li Li, Huixia Zhang, Yuanqing Xia, and Hongjiu Yang. Security estimation under denial-

of-service attack with energy constraint. Neurocomputing, 292:111–120, 2018.

[89] Yuzhe Li, Aryan Saadat Mehr, and Tongwen Chen. Multi-sensor transmission power control
for remote estimation through a sinr-based communication channel. Automatica, 101:78–86,
2019.

[90]

Jinliang Liu, Liang Xiao, Guolong Liu, and Yifeng Zhao. Active authentication with
reinforcement learning based on ambient radio signals. Multimedia Tools and Applications,
76(3):3979–3998, 2017.

[91] Yin-Chen Liu, Gianluca Bianchin, and Fabio Pasqualetti. Secure trajectory planning against

undetectable spoofing attacks. Automatica, 112:108655, 2020.

[92] Zhaoxi Liu and Lingfeng Wang. Flipit game model-based defense strategy against cyberat-
tacks on scada systems considering insider assistance. IEEE Transactions on Information
Forensics and Security, 16:2791–2804, 2021.

[93] George Louthan, Phoebe Hardwicke, Peter Hawrylak, and John Hale. Toward hybrid attack
dependency graphs. In Proceedings of the Seventh Annual Workshop on Cyber Security and
Information Intelligence Research, CSIIRW ’11, New York, NY, USA, 2011. Association
for Computing Machinery.

188

[94] Xiaozhen Lu, Dongjin Xu, Liang Xiao, Lei Wang, and Weihua Zhuang. Anti-jamming
communication game for UAV-aided VANETs. In GLOBECOM 2017 - 2017 IEEE Global
Communications Conference, pages 1–6, 2017.

[95] Mayra Macas and Wu Chunming. Enhanced cyber-physical security through deep learning
In 2019 Proceedings of the Cyber-Physical Systems PhD Workshop, pages

techniques.
72–83, 2013.

[96] Magdi S Mahmoud, Mutaz M Hamdan, and Uthman A Baroudi. Modeling and control of
cyber-physical systems subject to cyber attacks: A survey of recent advances and challenges.
Neurocomputing, 338:101–115, 2019.

[97] K. Mansfield, T. Eveleigh, T. H. Holzer, and S. Sarkani. Unmanned aerial vehicle smart
In 2013 IEEE International

device ground control station cyber security threat model.
Conference on Technologies for Homeland Security (HST), pages 722–728, Nov 2013.

[98]

J. P. McDermott. Attack net penetration testing. In Proceedings of the 2000 Workshop on
New Security Paradigms, NSPW ’00, page 15–21, New York, NY, USA, 2001. Association
for Computing Machinery.

[99] Fei Miao, Quanyan Zhu, Miroslav Pajic, and George J Pappas. Coding schemes for securing
cyber-physical systems against stealthy data injection attacks. IEEE Transactions on Control
of Network Systems, 4(1):106–117, 2016.

[100] Fei Miao, Quanyan Zhu, Miroslav Pajic, and George J Pappas. A hybrid stochastic game for

secure control of cyber-physical systems. Automatica, 93:55–63, 2018.

[101] Erik Miehling, Cedric Langbort, and Tamer Başar. Secure contingency prediction and
response for cyber-physical systems. In 2020 IEEE Conference on Control Technology and
Applications (CCTA), pages 998–1003, 2020.

[102] Erik Miehling, Mohammad Rasouli, and Demosthenis Teneketzis. Optimal defense policies
for partially observable spreading processes on bayesian attack graphs. In Proceedings of
the Second ACM Workshop on Moving Target Defense, MTD ’15, page 67–76, New York,
NY, USA, 2015. Association for Computing Machinery.

[103] Erik Miehling, Mohammad Rasouli, and Demosthenis Teneketzis. Control-Theoretic Ap-
proaches to Cyber-Security, page 12–28. Springer-Verlag, Berlin, Heidelberg, 2022.

[104] Yilin Mo, Joao Hespanha, and Bruno Sinopoli. Robust detection in the presence of integrity
attacks. In 2012 American Control Conference (ACC), pages 3541–3546. IEEE, 2012.

[105] Athira M Mohan, Nader Meskin, and Hasan Mehrjerdi. Covert attack in load frequency
control of power systems. In 2020 6th IEEE International Energy Conference (ENERGYCon),
pages 802–807. IEEE, 2020.

[106] Luan Nguyen and Vĳay Gupta. Towards a framework of enforcing resilient operation of
cyber-physical systems with unknown dynamics. IET Cyber-Physical Systems: Theory &
Applications, 6(3):125–138, 2021.

189

[107] Thanh Thi Nguyen and Vĳay Janapa Reddi. Deep reinforcement learning for cyber security.

IEEE Transactions on Neural Networks and Learning Systems, pages 1–17, 2021.

[108] Zhen Ni and Shuva Paul. A multistage game in smart grid security: A reinforcement learning
solution. IEEE transactions on neural networks and learning systems, 30(9):2684–2695,
2019.

[109] Felix O. Olowononi, Danda B Rawat, and Chunmei Liu. Resilient machine learning for
networked cyber physical systems: A survey for machine learning security to securing
IEEE Communications Surveys & Tutorials, 23(1):524–552,
machine learning for cps.
2021.

[110] Miroslav Pajic, James Weimer, Nicola Bezzo, Paulo Tabuada, Oleg Sokolsky, Insup Lee,
and George J Pappas. Robustness of attack-resilient state estimators. In 2014 ACM/IEEE
International Conference on Cyber-Physical Systems (ICCPS), pages 163–174. IEEE, 2014.

[111] Kyuchan Park, Bohyun Ahn, Jinsan Kim, Dongjun Won, Youngtae Noh, Jinchun Choi, and
Taesic Kim. An advanced persistent threat (apt)-style cyberattack testbed for distributed
energy resources (der). In 2021 IEEE Design Methodologies Conference (DMC), pages 1–5,
2021.

[112] Praveen Paruchuri, Jonathan P Pearce, Milind Tambe, Fernando Ordonez, and Sarit Kraus.
An efficient heuristic approach for security against multiple adversaries. In Proceedings of
the 6th international joint conference on Autonomous agents and multiagent systems, pages
1–8, 2007.

[113] Martin Pelikan, David E. Goldberg, and Erick Cantú-Paz. Boa: The bayesian optimization
In Proceedings of the 1st Annual Conference on Genetic and Evolutionary
algorithm.
Computation - Volume 1, GECCO’99, page 525–532, San Francisco, CA, USA, 1999.
Morgan Kaufmann Publishers Inc.

[114] Lianghong Peng, Ling Shi, Xianghui Cao, and Changyin Sun. Optimal attack energy alloca-
tion against remote state estimation. IEEE Transactions on Automatic Control, 63(7):2199–
2205, 2018.

[115] Lina Perelman and Saurabh Amin. A network interdiction model for analyzing the vulner-
ability of water distribution systems. In Proceedings of the 3rd international conference on
High confidence networked systems, pages 135–144, 2014.

[116] Damien Picard, Ján Drgoňa, Michal Kvasnica, and Lieve Helsen. Impact of the controller
model complexity on model predictive control performance for buildings. Energy and
Buildings, 152:739–751, 2017.

[117] Victor Picheny, Mickael Binois, and Abderrahmane Habbal. A Bayesian optimization

approach to find Nash equilibria. Journal of Global Optimization, 73(1):171–192, 2019.

[118] PNNL. Python systems library, 2019.

190

[119] J-P Ponssard and Sylvain Sorin. The lp formulation of finite zero-sum games with incomplete

information. International Journal of Game Theory, 9(2):99–105, 1980.

[120] Tereza Pultarova. Cyber security-ukraine grid hack is wake-up call for network operators

[news briefing]. Engineering & Technology, 11(1):12–13, 2016.

[121] Tirukkannamangai ES Raghavan, SH Tĳs, and OJ Vrieze. On stochastic games with additive
reward and transition structure. Journal of Optimization Theory and Applications, 47(4):451–
464, 1985.

[122] Nils Miro Rodday, Ricardo de O Schmidt, and Aiko Pras. Exploring security vulnerabilities
In NOMS 2016-2016 IEEE/IFIP Network Operations and

of unmanned aerial vehicles.
Management Symposium, pages 993–994. IEEE, 2016.

[123] Sudip Saha, Anil Vullikanti, and Mahantesh Halappanavar. Flipnet: Modeling covert and
persistent attacks on networked resources. In 2017 IEEE 37th International Conference on
Distributed Computing Systems (ICDCS), pages 2444–2451, 2017.

[124] Sudip Saha, Anil Kumar S. Vullikanti, Mahantesh Halappanavar, and Samrat Chatterjee.
In 2016

Identifying vulnerabilities and hardening attack graphs for networked systems.
IEEE Symposium on Technologies for Homeland Security (HST), pages 1–6, 2016.

[125] Dinuka Sahabandu, Shana Moothedath, Joey Allen, Linda Bushnell, Wenke Lee, and Radha
Poovendran. Stochastic dynamic information flow tracking game with reinforcement learn-
ing.
In Tansu Alpcan, Yevgeniy Vorobeychik, John S. Baras, and György Dán, editors,
Decision and Game Theory for Security, pages 417–438, Cham, 2019. Springer Interna-
tional Publishing.

[126] Dinuka Sahabandu, Shana Moothedath, Joey Allen, Linda Bushnell, Wenke Lee, and Radha
Poovendran. A reinforcement learning approach for dynamic information flow tracking
games for detecting advanced persistent threats, 2021.

[127] Dinuka Sahabandu, Baicen Xiao, Andrew Clark, Sangho Lee, Wenke Lee, and Radha
Poovendran. Dift games: Dynamic information flow tracking games for advanced persistent
threats. In 2018 IEEE Conference on Decision and Control (CDC), pages 1136–1143, 2018.

[128] Ahmed Salem, Xuening Liao, Yulong Shen, and Xiang Lu. Provoking the adversary by dual
detection techniques: A game theoretical framework. In 2017 International Conference on
Networking and Network Applications (NaNA), pages 326–329. IEEE, 2017.

[129] Anibal Sanjab, Walid Saad, and Tamer Basar. A game of drones: Cyber-physical security of
time-critical UAV applications with cumulative prospect theory perceptions and valuations.
CoRR, abs/1902.03506, 2019.

[130] Anibal Sanjab, Walid Saad, and Tamer Başar. Prospect theory for enhanced cyber-physical
security of drone delivery systems: A network interdiction game. 2017 IEEE International
Conference on Communications (ICC), pages 1–6, 2017.

191

[131] Aaron Schlenker, Omkar Thakoor, Haifeng Xu, Fei Fang, Milind Tambe, Long Tran-Thanh,
Phebe Vayanos, and Yevgeniy Vorobeychik. Deceiving Cyber Adversaries: A Game Theo-
retic Approach. AAMAS ’18, page 892–900, Richland, SC, 2018. International Foundation
for Autonomous Agents and Multiagent Systems.

[132] Lloyd S Shapley and RN Snow. Basic solutions of discrete games. Contributions to the

Theory of Games, 1:27–35, 1952.

[133] Devendra Shelar and Saurabh Amin. Security assessment of electricity distribution networks
under DER node compromises. IEEE Transactions on Control of Network Systems, 4(1):23–
36, 2016.

[134] Dan Simon. Optimal state estimation: Kalman, H infinity, and nonlinear approaches. John

Wiley & Sons, 2006.

[135] Roy S Smith. Covert misappropriation of networked control systems: Presenting a feedback

structure. IEEE Control Systems Magazine, 35(1):82–92, 2015.

[136] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias W Seeger. Information-
IEEE
theoretic regret bounds for Gaussian process optimization in the bandit setting.
Transactions on Information Theory, 58(5):3250–3265, 2012.

[137] G Edward Suh, Jae W Lee, David Zhang, and Srinivas Devadas. Secure program execution
via dynamic information flow tracking. ACM Sigplan Notices, 39(11):85–96, 2004.

[138] Jiachen Sun, Yulong Cao, Qi Alfred Chen, and Z Morley Mao. Towards robust lidar-
based perception in autonomous driving: General black-box adversarial sensor attack and
countermeasures. In 29th {USENIX} Security Symposium ({USENIX} Security 20), pages
877–894, 2020.

[139] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT

press, 2018.

[140] Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver,
Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference
learning with linear function approximation. In Proceedings of the 26th Annual International
Conference on Machine Learning, ICML ’09, page 993–1000, New York, NY, USA, 2009.
Association for Computing Machinery.

[141] Wen Tian, Xiao-Peng Ji, Weiwei Liu, Jiangtao Zhai, Guangjie Liu, Yuewei Dai, and Shuhua
Huang. Honeypot game-theoretical model for defending against apt attacks with limited
resources in cyber-physical systems. Etri Journal, 41(5):585–598, 2019.

[142] Anastasios Tsiamis, Andreea B. Alexandru, and George J. Pappas. Motion planning with

secrecy. 2019 American Control Conference (ACC), pages 784–791, 2019.

[143] Kyriakos G Vamvoudakis, João P Hespanha, Bruno Sinopoli, and Yilin Mo. Adversarial
detection as a zero-sum game. In 2012 IEEE 51st IEEE Conference on Decision and Control
(CDC), pages 7133–7138. IEEE, 2012.

192

[144] Kyriakos G Vamvoudakis, Joao P Hespanha, Bruno Sinopoli, and Yilin Mo. Detection in
IEEE Transactions on Automatic Control, 59(12):3209–3223,

adversarial environments.
2014.

[145] Marten Van Dĳk, Ari Juels, Alina Oprea, and Ronald L Rivest. Flipit: The game of “stealthy

takeover”. Journal of Cryptology, 26(4):655–713, 2013.

[146] Huan Wang, Zhanfang Chen, Jianping Zhao, Xiaoqiang Di, and Dan Liu. A vulnerability
assessment method in industrial internet of things based on attack graph and maximum flow.
IEEE Access, 6:8599–8609, 2018.

[147] Alan Washburn and Kevin Wood. Two-person zero-sum games for network interdiction.

Operations research, 43(2):243–251, 1995.

[148] Chathurika S. Wickramasinghe, Daniel L. Marino, Kasun Amarasinghe, and Milos Manic.
Generalization of deep learning for cyber-physical system security: A survey. In IECON
2018 - 44th Annual Conference of the IEEE Industrial Electronics Society, pages 745–751,
2018.

[149] Liang Xiao, Yan Li, Guoan Han, Guolong Liu, and Weihua Zhuang. Phy-layer spoofing
detection with reinforcement learning in wireless networks. IEEE Transactions on Vehicular
Technology, 65(12):10037–10047, 2016.

[150] Wei Xing, Xudong Zhao, Tamer Başar, and Weiguo Xia. Security investment in cyber-
physical systems: Stochastic games with asymmetric information and resource-constrained
players. IEEE Transactions on Automatic Control, 67(10):5384–5391, 2021.

[151] Xin Xu, Lei Zuo, and Zhenhua Huang. Reinforcement learning algorithms with function

approximation: Recent advances and applications. Information Sciences, 261:1–31, 2014.

[152] Jun Yan, Haibo He, Xiangnan Zhong, and Yufei Tang. Q-learning-based vulnerability
analysis of smart grid against sequential topology attacks. IEEE Transactions on Information
Forensics and Security, 12(1):200–210, 2016.

[153] Dayong Ye, Tianqing Zhu, Sheng Shen, and Wanlei Zhou. A differentially private game
IEEE Transactions on Information

theoretic approach for deceiving cyber adversaries.
Forensics and Security, 16:569–584, 2020.

[154] Yuriy Zacchia Lun, Alessandro D’Innocenzo, Francesco Smarra, Ivano Malavolta, and
Maria Domenica Di Benedetto. State of the art of cyber-physical systems security: An
automatic control perspective. Journal of Systems and Software, 149:174–216, 2019.

[155] Heng Zhang, Peng Cheng, Ling Shi, and Jiming Chen. Optimal denial-of-service attack
scheduling with energy constraint. IEEE Transactions on Automatic Control, 60(11):3023–
3028, 2015.

[156] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A
selective overview of theories and algorithms, pages 321–384. Springer, Cham, 2021.

193

[157] Ming Zhang, Zizhan Zheng, and Ness B Shroff. A game theoretic model for defending against
In Decision and Game Theory for Security: 6th
stealthy attacks with limited resources.
International Conference, GameSec 2015, London, UK, November 4-5, 2015, Proceedings
6, pages 93–112. Springer, 2015.

[158] Ming Zhang, Zizhan Zheng, and Ness B Shroff. Defending against stealthy attacks on
multiple nodes with limited resources: A game-theoretic analysis. IEEE Transactions on
Control of Network Systems, 7(4):1665–1677, 2020.

[159] Yichi Zhang, Yingmeng Xiang, and Lingfeng Wang. Power system reliability assessment in-
corporating cyber attacks against wind farm energy management systems. IEEE transactions
on smart grid, 8(5):2343–2357, 2016.

[160] Lifeng Zhou, Vasileios Tzoumas, George J. Pappas, and Pratap Tokekar. Distributed
Attack-Robust Submodular Maximization for Multi-Robot Planning. arXiv e-prints, page
arXiv:1910.01208, October 2019.

[161] Quanyan Zhu and Tamer Basar. Game-theoretic methods for robustness, security, and
resilience of cyberphysical control systems: games-in-games principle for optimal cross-
layer resilient control systems. IEEE Control Systems Magazine, 35(1):46–65, 2015.

194