DEEP MULTI-AGENT REINFORCEMENT LEARNING FOR EFFICIENT AND
           SCALABLE NETWORKED SYSTEM CONTROL
                                       By
                                   Dong Chen
                             A DISSERTATION
                                 Submitted to
                         Michigan State University
                  in partial fulfillment of the requirements
                               for the degree of
       Electrical and Computer Engineering—Doctor of Philosophy
                                      2023


                                         ABSTRACT
     Recently, intelligent systems, such as robots, connected automated vehicles, and smart
grids have emerged as promising tools to enhance efficiency and sustainability across di-
verse areas, including intelligent transportation, industrial automation, and energy manage-
ment. These systems can connect with local communication networks, forming connected
systems, and showing high scalability and robustness. Yet, controlling these connected sys-
tems presents great challenges, mainly due to the high dimensionality of their state/action
spaces and the complex interactions among their components. Traditional control methods
often struggle in the real-time management of these systems, given their inherent complexity
and uncertainties. Fortunately, reinforcement learning (RL), especially multi-agent reinforce-
ment learning (MARL), offers an effective solution by leveraging adaptive online capabilities
and their proficiency in solving intricate problems. In this thesis, three unique deep MARL
algorithms are explored for safe, efficient, and scalable networked system control (NSC). The
efficacy of these algorithms is validated in several practical and real-world applications, such
as power grids and connected automated vehicles (CAVs).
     In the first algorithm, a safe, scalable, and efficient MARL framework is introduced
specifically for on-ramp merging in mixed-traffic scenarios, where both human-driven vehicles
and connected automated vehicles exist. By leveraging parameter sharing and local reward
design, the framework fosters cooperation among agents without compromising on scalability.
To mitigate the collision rates and expedite the training process, an innovative priority-based
safety supervisor is developed and incorporated into the MARL framework. In addition, a
gym-like simulation environment is developed and open-sourced, offering three traffic density
levels. Extensive experimental results show that our proposed MARL model consistently
surpasses several state-of-the-art (SOTA) benchmarks, showing its significant promise for
managing CAVs in the specified on-ramp merging scenarios.
     In our second exploration, we propose a fully-decentralized MARL framework for Co-
operative Adaptive Cruise Control (CACC). This approach differs substantially from the


traditional centralized training and decentralized execution (CTDE) method. Within this
framework, each agent acts based on its unique observations and rewards, eliminating the
need for a central controller. In addition, we further introduce a quantization-based commu-
nication protocol to enhance communication efficiency and reduce bandwidth consumption
by employing randomized rounding to quantize each transmitted data piece, while only
sending the non-zero components after quantization. Through the validation of two dis-
tinct CACC scenarios, our method has proven to outperform SOTA models in both control
precision and communication efficiency.
    In our third exploration, we present an efficient MARL algorithm specifically for cooper-
ative control within power grids. In particular, we focus on the decentralized inverter-based
secondary voltage control problem by formulating it as a cooperative MARL problem. Then,
we introduce a novel on-policy MARL algorithm, named PowerNet, where each agent (i.e.,
each distributed generator (DG)) learns a control policy based on (sub-)global reward, as
well as encoded communication messages from its neighbors. Additionally, a novel spatial
discount factor is introduced to mitigate the effect of remote agents, expedite the training
process and improve scalability. Moreover, a differentiable, learning-based communication
protocol is employed to enhance collaboration among adjacent agents. To support com-
prehensive training and assessment, we introduce PGSim, an open-source and cutting-edge
power grid simulation platform. The evaluation across two microgrid configurations shows
that PowerNet not only outperforms conventional model-based control techniques but also
several SOTA MARL strategies.


Copyright by
DONG CHEN
2023


                                ACKNOWLEDGEMENTS
    First and foremost, I would like to extend my heartfelt gratitude to my advisor, Dr.
Zhaojian Li, for his invaluable advice, unwavering encouragement, inspiring guidance, and
continual support throughout my research journey and academic career at Michigan State
University. He is always ready to engage in thoughtful discussions about the grand scheme
of our research while providing constructive feedback on intricate technical details. Despite
his own remarkable creativity and productivity, he generously allowed me the freedom to
explore various problems, even those not directly aligned with his own research interests. I
also want to express my sincere thanks to Drs. Vaibhav Srivastava, Shaunak D. Bopardikar,
and Hamidreza Modares for serving on my thesis committee.
    I am deeply thankful for the opportunity to collaborate with an exceptional group of
colleagues, faculty, and researchers throughout my Ph.D. program. The collaborative work
presented in this dissertation would not have been possible without the valuable contributions
of Dr. Kaixiang Zhang, Dr. Kaixiang Lin, Mohammad Hajidavalloo, Dr. Kaian Chen, Dr.
Yue Wang, Dr. Longsheng Jiang, Dr. Tianshu Chu, Dr. Feng Qiu, Dr. Rui Yao, Dr.
Yongqiang Wang, and Dr. Zhaojian Li. Beyond the scope of this thesis, I had the privilege
of working alongside other outstanding researchers such as Yu Zheng, Pengyu Chu, Ramin
Vahidimoghaddam, Jiajia Li, Lingxuan Hao, Dr. Fengying Dang, Dr. Yanbo Huang, Dr.
Yuzhen Lu, Dr. Yichen Zhang, Dr. Shunbo Lei, Dr. Xiaobo Tan, Xinda Qi, and Qianqian
Liu. Their insights have enriched my learning experience, and I am grateful for their input.
    My internships also presented me with wonderful learning and networking opportunities,
for which I am grateful. I wish to express my appreciation to Dr. Pei Zheng, Anqi Luo,
Tor Fredericks, Dr. Feng Qiu, Dr. Rui Yao, Dr. Yichen Zhang, and Dr. Bo Chen. Special
thanks go to Drs. Pei Zheng and Feng Qiu for hosting my internships at T-Mobile USA in
2020 and Argonne National Lab in 2022 and 2023, respectively.
    Last but certainly not least, owe a profound debt of gratitude to my family, including
my cat Huihui, and my friends for their unwavering support and unconditional love. I would
                                               v


also like to remember and honor my godfather, Mr. Wang, who supported and loved me like
a father. His passing last year brought great sorrow. I am forever grateful for his impact
on my life and hope that he is at peace and know that his spirit continues to inspire and
motivate me.
                                            vi


                         TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .          1
CHAPTER 2 PRELIMINARIES OF RL AND MARL . . . . . . . . . . . . . . . .                  5
CHAPTER 3 DEEP MARL FOR HIGHWAY ON-RAMP MERGING IN MIXED
           TRAFFIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   13
CHAPTER 4 CACC WITH FULLY DECENTRALIZED AND
           COMMUNICATION EFFICIENT MARL . . . . . . . . . . . . . . .                  42
CHAPTER 5 DEEP MARL FOR SECONDARY VOLTAGE CONTROL . . . . . .                          62
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       82
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
                                        vii


                                        CHAPTER 1
                                     INTRODUCTION
    In this chapter, we first introduce the motivation of this thesis and the challenges for
applying MARL for NSC. Then, we illustrate the specific research objectives of this thesis
and provide a summary of the key contributions.
1.1   Motivation
    Recently, intelligent systems, such as robots, connected automated vehicles, and smart
grids have emerged as promising tools to enhance efficiency and sustainability across diverse
areas, including intelligent transportation, industrial automation, and energy management
[142]. These systems can connect with local communication networks, forming connected
systems, and showing high scalability and robustness [68]. Yet, controlling these connected
systems presents great challenges, mainly due to the high dimensionality of their state/action
spaces and the complex interactions among their components. Traditional control methods
often struggle in the real-time management of these systems, given their inherent complexity
and uncertainties [69]. Fortunately, reinforcement learning (RL) [96], especially multi-agent
reinforcement learning (MARL) [129, 26], offers an effective solution by leveraging adaptive
online capabilities and their proficiency in solving intricate problems.
1.2   Challenges In MARL For NSC
    Implementing MARL for networked system control (NSC) presents significant challenges
including scalability due to the exponential growth of the state-action space with a large
volume of agents [49], ensuring safety as interactions between autonomous agents can lead
to unforeseen and potentially harmful behavior [39], improving communication efficiency
which is essential for cooperative decision-making [26]. In addition, the lack of realistic
simulators for MARL also makes it difficult to accurately predict and analyze system
behavior in complex, real-world scenarios [103]. These challenges highlight the intricate
trade-offs between scalability, safety, communication efficiency, and realistic simulation in
                                               1


implementing MARL for NSC.
1.3   Research Goals
    The principal objective of this thesis is to develop safe, efficient, and scalable MARL
algorithms for networked system control (NSC) with specific applications for connected and
automated vehicles (CAVs) and smart grids. The detailed objectives are as follows:
   1. The first objective is to create scalable MARL algorithms for large-scale NSC. This
      goal will be achieved by developing a parameter-sharing-based MARL algorithm, where
      agents independently make decisions based on their own local observations, yet share
      the same set of parameters. This approach is designed to facilitate scalability in large-
      scale networked environments, such as CAVs.
   2. The second objective is to investigate the development of fully decentralized MARL
      algorithms for NSC, eliminating the need for a central controller during the train-
      ing process. Furthermore, a cooperative learning scheme will be proposed, aiming to
      enhance cooperation and coordination among the agents.
   3. The third objective is to develop efficient communication protocols to promote com-
      munication efficiency among agents. The designed communication protocol should
      maintain the control performance of the MARL algorithms while enhancing efficiency.
   4. The final objective is to implement highly realistic simulators for developing and testing
      MARL algorithms. The simulators should accurately model the characteristics of real-
      world objectives, providing robust platforms for the development and refinement of
      MARL techniques.
1.4   Contributions
    The principal contributions of this thesis are outlined as follows:
   1. A safe, efficient, and scalable MARL framework is developed for on-ramp merging in
      mixed traffic [21]. In addition, a novel priority-based safety supervisor is incorporated
      into the MARL framework to significantly reduce the collision rate and expedite the
      training process.
                                               2


   2. A gym-like simulation environment for on-ramp merging is developed and open-sourced
      with three different traffic density levels (https://github.com/DongChen06/MARL_C
      AVs), including five SOTA MARL algorithms.
   3. A fully-decentralized MARL framework is introduced for Cooperative Adaptive Cruise
      Control (CACC) without the need for a central controller. In addition, a quantization-
      based communication protocol is developed to enhance communication efficiency by
      applying random quantization to the messages being communicated and ensuring that
      critical information is transmitted with minimized bandwidth usage.
   4. The adopted gym-like simulation environment for the specified CACC scenarios is
      open-sourced with 7 state-of-the-art MARL benchmarks (https://github.com/DongC
      hen06/CACC_MARL).
   5. An efficient MARL algorithm is developed specifically for cooperative secondary volt-
      age control within power grids, where each agent (i.e., each DG) formulates a control
      policy based on (sub-)global reward and coded messages from its adjacent neighbors.
      Additionally, we introduce an innovative spatial discount factor, designed to minimize
      interference from distant agents, accelerate training, and bolster scalability. Moreover,
      a differentiable, learning-based communication protocol is developed to strengthen co-
      ordination among neighboring agents.
   6. An open-source software, called PGSim, that offers a highly efficient and high-fidelity
      simulation platform for power grids is developed and open-sourced (https://github.c
      om/DongChen06/PGSIM).
1.5   Organization Of Thesis
    The subsequent chapters of this thesis are organized as follows:
Chapter 2: Background
    This chapter provides a comprehensive introduction to the fundamentals of Reinforce-
ment Learning (RL) and delves into several SOTA Multi-Agent Reinforcement Learning
(MARL) algorithms, setting the stage for a deeper understanding and contextualization of
                                                 3


our proposed research.
Chapter 3: Deep MARL for Highway On-ramp Merging in Mixed Traffic [21]
    In this chapter, we propose an efficient and scalable MARL framework for on-ramp merg-
ing in mixed traffic, leveraging parameter sharing and local rewards to encourage cooperation
between agents, while still maintaining impressive scalability. In addition, a novel priority-
based safety supervisor is introduced to mitigate the collision rates.
Chapter 4: CACC with Fully Decentralized and Communication-efficient MARL
[24]
    In this chapter, we propose a fully-decentralized MARL framework for Cooperative Adap-
tive Cruise Control (CACC). Furthermore, a quantization-based communication protocol is
introduced to enhance communication efficiency.
Chapter 5: Deep MARL for Secondary Voltage Control [20]
    In this chapter, we propose an efficient MARL algorithm for the decentralized inverter-
based secondary voltage control problem. A novel on-policy MARL algorithm, named Pow-
erNet, is introduced where each agent (i.e., each DG) learns a control policy based on (sub-
)global reward, as well as encoded communication messages from its neighbors.
Chapter 6: Conclusion
    This chapter concludes the thesis and discusses its limitations.
                                               4


                                        CHAPTER 2
                        PRELIMINARIES OF RL AND MARL
     In this chapter, we provide a comprehensive introduction to the fundamentals of Rein-
forcement Learning (RL) and delve into several state-of-the-art Multi-Agent Reinforcement
Learning (MARL) algorithms, providing the necessary context to properly position and un-
derstand our proposed work.
2.1     Reinforcement Learning (RL)
     Reinforcement learning (RL), which is often mathematically formulated as a Markov
Decision Process (MDP), has emerged as a promising data-driven method for sequential
decision-making [26, 31]. The recent advances in Deep Neural Networks (DNNs) have
further enhanced the capabilities of RL in handling intricate tasks. Notable examples of
these algorithms include the deep Q-network (DQN [97]), deep deterministic policy gradient
(DDPG [79]), and advantage actor-critic (A2C [95]). For instance, AlphaGo, a computer
program based on DQN, made history as the first of its kind to defeat a professional hu-
man player in Go, even going on to beat a world champion in the game [97]. Moreover, in
[73], researchers successfully employ the Trust Region Policy Optimization (TRPO, [117])
to navigate Quadrupedal robots across arduous terrains, including challenging surfaces like
mud and snow, and through dynamic footholds.
     In an RL framework (see Figure 2.1), the learner (i.e., agent) navigates the environment
by adopting a trial-and-error methodology. The agent makes decisions, performs the action
to the environment, and in return, receives a reward signal accompanied by a new state.
This reward, provided by the environment, provides feedback to the agent, indicating whether
the impact of its actions is positive or negative. Mathematically, we can formulate the RL
problem as a Markov Decision Process(MDP). The MDP M = (S, A, P, R) is defined as
follows:
    1. State space S: a set of states that includes the comprehensive description of the en-
       vironment provided by the environment itself, which outlines the position or other
                                                5


      conditions of the agents at a specific time step t. An observation o is a partial de-
      scription of a state and may not include all information. If an agent can access the
      whole state of the environment, then the environment is fully observed. However, if
      the agent is only able to acquire a partial observation, the environment is considered
      as partially observed.
   2. Action space A: a set of all valid actions within a specific environment. Some environ-
      ments, like Atari and Go, have discrete action spaces, where the agent has a finite
      number of available moves. In contrast, environments like those controlling a robot
      in a physical world possess continuous action spaces. In continuous action spaces,
      actions take the form of real-valued vectors.
   3. Transition Probability Pss′ (St+1 = s′ |St = s): the transition probability describes the
      likelihood of an agent moving from one state to another.
   4. Reward R(st , at , st+1 ): the reward is returned by the environment once the action at
      is executed at state st . The value of the reward signal can be positive or negative
      contingent on the actions of the agent.
                                                  RL
                                                Agent
               state        reward                                        action
                𝑠𝑡              𝑟𝑡                                          𝑎𝑡
                                𝑟𝑡+1       Environment
                                𝑠𝑡+1
                    Figure 2.1 Illustration of reinforcement learning (RL).
    As shown in Figure 2.1, at time step t, the agent observes the state st ∈ S ⊆ Rn from the
environment, executes an action at ∈ A ⊆ Rm . The RL agent executes actions guided by a
learned policy π(at |st ). In the context of Deep RL, the policy is frequently parameterized
using some function approximators, denoted as πθ (·|st ), where θ represents the learnable
parameters of the function approximator, such as DNNs or Q-tables. Then the environment
                                                 6


evolves to the new state st+1 based on the transition dynamics p(·|st , at ) and returns an
immediate reward rt = r(st , at , st+1 ) to the agent. The objective of an RL agent is to learn
an optimal policy π ∗ : S → A that maps from state to action, maximizing the accumulated
reward:
                                                          X T
                                                  Rt =         γ k rt+k ,                         (2.1)
                                                           k=0
where rt+k is the reward at time step t + k and γ ∈ (0, 1] and T represent the discount factor
and episode length, respectively. The state-action function is denoted as:
                                     Qπ (st , at ) = E (Rt |st = s, at = a),                      (2.2)
                                                      |{z}
                                                       τ ∼π
where τ = (s0 , a0 , s1 , a1 , ..., sT , aT ) represents the trajectory containing a sequence of states
and actions. The state-action function represents the expected return starting from state
st and taking an immediate action at , then following policy π afterward. The optimal Q-
function determines the optimal greedy policy π ∗ (at |st ) and is define as:
                                          Q∗ (st , at ) = max Qπ (st , at ),                      (2.3)
                                                              π
    The state value function V π (st ) represents the expected return if starting from st and
immediately following the policy π and is denoted as:
                                            V π (st ) = E (Rt |st = s),                           (2.4)
                                                          |{z}
                                                          τ ∼π
The relationship between the action-value function Q(st , at ) and state-value function V π (st )
can be represented as:
                                             V π (s) = E [Qπ (s, a)],                             (2.5)
                                                          |{z}
                                                           a∼π
    In the subsequent subsections, we will delve into three prevalent RL algorithms: Deep
Q-learning, Policy Gradient, and the Actor-Critic Network.
2.1.1    Deep Q-Learning
    In Q-learning, the Q-function, denoted as Qθ , is frequently parameterized by a set of
parameters θ. This can be achieved using function approximators, ranging from Q-tables
                                                             7


[141] and Linear Regression (LR) [128], to the more powerful Deep Neural Networks (DNNs)
[97]. The temporal difference, defined as (T Qθ− − Qθ )(st , at ), provides the basis for updating
θ. Here T and Qθ− denote the dynamic programming (DP) operator and a recently frozen
model θ− [28, 122], respectively. To reduce the variance in estimating Q-values and improve
exploration, techniques like the ϵ−greedy and experience replay are commonly integrated into
deep Q-learning [128]. We can obtain the optimal action a∗ (s) through the Q∗ (st = s, at = a)
as:
                              a∗ (s) = arg max Q∗ (st = s, at = a),                          (2.6)
                                           a
Some of the widely recognized deep Q-learning algorithms include DQN [96], DDQN [133],
C51 [9], HER [5], and HR-DQN [30].
2.1.2     Policy Gradient
    Unlike the Q-learning approach, within the policy gradient method, the policy, denoted
as πθ , is usually parameterized directly by a set of parameters θ. The objective of updating
θ is to augment both the likelihood of actions taken and the cumulative rewards. This can
be achieved by the loss function:
                                              XT
                            ∇θ L(πθ ) = E [        ∇θ log πθ (at |st )Rt ],                  (2.7)
                                        τ ∼πθ
                                              t=0
Following this, the policy network parameters take an update via stochastic gradient ascent,
described as:
                                     θk+1 = θk + α∇θ L(πθ ),                                 (2.8)
When compared to Q-learning, the policy gradient is robust to non-stationary transitions
within each trajectory. However, it’s worth noting that it tends to exhibit high variance
[29]. Renowned algorithms that utilize policy gradients include TRPO [117], PPO [119], and
DDPG [79].
2.1.3     Actor-critic Network
    To mitigate the high variance associated with the sample return in the policy gradient
method, actor-critic algorithms, such as A2C [95], adopt the advantage function to refine the
                                                 8


policy gradient technique, since it leverages both policy (actor) and value (critic) functions.
The advantage function is represented as:
                               Aπ (st , at ) = Qπθ (st , at ) − Vw (st ),                 (2.9)
Here, the parameters θ are updated via the policy loss function:
                                             XT
                               ∇θ L = E [        ∇θ log πθ (at |st )At ],                (2.10)
                                         πθ
                                             t=0
    Simultaneously, the value function updates with:
                            L = min E[(Rt + γVw− (st ) − Vw (st ))2 ],                   (2.11)
                                  w    D
where D represents the experience replay buffer, which aggregates past experiences. This
buffer works alongside parameters derived from previous iterations, typically used in a target
network [21].
    However, despite great advancements, single-agent RL often struggles with scalability
challenges, especially in real-world control scenarios with multiple agents. These challenges
come from inherent non-stationarities and the partial observability intrinsic to such systems
[26].
2.2    Multi-agent Reinforcement Learning (MARL)
    Multi-agent systems, often found in day-to-day applications such as drone delivery, smart
grids, autonomous driving, and multi-robot assembly, consist of numerous agents interacting
within a shared environment. Generally, these systems can be delineated into three distinct
categories based on their team objectives:
   1. Cooperative: In a cooperative setting, all agents collaboratively work towards max-
      imizing a shared team reward. In such environments, coordination among agents be-
      comes vital to ensure effective collaboration.
   2. Competitive: In competitive settings, agents operate independently and are primarily
      self-interested. They frequently have independent objectives and endeavor to maximize
      their individual rewards.
                                                  9


   3. Mixed: In a mixed environment, agents may be self-interested with different objectives
        (not opposite).
    Real-world applications such as autonomous driving, smart grids, and traffic light control
often require agents to work cooperatively, aiming for a shared goal. The agents can work
independently or communicate with each other through the local communication channels
(see Figure 2.2). Given the importance of collaboration in such scenarios, this thesis primarily
focuses on algorithms designed for Cooperative settings. In the following subsections, we
will introduce some SOTA MARL algorithms.
   state 1      state 2        state n              state 1              state 2            state n
  reward 1     reward 2       reward n             reward 1             reward 2           reward n
                                                              Comm.
                                                              network
                          …                                                            …
                                                                            action 2
Figure 2.2 Illustration of multi-agent reinforcement learning (MARL): left framework without
communication, while right framework with communication.
2.2.1        Independent MARL
    To address the scalability issues common in single-agent RL, independent multi-agent
RL (MARL) has been introduced. In this approach, each agent learns and adapts its unique
policy based on its local observations and rewards [29]. A prominent and simplistic method-
ology in this domain is Independent Q-learning (IQL) [129]. In IQL, each local Q-function
predominantly relies on the local action, represented as Qi (s, a) ≈ Qi (s, ai ). Along similar
lines, the Independent Advantage Actor-Critic (IA2C) serves as an actor-critic version of
                                              10


MARL, as proposed by [28].
     While both IQL and IA2C offer highly scalable solutions, they grapple with challenges
posed by partial observability and non-stationary Markov Decision Processes (MDP). This
is mainly attributed to their intrinsic assumption: the behaviors of all other agents are
perceived as part of the environment’s dynamics. This becomes problematic considering
these agents’ policies take continual updates during training [29].
2.2.2     Cooperative MARL
     To address the non-stationary issues common in MARL, [152] decentralizes the critic,
allowing it to take both global observations and actions, followed by consensus updates.
Though this method does away with the need for a centralized controller during training, it
still requires access to global information. The challenge of partial observability in MARL has
led to various studies focusing on the potential of communication. For instance, FPrint [43]
investigates the impact of direct communication between agents, demonstrating that shar-
ing low-dimensional policy fingerprints can enhance performance. DIAL [42], on the other
hand, has each DQN agent produce a communication message in tandem with action-value
estimation. This message is subsequently encoded and combined with other input signals
on the receiving end. A different approach, CommNet [126], presents a broader communi-
cation protocol but simplistically calculates the mean of all messages instead of encoding
them. The NeurComm strategy [26] introduces a learnable communication protocol, where
messages are intricately encoded and concatenated to curtail information loss.
     Despite their innovations, these methods commonly adopt a centralized critic network
during training. Additionally, their communication messages, whether raw or encoded, are
often network parameters, leading to significant data transmission. This can burden com-
munication channels due to the large volume of transmitted messages. Moreover, even with
safety considerations embedded in their reward functions, the safety of these algorithms re-
mains unassured. It’s common practice to first implement MARL algorithms in realistic
simulators before transitioning to real-world deployment. Hence, the development of such
                                               11


simulators is paramount. Bearing all these factors in mind, we will delve into our own
groundbreaking MARL algorithms.
                                          12


                                        CHAPTER 3
DEEP MARL FOR HIGHWAY ON-RAMP MERGING IN MIXED TRAFFIC
    In this chapter, we introduce our first exploration of applying MARL for managing on-
ramp merging in mixed traffic including both human-driven vehicles (HDVs) and connected
autonomous vehicles (CAVs), which is one of the most challenging scenarios in the realm of
autonomous driving.
Figure 3.1 An illustration of the on-ramp merging scenario. Connected Autonomous Vehicles
(CAVs) are denoted in blue, and Human-Driven Vehicles (HDVs) are denoted in green. Both
types of vehicles coexist on the ramp and through lanes.
3.1    Background
    Over the past decade, autonomous vehicle (AV) technologies such as Tesla’s Autopilot [2]
and Baidu’s Apollo [1] have witnessed substantial advances, leading to their deployment in
(semi-)autonomous vehicles navigating real-world roads. However, alongside this progress,
there has been a noticeable increase in traffic accidents involving AVs [35, 40]. These incidents
are frequently caused by the inability to adapt to dynamic driving environments, especially
in mixed traffic conditions with both AVs and human-driven vehicles (HDVs) sharing the
road. In these scenarios, AVs must not only respond to static and moving road obstacles but
also interpret and predict HDV behaviors. Among the many challenging driving situations,
highway on-ramp merging stands out as especially challenging for AVs [99, 72], which is the
topic of this chapter.
    Figure 3.1 illustrates the on-ramp merging scenario under consideration, where we con-
sider a common setting where both autonomous vehicles (AVs) and human-driven vehicles
                                                13


(HDVs) navigate and coexist on the merge and through lanes. For a successful merging
maneuver, on-ramp vehicles must efficiently merge into the through lane without causing
collisions. In an ideal cooperative setting, vehicles on the through lane proactively adjust
their speeds, either decelerating or accelerating, to create sufficient space for the on-ramp
vehicles to merge smoothly, whereas on-ramp vehicles, in turn, should regulate their speeds
and time to ensure safe merging, eliminating the risk of deadlocks situations [67, 11] at the
merge point. Clearly, coordination between vehicles is crucial to facilitate safe and efficient
merging. While this is relatively easy to achieve in a full-AV scenario, AV coordination in
the presence of HDVs is a significantly more intricate task.
    Several methods have been proposed to tackle the automated merging problem, including
rule-based and optimization-based approaches [113, 60, 58, 81]. Rule-based strategies utilize
heuristics and predefined rules to steer autonomous vehicles (AVs) [60, 58]. Although these
are effective in easy traffic scenarios, they become impractical for intricate merging scenarios
[19]. In an optimal control setting, vehicle interactions between vehicles are perceived as a
dynamic system where controlled vehicles’ actions serve as inputs. For instance, a model
predictive control (MPC) technique has been developed to navigate an AV through a paral-
lel ramp merge [19]. Despite promising results, MPC methods depend on precise dynamic
merging models (including those of human drivers) and are computationally intensive due
to necessary online optimizations at every time step [112]. Comprehensive reviews of model-
based control strategies for on-ramp merging can be found in [104, 105, 106]. Nonethe-
less, these methods were primarily developed for fully automated vehicles, making them
unsuitable for mixed-traffic situations. In addition, gap acceptance theory has also been
investigated for merging behavior [57, 90], emphasizing intricate modeling features for all
traffic entities. However, this becomes problematic when traditional vehicles exhibit variable
behaviors, especially existing numerous concurrent CAVs [38].
    On the other hand, data-driven approaches, especially reinforcement learning (RL), have
gained increased attention and been explored for AV highway merging [81, 89]. Notably,
                                                 14


[81] employs a multi-objective reward function centered on safety and jerk minimization
for AV merging. To address the RL challenge, the Deep Deterministic Policy Gradient
(DDPG) algorithm [79] is leveraged. In [89], RL and MPC are fused to enhance learning
efficiency, which achieves a good balance between passenger comfort, crash rate, efficiency,
and robustness. Nonetheless, these techniques are primarily conceptualized for individual
AV, with other vehicles merely considered as part of environmental elements.
     In this chapter, we explore a specific scenario as illustrated in Figure 3.1, where multiple
AVs adaptively engage with HDVs and work to successfully merge, aiming to optimize traffic
flow while ensuring safety. This scenario naturally extends the single-agent RL to a more ex-
pansive multi-agent reinforcement learning (MARL) framework. Within this paradigm, AVs
collaboratively learn control policies to realize the aforementioned objectives. However, this
is a challenging task due to dynamic connectivity topology, sophisticated motion dynamics
involving AV-coupled behaviors, and intricate decision-making processes. This complexity
is even more pronounced when human drivers are involved.
     While several MARL techniques have been formulated for CAVs in scenarios like car-
following and lane overtaking scenarios [137, 102, 64, 53, 12, 36, 150]. However, to the best
of our knowledge, no MARL algorithm has been proposed for the considered highway on-
ramp merging scenario. In this work, we develop a novel decentralized MARL framework,
empowering AVs to adeptly learn and execute a safe and efficient merging policy applicable
to vehicles on both lanes of the highway. To enhance safety and the learning process, a
priority-based safety supervisor that leverages sequential and multi-step forecasting is pro-
posed. Furthermore, we explore parameter sharing and localized rewards to enhance inter-
agent collaboration, ensuring optimal scalability. The main contributions and the technical
advancements of this chapter are detailed below.
    1. Problem Formulation & Simulation Platform: We formulate the mixed-traffic
       on-ramp merging scenario, where AVs and HDVs coexist on both the ramp and through
       lanes, into a decentralized MARL framework. Our approach is tailored to accommodate
                                                15


       dynamic environments with a dynamic connectivity topology. A corresponding gym-
       like simulation platform with three different traffic density levels is made publicly
       accessible1 .
   2. MARL Algorithm & Safety Supervisor: We develop a novel, efficient, and scalable
       MARL algorithm, featuring an effective reward function design, parameter-sharing
       mechanism, and action masking. In addition, we have also integrated a priority-driven
       safety supervisor into the MARL framework, which significantly reduces collision rates
       during training, and subsequently enhances learning efficiency.
   3. Curriculum Learning: By employing curriculum learning, we expedite the learn-
       ing process for more intricate tasks by utilizing models pre-trained on simpler traffic
       scenarios.
   4. Experiments & Performance Metrics: Extensive experiments are conducted to
       evaluate our approach, showing that our method consistently outperforms several
       SOTA algorithms, especially in driving safety and operational efficiency metrics.
    The subsequent sections are structured as follows. The problem formulation and our
innovative MARL framework are described in Section 3.2 whereas the priority-based safety
supervisor is described in Section 3.3. A detailed exposition of our experiments, findings, and
discussions is presented in Section 3.4. We conclude the chapter and discuss future works in
Section 3.5.
3.2    On-ramp Merging As MARL
    In this section, we characterize the on-ramp merging scenario as a partially observable
Markov decision process (POMDP) [55]. Subsequently, we introduce our actor-critic-centric
MARL approach, featuring a parameter-sharing mechanism, effective reward function design,
and action masking, to navigate the challenges of the devised POMDP. For clarity, this
approach is referred to as the “baseline method” in Section 3.5.
   1
     See https://github.com/DongChen06/MARL_CAVs
                                              16


                                                                     High-level Intelligent Agent
                   Driving Environment                  agent 1   𝑎1
                                                        …
                                                        agent n   𝑎𝑛
                                                                                   RL actions
                             control signals
                                                         Low-level controller
Figure 3.2 Schematics of system and simulation setup without the safety supervisor, in which
actions from the MARL will send directly to the low-level controller.
3.2.1     MARL Formulation
   In this chapter, we conceptualize the mixed-traffic on-ramp merging environment as a
model-free multi-agent network [29, 23], denoted by G = (ν , ε). Here, each agent i ∈                ν
interacts with its neighboring agents, defined as Ni := {j|εij ∈           ε)} via the edge connections
εij , i ̸= j.The combined state space and action space for all agents are represented as S :=
×i∈ν Si and A := ×i∈ν Ai . The intrinsic dynamics of this system are encapsulated by the state
transition distribution P, which maps S × A × S to [0, 1]. We adopt a decentralized MARL
paradigm, wherein each agent i (AV i) perceives only a part of the entire environment,
specifically its immediate surroundings. This is consistent with the reality that AVs can
only sense or communicate with vehicles in close vicinity, making the overall dynamical
system a POMDP MG . This POMDP can be comprehensively represented by the tuple
({Ai , Si , Ri }i⊆ν , T ), which is described as follows:
   1. Action Space: The action space of agent i, denoted by Ai , represents the set of po-
        tential high-level control decisions. These control decisions include turn left, turn right,
        cruising, speed up, and slow down following the designs in [77, 22]. Once high-level
        actions are chosen, lower-level controllers generate the relevant steering and throttle
        commands to guide autonomous vehicles (AVs). The system and simulation setup
        is illustrated in Figure 3.2. The entire action space, A, is the union of the indi-
        vidual action spaces of all AVs and can be represented as the Cartesian product:
        A = A1 × A2 × · · · × AN .
                                                   17


2. State Space: The state of agent i, Si , is conceptualized as a NNi × W matrix, wherein
   NNi represents the number of vehicles observable by the agent and W refers to the
   attributes that describe a vehicle’s state. Key attributes include: ispresent: a binary
   variable indicating the presence of a vehicle within the sensing range of the ego vehicle;
   xl : the longitudinal position of the detected vehicle relative to the ego vehicle; y: the
   lateral position of the detected vehicle relative to the ego vehicle; vx : the longitudinal
   velocity of the observed vehicle relative to the ego vehicle; vy : the lateral velocity of
   the observed vehicle relative to the ego vehicle. Based on the proximity principle, the
   ego vehicle can only perceive the “neighboring vehicles”. These neighboring vehicles
   comprise the nearest NNi vehicles within a longitudinal range of 150 meters from the
   ego vehicle [150]. In the considered on-ramp merging case as shown in Figure 3.1, we
   found NNi = 5 achieves the best performance. The overall state space of the system,
   S, aggregates the individual states of all agents: S = S1 × S2 × · · · × SN .
3. Reward Function: The reward function Ri is crucial to train the RL agents to
   follow preferred behaviors. In the on-ramp merging context, the agent’s objective is to
   efficiently and safely navigate through the merging zone. The reward function for an
   agent at a given time step t is formulated as:
                                ri,t = wc rc + ws rs + wh rh + wm rm ,                   (3.1)
   Here, wc , ws , wh , and wm are the coefficients that respectively correspond to collision
   evaluation, stable-speed assessment, headway time assessment, and merging cost eval-
   uation. Given the paramount importance of safety, we make wc significantly higher
   than the other coefficients. The performance evaluation on different wc values is dis-
   cussed in Section 3.4.1 and Table 3.1. The four evaluation metrics are specified as:
   rc penalizes collisions; it is set to -1 in case of a collision and 0 otherwise; the speed
   evaluation rs is defined as:
                                                              
                                                  vt − vmin
                                    rs = min                 ,1 ,                        (3.2)
                                                vmax − vmin
                                               18


     where vt is the current speed of the ego vehicle. Combining the speed recommendation
     from the US Department of Transportation (20-30 m/s [101]) and the speed range
     observed in the Next Generation Simulation (NGSIM) dataset2 (the minimum speed
     at 6-8 m/s [131]), we set the minimum and maximum speeds of the ego vehicle as
     vmin = 10 m/s, and vmax = 30 m/s, respectively.
     The time headway evaluation is defined as:
                                                             dheadway
                                                  rh = log            ,                (3.3)
                                                               th vt
     here, dheadway and th are the distance headway and the predefined time headway thresh-
     old, respectively. Thus, the ego vehicle will get penalized when the time headway is
     less than th and rewarded only when the time headway is greater than th . In this
     chapter, we choose th as 1.2 s as suggested in [6]. The merging cost rm is designed to
     penalize the waiting time on the merge lane to avoid deadlocks [16]. Here we adopt
     rm = − exp(−(x − L)2 /10L), where x is the distance the ego vehicle has navigated on
     the ramp and L is the length of the ramp (see Figure 1). The merging cost function is
     plotted in Figure 3.3, which shows that the penalty increases as the ego vehicle moves
     closer to the merging end.
                                0.0
            Merging reward rm
                                0.2
                                0.4
                                0.6
                                0.8
                                1.0
                                      0     20       40            60     80    100
                                          Distance traveled on the ramp x (m)
                     Figure 3.3 Illustration of the designed merging reward/penalty.
4. Transition Probabilities: the transition probability T (s′ |s, a) elucidates the sys-
     tem’s dynamics. The simulator that we have devised harnesses the Intelligent Driver
2
    See https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm
                                                      19


       Model (IDM) [132] and MOBIL model [66] for driving decisions related to longitudinal
       acceleration and lateral lane changes, respectively, for human-driven vehicles (HDVs).
       The high-level decisions of AVs are made by the MARL algorithm and will be tracked
       by the lower-level controller (PID controller) (see Figure 3.2). The system leverages
       a kinematic bicycle model to trace vehicle trajectories. Importantly, in our MARL
       approach, we do not need prior knowledge of the transition probability.
     This section offers a comprehensive framework for a decentralized MARL algorithm spec-
ified for the mixed-traffic on-ramp merging scenario. The elements detailed above jointly
contribute to the MARL system’s ability to make efficient and safe merging decisions in real
time.
3.2.2     MA2C For CAVs
     In the realm of cooperative MARL, the overall goal is to optimize the cumulative reward,
represented as Rg,t = N
                          P
                            i=1 ri,t . Ideally, each agent will be assigned with the same average
                        1
global reward Rt =        R
                       N g,t
                              during training, i.e., r1,t = r2,t = · · · = rN,t . However, Yet, this
uniform allocation does not genuinely reflect the individual contributions of each vehicle and
introduces complications [144, 136]. Two key challenges emerge:
    1. First, aggregating rewards on a global scale can introduce significant latency, increasing
       communication overheads, which is impractical for real-time systems like AVs;
    2. Second, relying only on a global reward leads to the notorious credit assignment prob-
       lem [127], which can hamper learning efficiency and constrict the scalability to accom-
       modate numerous agents.
Given these challenges, this work employs a more localized reward design. In this framework,
the reward assigned to a particular agent, says the ith agent at time t is defined as:
                                                    1 X
                                            ri,t =            rj,t ,                           (3.4)
                                                   |νi | j∈ν
                                                            i
where νi = i ∪ Ni is a set containing the ego vehicle and its neighbors, and | · | represents
set cardinality. This method, founded on local rewards, inherently focuses on agents that
                                                     20


are most related to the success or failure of a task [80, 7]. Such an approach is suitable
for vehicular scenarios, in which vehicles are predominantly affected by their immediate
surroundings, with distant vehicles exerting minimal influence.
  ispresent states FC (32)
                                                                logits                action prob. w/o action mask
                           FC (128)        Actor
                                                        5    5    5     5   5    ×     0.2   0.2   0.2   0.2   0.2
  position states  FC (64)               (FC (128))
                                                                              ⊗
                                                        1    0    1     1   1
                                                          invalid action mask          0.25     0    0.25 0.25 0.25
                                                                                        action prob. w/ action mask
                                           Critic
                                         FC (128))
                                                            V value
  speed states     FC (64)
Figure 3.4 Architecture overview of the proposed network: layer dimensions are indicated in
parentheses. “w/o” and “w/” stands for “without” and “with”, respectively.
     The network backbone utilized is depicted in Figure 3.4. Both the actor and critic
networks leverage the same foundational representations, consequently combining the policy
loss and value function error loss into a unified loss function [119]. Given these shared
parameters, the overall loss function takes the following form:
                                   J(θi ) = J πθi − β1 J Vϕi + β2 H(πθi (st )),                                    (3.5)
where β1 and β2 are the weighting coefficients, corresponding to the value function loss and
the entropy regularization term respectively. The entropy term, H(πθi (st )) = Eπθi [− log(πθi (st ))],
is introduced to enhance agent exploration of new states [143, 119]. Given Eq. 3.6, the policy
loss can be expressed as:
                                                                              πθ 
                                 ∇θi J πθi = Eπθi ∇θi log πθi (ai,t |si,t )Ai,ti ,
                                                    
                                                                                                                   (3.6)
            πθ
Here, Ai,ti = ri,t + γV πϕi (si,t+1 ) − V πϕi (si,t ) signifies the advantage function, while V πϕi (si,t )
represents the state value function. The loss for updating state value Vϕi is denoted as:
                                                  h                                i2
                            J Vϕi = min EDi ri,t + γVϕi (si,t+1 ) − Vϕi (si,t ) .                                  (3.7)
                                        ϕi
                                                       21


Distinct experience reply buffers are allocated for each agent but a unified policy network,
sharing the same parameters, is updated across agents. Such an approach is effective for
training a universal policy applicable to both on-ramp and through AVs [64, 80]. Further-
more, minibatches of trajectory samples are utilized to refine network parameters via Eq. 3.5,
aiming to reduce variance.
3.2.3     Deep Neural Network (DNN) Configuration
     Figure 3.4 describes our deep neural network’s architecture. To enhance scalability and
robustness, observations, si,t , are categorized based on their physical units. For example,
si,t is segmented into three distinct groups: s1i,t , s2i,t , and s3i,t , corresponding to presence
states, position states, and speed states respectively. Each subset is encoded by an individual
fully connected (FC) layer, and subsequently, the encoded states are concatenated into a
singular vector. This vector is processed by a 128-neuron FC layer, the output of which
serves both the actor and critic networks. Under standard configurations, the actor-network
produces logits li , which are then inputted through a Softmax layer, outputting a probability
distribution, denoted by πθi (si ) = softmax([l1 , l2 , l3 , l4 , l5 ]). Actions are then sampled based
on this distribution: ai ∼ πθi (si ).
     However, this approach faces challenges. Firstly, invalid/unsafe actions are also assigned
non-zero probabilities. Leveraging a stochastic policy, these unsafe actions can inadver-
tently be sampled during training, risking system malfunctions or potentially catastrophic
outcomes. Secondly, the sampling of these unsafe actions obstructs efficient policy training
since they lead to erroneous policy updates [59]. Such updates are misleading since they’re
rooted in experiences linked with invalid actions. To mitigate these challenges, we utilize
an invalid action masking technique [80] that effectively “filters out” inappropriate actions,
thus allowing sampling exclusively from valid actions. As illustrated in Figure 3.4, with an
invalid action mask obtained from the environment (e.g., based on the traffic scenario) where
“0” flags an invalid action and “1” signifies a valid one. The logits corresponding to the in-
valid actions are substituted with substantially negative values, such as −1e8 . Consequently,
                                                22


the Softmax layer assigns these actions probabilities nearing zero, ensuring they are seldom
sampled. This method effectively “renormalizes the probability distribution” [130]. In our
research, we identify two main invalid actions:
  1. The ego vehicle trying a lane change to a non-existent lane, e.g., aiming for a left turn
        while already in the leftmost lane;
  2. The ego vehicle adjusting its speed (acceleration or deceleration) beyond predefined
        speed limits.
It’s worth noting that these are foundational invalid actions. Additional unsafe actions un-
dergo rigorous verification and are governed by the priority-based safety supervisor discussed
in Section 3.3.
                                                              High-level Intelligent Agent
                   Driving Environment              agent 1    𝑎1
                                                    …
                                                    agent n    𝑎𝑛          Safety
                                                                         Supervisor
                                                                               safe actions
                             control signals
                                                     Low-level controller
Figure 3.5 Schematics of system and simulation setup with safety supervisor, in which only
safe actions from MARL will send to the low-level controller.
3.3     Priority-based Safety Enhancement
      While obvious invalid actions can be avoided using the rule-based action masking scheme
described above, it cannot prevent inter-vehicle or vehicle-obstacle collisions. Therefore,
a more comprehensive safety supervisor is needed to deal with collisions in complex, dy-
namic, and cluttered mix-traffic environments. To address this, we introduce a novel safety-
enhancement strategy leveraging vehicle dynamics and multi-step predictions. The strategy
aims to forecast any potential collisions within a prediction horizon Tn and adjust unsafe
exploratory actions. Given mixed traffic, inclusive of HDVs, a robust model is paramount
to predict human driver decisions. We deploy IDM [132] for forecasting HDVs’ longitudinal
                                               23


acceleration based on current speed and preceding distance. Simultaneously, the MOBIL
lane-change model [66] predicts lane-change behaviors of HDVs. MARL agent defines the
high-level decisions for AVs, as described in Section 3.2.1. The high-level acceleration and
lane-change decisions are then executed via low-level PID controllers. The vehicle trajectories
are then projected employing the kinematic bicycle model [110]. We label these trajectories,
steered by high-level decisions, as “motion primitives”. The proposed framework and the
associated simulation setup can be visualized in Figure 3.5.
3.3.1    Establishing Priorities
    Using HDV motion models, it’s feasible to predict potential collisions within the next
Tn steps, considering joint motion primitives from all AVs. Naturally, one could consider
using the joint action of all AVs for the safety-enhancement design. However, while it is
relatively straightforward to determine potential collisions based on a joint action, it is very
computationally costly to determine a joint safe action. Given the action space size of |Ai |N
(with N being the number of AVs), computational demands scale quickly, especially with
real-time constraints. As such, we propose a sequential, priority-based safety enhancement
mechanism known for its computational efficiency and real-time application compatibility.
The principle behind this approach revolves around sequencing AVs based on urgency, par-
ticularly emphasizing those with severe safety buffers. For instance, AVs nearing the merging
lane’s end or those approaching predefined safety distances, like narrowly defined headway
distances, are prioritized. More specifically, priority assignments are considered as follows:
   1. AVs on the merging lane hold precedence over those on the through lane due to the
       pressing nature of their merging objective.
   2. AVs approach the merging lane’s end are prioritized higher given their heightened
       collision and deadlock risks [16].
   3. AVs with minimal time headways rank higher due to increased collision susceptibility
       with preceding vehicles.
                                               24


    Based on these considerations, the priority index pi for ego vehicle i is formulated as:
                                    pi = α1 pm + α2 pd + α3 ph + σi ,                       (3.8)
Here, α1 , α2 , and α3 denote positive weightings for pm (merging priority), pd (distance-to-end
metric), and ph (time headway metric), respectively. σi ∼ N (0, 0.001) introduces a small
random variable to prevent identical priority indices across vehicles. In addition, the merging
priority score pm is outlined as:
                                          
                                          
                                          0.5, if on merge lane;
                                          
                                   pm =                                                     (3.9)
                                          
                                          0,
                                                otherwise,
which assigns priority scores to vehicles on the merge lane. The distance-to-end priority
score pd is defined by:                   
                                           x , if on merge lane;
                                          
                                          
                                             L
                                     pd =                                                  (3.10)
                                          
                                          0,
                                               otherwise,
where x and L refer to the ego vehicle’s traveled distance on the ramp and the ramp’s length,
respectively (refer to Figure 3.1). Lastly, the time headway priority metric ph is derived as
            dheadway
ph = − log     th vt
                     , utilizing the time headway definition from Eq. 3.3.
3.3.2    Priority-based Safety Supervisor
    In this subsection, we present the design and workings of our proposed priority-based
safety supervisor. At each time step t, commences with the safety supervisor predicting
HDV motions and assigning priority scores to all AVs, as discussed earlier. This process
yields a priority list of AVs, symbolized as Pt , organized in descending order. This implies
that the vehicle with the highest priority is in the top position. Then the AV on the top
of the obtained list, indexed by Pt [0], undergoes a safety check. More specifically, based
on the (exploratory) action generated from the action network of vehicle Pt [0], the safety
supervisor will examine whether the motion primitive induced by the exploratory action will
conflict with its neighboring vehicles NPt [0] (both AVs and HDVs) over a pre-determined time
                                                  25


frame Tn , where Tn is a hyper-parameter that can be tuned. HDV motions are gauged using
previously mentioned human-driver decision and vehicle kinematic models, while the rest of
the AVs (with lower priority scores) are assessed based on their last recorded actions. To
detect potential collisions, the system checks if any two trajectory sequences, each lasting
Tn steps, come closer than a defined safety distance. If no collision is detected, the vehicle
Pt [0] adopts the exploratory action.
                   Figure 3.6 Illustration of trajectory conflict for Tn = 5 steps.
     Conversely, if a potential collision is detected, as depicted in Figure 3.6, the exploratory
action is labeled as unsafe. A safer alternative is then generated. This safer action maximizes
the safety margin, which is captured by the formula:
                                  a′t = arg maxat ∈Avalid min dsm,k .
                                                                   
                                                                                            (3.11)
                                                          k∈Tn
where Avalid is the set of valid actions at time step t. The nuances of calculating the safety
margin differ based on the nature of the action. For instance, lane-change actions like “turn
left” or “turn right” consider the shortest distance to vehicles in both current and target
lanes. In contrast, actions such as “speed up”, “idle”, or “slow down” gauge safety margin
using the minimum distance headway. These scenarios are visually represented in Figure 3.7.
     Once vehicle Pt [0]’s action is finalized, its trajectory is recalibrated. Then vehicle Pt [0]
is removed from the list and the second highest becomes the first, i.e., Pt [i] ← Pt [i + 1], i =
1, 2, · · · . This sequential safety validation continues for every AV in the list until none
remain. The intricate design of this priority-based safety supervisor is further detailed in
Algorithm 3.1.
                                                  26


Figure 3.7 Illustration of safety margin definitions. Top: safety margin if vehicle 1 turns left;
and Bottom: safety margin when vehicle 1 keeps straight.
Remark 3.3.1. Our design envisages the priority-based safety supervisor’s real-world ap-
plication via vehicle-to-infrastructure [25] (V2I) communication, where a centralized infras-
tructure agent in proximity to a ramp can observe HDVs and interface with AVs. With each
time step t, the infrastructure agent assigns priority scores grounded in real-time traffic data.
It collects exploratory actions from AVs and then applies Algorithm 3.1 to finalize safe ac-
tions. Its sequential nature ensures computational efficiency (approximately 28.13 ms for the
safety supervisor with Tn = 8 to make a decision in the Hard traffic model, see Table 3.2 in
Section 3.5). Given robust computational infrastructure, it’s feasible to apply this algorithm
in real time, ensuring decisions are rendered within a single sampling time frame. Enhancing
computational efficiency remains a focus for our future endeavors.
                                               27


                          Algorithm 3.1 Priority-based Safety Supervisor
Parameter: L, α1 , α2 , α3 , th , w, Tn .
Output        : ai , i ∈ ν .
for i = 0 to N do
    compute the priority scores according to Eq. 3.8 rearrange ego vehicles to list Pt accord-
      ing to their priority scores.
end
for j = 0 to |Pt | do
    obtain the highest-priority vehicle Pt [0] find its neighboring vehicles NP⊔ [0] predict tra-
      jectories ζv , v ∈ Pt [0] ∪ NP⊔ [0] for Tn time steps.
    if trajectories are overlapped then
         replace the risky action as at ← a′t according to Eq. 3.11 replace the trajectory ζPt [0]
          with ζP′ t [0]
    end
    remove Pt [0] from Pt update Pt [i] ← Pt [i + 1], i = 1, 2, · · · .
end
Remark 3.3.2. The prediction horizon, denoted as Tn , plays an important role in the safety-
enhancement strategy. If Tn is too small, the safety supervisor is “short-sighted”, potentially
leading to an infeasible solution within a few iterations. Conversely, if Tn is too large, the
compounded uncertainties of HDVs (the actual vehicle motion in the simulation has noisy
perturbations from the human driver models used to predict the trajectories) are propagated.
This can cause overly cautious decisions, aiming to ensure extensive horizon safety. Through
rigorous cross-validations in our research, we ascertain Tn = 8 or 9 as optimal choices (refer
to Figure 3.12 and Table 3.2 for insights).
    The skeleton of the proposed MARL complemented with a priority-based safety supervi-
sor is depicted in Algorithm 3.2. Key hyperparameters include the time-discount factor γ,
learning rate η, epoch length T , cumulative training epochs M , and the loss function coeffi-
cients (β1 and β2 ). Each epoch commences with agents acquiring state data and pinpointing
actions, leveraging an action masking approach to sidestep invalid maneuvers (Lines 4–7).
Actions derived from MARL subsequently undergo a safety evaluation by the supervisor,
as elaborated in Algorithm 3.1 (Line 9). If the action is unsafe, then the safety supervisor
will replace the risky action with a safe action according to Eq. 3.11. Agents then act upon
                                                    28


this safer directive, and the resultant experience is documented in the replay buffer (Lines
10–17). Upon ending each episode, the policy network’s parameters undergo an update with
experiences extracted from the on-policy experience buffer (Lines 20–26). The DONE signal
is flagged either at the episode’s end or in the event of a collision. Once receiving the DONE
signal, all agents are reset to their initial states to start a fresh epoch (Lines 28).
                   Algorithm 3.2 MARL with Priority-based Safety Supervisor
Parameters: γ, η, T, M, β1 , β2 .
Outputs        : θ.
initialize s0 , t ← 0, D ← ∅;
for j = 0 to M − 1 do
     for t = 0 to T − 1 do
        for i ∈ ν do
            observe si update ai,t ∼ πθi (·|si ) with action masking.
        end
        for i ∈ ν do
            check the actions by Algorithm 3.1 if safe then
                 execute ai,t update Di ← (si,t , ai,t , ri,t , vi,t )
            end
            else
                 update ai,t ← a′i,t and execute a′i,t update Di ← (si,t , a′i,t , ri,t , vi,t ).
            end
        end
        update t ← t + 1
        if DONE then
            for i ∈ ν do
                 update θi ← θi + η∇θi J(θi )
            end
        end
        initialize Di ← ∅, i ∈ ν update j ← j + 1
     end
     update s0 , t ← 0
end
3.4    Numerical Experiments
     This section evaluates the efficacy of the advanced MARL algorithm through its training
efficiency and collision rate within the context of the on-ramp merging paradigm, depicted
in Figure 3.1. The length of the road is 520 m, inclusive of a 320 m entrance to the merge
                                                 29


Figure 3.8 Simulation settings for the single through lane case (upper) and multiple through
lane case (lower). “L” represents the length of the road segments. The numbers under the
vehicles are the vehicle spawn points.
lane (segment AB) and a merging lane with a length L of 100 m. There are 12 spawn points
(numbered beneath the vehicles) evenly distributed on the through lane and the ramp lane
from 0 m to 220 m, as shown in Figure 3.8. Vehicles that exceed the road are withdrawn
from view, though their kinematics continue to be updated. Three distinct traffic densities,
determined by the initial count of vehicles, are defined:
   1. Easy mode: Comprising 1-3 AVs and 1-3 HDVs.
   2. Medium mode: A mix of 2-4 AVs and 2-4 HDVs.
   3. Hard mode: 4-6 AVs combined with 3-5 HDVs.
    Within each training episode, a diverse set of HDVs and AVs spawn at varying points. A
positional random noise (uniformly ranging between -1.5 m and 1.5 m) tweaks their starting
positions. Initial speeds fluctuate between 25 to 27 m/s. The vehicle control sampling
frequency is 5 Hz, translating to AVs acting every 0.2 seconds. A 5% random noise is
added to the predicted acceleration and steering angle for HDVs. The MARL algorithms
                                               30


are trained for 2 million steps, leveraging 3 distinct random seeds, shared among agents,
translating roughly to 20,000 episodes with an episode horizon T = 100 steps. We evaluate
the algorithm over 3 episodes every 200 training episodes. We set γ = 0.99 and the learning
rate η = 5e−4 ; The reward function coefficients wc , ws , wh , and wm are set as 200, 1,
4, and 4, respectively. The priority coefficients α1 , α2 and α3 are equally set as 1. The
weighting coefficients β1 and β2 for the loss function are chosen as 1 and 0.01, respectively.
For comparison, we label the MARL algorithm without the safety supervisor and introduced
in Section 3.2, as the “baseline”.
    Our simulation environment, derived from the gym-based highway-env simulator [75] and
is available for open-source exploration3 . The default IDM and MOBIL model parameters
aligned with those in the highway-env simulator [75] are used. The experiments are conducted
in a Ubuntu 18.04 server with AMD 9820X processor and 64 GB memory. The video demo
of the training process can be found at the site4 .
3.4.1     Reward Function Designs
    This subsection evaluates the performance of the proposed MARL framework under var-
ious reward function designs, namely, the local (baseline) v.s. global rewards (baseline with
global reward). Subsequently, the influence of the safety penalty weight wc within the reward
function (Eq. 3.1) is evaluated.
    We investigate the localized reward function by comparing it with the global reward
design used in [36, 64], wherein the reward for the ith agent at t time step is determined
as the averaged global reward ri,t = N1 N
                                                P
                                                   j=1 rj,t . The performance contrast between our
localized and the global reward mechanisms is demonstrated in Figure 3.9. As expected, our
localized reward outperforms the global reward design, both in terms of rewards achieved
and expedited convergence across all traffic setups. While the global reward mechanism fares
well in Easy and Medium modes due to fewer AVs, its efficacy decreases in the Hard mode,
with an evaluation reward less than 0, as it suffers from the credit assignment issues [127]
    3
      See https://github.com/DongChen06/MARL_CAVs
    4
      See https://drive.google.com/drive/folders/1437My4sDoyPFsUjrThmlu1oJjTkTkvJ7?usp=sharing
                                                    31


and the reduced correlation between average global rewards and individual agent actions as
agent numbers increase.
                                           Easy Mode                                                           Medium Mode                                                            Hard Mode
 Evaluation reward                                                       Evaluation reward                                                    Evaluation reward
                     60                                                                                                                                            20
                                                                                              25
                                                                                               0                                                                    0
                     40
                                                                                             −25                                                                  −20
                     20               global reward     local reward                         −50            global reward    local reward                                        global reward    local reward
                                                                                                                                                                  −40
                          0      20      40       60    80         100                             0   20      40       60    80        100                             0   20      40       60    80        100
                                      Evaluation epochs                                                     Evaluation epochs                                                    Evaluation epochs
Figure 3.9 Evaluation curves during training with distinct reward functions across varied
traffic intensities, with the shaded portion standing for standard deviation over three random
seeds, smoothed over the nine evaluation epochs.
                     Furthermore, Table 3.1 illuminates the performance under diverse wc values in the Medium
traffic mode, configuration, holding other reward function coefficients constant in Eq. 3.1.
Noticeably, there are no collisions when a large enough wc is selected (e.g., wc ≥ 100), and
the average traffic speed decreases as wc further increases. This is because the CAVs will
behave conservatively if we emphasize too much on the safety. To strike a balance between
safety and traffic efficiency, subsequent tests assign wc to a value of 200.
Table 3.1 Performance with different wc ’s in terms of collision rate and average speed in the
Medium traffic mode, while all other weighting coefficients remain consistent.
                                                       wc = 10 wc = 100 wc = 200 (we chose) wc = 1000 wc = 10000
                              Collision rate             0.1      0              0              0          0
                              Avg. speed                24.77   24.09          24.08          23.69      23.62
3.4.2                           Curriculum Learning
                     In this subsection, we leverage the concept of curriculum learning [65] to enhance both
the speed and performance of learning in the Hard mode scenario. Instead of diving directly
into the Hard mode, we build upon the trained model from the easier modes (i.e., easy
and medium) and train the models to achieve higher efficiency. This method of building
on basic models is particularly valuable for applications where safety is paramount, such
as autonomous driving, as starting from a decent model can greatly reduce the number of
“blind” explorations which could lead to high-risk situations.
                                                                                                                  32


   Figure 3.10 shows the training performance comparison between the baseline method
(i.e., starting from scratch) and curriculum learning (baseline + curriculum learning) for the
Hard traffic mode. It is obvious that learning based on the trained model from easier tasks
greatly expedites the speed of convergence and improves the final model performance. The
average speed during the training, as shown in Figure 3.11, indicates that the curriculum
learning strategy also improves the average vehicle speed up to 22 m/s compared to baseline
method at 18 m/s, thus achieving high traffic efficiency. Therefore, we apply the curriculum
learning in the following experiments for Hard traffic modes.
                                      20
                   Training return
                                          0
                                     −20
                                                                 baseline    curriculum learning
                                     −40
                                              0    2500 5000 7500 10000 12500 15000 17500
                                                              Training epochs
 Figure 3.10 Training curves with and without curriculum learning for Hard traffic mode.
                                     22
                   Average speed
                                     20
                                     18
                                     16
                                     14
                                                                 baseline    curriculum learning
                                     12
                                          0       2500 5000 7500 10000 12500 15000 17500
                                                            Evaluation epochs
Figure 3.11 Average speed during training with and without curriculum learning for Hard
traffic mode.
3.4.3   Evaluating The Priority-based Safety Supervisor’s Performance
   In this subsection, we probe into the efficacy of our introduced priority-based safety
supervisor. Figure 3.12 shows that the proposed priority-based safety supervisor method
                                                                 33


has enhanced sample efficiency, evidenced by faster converge speed across all traffic den-
sities. In addition, even in the Hard traffic mode, it achieves a higher evaluation reward.
This improved performance stems from the safety supervisor’s ability to replace most un-
safe maneuvers with secure alternatives, particularly during initial exploration, minimizing
premature terminations, and paving the way for better learning efficiency.
                   An exploration of Figure 3.13 underlines the average vehicle speed throughout the train-
ing, a metric pointing to traffic throughput. Obviously, algorithms adopting the safety
supervisor consistently maintain a faster training speed compared to the baseline with the
safety supervision. This indicates that the proposed safety supervisor is not only beneficial
for training but also leads to better traffic efficiency. It can also be seen that vehicle speeds
are slower as the traffic density increases (26 m/s, 24 m/s and 22 m/s for Easy, Medium and
Hard traffic densities, respectively). This is reasonable since denser traffic naturally results
in more frequent interactions, resulting in reduced speeds to avoid collisions.
                                      Easy Mode                                                        Medium Mode                                                          Hard Mode
                                                                                      60
 Training return                                                    Training return                                                    Training return
                   60                                                                                                                                    40
                                                                                      40
                   40                                                                                                                                    20
                                                                                      20
                            baseline              baseline + Tn=6                              baseline              baseline + Tn=6                              baseline              baseline + Tn=6
                   20       baseline + Tn=3       baseline + Tn=8                      0       baseline + Tn=3       baseline + Tn=8                      0       baseline + Tn=3       baseline + Tn=8
                        0    5000        10000      15000                                  0    5000        10000      15000                                  0    5000        10000      15000
                                    Training epochs                                                    Training epochs                                                    Training epochs
                            Figure 3.12 Training curves for the n-step priority-based safety supervisor.
                                      Easy Mode                                                        Medium Mode                                                          Hard Mode
                   26
                                                                                      24
                   25                                                                                                                                    22
 Average speed                                                      Average speed                                                      Average speed
                                                                                      22                                                                 20
                   24
                   23                                                                 20                                                                 18
                   22       baseline              baseline + Tn=6                     18       baseline              baseline + Tn=6                     16       baseline              baseline + Tn=6
                            baseline + Tn=3       baseline + Tn=8                              baseline + Tn=3       baseline + Tn=8                              baseline + Tn=3       baseline + Tn=8
                   21                                                                 16                                                                 14
                             5000        10000      15000                                      5000         10000      15000                                      5000         10000      15000
                                    Training epochs                                                    Training epochs                                                    Training epochs
 Figure 3.13 Average speed during training for the n-step priority-based safety supervisor.
                   After training, MARL algorithms for each traffic density are tested across 3 random seeds
spanning 30 epochs. The outcomes, in terms of average collision rates, vehicle speeds, and the
safety supervisor’s inference time, are represented in Table 3.2. With the intervention of the
                                                                                                         34


safety supervisor, particularly for Tn ≥ 7, MARL seamlessly runs without collisions across
all traffic modes. In contrast, the baseline method exhibits collision rates of 0.07 and 0.16 for
Medium and Hard traffic densities, respectively. It is clear that with only a short prediction
horizon, e.g., Tn = 3 or 6, the MARL is still failed under challenging cases. For example,
the agents still have 0.03 collision rate in the Hard traffic mode when choosing Tn = 6. The
reason is that if Tn is too small, the safety supervisor is “short-sighted” and can lead to no
feasible solutions after only a few steps. Conversely, if Tn (e.g., Tn = 10, 12, 14) is excessively
increased, collision rates might increase due to compounded uncertainties over extended
durations, as discussed in Remark 2 in Section 3.3. This might also suppress average speeds
as vehicles would adopt more careful actions to ensure safety. With a reasonable Tn (e.g.,
7, 8), the average speed indicates that the safety supervisor leads to higher traffic efficiency.
In all traffic modes, the safety supervisor always leads to higher average speed while lower
collision rate. For instance, the best average speed for the Easy traffic mode is achieved by
baseline + Tn = 8 (27.72 m/s) compared to the baseline method (23.52 m/s). Interestingly,
the baseline’s average speed in the Medium mode lags behind that of the Hard mode, a
pattern inconsistent with methods incorporating the safety supervisor. Our observation is
that the CAVs (in the baseline) often behave conservatively and hesitate to speed up or
slow down to let the merging vehicles merge to avoid traffic congestion, which is the main
reason causing traffic inefficiencies. In contrast, with the designed safety supervisor, traffic
inefficiencies are largely improved and this situation is reduced.
Table 3.2 Comparative analysis of the n-step safety supervisor based on the baseline (bs)
method, in terms of collision rate, average speed (m/s), and inference time (ms).
  Scenarios      Metrics      bs   bs + Tn = 3 bs + Tn = 6 bs + Tn = 7 bs + Tn = 8 bs + Tn = 9 bs + Tn = 10 bs + Tn = 12 bs + Tn = 14
              collision rate   0         0          0            0          0           0            0            0            0
  Easy Mode
               avg. speed    23.53    25.12       25.38       25.27       27.72       27.50        25.89        25.82        25.74
               infrn time      -       4.90        7.62        8.30        8.93       10.07        11.08        12.62        14.71
              collision rate 0.07      0.03        0.01          0           0          0            0             0          0.01
  Medium Mode
               avg. speed    20.30    24.22       24.61       24.13       24.08       24.19        23.74        24.35        24.13
               infrn time      -      14.75       14.64       16.78       17.55       19.40        21.29        23.22        26.89
              collision rate 0.16      0.14        0.03          0           0          0          0.03          0.05        0.05
  Hard Mode
               avg. speed    21.71    22.52       22.56       22.58       22.73       23.01        22.52        21.31        21.83
               infrn time      -      14.75       23.13       25.69       28.13       31.45        35.93        39.43        50.13
                                                                   35


                                              Easy Mode                                                                            Medium Mode                                                                       Hard Mode
                                                                                                    50                                                                                        50
Training return                                                            Training return                                                                        Training return
                  60
                                                                                                    25                                                                                        25
                  40                                                                                                                                                                                0
                                                                                                             0
                                               MAA2C          MAPPO                          −25                                      MAA2C          MAPPO                           −25                              MAA2C          MAPPO
                  20                           MAACKTR        bs + Tn=8                                                               MAACKTR        bs + Tn=8                                                        MAACKTR        bs + Tn=8
                                                                                             −50                                                                                     −50
                            0        5000        10000      15000                                                0          5000        10000      15000                                                0   5000        10000      15000
                                            Training epochs                                                                        Training epochs                                                                 Training epochs
Figure 3.14 Training curves comparison between the proposed MARL policy (baseline (bs)
+ Tn = 8) and 3 state-of-the-art MARL benchmarks.
                                               Easy Mode                                                                            Medium Mode                                                                      Hard Mode
                                                                                                                                                                                                    30
                                                                                                             25
                            25
            Average speed                                                                    Average speed                                                                          Average speed
                                                                                                                                                                                                    25
                            20                                                                               20
                                                                                                                                                                                                    20
                            15                  MAA2C          MAPPO                                         15                        MAA2C          MAPPO                                                           MAA2C          MAPPO
                                                                                                                                                                                                    15
                                                MAACKTR        bs + Tn=8                                                               MAACKTR        bs + Tn=8                                                       MAACKTR        bs + Tn=8
                                      5000        10000     15000                                                           5000        10000     15000                                                     5000        10000     15000
                                             Training epochs                                                                       Training epochs                                                                 Training epochs
Figure 3.15 Average speed comparison between the proposed MARL policy (baseline (bs) +
Tn = 8) and 3 state-of-the-art MARL benchmarks.
3.4.4                              Comparison With State-of-the-art Benchmarks
Table 3.3 Testing performance comparison of collision rate and average speed between the
proposed method and four SOTA benchmark techniques.
                                 Scenarios                   Metrics                                             MPC          MAA2C             MAACKTR                        MAPPO baseline + Tn = 8
                                                          collision rate                                             0.03          0.02            0.08                                                 0                 0
                                 Easy Mode
                                                       avg. speed [m/s]                                          22.05         21.00               24.71                                       25.70                   25.72
                                                          collision rate                                             0.03          0.08            0.12                                             0.02                  0
                                 Medium Mode
                                                       avg. speed [m/s]                                          19.67         19.33               21.94                                       24.00                   24.08
                                                          collision rate                                             0.40          0.52            0.18                                             0.34                  0
                                 Hard Mode
                                                       avg. speed [m/s]                                          21.02         19.68               18.19                                       22.41                   22.73
                            In this subsection, we compare the proposed method with several SOTA MARL bench-
marks, including MAA2C, MAPPO, and MAACKTR. Additionally, we examine an enhanced
model predictive control (MPC) method as cited in [18, 19]. All the MARL benchmarks
are structured to share parameters among agents to accommodate dynamic agent num-
bers, utilizing global rewards and discrete action space. Specifically, MAA2C [80] inte-
grates a context-aware multi-agent actor-critic methodology with a centralized critic net-
work employing an expected update. AACKTR [146] fine-tunes the trust region using
                                                                                                                                     36


the Kronecker-factored approximate curvature [91] (K-FAC). Meanwhile, MAPPO [149] en-
hances the MARL paradigm by incorporating best practices, including Generalized Advan-
tage Estimation (GAE) [118], advantage normalization, and value clipping.
     Figure 3.14 shows evaluation metrics during the training phase for all MARL strategies.
Notably, our proposed approach (baseline + Tn = 8) persistently surpasses its contemporaries
across varied traffic levels. Its superiority becomes even more obvious regarding sample
efficiency and training outcomes, especially in the Hard mode. Figure 3.15 indicates that
our proposed method maintains superior average training speeds, leading to high training
effectiveness.
     After training, all algorithms for each traffic density are tested across different traffic
densities for 30 epochs. The derived average collision rates and vehicle speeds are tabulated
in Table 3.3. These findings highlight the proposed method’s capability to avoid collisions
entirely and operate with enhanced efficiency compared to benchmark methods. In particu-
lar, MAPPO stands out in the Easy traffic setting, achieving zero collisions, and showcases
commendable performance in the Medium mode with a minor collision rate (0.02). However,
it demonstrates a high collision rate (0.34) in the Hard traffic mode due to abrupt speed
fluctuations, as visualized in Figures 13 and 14, causing HDVs little room to respond. Con-
versely, both MAA2C and MAACKTR, lacking safety checks, fail in the on-ramp merging
tasks across all traffic scenarios, leading to high collision rates. It is noteworthy that, the
discrepancy between the exact dynamics used in the highway simulator environment and our
model used in MPC, along with the uncertainties injected in the simulator, mean that even
the MPC can result in collisions (0.03, 0.03, and 0.40 collision rate in the Easy, Medium, and
Hard traffic modes, respectively). This becomes particularly evident as traffic complexity
increases; the merging challenge amplifies due to an augmented model mismatch, making
it difficult for MPC to devise a collision-averse policy. This shows the strengths of model-
free approaches that do not rely on explicit models. Furthermore, the MPC’s model-centric
implementation relies on potent computational resources to facilitate extensive real-time
                                               37


calculations. This becomes even more critical in on-ramp merging scenarios, which involve
nonlinear dynamics and necessitate solving a nonlinear program each time step to calcu-
late the control input, requiring significant onboard computation power. In contrast, our
RL-based approach requires much less computational capability [82].
Figure 3.16 Frames show the learned policy. The below figure shows the corresponding speed
of the AVs.
3.4.5   Policy Interpretation
    This subsection interprets the behaviors exhibited by learned AVs. Figure 3.16 provides
snapshots at time steps 25, 37, and 50 and outlines the speeds of agents 2-4. At time step
25, vehicle 2 decelerates, creating a space for vehicle 3 to merge. As vehicle 3 accelerates
to merge, it maintains an appropriate gap from vehicle 1. By time step 37, vehicle 3 has
smoothly merged and accelerated, while vehicle 2 moderates its speed to ensure a safe dis-
tance from vehicle 3. By the time 50, vehicle 2 accelerates, maintaining a secure following
                                              38


distance from vehicle 3. Similar dynamics are noted with vehicle 4.
3.4.6    Multiple Through-lane Case
     The efficacy of the proposed model is further highlighted in the complex multiple through-
lane scenarios showcased in Figure 3.8(b), where vehicles are allowed to change lanes in the
through lanes. As shown in Section 2.1, the on-ramp merging is depicted as a POMDP
MG , which can be described by the following tuple ({Ai , Si , Ri }i⊆ν , T ). Here, the action
state A extends to fit multiple through-lane scenarios, with a slightly adjusted state space to
account for additional surrounding vehicles. Specifically, the observation space (the number
of observable neighboring vehicles) is determined by the parameter NNi . For the multiple
through-lane case, we choose a larger NNi = 8 (NNi = 5 in the single through-lane case).
For the priority-based safety supervisor, we also extend it to the multi-lane case without any
changes.
     The reward function was modified after observing that ego vehicles frequently made
unnecessary lane changes, leading to unsafe driving behaviors (a demo on frequent lane
changes at https://drive.google.com/file/d/1dO8xPCwLXVRgQFM_xwqscRazoId5ksf4/vi
ew?usp=sharing). This revised reward function was constructed with an additional metric,
rl , which penalizes unwarranted and repeated lane changes, inspired by designs in [116].
                            ri,t = wc rc + ws rs + wh rh + wm rm + wl rl ,                (3.12)
where wc , ws , wh , wm and wl are positive weighting coefficients corresponding to collision
evaluation rc , stable-speed evaluation rs , headway time evaluation rh , merging cost evalua-
tion rm , and lane-changing evaluation rl , respectively. Here, rl is defined as:
                                         
                                         
                                         −1, if change lanes;
                                         
                                   rl =                                                   (3.13)
                                         
                                         0,
                                                 otherwise.
To further validate the flexibility and effectiveness of the proposed MARL framework, we
implemented the aforementioned multiple through-lane cases in the highway environment.
Figure 3.17 shows that our approach can be easily extended to the multiple through-lane cases
                                                  39


and achieves good performance. Table 3.4 shows the evaluation performance on the multi-
lane scenarios. As expected, the proposed approach achieves the best performance among
the MARL benchmarks in terms of the lowest collision rate. Overall, MAPPO, MAA2C,
and MAACKTR achieve better performance in terms of lower collision rate than the single
through-lane case since there is an extra through lane providing more operating space for the
CAVs. However, nearly all MARL algorithms achieve relatively lower average speeds in the
multi-lane case than in the single-lane case due to more complicated traffic scenarios with
more CAVs and HDVs. It is noted that MAA2C learns a suboptimal policy in the Medium
traffic mode, representing the lowest average speed due to conservative operations. Since it
is a very involved task to apply MPC approaches to the multi-through lane cases considering
the complicated system modeling, we will leave it for future work. The demo video and code
for the multiple through-lane scenarios can be found at https://github.com/DongChen06/
MARL_CAVs/tree/multi-lane.
                               40
            Training return
                               20
                                0
                                                      baseline          baseline + Tn=6
                              −20                     baseline + Tn=3   baseline + Tn=8
                                    0   2500   5000   7500 10000 12500 15000 17500 20000
                                                        Training epochs
Figure 3.17 Training curves for the n-step priority-based safety supervisor for the multiple
through-lane cases.
3.5     Conclusions And Discussions
      This chapter formulated the on-ramp merging challenge in mixed-traffic scenarios as an
on-policy MARL, incorporating action masking, local reward design, curriculum learning,
and parameter sharing, demonstrating robust performance over several SOTA benchmarks.
                                                            40


Table 3.4 Testing performance comparison of collision rate and average speed between the
proposed method and 3 state-of-the-art benchmarks on the multiple through-lane cases.
       Scenarios           Metrics      MAA2C    MAACKTR MAPPO          baseline + Tn = 8
                        collision rate     0         0.03         0              0
       Easy Mode
                      avg. speed [m/s]   19.78       22.71      23.07          23.53
                        collision rate    0.03       0.07        0.03            0
       Medium Mode
                      avg. speed [m/s]   15.16       20.07      19.59          21.05
                        collision rate    0.40       0.27        0.10            0
       Hard Mode
                      avg. speed [m/s]   22.04       21.60      19.65          20.95
Furthermore, a unique priority-based safety supervisor was introduced, which significantly
improved safety and enhanced learning efficiency.
    In future work, we aim to bridge the gaps between simulations and their real-world ap-
plications. Initial explorations into general RL training can inadvertently result in undesired
system behaviors, even leading to potential crashes upon real-world deployment. To address
this, safety during the exploration can be enhanced by exploiting the dynamic information
of the system to limit the exploration actions within an admissible range, see e.g., our pre-
vious work [22] as well as others [4, 46] for safe RL algorithms. It’s essential that before
transitioning to real-world applications, our policy network undergoes rigorous examinations
in high-fidelity simulations until it achieves optimal performance. Furthermore, it needs to
pass various tests before field deployment to ensure both safety and robustness. Once the
policy is deployed on the ego vehicles, regular updates and maintenance should be conducted
to enhance the model performance in unseen scenarios. For a comprehensive understanding
of the sim-to-real paradigm in reinforcement learning, we recommend an in-depth survey by
[154]. Therefore, we will develop a more realistic simulation environment by incorporating
data from real-world traffic systems to better fill the sim2real gap.
                                               41


                                       CHAPTER 4
                   CACC WITH FULLY DECENTRALIZED AND
                       COMMUNICATION EFFICIENT MARL
     In this chapter, we introduce a fully decentralized MARL framework, equipped with a
novel quantization-based communication protocol for Cooperative Adaptive Cruise Control
(CACC).
4.1    Introduction
    Recently, connected autonomous vehicles (CAVs) have gained significant attention due
to their ability to create safe and sustainable transportation systems in the future [140].
One pivotal technology of CAVs, known as Cooperative Adaptive Cruise Control (CACC),
has been recognized for its capacity to increase road usage efficiency, alleviate traffic conges-
tion, and decrease both energy consumption and exhaust emissions [28, 107]. The primary
objective of CACC is to adaptively coordinate a fleet of vehicles, aiming to minimize the car-
following headway and speed variations, utilizing real-time vehicle-to-vehicle (V2V) commu-
nication [26]. While autonomous vehicle platooning offers great benefits, developing a robust
CACC platform that tightly integrates computing, communication, and control technologies
presents a considerable challenge, especially considering the constraints of limited onboard
communication bandwidth and computing resources [155].
    Classical control theory and optimization-based methodologies have been employed to
tackle the CACC problem [92, 44, 3, 145, 50].         Specifically, some research targets the
predecessor-following model [92] and string stability [138, 41], modeling CACC within the
context of a two-vehicle system. In contrast, other studies formulate the challenges posed
by CACC as optimal control problems [63, 44, 145]. However, these approaches frequently
hinge on precise system modeling [92, 138, 41] or necessitate online optimization, which
may not align with the demands of efficiency and scalability that are essential for real-time
application [28].
    On the other hand, platoon control has also been conceptualized as a sequential deci-
                                              42


sion problem and addressed with data-driven strategies such as reinforcement learning (RL)
[32, 28, 74, 62, 155, 140, 84, 76]. In particular, In [62], the Soft Actor-Critic (SAC [54])
is adopted to mitigate traffic oscillations and enhance platoon stability. Furthermore, deep
deterministic policy gradient (DDPG [79]) algorithm is employed in [140] for CACC, taking
into account both time-varying leading velocity and communication delays via wireless V2V
communication technology. A policy-gradient RL approach is developed in [32] to ensure
the safe longitudinal distance to a front vehicle. However, these approaches primarily fo-
cus on platoons of only 2 vehicles (i.e., leader-follower architecture). To control multiple
CAVs, centralized RL approaches are frequently developed, a strategy that relies heavily on
the high bandwidth capabilities of vehicle-to-cloud (V2C) or vehicle-to-infrastructure (V2I)
communication [61]. For instance, in [28], a centralized RL controller is introduced for the
CACC problem in mixed-traffic scenarios via V2C communication. While these central-
ized control strategies have demonstrated promising results, they bear the burden of heavy
communication overheads and are often plagued by a single point of failure and the curse
of dimensionality [26]. These factors make them impractical for deployment in large-scale
CACC systems prevalent in the future landscape.
    More recently, Multi-Agent Reinforcement Learning (MARL) has emerged as a promising
solution to address the CACC control problem involving multiple AVs, owing to its capa-
bilities for online adaptation and solving complex problems [108, 26, 111]. For instance, a
MARL framework with both local and global reward designs is evaluated in [108] on two
platoons of 3 and 5 AVs, concluding that the local reward design (i.e., independent MARL)
surpasses the global reward design. However, our experiments will demonstrate that inde-
pendent MARL achieves promising performance in straightforward CACC scenarios but falls
short in more complex situations (see Sec. 4.4). In [26], a learnable communication MARL
protocol is developed to reduce information loss across two CACC scenarios and each agent
(i.e., AV) learns a decentralized control policy based on local observations and messages from
connected neighbors. Moreover, Blockchain is incorporated into the MARL (i.e., MADDPG
                                                43


[87]) framework to enhance the privacy of CACC. Despite these advances, these approaches
uniformly adopt a Centralized Training and Decentralized Execution (CTDE) framework,
wherein agents use additional global information to guide training centrally and make deci-
sions based on decentralized local policies [152, 151]. However, in many real-world scenarios,
such as CACC, installing a central controller (e.g., cloud facilities or roadside units) can be
prohibitively expensive. Moreover, the central controller needs to communicate with each
agent to exchange information, which perpetually amplifies the communication overhead on
the single controller [152].
    In this chapter, we formulate CACC as a fully decentralized MARL problem, in which
the agents are connected via a sparse communication network without the need for any
central controller. To achieve this, we introduce a decentralized MARL algorithm based
on a novel policy gradient update mechanism. Throughout the training process, at every
time step, each agent takes an individual action based solely on locally available informa-
tion. To stabilize training and counteract the inherent non-stationarity in MARL [151], each
agent shares its estimate of the value function with its neighbors on the network, collectively
aiming to maximize the average rewards of all agents across the network. Furthermore, a
novel quantization-based communication scheme is further proposed, which greatly improves
communication efficiency in decentralized stochastic optimization without a substantial com-
promise on optimization accuracy. The main contributions and the technical advancements
of this chapter are summarized as follows.
   1. We formulate the CACC problem as a fully decentralized MARL framework, which
        allows fast convergence without any central controller. A corresponding gym-like sim-
        ulation platform with two CACC scenarios and six state-of-the-art MARL baseline
        algorithms is developed and open-sourced1 .
   2. We introduce an innovative, effective, and scalable MARL algorithm equipped with a
        quantization-based communication protocol to enhance communication efficiency. The
    1
      See https://github.com/DongChen06/CACC_MARL
                                               44


      quantization process condenses complex parameters of the critic network into discrete
      representations, facilitating efficient information exchange among agents.
   3. We conduct comprehensive experiments on two CACC scenarios, and the results show
      that the proposed approach consistently outperforms several state-of-the-art MARL
      algorithms.
    The structure of this chapter is as follows: In Section 4.2, we introduce the CACC prob-
lem that we are addressing. The problem formulation and the proposed MARL framework
are introduced in Sec. 4.3 whereas experiments, results, and discussions are presented in
Sec. 4.4. Lastly, in Section 4.5, we conclude the chapter, by summarizing our contributions
and suggesting potential insights for future research.
                                   V2V communication network
                                                …
              CAV                 CAV                    CAV            Platoon leader
                          ℎ𝜈                                        ℎ1
                                                …
                𝜈                   𝜈-1                    1                   0
                         Figure 4.1 Framework of the CACC system.
4.2   Cooperative Adaptive Cruise Control (CACC)
    In this section, we introduce the system model for vehicle platooning along with the
behavior model employed within the platoon. Furthermore, we present an introduction to
the two CACC scenarios used in this chapter.
4.2.1    Vehicle Dynamics
    As shown in Figure 4.1, we consider a platoon, comprised of V + 1 CAVs, navigating
along a horizontal road. For simplicity, we assume that all vehicles in the system share
identical characteristics. The platooning system is guided by a platoon leader vehicle (PL,
0th vehicle), while the platoon member vehicles (PMs, i ∈ 1, ..., V) travel behind the PL.
Each PM i maintains a desired inter-vehicle distance (IVD) hi and velocity vi relative to
its preceding vehicle i − 1, based on its unique spacing policy [155]. The one-dimensional
                                                 45


dynamics of vehicle i can be expressed as follows:
                                               ḣi = vi−1 − vi ,                         (4.1a)
                                                    v̇i = ui ,                           (4.1b)
where vi−1 and ui symbolize the velocity of its preceding vehicle and the acceleration of
vehicle i, respectively. As per the design outlined in [28], the discretized vehicle dynamics,
given a sampling time ∆t, can be denoted as:
                                                    Z t+∆t
                              hi,t+1 = hi,t +              (vi−1,τ − vi,τ )dτ,           (4.2a)
                                                      t
                                         vi,t+1 = vi,t + ui,t ∆t,                        (4.2b)
In order to guarantee both comfort and safety, each vehicle must follow the following con-
straints [28]:
                                                 hi,t ≥ hmin ,                           (4.3a)
                                              0 ≤ vi,t ≤ vmax ,                          (4.3b)
                                          umin ≤ ui,t ≤ umax .                           (4.3c)
where hmin = 1 m, vmax = 30 m/s, umin = −2.5m/s2 < 0 and umax = 2.5m/s2 > 0
represent the minimum safe headway, maximum speed, deceleration, and acceleration limits,
respectively.
4.2.2    Vehicle Behavior
    The behavior of vehicles in the platoon is simulated using the Optimal Velocity Model
(OVM [8]). The OVM has been widely used in traffic flow modeling due to its ability to
capture realistic human driving behaviors [26]. The principal equation of OVM for ith vehicle
is defined as follows:
                        ui,t = αi (v ◦ (hi,t ; hs , hg ) − vi,t ) + βi (vi−1,t − vi,t ),  (4.4)
where αi and βi are the headway gain and relative velocity gain, respectively. These param-
eters serve as representations of human driver behavior, encapsulating the influence of both
                                                        46


spacing and relative speed in determining vehicle acceleration. Here, hs = 5 m and hg = 35
m denote the stop headway and full-speed headway, both of which are key to understanding
traffic dynamics at different vehicle densities. Furthermore, v ◦ represents the headway-based
velocity policy, which is defined as:
                                
                                                                   if h < hs ;
                                
                                
                                
                                
                                  0,
                                
                                
                         ◦                                 s
                        v (h) ≜ 1 vmax (1 − cos (π h−h       )), if hs ≤ h ≤ hg ;                   (4.5)
                                  2                  hg −hs
                                
                                
                                
                                                                   if h > hg ;
                                
                                vmax ,
                                
This policy function serves as an optimal velocity strategy for each vehicle based on the
current headway to the preceding vehicle. At small headways less than or equal to hs , the
optimal velocity is zero, highlighting the need for vehicle stopping to prevent potential col-
lisions. For headways within the range hs to hg , the optimal velocity gradually increases
following a cosine curve until reaching the maximum velocity. For larger headways greater
than or equal to hg , the optimal velocity is capped at the vehicle’s maximum speed, ensur-
ing both safety and efficiency in the traffic flow. This strategy significantly contributes to
maintaining fluidity in vehicular traffic under various density conditions.
4.2.3     Two CACC Scenarios
     In this chapter, the objective of CACC is to adaptively control a fleet of CAVs in order to
reduce the car-following headway to h∗ = 20 m and achieve a target velocity of v ∗ = 15 m/s,
leveraging real-time V2V communication. Two different CACC scenarios are investigated
as presented in [26]: “Catchup” and “Slowdown”. For the “Catchup” scenario, the CAVs
(i = 1, ..., V) are initialized with states vi,0 = vt∗ and hi,0 = h∗t , while the platoon leader
(PL) is initialized with states v0,0 = vt∗ and h0,0 = a · h∗t m, where a is a random variable
uniformly distributed between 3 and 4. In contrast, during the “Slowdown” scenario, all
vehicles (i = 0, 1, ..., V) have initial velocities vi,0 = b · vt∗ and hi,0 = h∗t , where b is uniformly
distributed between 1.5 and 2.5. Here, vt∗ linearly decreases to 15 m/s within the first
30 seconds and then remains constant. The “Slowdown” scenario poses a more complex
                                                   47


and challenging task than the “Catchup” scenario due to the necessity for all vehicles to
precisely coordinate their deceleration rates and maintain safe inter-vehicle distances, thereby
requiring more precise control strategies. An example of the headway and speed profiles of
the CAVs in these scenarios is illustrated in Figure 4.4 and Figure 4.5.
4.3    CACC As MARL
     In this section, we first formulate the considered CACC problem as a partially observable
Markov decision process (POMDP). Subsequently, we present our fully decentralized Actor
Critic-based MARL algorithm, which represents our primary strategy for addressing the
challenges presented in the CACC problem. Then, we introduce the quantization-based
communication protocol to enhance the efficiency of agent communication within the MARL
framework.
4.3.1      Problem Formulation
    In this chapter, we model the CACC problem as a model-free multi-agent network [26],
where each agent (i.e., AV) is capable of communicating with the vehicles ahead and behind
it via V2V communication channels. We denote the global state space and action space as
S := ×i∈ν Si and A := ×i∈ν Ai , respectively. The intrinsic dynamics of the system can be
characterized by the state transition distribution P: S × A × S → [0, 1]. We propose a
fully decentralized MARL framework where each agent i (equivalently, AV i) has a partial
view of the environment, specifically the surrounding vehicles, which accurately reflects the
practical scenario where AVs are limited to sensing or communicating with proximal vehicles,
thereby rendering the overall dynamical system as a Partially Observable Markov Decision
Process (POMDP). This POMDP, MG , can be delineated by the following tuple MG =
({Ai , Si , Ri }i⊆V+1 , T ):
   1. Action Space: In the considered CACC problem, the action at ∈ Ai is straightfor-
       wardly related to the longitudinal control. However, due to the data-driven nature
       of RL, formulating a safe and robust longitudinal control strategy poses a significant
       challenge [28]. To address this, we adopt OVM (see Sec. 4.2.2, [8]) to carry out the
                                                48


   longitudinal vehicle control. The OVM control behavior is affected by various hyper-
   parameters: headway gains α, relative velocity gain β, stop headway hs , and full-speed
   headway hg . Usually,(α; β) represents the driving behavior of a human driver. How-
   ever, in this work, we leverage MARL to propose suitable values of (α; β) for each
   OVM controller. These recommended values are selected from a set of four different
   levels: {(0, 0), (0.5, 0), (0, 0.5), (0.5, 0.5)}. Subsequently, the longitudinal action can
   be computed using Eq. 4.4 and Eq. 4.5.
2. State Space: The state space represents the description of the environment. The state
   of agent i, Si , is defined as [v, vdif f , vh, h, u], where v = (vi,t − vi,0 )/vi,0 denotes the
   current normalized vehicle speed. vdif f = clip((vi−1,t − vi,t )/5, −2, 2) represents clipped
   vehicle speed difference with its leading vehicle. vh = clip((v ◦ (h) − vi,t )/5, −2, 2),
   h = (hi,t + (vi−1,t − vi,t )∆t − h∗ )/h∗ , and u = ui,t /umax are the headway-based velocity
   defined in Eq. 4.5, normalized headway distance, and acceleration, respectively.
3. Reward Function: The reward function ri,t is pivotal for training the RL agents to
   exhibit the desired behaviors. With our objective being the training of our agents to
   achieve a predefined car-following headway h∗ = 20 m and velocity v ∗ = 15 m/s, the
   reward assigned to the ith agent at each time step t is formulated as follows:
                ri,t = w1 (hi,t − h∗ )2 + w2 (vi,t − v ∗ )2 + w3 u2i,t + w4 (2hs − hi,t )2+ , (4.6)
   where wi , i ∈ {1, 2, 3, 4} are the weighting coefficients. In this equation, the first two
   terms, (hi,t − h∗ )2 and (vi,t − v ∗ )2 , penalize deviations from the desired headway and
   velocity, encouraging the agent to achieve these targets closely. The third term, u2i,t ,
   is included to minimize abrupt accelerations, thereby promoting smoother and more
   comfortable rides for passengers. Lastly, the term (2hs − hi,t )2+ functions as a safety
   constraint, penalizing the agent heavily if the inter-vehicle distance is less than twice
   the stop headway hs , which is critical for preventing collisions and ensuring the safety of
   the vehicle platoon. This comprehensive reward design serves to balance performance,
   comfort, and safety considerations in the CACC system. Upon a collision, if the inter-
                                                49


      vehicle distance hi,t ≤ 1 m, each agent is subjected to a substantial penalty of 1000,
      resulting in immediately terminating the training episode.
   4. Transition Probabilities: The transition probability, T (s′ |s, a), describe the dynam-
      ics of the system. Given that our approach is a model-free MARL framework, we do
      not assume any prior knowledge of this transition probability while developing our
      MARL algorithm.
4.3.2    Fully Decentralized MARL
    In this chapter, we formulate the CACC as a fully decentralized MARL scenario, where
each agent (i.e., an autonomous vehicle) independently decides its action based solely on its
local observation. Importantly, this structure lacks a centralized controller, meaning that
each agent possesses its own individual policy networks. During the learning phase, agents
rely on locally received rewards to train and update these networks. In this chapter, we
adopt the actor-critic MARL framework [29], and the policy loss for agent i is defined as:
                                             " T                                   #
                                               X                               πθi
                        ∇θ L(πθi ) = Eπθi            ∇θ log πθi (ai,t |si,t )Ai,t ,                    (4.7)
                                               t=0
         πθ
where Ai,ti = ri,t + γV πϕi (si,t+1 ) − V πϕi (si,t ) is the advantage function and V πϕi (si,t ) is the
state value function, which is updated following the loss function:
                                           h                                       i2
                        LVϕi = min EDi ri,t + γVϕi (si,t+1 ) − Vϕi (si,t ) .                           (4.8)
                                   ϕi
Despite each agent learning independently, the overall goal of the cooperative MARL frame-
                                                                   1
                                                                        PV+1
work is to optimize the average global reward rg,t = V+1                     i=0 ri,t . To address the non-
stationary, in [152], the update of the policy network is executed independently by each
agent, eliminating the need for inferring other agents’ policies. However, when it comes
to updating the critic network, a collaborative approach is adopted, in which each agent
shares its estimate of the value function xi with its neighboring agents within the network
through a “mean” operation, i.e., xk+1       = |N1i | j∈Ni xkj . This allows for the joint evolution
                                                       P
                                        i
and continuous improvement of the system’s overall performance. However, the approach
                                                     50


is based on the assumption that all agents are homogeneous, sharing the same characteris-
tics. While this simplifies the problem structure, it doesn’t adequately represent the intrinsic
diversity of individual agents, which is particularly relevant for the CACC scenario where di-
verse strategies are needed based on vehicles’ positions, speeds, and proximities. To address
this concern, we propose a novel update strategy that fosters a balance between individual
learning and collaborative influence from neighboring agents as follows.
                                               X
                              xk+1
                               i    = xki + ϵ       ωij (xkj − xki ) − λgik                 (4.9)
                                              j∈Ni
                                           −4
where the scaling factor ϵ = 1.0 × e           modulates the impact or collaborative influence
from neighboring agents, while the learning rate λ = 5.0 × e−4 adjusts the influence of the
gradient on the update process. This novel update strategy fosters collaboration among the
agents while preserving the individual learning capabilities of each, thereby striking a balance
between global performance optimization and localized adaptivity. The update strategy of
the proposed fully decentralized MARL for CACC (abbreviated as MACACC) is given in
Algorithm 4.1.
                              Algorithm 4.1 MACACC for CACC
Public parameters: W , ϵk , λk , x0i for all i, the total number of iterations k
for ith agent do
    Determine the local gradient gik for the critic network;
    Send states to all neighboring agents j ∈ Ni ;
    After receiving xkj from all j ∈ Ni , update network parameters as
                                                 X
                                xk+1
                                 i    = xki + ϵ       ωij (xkj − xki ) − λgik
                                                 j∈Ni
end
4.3.3    Quantization-based Communication Protocol
    To enhance communication efficiency among agents in our MARL framework, we propose
a strategy of transmitting quantized parameters, rather than the raw parameters of the critic
network. This approach is especially important for autonomous driving applications that
                                                   51


are often subject to limitations in communication bandwidth. By transmitting compact,
quantized parameters instead of raw data, we ensure optimal use of available bandwidth,
thereby fostering efficient and effective communication among the vehicles in the network.
     Let’s denote the parameters of the critic network as x = [x1 , x2 , ..., xd ]T , with d repre-
senting the dimension of the parameter vector. We then apply a quantization function Q(x)
to these parameters, yielding a quantized parameter vector [q1 , q2 , ..., qd ]T . The quantization
rule is defined as:
                                              qi = rsign(xi )bi ,                               (4.10)
Here, r is a non-negative real number that is at least the maximum absolute value of the
entries in x (denoted as ||x||∞ ), while sign(·) stands for the sign function, which returns
the sign of any given real number. The factor bi is a random variable following a designed
distribution determined by the magnitude of the corresponding parameter xi . Let n be the
resolution of the quantization and overall 2n +1 discrete points will be generated. We assume
|xi | belongs to the interval [ nk r, k+1
                                       n
                                          r], then probability of bi can be determined as:
                                                k+1          |xi | n
                                     P (bi =          |x) =                ,                   (4.11a)
                                                   n           r k+1
                                               k             |xi | n
                                     P (bi = |x) = 1 −                     ,                   (4.11b)
                                               n               r k+1
                                        0, 1, ..., k − 1, k + 2, ..., n
                            P (bi =                                      |x) = 0,              (4.11c)
                                                       n
                                          k+1                                                     k+1
If the magnitude of xi is closer to        n
                                              r,  then the higher the probability that bi will be  n
                                                                                                      ,
and vice versa. An illustration of the proposed quantization-based communication scheme
is represented in Figure 4.2. We denote the quantization-based MACACC algorithm as
QMACACC (n). An extremely condensed version of QMACACC is QMACACC (1), in
which only three discrete numbers {−1, 0, 1} are used to represent each parameter, and bi is
defined as:
                                                              |xi |
                                            P (bi = 1|x) =          ,                          (4.12a)
                                                                r
                                                                 |xi |
                                         P (bi = 0|x) = 1 −            .                       (4.12b)
                                                                   r
                                                      52


                                          𝑚𝑟       (𝑚 + 1)𝑟
                      0         …                                …             1
                                            𝑛           𝑛
                                                  |𝑥𝑖 |
                           𝑚         𝑛 𝑥𝑖 − 𝑚𝑟                   𝑚+1         𝑛 𝑥𝑖 − 𝑚𝑟
                 Pr 𝑏𝑖 =       = 1−                    Pr 𝑏𝑖 =            =
                           𝑛               𝑟                       𝑛              𝑟
                                 𝑙
                    𝑃𝑟 𝑏𝑖 =        |𝑥 = 0, 𝑙 = 0,1, … , 𝑚 − 1, 𝑚 + 2, … , 𝑛
                                𝑛
                  Figure 4.2 Schematics of the proposed quantization scheme.
Remark 4.3.1. The quantization resolution n is a significant hyperparameter in the quan-
tization scheme. If n is too small, the granularity of the quantization could lead to excessive
loss of information, negatively impacting the performance of the MARL framework. Con-
versely, if n is too large, the computational and communication overheads could increase due
to the larger number of potential quantized values. Therefore, choosing an appropriate value
of n is crucial for balancing communication efficiency and the performance of the MARL
framework. An empirical evaluation of different n values will be conducted in the experiments
section.
    The update strategy of the proposed quantization-based MARL (i.e., QMACACC (n))
for CACC is given in Algorithm 4.2.
                            Algorithm 4.2 QMACACC (n) for CACC
Public parameters: W , ϵk , λk , x0i for all i, the total number of iterations k
for ith agent do
    Determine the local gradient gik for the critic network;
    Quantize states according to Eq. 4.11 and send to all neighboring agents j ∈ Ni ;
    After receiving Q(xkj ) from all j ∈ Ni , update state as
                                             X
                             xk+1
                              i    = xki + ϵ      ωij (Q(xkj ) − Q(xki )) − λgik
                                             j∈Ni
end
                                                   53


4.4    Experimental Results & Discussions
    In this section, we evaluate our MARL framework in two CACC scenarios detailed in
Section 4.2.3. Firstly, we benchmark our approach against several state-of-the-art MARL
strategies. Then, we demonstrate the effectiveness of our quantization-based communication
protocol.
4.4.1    General Setups
    To demonstrate the efficiency and robustness of MA2C, we compare it to several state-of-
the-art MARL benchmark controllers. IA2C performs independent learning, while ConseNet
[152] takes the “mean" operation during updating critic networks, and FPrint [43] incorpo-
rates the neighbors’ policy into the inputs. DIAL [42], CommNet [126], and NeurComm [26]
are implementations with learnable communication protocols, incorporating more messages
from the neighbors, e.g., neighboring states or policy information, relying on higher commu-
nication bandwidth. All algorithms use the same DNN structures: one fully-connected layer
for input state encoding, followed by one LSTM layer for message extracting. All hidden
layers have 64 units. During the training, the network is initialized with the state-of-the-art
orthogonal initializer [115]. We train each model over 1M steps, with γ = 0.99, actor learning
rate 5.0 × 10−4 , and critic learning rate 2.5 × 10−4 . Also, each algorithm is trained three
times with different random seeds for generalization purposes. Each training takes about 12
hours d in a Ubuntu 18.04 server with an AMD 9820X processor and 64 GB memory.
    The hyperparameter wi , i ∈ {1, 2, 3, 4} in the reward function Eq. 4.6 are set to 1.0, 1.0,
0.1, and 5.0, respectively, with a significant emphasis on penalizing situations where the
safety headway distance is insufficient. Considering a simulated traffic environment over a
period of T = 60 seconds, we define ∆t = 0.1 seconds as the interaction period between RL
agents and the traffic environment, so that the environment is simulated for ∆t seconds after
each MDP step. In the following experiments, we assume the platoon size to be V + 1 = 8,
implying that there are a total of 8 CAVs in the platoon. The impact of different platoon
sizes on our model’s performance will be studied and presented in Sec. 4.4.4.
                                               54


                                              Catchup                                                                                      Slowdown
                        0                                                                                     −500
                                                                                                             −1000
                    −500
                                                                                                             −1500
  Training reward                                                                          Training reward
                                                                                                             −2000
                    −1000
                                                                                                             −2500
                                        IA2C                     DIAL
                                                                                                             −3000
                    −1500               NuerComm                 FPrint
                                        ConseNet                 MACACC                                      −3500
                                        CommNet
                    −2000                                                                                    −4000
                        0.0       0.2       0.4      0.6          0.8            1.0                             0.0           0.2        0.4      0.6           0.8          1.0
                                           Training steps                        1e6
                                                                                                                                         Training steps                       1e6
Figure 4.3 Training curves comparison between the proposed MARL policy (MACACC) and
6 state-of-the-art MARL benchmarks.
Table 4.1 Execution performance comparison over trained MARL policies. The best values
are in bold.
                    Scenario Name         IA2C       FPrint       ConseNet                   NeurComm CommNet                                    DIAL          MACACC
                       Catch-up          -241.38      -198.93           -94.67                               -301.41           -397.55           -227.68        -50.44
                      Slow-down          -2103.38    -1470.41         -1746.43                               -1912.23          -2590.72        -1933.27         -492.30
Table 4.2 Performance of MARL controllers in CACC environments: “Catchup” (above) and
“Slowdown” (below). The best values are in bold.
 Temporal Average Metrics IA2C                              FPrint         ConseNet                             NeurComm             CommNet          DIAL        MACACC
 avg vehicle headway [m]                            19.43   20.02                20.28                                 21.77             22.38         21.86           19.91
 avg vehicle velocity [m/s]                         15.00       15.34            15.30                                 15.04             15.01         15.01           15.32
 collision number                                     0          0                     0                                0                  0               0              0
 avg vehicle headway [m]                              -         15.16            9.23                                  11.45              4.90          9.71           20.44
 avg vehicle velocity [m/s]                           -         13.10            8.08                                  10.32              4.18          8.91           16.61
 collision number                                    50          14                29                                   22                38               26             0
4.4.2                 Comparison With State-of-the-Art Benchmarks
      Figure 4.3 shows the performance comparison in terms of the learning curves between
the proposed approach MACACC and several state-of-the-art MARL benchmarks. As ex-
pected, the proposed approach achieves the best performance, evidenced by higher training
rewards in both CACC scenarios. In the more challenging “Slowdown” environment, the
proposed approach shows its greater advantages of sample efficiency as seen from the fastest
                                                                                       55


convergence speed and best training reward compared to other algorithms.
    After training, we evaluate each algorithm 50 times with different initial conditions.
Table 4.1 shows the evaluation performance comparison over the trained MARL policies.
The proposed method consistently outperforms the benchmarks in all CACC scenarios in
terms of the evaluation reward, which reveals the overall evaluation metrics including vehicle
headway, velocity, acceleration, and safety as described in Eq. 4.6. Table 4.2 shows the key
evaluation metrics in CACC. The best headway and velocity averages are the closest ones
to h∗ = 20 m, and v ∗ = 15 m/s. Note the averages are only computed from safe execution
episodes, and we use another metric “collision number” to count the number of episodes where
a collision happens within the horizon. Ideally, “collision-free” is the top priority. It is clear
that our approach achieves promising performance in the “Catchup” environment, and the
best performance in the harder “Slowdown” environment. All algorithms achieve relatively
good performance in the “Catchup” environment with zero collision number. It is surprising
that IA2C achieves excellent average vehicle velocity at v ∗ . However, it demonstrates high
collision numbers (i.e., 50) in the “Slowdown” scenario due to non-stationary issues since
there is no communication between agents. FPrint yields the best average vehicle headway
in the “Catchup” environment, while it has 14 out of 50 collisions during the testing. On the
other hand, NeurComm and CommNet show great average vehicle velocity in the “Catchup”
environment, however, they failed to track the optimal headway, resulting in average high
headway of 21.77 m and 22.38 m, respectively. It is noted that ConseNet achieves promising
performance in the “Catchup” environment, with a zero collision rate, and average vehicle
headway (20.28 m) and velocity (15.30 m/s) close to the optimal values. However, it yields
high collision numbers (29 out of 50) in the “Slowdown” scenario as it simply encourages all
agents to behave similarly via the “average” operations during training, which is especially
impractical for complex scenarios, such as “Slowdown”, where agents need to react differently
to the speed and headway changes.
    Figure 4.4 and Figure 4.5 show the corresponding headway and velocity profiles for the
                                              56


Vehicle headway (m)                                              Vehicle velocity (m/s)
                                         MACACC, veh# 1                                   20
                      35                 MACACC, veh# 8
                      30
                                         ConseNet, veh# 1                                 18
                                         ConseNet, veh# 8
                                                                                          16
                      25
                                                                                          14
                      20
                                                                                          12
                       0     10 20 30 40 50 60                                             0   10 20 30 40 50 60
                             Simulation time (sec)                                             Simulation time (sec)
                             (a) Headway profiles.                                             (b) Velocity profiles.
Figure 4.4 Headway and velocity profiles in the “Catchup” environment of the first and
last vehicles of the platoon, controlled by the proposed approach (MACACC) and the top
baseline policy (ConseNet).
selected controllers for the two CACC scenarios. In the “Catchup” scenario, as expected, the
MACACC controller is able to achieve steady state v ∗ and h∗ for the first and last vehicles of
the platoon, whereas the ConseNet controller still has difficulty eliminating the perturbation
through the platoon. In a harder “Slowdown” environment, MACACC is still able to achieve
optimal headway at about 60 seconds and reach the optimal velocity quickly. However,
FPrint fails the control task with a collision that happened at about 35 seconds. This may
be because simply incorporating neighboring agents’ policies might not be sophisticated
enough to accurately model and adapt to the intricacies among agents.
4.4.3                      Performance Of The Quantization-based MACACC
               In this subsection, we evaluate the effectiveness of the proposed quantization-based com-
munication protocol with different quantization resolutions. As shown in Figure 4.6, in
the less complex “Catchup” scenario, minor quantization appears to improve control per-
formance. This could be attributed to the fact that the quantization process introduces a
level of randomness during the training phase, thereby fostering improved exploration by the
agents, as discussed in [109]. Conversely, in the more challenging “Slowdown” scenario, the
impact of quantization results in a more significant performance degradation relative to the
                                                            57


                      35                                                             35
Vehicle headway (m)                                         Vehicle velocity (m/s)
                                                                                                       MACACC, veh# 1
                      30                                                             30                MACACC, veh# 8
                      25                                                                               FPrint, veh# 1
                      20                                                             25                FPrint, veh# 8
                      15                                                             20
                      10
                       5                                                             15
                       0
                        0   10 20 30 40 50 60                                         0   10 20 30 40 50 60
                            Simulation time (sec)                                         Simulation time (sec)
                            (a) Headway profiles.                                         (b) Velocity profiles.
Figure 4.5 Headway and velocity profiles in the “Slowdown” environment of the first and
last vehicles of the platoon, controlled by the proposed approach (MACACC) and the top
baseline policy (FPrint).
“Catchup” scenario. Nonetheless, even with extremely quantized communication, such as
QMACACC (1), our proposed approach continues to surpass the performance of the robust
baseline method, FPrint.
              Figure 4.7 presents the number of bits required for each communicated parameter as
well as the corresponding test performance at varying quantization resolutions. For better
visualization, these values are normalized with corresponding maximum values. Within the
“Catchup” scenario, QMACACC (1) manages to achieve 98.63% of the control performance
achieved by the non-quantized version, i.e., QMACACC (0), while only requiring 12.5% of
the communicated bits. However, in the “Slowdown” scenario, QMACACC (1) can only
realize 64.64% of the control performance of the non-quantized version, i.e., QMACACC
(0). This underscores a trade-off between the benefits of enhanced communication efficiency
brought about by quantization and the associated diminution in control performance.
4.4.4                   Impact Of Platoon Size
              In this subsection, we explore how variations in platoon sizes affect the performance of
our model. The normalized training curves comparison among MACACC, QMACACC (1),
and the top-performing baseline methods under different platoon sizes (i.e., V + 1 ∈ 2, 8, 12)
                                                       58


                                                                                                Catchup                                                                                                                            Slowdown
                                                                      0                                                                                                                     −500
                                                              −250                                                                                                                 −1000
                                                              −500                                                                                                                 −1500
     Training reward                                                                                                                                       Training reward
                                                                                                            −50
                                                              −750                                                                                                                 −2000
                                                    −1000                                                                                                                          −2500
                                                                                                          −100
                                                    −1250                                                                                                                          −3000
                                                                                     ConseNet                     QMACACC (1)                                                      −3500                              FPrint                   QMACACC (1)
                                                    −1500
                                                                                     MACACC                                                                                                                           MACACC
                                                                                                                                                                                   −4000
                                                                      0.0          0.2         0.4      0.6            0.8          1.0                                                0.0                          0.2        0.4      0.6         0.8               1.0
                                                                                              Training steps                        1e6
                                                                                                                                                                                                                              Training steps                         1e6
Figure 4.6 Training curves comparison between the proposed MARL policy (MACACC) and
Quantization-based MACACC (QMACACC (n)).
                                                                                         

outperforms the baseline method (i.e., FPrint) under different platoon sizes, showing the
impressive scalability of our proposed approach.
                              Plotoon size = 2                                  0
                                                                                        Plotoon size = 8                                 −100
                                                                                                                                                   Plotoon size = 12
                                                                                                                                         −125
 Training reward                                           Training reward                                             Training reward
                    −50                                                       −50
                                                                                                                                         −150
                                       −10
                                                                             −100                                                        −175
                   −100
                                       −15                                                                                               −200
                                       ConseNet                                                    ConseNet                                                  ConseNet
                                                                             −150
                                       MACACC                                                      MACACC                                −225                MACACC
                   −150                QMACACC (1)                                                 QMACACC (1)                                               QMACACC (1)
                                                                             −200                                                        −250
                      0.00   0.25   0.50     0.75   1.00                        0.00   0.25    0.50    0.75   1.00                          0.00   0.25   0.50   0.75   1.00
                               Training steps        1e6
                                                                                         Training steps          1e6
                                                                                                                                                    Training steps         1e6
                              Plotoon size = 2                                          Plotoon size = 8                                           Plotoon size = 12
                   −100                                                                                                                  −150
                                                                             −200
 Training reward                                           Training reward                                             Training reward
                                                                                                                                         −200
                   −200                                                                                                                  −250
                                                                             −400
                                                                                                                                         −300
                   −300                FPrint                                                      FPrint                                                    FPrint
                                       MACACC                                                      MACACC                                −350                MACACC
                                                                             −600
                                       QMACACC (1)                                                 QMACACC (1)                                               QMACACC (1)
                   −400                                                                                                                  −400
                      0.00   0.25   0.50     0.75   1.00                        0.00   0.25    0.50    0.75   1.00                          0.00   0.25   0.50   0.75   1.00
                               Training steps        1e6
                                                                                         Training steps          1e6
                                                                                                                                                     Training steps        1e6
Figure 4.8 Normalized training curves comparison between MACACC, QMACACC (1), and
the top baseline methods with different platoon sizes in the CACC scenarios: “Catchup”
(above) and “Slowdown” (below).
4.5                   Conclusions And Future Work
                   In this chapter, we have addressed the CACC problems by formulating it as a fully decen-
tralized MARL problem. This novel approach eliminated the need for a centralized controller,
thereby enhancing the system’s scalability and robustness. Additionally, we introduced an
innovative quantization-based communication protocol, which significantly enhanced com-
munication efficiency among the agents. To validate our proposed approach, we undertook
comprehensive experiments and compared it with several state-of-the-art MARL algorithms,
showing our approach had superior control performance and communication efficiency. These
results underscore the potential of our fully decentralized MARL and quantization-based
communication protocol as a robust and effective solution for real-world CACC problems.
                                                                                              60


    In this chapter, we employed the Optimal Velocity Model (OVM), popular for its sim-
plicity and efficacy in emulating certain traffic flow behaviors. However, it is worth noting
that OVM simplifies the intricate nature of human driving behaviors and may not be en-
tirely precise. As a result, future research endeavors will focus on the integration of more
comprehensive human driver models to improve the accuracy of our simulation.
                                              61


                                         CHAPTER 5
             DEEP MARL FOR SECONDARY VOLTAGE CONTROL
    In this chapter, we propose a scalable and efficient MARL framework for secondary
voltage control.
5.1    Introduction
    Over recent decades, renewable energy sources, notably solar and wind, have received ris-
ing attention, driven by their capacity to reduce greenhouse gas emissions and combat global
warming [125]. In a modern power grid, these green energies are integrated as distributed
generators (DGs), working alongside conventional sources like fossil fuels and nuclear power
plants. Specifically, localized energy networks with the flexibility to operate either connected
to or isolated from the main grid. As microgrids can still operate when the main grid is dis-
connected, they can strengthen grid resilience and service reliability. When disconnected
from the main grid, there are two levels of controls in a microgrid: primary control and
secondary control [13]. The primary control refers to the basic, low-level control within a
Distributed Generator (DG) aimed at maintaining a specific voltage reference. In contrast,
the secondary control focuses on the collaborative generation of local references across mul-
tiple DGs to meet grid-wise control objectives [71, 51, 94, 114, 120]. Given its potential to
significantly enhance grid efficiency, this chapter will focus on secondary voltage control.
    Secondary control methods can be generally divided into centralized and distributed
approaches. Centralized controllers aggregate data from all DGs and make collective control
decisions relayed to each DG [135, 93, 96]. Despite their promising results, these centralized
schemes have significant communication overheads, potential single points of failure [139]
and the curse of dimensionality, which makes them less feasible for large-scale microgrid
systems. On the other hand, distributed control approaches are inspired by cooperative
control in multi-agent systems [147, 14, 120, 15, 13, 34, 88], which adopt a decentralized
approach, where each DG interacts with its neighbors, making decisions based on shared data
from local communication networks. Traditional model-based approaches often simplify the
                                               62


intricate dynamics of microgrids for control design and then develop distributed feedback
controllers for the formulated tracking synchronization problem [14, 15, 13, 70, 124, 52].
Note that the underlying microgrid dynamics is subject to complex nonlinearity, system and
disturbance uncertainty, and high dimensionality, model simplifications have to be made to
enable the model-based control designs, which will inevitably negatively impact the control
performance.
     Recently, reinforcement learning (RL) has gained rising attention as a promising frame-
work to address the centralized voltage control problem, acclaimed for its online adaptability
and capability to handle intricate issues [79, 123, 95, 22, 48]. Notably, the Deep Q-network
(DQN) [96] demonstrates impressive performance in autonomous grid control, especially
amidst load fluctuations and topological changes [33]. Duan et. al [37] effectively utilize
deep deterministic policy gradient (DDPG) [79], an off-policy RL algorithm, to maintain
bus voltages within desired ranges. Furthermore, a two-time scale voltage controller is pro-
posed in [148], where shunt capacitors are configured to minimize the voltage deviations
using a deep reinforcement learning algorithm.
     At the same time, multi-agent reinforcement learning (MARL) has seen great improve-
ments, finding applications in diverse domains like gaming such as StarCraft and DOTA
[134, 10], traffic light management [29], and autonomous driving [121]. There are also efforts
in applying MARL to microgrid voltage control, emphasizing autonomous voltage control
(AVC) [83, 17, 153, 45]. In this chapter, our focus is on secondary voltage control in iso-
lated microgrids, aiming to maintain DG output voltages at a predefined reference value
[14]. We formulate the secondary voltage control (SVC) of inverter-based microgrid systems
as a partially observable Markov decision process (POMDP) and introduce an on-policy
MARL algorithm, PowerNet. This decentralized MARL strategy offers stability in training
and efficiency in policy learning. Our extensive experiments show that our proposed Power-
Net outperforms several SOTA MARL algorithms and a model-based approach in learning
efficiency, performance, and scalability.
                                               63


    The key contributions and advancements presented in this chapter include:
   1. We formulate the secondary voltage control of inverter-based microgrids as a decen-
      tralized, cooperative MARL problem. To support this, we introduce and open-source
      a power grid simulation platform, PGSIM, available at https://github.com/Derekab
      c/PGSIM.
   2. We propose PowerNet, an efficient on-policy decentralized MARL algorithm, featuring
      a novel spatial discount factor, a learning-based communication protocol, and an action
      smoothing mechanism, all aimed at effectively learning a control policy.
   3. We conduct comprehensive experiments that highlight PowerNet’s superiority.            It
      demonstrates better performance than the traditional model-based control method
      and six other state-of-the-art MARL algorithms, especially in sample efficiency and
      voltage regulation.
    The remainder of the chapter is organized as follows. In Section 5.2, we formulate the sec-
ondary voltage control as a MARL problem and our developed MARL algorithm, PoweNet,
is detailed in Section 5.3. Experiments, results, and discussions are presented in Section 5.4
whereas concluding remarks and future works are drawn in Section 5.5.
     (a) Diagram of decentralized control of            (b) Diagram of inverter-based DG.
                  microgrids.
Figure 5.1 Schematic diagram of the decentralized control of microgrids and the inverter-
based DG.
                                                64


5.2     MARL Formulation
    The Voltage-controlled voltage source inverter (VCVSI) is widely used in DGs, offer-
ing expedient voltage/frequency support [14]. Figure 5.1(a) shows a typical VCVSI with
decentralized control architecture, each DG employs a secondary controller, coordinating
with neighboring DGs to dynamically produce voltage references. The primary controller,
a low-level level controller, uses this reference for tracking. The overall aim is to ensure the
voltage and frequency of all DGs align with the reference value, even with power network
disturbances and primary control inaccuracies [14, 15, 100]. As Figure 5.1(b) illustrates, the
primary controller of each DG, labeled from i = 1 to N , takes the frequency and voltage
references from the secondary controller and regulates the output voltage and frequency
towards the desired reference. This is typically achieved via the active and reactive-power
droop techniques without DG intercommunication [14, 51]. The readers can refer to [14, 15]
for an in-depth understanding of the system dynamics, which is exploited to develop our
power grid simulation platform, PGSIM1 . The objective of the SVC is to coordinate with
other DGs and generate reference signals Vni to synchronize the voltage magnitude of DG i to
the reference value, in the presence of power disturbances and primary control imperfections.
    While there are existing model-based secondary control strategies [13, 14, 15], they tend
to underperform due to simplifications made in tackling non-linearity and uncertain dis-
turbances. Thus, in this chapter, we develop a model-free approach using MARL. Here,
the microgrid is formulated as a multi-agent network, denoted as G = (ν , ε). Each agent,
represented by i ∈ ν communicates only with its adjacent nodes Ni := {j|εij ∈        ε)}.
    Let’s define the global state and action spaces as S := ×i∈ν Si and A := ×i∈ν Ai respec-
tively. These symbolize the collective state data and combined controls of all DGs. The
microgrid’s underlying dynamics can be characterized by the state transition distribution
P: S × A × S → [0, 1]. For scalable power grid control, we adopt a decentralized MARL
framework. Each DG only communicates with its neighbors and makes decisions based on
    1
      See https://github.com/Derekabc/PGSIM
                                                65


these observations. As each agent i (or DG i) observes only a part of the environment, it
naturally results in a POMDP [55]. At each time step t, each agent i receive an observation
oi,t ∈ Oi , takes an action ai,t , and then receives the subsequent observation oi,t+1 and a
reward signal ri,t := Ot × At → R. The objective is to find an optimal decentralized policy
πi := Oi × Ai → [0, 1] that maximizes the expected total rewards. We tackle this challenge
using MARL, defining the key POMDP elements as follows:
    1. Action Space: Each DG’s control action is the secondary voltage control set point,
       Vni . For this work, we employ 10 uniformly spaced discrete actions between 1.00 pu
       and 1.14 pu. The overall action of the microgrid is the combined actions from all DGs,
       i.e., a = vn1 × vn2 × · · · × vnN .
    2. State Space: The DG’s state is defined with nine variables to characterize the op-
       erations of the DG, denoted as si,t = (δi , Pi , Qi , iodi , ioqi , ibdi , ibqi , vbdi , vbqi ), where δi is
       the measured reference angle frame; Pi and Qi denote the active and reactive power,
       respectively; ioqi , ioqi , ibdi and ibqi represent the output d-q currents of the DG i and
       the directly connected buses, respectively; and vbdi and vbqi are the output d-q voltages
       of the connected bus, respectively. The entire microgrid state is a Cartesian product
       of these individual states, i.e., S(t) = s1,t × · · · × sN,t .
    3. Observation Space: We assume DGs observe only their local state and messages from
       neighbors. This observation consists of the local state and the received communication
       message, i.e., oi,t = si,t ∪ mi,t , where mi,t is the communication message received from
       its neighboring agents j ∈ Ni and will be detailed in Section 5.3.
    4. Transition Probabilities: The transition probability T (s′ |s, a) characterizes the dy-
       namics of the microgrid. We follow the models in [14, 15] to build our simulation
       platform but we do not exploit any prior knowledge of the transition probability as our
       MARL is model free.
    5. Reward Function: we design the following reward function to promote the DGs to
                                                     66


        quickly converge to reference voltages (e.g., 1 pu):
                                
                                
                                
                                
                                
                                 0.05 − |1 − vi |, vi ∈ [0.95, 1.05],
                                
                                
                          ri,t ≜ −|1 − vi |,         vi ∈ [0.8, 0.95] ∪ [1.05, 1.25],          (5.1)
                                
                                
                                
                                
                                
                                −10,
                                                    Otherwise.
        where ri,t is the reward of agent i at time step t. More specifically, we divide the voltage
        range into 3 operation zones similar to [139]: normal zone ([0.95, 1.05] pu), violation
        zone ([0.8, 0.95] ∪ [1.05, 1.25] pu), and diverged zone ([0, 0.8] ∪ [1.25, ∞] pu). With the
        formulated reward, DGs with diverged voltages or no power flow solution will receive
        a large penalty, while DGs with a voltage close to 1 pu obtain positive rewards.
    Remark 1. Our formulation for regulating DG voltages follows the literature such
as [14, 13, 98]. Although effective for DG voltages, bus voltages might still deviate. An
alternative formulation can address bus voltages with a distinct reward function, detailed
further on our site2 .
5.3     PowerNet For Secondary Voltage Control
    In this section, we introduce PowerNet, a new decentralized MARL algorithm devised
to address the previously stated POMDP. The proposed PowerNet extends the independent
actor-critic (IA2C) to deal with multiple cooperative agents by fostering collaborations be-
tween neighboring agents, which is enabled by the following three novel characteristics: 1)
a learning-based differentiable communication protocol, fostering agent collaboration; 2) a
unique spatial discount factor, mitigating partial observability and boosting learning stabil-
ity; and 3) an action smoothing mechanism, offsetting the influence of system uncertainties
and on-policy learning noise. The subsequent sections delve into these features.
5.3.1      Differentiable Communication Protocol
    Within our decentralized MARL framework, each agent (DG) communicates with its
neighbors to exchange crucial information, like encoded states and policies. In opposition
    2
      See https://github.com/Derekabc/PGSIM/tree/R2
                                                   67


                  Figure 5.2 Overview of the proposed communication protocol.
to traditional non-communicative MARL algorithms such as IA2C, FPrint, and ConseNet,
which often exhibit slow convergence, our method leverages neighbor information for more
efficient learning (Figure 5.2 demonstrates this). At each time step t, agent i updates its
hidden state hi,t using the given equation:
                               hi,t = fi (hi,t−1 , qo (es (oi,t )), qh (hNi,t−1 )).              (5.2)
where hi,t−1 is the encoded hidden state from last time step; oi,t is agent i’s observation made
at time t, i.e., its internal state ; hNi,t−1 is the concatenated hidden state from its neighbors;
es , qo , and qh are differentiable message encoding and extracting functions, where one-layer
fully connected layers with 64 neurons are used; and fi is the encoding function for the
hidden states and communication information, where we use a Long Short Term Memory
(LSTM [56]) network with a 64-neuron hidden layer to improve the observability [55] and
better utilize the past hidden state hi,t−1 information.
     To improve both scalability and robustness, we classify the observation oi,t based on their
physical units. These categorized sub-observations are individually encoded and then con-
catenated. For instance, the observation oi,t is divided into four groups: o1i,t ∪ o2i,t ∪ o3i,t ∪ o4i,t
according to their units, i.e., voltage, power, reference angle frame, and current. These
regrouped sub-observations are encoded independently and then concatenated as es (oi,t ) =
                                                       68


cat(e1s (o1i,t ), e2s (o2i,t ), e3s (o3i,t ), e4s (o4i,t )), where ejs , j=1,2,3,4, are one-layer fully connected encod-
ing layers.
     The received communication message mi,t for ith agent comprises the encoded hidden
states of its neighbors, i.e., mi,t = hNi ,t−1 with hNi ,t−1 being the hidden states of agent i’s
neighbors at time t − 1. Given that the hidden state ht−1 is neurally encoded, these mes-
sages offer increased security over transmitting raw states directly. The encoded observation
es (oi,t ) and the neighbors’ hidden states hNi,t−1 are extracted by qo and qh , respectively. We
then concatenate the encoded message as õi,t = cat(qo (es (oi,t )), qh (hNi,t−1 )). Concatenation
operation is shown in [27] to have better performance as compared to the summation oper-
ation used in DIAL and CommNet on reducing information loss. The hidden state is then
processed through the LSTM fi function to encode õi,t and hi,t−1 .
     Following this, the hidden state hi,t obtained from (5.2) is then employed in the actor
and critic networks to generate random actions and predict the value functions, respectively,
i.e., πθi (·|hi,t ) and Vωi (hi,t ). Inspired by MADDPG [86], we also include the neighbors’
action information into the critic network Vwi (hi,t , aNi ,t ) to enhance the training. In this
chapter, we use a discrete action space and the action is sampled from the last Softmax
layer as ai,t ∼ πθi (·|hi,t ). We adopt the centralized training and decentralized execution
scheme [27, 29], where each agent has its own actor and critic networks, and their policies
are updated independently instead of updating in a consensus way [152] that may hurt the
convergence speed.
5.3.2       Spatial Discount Factor
     In cooperative MARL, the objective is to maximize the shared global reward, Rg,t =
P                                         PT           k
    i∈ν Ri,t , where Ri,t =                   k=0 γ ri,t+k denotes the cumulative reward for agent i. For each
                                                                                                                PN
agent, a natural choice of the reward is the instantaneous global reward, i.e.,                                   i=1 ri,t .
However, this scheme can lead to several issues. First, aggregating global rewards can cause
large latency and increase communication overheads. Second, the single global reward leads
to the credit assignment problem [127], which would significantly impede learning efficiency
                                                                        69


and limit the number of agents to a small size. Third, as an agent is typically only slightly
impacted by agents distant from it, using the global reward to train the policy for each agent
can lead to slow convergence. To address the aforementioned issues of using a global reward,
in this chapter, we employ a novel spatial discount factor to address the above challenges.
Specifically, each agent i, i = 1, · · · , N , utilizes the following reward:
                                              XT     X
                                      Ri,t =      γk     α(di,j )rj,t+k .                         (5.3)
                                              k=0    j∈ν
where α(di,j ) ∈ [0, 1] is the spatial discount function with di,j being the distance between
agent i and j. The distance di,j can be an Euclidean distance characterizing the physical
distance between the two agents or the distances between two vertices in a graph (i.e., the
number of shortest connecting edges). Note that the new reward defined in (5.3) characterizes
a whole spectrum of reward correlation, from local greedy control (When α(di,j̸=i ) = 0 and
α(di,i ) = 1), to a global reward (when α ≡ 1). Note that there are different choices for the
spatial discount function. For example, one can choose
                                                   
                                                   
                                                   1, if dij ≤ D,
                                                   
                                       α(di,j ) =                                                 (5.4)
                                                   
                                                   0, Otherwise.
                                                   
where D is the distance threshold that defines an “effective distance” near the considered
agent. The threshold D can incorporate factors such as communication speed and overhead.
In this chapter, we adopt α(di,j ) = αdi,j with α ∈ (0, 1] being a constant scalar and a
hyperparameter to be tuned.
    As a result, the gradient computation in Eq. (5.5) becomes:
                                                h                          i
                             ∇θi J(θi ) = Eπθ ∇θi log πθi (ai,t |oνi ,t ) Ai,t .                  (5.5)
           πθ
where Ai,ti = Qπθi (s, a) − V πθi (s) is the advantage function, Qπθi (s, a) = Eπθi (Ri,t |st =
s, aνi ,t = aνi ) is the state-action value function, and V πθi (s) = V πi (s, aNi ) is the state value
function. We optimize the parameter of critic network Vwi as follows:
                                    hX                                             i2
                                             dij
                             min ED         α rj,t + Vωi′ (oνi ,t ) − Vωi (oνi ,t ) .             (5.6)
                              ωi
                                        j∈ν
                                                     70


Minibatches of sampled trajectory are used to update the network parameters using Eq. 5.5
and Eq. 5.6 to reduce the variance [29].
5.3.3    Action Smoothing
     Our introduced PowerNet operates as an on-policy MARL algorithm, necessitating the
sampling of stochastic actions at every time step. This occasionally results in noisy action
samples with significant fluctuations, even after the algorithm convergences. These action
fluctuations can cause undesirable system perturbations. To address this, we introduce an
action smoothing scheme to refine the sampled action ai,t ∼ πθi (·|hi,t ) before its execution
as follows:                         
                                    
                                    ai,t ,
                                                            t = 1;
                             ai,t ←                                                       (5.7)
                                    
                                    ρai,t−1 + (1 − ρ)ai,t , t > 1.
                                    
where ρ ∈ [0, 1] denotes the discount factor and ai,t−1 represents the action from the previous
time step t − 1. This scheme is also recognized as exponential moving average [47]. When ρ
is chosen as 0, this scheme performs no action smoothing, allowing the agent to take actions
directly from the policy πθi . However, for ρ = 1, the new action ai,t remains unchanged,
causing the agent to rely only on the preceding action. We define an action smoothing
window Tw to buffer and utilize past actions for this smoothing process. The window size
Tw is crucial; an overly large Tw might retain outdated actions, causing the agent’s slow
response to sudden changes, while an exceedingly small Tw can limit the smoothing efficacy.
In this chapter, both ρ and Tw are considered hyperparameters and are optimized through
cross-validation.
     The comprehensive PowerNet algorithm is detailed in Algorithm 5.1. The primary hy-
perparameters encompass the distance discount factor α, action smoothing factor ρ, the
(time)-discount factor γ, actor network’s learning rate ηw and critic network’s learning rate
ηθ , action sample window Tw , total training iterations M , epoch duration T , and batch
size N . In the algorithm, agents repeatedly interact with the environment across multiple
epochs (Lines 2–29). During Lines 4–6, each agent gathers and transmits the communication
                                              71


message mi,t to its neighbors. Subsequently, agents consolidate, rearrange, and encode their
observations (Line 8). The agents update their hidden states and the actor networks for
action sampling (Lines 9–10). In Lines 12–15, the state value is updated and actions are
smoothed before deployment. After environment interaction (Lines 16), agents transition to
the subsequent state and earn an immediate reward (Lines 17), which is then stored in an
on-policy experience buffer (Lines 18). In Lines 20–26, both actor and critic network param-
eters get updated, leveraging trajectories from the on-policy experience buffer after the end
of each episode. An episode terminates with a DONE signal either upon its completion or
if there’s an absence of a power flow solution. Upon receiving this DONE signal, all agents
are reset to their initial states, transit to a fresh epoch (Line 28).
                                                 72


                      Algorithm 5.1 PowerNet for Secondary Voltage Control
Parameter: α, ρ, γ, ηw , ηθ , T, M, Tw .
Output        : θi , wi , i ∈ ν .
initialize s0 , h−1 , t ← 0, D ← ∅;
for j = 0 to M − 1 do
    for t = 0 to T − 1 do
        for i ∈ ν do
            send mi,t = hi,t−1
        end
        for i ∈ ν do
            get õi,t = cat(qo (es (oi,t )), qh (hNi,t−1 ))
              get hi,t = fi (hi,t−1 , oei,t ), πi,t ← πθi (·|hi,t )
              update ai,t ∼ πi,t
        end
        for i ∈ ν do
            update vi,t ← Vwi (hi,t , aNi ,t )
              update ai,t according to Eq. (5.7)
              execute ai,t
        end
        simulate {si,t+1 , ri,t }i∈ν
        update D ← {(oi,t , ai,t , ri,t , vi,t )}i∈ν
        update t ← t + 1, j ← j + 1
        if DONE then
            for i ∈ ν do
                 update θi ← θi + ηθi ∇θi J(θi )
                 update wi ← wi + ηwi ∇ωi V (ωi )
            end
        end
        initialize D ← ∅
    end
    update s0 , h−1 , t ← 0
end
5.4    Experiment, Results, And Discussion
    In this section, we deploy the PowerNet model to two distinct microgrid systems: the
IEEE 34-bus test feeder equipped with 6 distributed DGs (denoted as microgrid-6, depicted
in Figure 5.3(a)), and a more expansive microgrid system hosting 20 DGs (referred to as
microgrid-20, showcased in Figure 5.3(b)). Our simulation platform is constructed upon the
line and load specifications detailed in [78] and [98], respectively. To achieve a representation
that mirrors real-world power systems, we introduced random load variations throughout
                                                         73


the microgrid, including perturbations of ±20% from the nominal values mentioned in [98].
Additionally, we incorporated random disturbances, amounting to ±5% of the nominal values
for each load, at every simulation step to simulate disruptions characteristic of actual power
grids. Each DG operates with a sampling interval of 0.05s and is equipped with the capability
to communicate with adjacent DGs via local communication edges. The primary control at
the foundational level is facilitated by [14]. All our experiments were executed on an Ubuntu
18.04 server powered by an AMD 9820X processor, accompanied by 64 GB of RAM.
Figure 5.3 Schematic representations of the two microgrid simulation setups: (a) microgrid-
6; and (b) microgrid-20.
    Employing cross-validation, the spatial discount factor, α is determined to be 1.0 for
microgrid-6 and 0.7 for microgrid-20. It is reasonable to have a larger spatial discount factor
in the microgrid-6 system as agents are strongly and tightly connected in that small microgrid
system. Conversely, a smaller α in the larger-scale microgrid-20 system is advantageous to
reduce the effect of remote agents, but a too small α will cause the agents to ignore their
effect on the neighboring agents. For example, if we choose α = 0, the expected reward for
agent i at time step t will become Ri = Tt=1 γ t ri,t , which cannot ensure a maximum global
                                           P
reward as each agent update its own policy greedily.
    For the action-smoothing coefficient, ρ, we adopt a value of 0.5 for microgrid-6 and 0.4
for microgrid-20. Given the intricacy of microgrid-20 compared to microgrid-6, a larger ρ
is chosen, keeping in mind that a diminished ρ might translate into delayed responses to
                                                74


abrupt voltage anomalies. The sample window, Tw is consistently set at 2 for both microgrid
configurations.
5.4.1   Compared To Centralized Control
    To benchmark PowerNet against a Centralized MARL technique, we utilized PPO [119],
a leading centralized actor-critic RL method. In this approach, a singular centralized RL
agent makes the control decisions for all agents using the global state as its input, and
compare its performance with PowerNet. As illustrated in Figure 5.4, PowerNet outperforms
the centralized method both in terms of convergence rate and overall training performance
across both microgrid systems. Notably, PPO displays poor convergence issues in the larger
microgrid that incorporates 20 DGs. This is not surprising as centralized methods are
conventionally challenged by the curse of dimensionality and scalability concerns. Unlike
the centralized approach, where the input dimension increases with network size, the input
dimension for PowerNet remains constant. Consequently, the centralized control strategy
rapidly loses feasibility for expansive networks, a phenomenon detailed in Table 5.1.
Figure 5.4 MARL training curves compared with PPO for (a) microgrid-6 and (b) microgrid-
20 systems. The lines show the average reward per training episode which is smoothed over
the past 100 episodes.
                                              75


           Table 5.1 Performance comparison between trained PowerNet and PPO.
                            Microgrid-6                              Microgrid-20
              Converge Time   Input Size Performance   Converge Time   Input Size Performance
   PPO            5.25 h         54          0.21       not converge      180         0.44
   PowerNet       0.90h           9          0.22          7.28h           9          0.60
5.4.2     Compared With SOTA Benchmarks
    PowerNet is compared against several leading benchmark MARL algorithms (IA2L [87],
FPrint [43], ConseNet [152], DIAL [42], MADDPG [86], and CommNet [126]), as well as a
conventional model-based method [85], to validate its efficiency. Each model is trained across
10,000 episodes, with parameters set as γ = 0.99, minibatch size N = 20, actor learning rate
ηθ = 5 × 10−4 , and critic learning rate ηw = 2.5 × 10−4 . To ensure an equitable comparison,
unique random seeds are generated for each episode, and the same seed is uniformly applied
across different algorithms, ensuring a consistent training/testing environment. Agents are
controlled every ∆T = 0.05 seconds, and one episode encompasses T = 20 steps.
    Figure 5.5 presents the training curves for all the MARL algorithms across the microgrid-
6 and microgrid-20 systems. To better visualize the training processes, we only show the
first 2,000 and 4,000 training episodes for microgrid-6 and microgrid-20 systems, respectively.
It’s evident that PowerNet, within both microgrid-6 and microgrid-20 contexts, exceeds the
performance of all benchmarked MARL algorithms in convergence rate. This is due to the
proposed communication protocol structure and suitable spatial discount factor, which help
enhance sample efficiency and accelerate learning. Especially in the intricate microgrid-
20 setting, PowerNet’s superior sample efficiency becomes obvious, marked by the fastest
convergence rate and the highest average episode reward when measured against the other
contenders (refer to Figure 5.5b).
    To more effectively illustrate voltage control capabilities, we compare the voltage con-
trol outcomes between the configurations of only primary controller and with secondary
controller in the context of substantial load fluctuations (25%). These comparisons are visu-
ally represented in Figure 5.6 (pertaining to microgrid-6) and Figure 5.7 (corresponding to
                                                76


Figure 5.5 MARL training curves for (a) microgrid-6 and (b) microgrid-20 systems. The lines
show the average reward per training episode which is smoothed over the past 100 episodes.
microgrid-20). The primary control objective is to align all DG voltages with the reference
benchmark of 1pu. A black cross mark represents a voltage violation on the DG whereas
a black dot means the DG’s voltage is within the normal range. Here we use r to denote
the average control reward of the last 5 control steps according to Eq. 5.1. As evidenced in
Figure 5.6, all evaluated algorithms consistently bring voltages within the standard range in
the compact microgrid-6 setup.
            Figure 5.6 Execution performance on voltage control in microgrid-6.
    Figure 5.7 presents a performance evaluation for the intricate microgrid-20 scenario, char-
acterized by its dense network interconnections and complex agent interactions. Controlling
the voltage in an isolated manner, as done by IA2C, is not wise and results in bad out-
comes. While MPC delivers commendable results, its dependency on a precise model makes
it computationally taxing, requiring 19 times the inference duration of PowerNet. This is
                                              77


             Figure 5.7 Execution performance on voltage control in microgrid-20.
primarily due to its need to address an online nonlinear optimization challenge at every
timestep, necessitating significant computational resources. Algorithms like FPrint, DIAL,
and CommNet also exhibit voltage violations, which shows that simply including neighbors’
policies is not sufficient for a complicated test case such as microgrid-20. The shortcom-
ings of DIAL and CommNet suggest that an oversimplified aggregation of communication
data, coupled with the computation of immediate rewards based on undiscriminating global
metrics, can hinder agents from establishing efficient communication protocols in expansive,
cooperative microgrid settings. ConseNet struggles to stabilize voltages, resulting in greater
deviation compared to PowerNet.
5.4.3    Robustness To Load Variations And Agent Mis-connections
    After training for 10,000 episodes, we assess the established policy 20 times under varying
load disturbances. Each agent retains the same random seed for each episode, while differ-
ent test seeds are employed for distinct episodes. Beyond the leading MARL algorithms,
PowerNet’s performance in voltage control is compared against a conventional model-based
method [85]. Table 5.2 consolidates the results, showcasing the average episode return across
the 20 test episodes for both the MARL algorithms and the conventional method (MPC).
    Evidently, PowerNet consistently outperforms other MARL methodologies in both sce-
narios, regardless of the load fluctuations. Although the MPC method yields comparable
results, it takes approximately 19.6 times the convergence time than PowerNet. IA2C’s sub-
optimal performance aligns with the training trajectories illustrated in Figure 5.5. While
other MARL benchmarks show reasonable outcomes, PowerNet surpasses them by a notable
                                               78


margin.
Table 5.2 Performance comparison between trained MARL policies and the conventional
model-based method under different load disturbances. The reward is the average reward
over 20 evaluation episodes. Best values are bolded.
  Load Disturbance  Network     PowerNet MPC [85] IA2C FPrint  ConseNet CommNet DIAL MADDPG
                   Microgrid-6   0.240    0.141    0.206 0.236  0.219     0.223 0.222 0.221
        10%
                   Microgrid-20  0.781    0.642    0.447 0.711  0.705     0.700 0.706 0.436
                   Microgrid-6   0.238    0.139    0.206 0.236  0.221     0.220 0.220 0.220
        15%
                   Microgrid-20  0.771    0.632    0.474 0.702  0.692     0.690 0.697 0.422
                   Microgrid-6   0.236    0.134    0.205 0.233  0.223     0.220 0.221 0.220
        25%
                   Microgrid-20  0.740    0.598    0.458 0.660  0.645     0.649 0.663 0.438
    Furthermore, Figure 5.8 presents a comparative analysis of training curves between train-
ing from scratch and training from the pre-trained model when disconnecting one DG unit
(microgrid-19) or adding a new DG unit (microgrid-21). The insights suggest that the model
trained on microgrid-20 can significantly expedite policy adjustments when a DG unit is ei-
ther removed or integrated. Since our algorithm is decentralized and on-policy, we do not
need to re-train the whole network from scratch and we can just modify the involved DNN
layers that are related to the topology change. For example, if a new DG unit is added, we
just initialize and insert a new policy layer into the existing policy network and then link it
to its neighboring layers. The other policy network will be loaded from the existing trained
weights and the whole system will be adapted. As evident from Figure 5.8, the pre-trained
model not only accelerates adaptation but also delivers superior performance compared to
starting the training process from scratch.
5.4.4    Scalability To Larger Powergrid
    In this subsection, we further investigate PowerNet’s scalability by extending the number
of DG units to 40. As evident in Figure 5.9, PowerNet maintains its efficacy and continually
outperforms other SOTA MARL benchmarks in performance. Notably, CommNet, which
initially showed promise, fails to converge, performing even poorer than IA2C. This might
be attributed to its approach of averaging communication messages rather than encoding
them.
                                                  79


Figure 5.8 MARL training curves compared with training from scratch and adapting from
a pre-trained microgrid-20 model for (a) microgrid-19 and (b) microgrid-21 systems. The
lines show the average reward per training episode which are smoothed over the past 100
episodes.
Figure 5.9 MARL training curves comparison between trained MARL policies for microgrid-
40 systems. The lines show the average reward per training episode which is smoothed over
the past 100 episodes.
    Table 5.3 shows the inference times required for an agent to generate an action across
microgrid-6, microgrid-20, and microgrid-40 configurations. Even in a large-scale setting
with 40 agents, PowerNet is remarkably efficient, necessitating a mere 35.8 ms to generate
                                            80


an action, a time frame that is practically feasible for meeting real-time needs. Given Pow-
erNet’s decentralized design, the ratio of inference time to network size remains constant,
underscoring its impressive scalability.
 Table 5.3 Inference time and time/size ratio of PowerNet for different microgrid networks.
                                 Microgrid-6     Microgrid-20   Microgrid-40
                 Inference time      7.9 ms        16.8 ms         35.8 ms
                   Time/size          1.32            0.84           0.90
5.5   Conclusions And Future Work
    In this chapter, we modeled the secondary voltage control in inverter-based microgrid
systems as a MARL problem. We introduced PowerNet, a novel on-policy and cooperative
MARL algorithm, integrating a differentiable, learning-centric communication protocol, a
spatial discount factor, and an action smoothing mechanism. Comprehensive experiments
were conducted that show the proposed PowerNet outperforms other state-of-the-art ap-
proaches, both in convergence speed and voltage control proficiency. In our future work,
we aim to develop a more realistic simulation environment, collecting data from real-world
power systems. Furthermore, we plan to explore severe system disturbances and delve into
schemes that guarantee safety during the learning process.
                                              81


                                         CHAPTER 6
                                       CONCLUSION
    In this thesis, we considered the problem of networked system control (NSC), aiming to
explore safe, effective, and scalable MARL algorithms, with specific applications in connected
autonomous vehicles and smart grids. First, an efficient and scalable MARL framework was
proposed for on-ramp merging in mixed traffic [21]. Furthermore, a novel priority-based
safety supervisor was incorporated into the framework to significantly reduce the collision
rate and expedite the training process. A gym-like simulation environment for on-ramp
merging was also developed and open-sourced with three different traffic density levels.
    Secondly, a fully decentralized MARL framework was introduced for Cooperative Adap-
tive Cruise Control (CACC) without the need for a central controller. To enhance commu-
nication efficiency, a quantization-based communication protocol was developed by applying
random quantization to the messages being communicated and ensuring that critical informa-
tion was transmitted with minimized bandwidth usage. Furthermore, a gym-like simulation
environment for cooperative adaptive cruise control was adopted and open-sourced with 7
state-of-the-art MARL benchmarks.
    Lastly, an efficient MARL algorithm was proposed specifically for cooperative control
within power grids, where each agent (i.e., each DG) learns a control policy based on (sub-
)global reward and encoded communication messages from its neighbors. Furthermore, a
novel spatial discount factor was developed to mitigate the effect of remote agents, expedite
the training process and improve scalability. Moreover, a differentiable, learning-based com-
munication protocol was developed to strengthen collaboration among neighboring agents.
An open-source software, called PGSim, for a highly efficient, high-fidelity power grid simu-
lation platform was developed and open-sourced.
    Despite the many challenges of applying MARL for NSC in real-world applications, such
as scalability, safety, efficiency, and the lack of realistic simulators, this thesis has made
considerable contributions in addressing these issues. The development of scalable MARL
                                               82


algorithms, fully decentralized MARL frameworks, efficient communication protocols, and
highly accurate simulators has effectively paved the way for further advancements in this field.
However, the journey does not end here. These accomplishments set the stage for future
research efforts to build upon these foundations, continually pushing the boundaries of what
is possible in NSC using MARL. Future work may focus on improving the efficiency and
robustness of these techniques and exploring their applicability to other complex networked
systems.
                                              83


                                   BIBLIOGRAPHY
[1]  Apollo open platform. https://apollo.auto/developer.html. Accessed: 2021-03-31.
[2]  Future of driving. https://www.tesla.com/autopilot. Accessed: 2021-03-31.
[3]  Ahmed MH Al-Jhayyish and Klaus Werner Schmidt. Feedforward strategies for coop-
     erative adaptive cruise control in heterogeneous vehicle strings. IEEE Transactions on
     Intelligent Transportation Systems, 19(1):113–122, 2017.
[4]  Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott
     Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In Proceedings of
     the AAAI Conference on Artificial Intelligence, volume 32, 2018.
[5]  Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter
     Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba.
     Hindsight experience replay. Advances in neural information processing systems, 30,
     2017.
[6]  TJ Ayres, L Li, D Schleuning, and D Young. Preferred time-headway of highway
     drivers. In ITSC 2001. 2001 IEEE Intelligent Transportation Systems. Proceedings
     (Cat. No. 01TH8585), pages 826–829. IEEE, 2001.
[7]  Drew Bagnell and Andrew Ng. On local rewards and scaling distributed reinforcement
     learning. Advances in Neural Information Processing Systems, 18:91–98, 2005.
[8]  Masako Bando, Katsuya Hasebe, Akihiro Nakayama, Akihiro Shibata, and Yuki
     Sugiyama. Dynamical model of traffic congestion and numerical simulation. Phys-
     ical review E, 51(2):1035, 1995.
[9]  Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on
     reinforcement learning. In International conference on machine learning, pages 449–
     458. PMLR, 2017.
[10] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak,
     Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al.
     Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,
     2019.
[11] David Bevly, Xiaolong Cao, Mikhail Gordon, Guchan Ozbilgin, David Kari, Brently
     Nelson, Jonathan Woodruff, Matthew Barth, Chase Murray, Arda Kurt, et al. Lane
     change and merge maneuvers for connected and automated vehicles: A survey. IEEE
     Transactions on Intelligent Vehicles, 1(1):105–120, 2016.
[12] Sushrut Bhalla, Sriram Ganapathi Subramanian, and Mark Crowley. Deep multi agent
                                             84


     reinforcement learning for autonomous driving. In Canadian Conference on Artificial
     Intelligence, pages 67–78. Springer, 2020.
[13] Ali Bidram, Ali Davoudi, and Frank L Lewis. A multiobjective distributed control
     framework for islanded ac microgrids. IEEE Transactions on industrial informatics,
     10(3):1785–1798, 2014.
[14] Ali Bidram, Ali Davoudi, Frank L Lewis, and Josep M Guerrero. Distributed cooper-
     ative secondary control of microgrids using feedback linearization. IEEE Transactions
     on Power Systems, 28(3):3462–3470, 2013.
[15] Ali Bidram, Ali Davoudi, Frank L Lewis, and Zhihua Qu. Secondary control of micro-
     grids based on distributed cooperative control of multi-agent systems. IET Generation,
     Transmission & Distribution, 7(8):822–831, 2013.
[16] Maxime Bouton, Alireza Nakhaei, Kikuo Fujimura, and Mykel J Kochenderfer.
     Cooperation-aware reinforcement learning for merging in dense traffic. In 2019 IEEE
     Intelligent Transportation Systems Conference (ITSC), pages 3441–3447. IEEE, 2019.
[17] Di Cao, Weihao Hu, Junbo Zhao, Qi Huang, Zhe Chen, and Frede Blaabjerg. A
     multi-agent deep reinforcement learning based voltage regulation using coordinated pv
     inverters. IEEE Transactions on Power Systems, 35(5):4120–4123, 2020.
[18] Wenjing Cao, Masakazu Mukai, and Taketoshi Kawabe. Two-dimensional merging
     path generation using model predictive control. Artificial Life and Robotics, 17(3-
     4):350–356, 2013.
[19] Wenjing Cao, Masakazu Mukai, Taketoshi Kawabe, Hikaru Nishira, and Noriaki Fujiki.
     Cooperative vehicle path generation during merging using model predictive control
     with real-time optimization. Control Engineering Practice, 34:98–105, 2015.
[20] Dong Chen, Kaian Chen, Zhaojian Li, Tianshu Chu, Rui Yao, Feng Qiu, and Kaixiang
     Lin. Powernet: Multi-agent deep reinforcement learning for scalable powergrid control.
     IEEE Transactions on Power Systems, 37(2):1007–1017, 2021.
[21] Dong Chen, Mohammad R Hajidavalloo, Zhaojian Li, Kaian Chen, Yongqiang Wang,
     Longsheng Jiang, and Yue Wang. Deep multi-agent reinforcement learning for highway
     on-ramp merging in mixed traffic. IEEE Transactions on Intelligent Transportation
     Systems, 2023.
[22] Dong Chen, Longsheng Jiang, Yue Wang, and Zhaojian Li. Autonomous driving using
     safe reinforcement learning by incorporating a regret-based human lane-changing de-
     cision model. In 2020 American Control Conference (ACC), pages 4355–4361. IEEE,
     2020.
                                             85


[23] Dong Chen, Zhaojian Li, Tianshu Chu, Rui Yao, Feng Qiu, and Kaixiang Lin. Pow-
     ernet: Multi-agent deep reinforcement learning for scalable powergrid control. arXiv
     preprint arXiv:2011.12354, 2020.
[24] Dong Chen, Kaixiang Zhang, Yongqiang Wang, Xunyuan Yin, and Zhaojian Li.
     Communication-efficient decentralized multi-agent reinforcement learning for cooper-
     ative adaptive cruise control, 2023.
[25] Chien-Ming Chou, Chen-Yuan Li, Wei-Min Chien, and Kun-chan Lan. A feasibility
     study on vehicle-to-infrastructure communication: Wifi vs. wimax. In 2009 tenth in-
     ternational conference on mobile data management: systems, services and middleware,
     pages 397–398. IEEE, 2009.
[26] Tianshu Chu, Sandeep Chinchali, and Sachin Katti. Multi-agent reinforcement learning
     for networked system control. In International Conference on Learning Representa-
     tions, 2020.
[27] Tianshu Chu, Sandeep Chinchali, and Sachin Katti. Multi-agent reinforcement learning
     for networked system control. arXiv preprint arXiv:2004.01339, 2020.
[28] Tianshu Chu and Uroš Kalabić. Model-based deep reinforcement learning for cacc
     in mixed-autonomy vehicle platoon. In 2019 IEEE 58th Conference on Decision and
     Control (CDC), pages 4079–4084. IEEE, 2019.
[29] Tianshu Chu, Jie Wang, Lara Codecà, and Zhaojian Li. Multi-agent deep reinforce-
     ment learning for large-scale traffic signal control. IEEE Transactions on Intelligent
     Transportation Systems, 21(3):1086–1095, 2019.
[30] Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional rein-
     forcement learning with quantile regression. In Proceedings of the AAAI Conference
     on Artificial Intelligence, volume 32, 2018.
[31] Fengying Dang, Dong Chen, Jun Chen, and Zhaojian Li. Event-triggered model predic-
     tive control with deep reinforcement learning for autonomous driving. arXiv preprint
     arXiv:2208.10302, 2022.
[32] Charles Desjardins and Brahim Chaib-Draa. Cooperative adaptive cruise control: A
     reinforcement learning approach. IEEE Transactions on intelligent transportation sys-
     tems, 12(4):1248–1260, 2011.
[33] Ruisheng Diao, Zhiwei Wang, Di Shi, Qianyun Chang, Jiajun Duan, and Xiaohu
     Zhang. Autonomous voltage control for grid operation using deep reinforcement learn-
     ing. In 2019 IEEE Power & Energy Society General Meeting (PESGM), pages 1–5.
     IEEE, 2019.
                                              86


[34] Lei Ding, Qing-Long Han, and Xian-Ming Zhang. Distributed secondary control for
     active power sharing and frequency regulation in islanded microgrids using an event-
     triggered communication mechanism. IEEE Transactions on Industrial Informatics,
     15(7):3910–3922, 2018.
[35] Vinayak V Dixit, Sai Chand, and Divya J Nair. Autonomous vehicles: disengagements,
     accidents and reaction times. PLoS one, 11(12):e0168054, 2016.
[36] Jiqian Dong, Sikai Chen, Paul Young Joun Ha, Yujie Li, and Samuel Labi. A drl-based
     multiagent cooperative control framework for cav networks: a graphic convolution q
     network. arXiv preprint arXiv:2010.05437, 2020.
[37] Jiajun Duan, Di Shi, Ruisheng Diao, Haifeng Li, Zhiwei Wang, Bei Zhang, Desong
     Bian, and Zhehan Yi. Deep-reinforcement-learning-based autonomous voltage control
     for power grid operations. IEEE Transactions on Power Systems, 35(1):814–817, 2019.
[38] Zine el abidine Kherroubi, Samir Aknine, and Rebiha Bacha. Novel decision-making
     strategy for connected and autonomous vehicles in highway on-ramp merging. IEEE
     Transactions on Intelligent Transportation Systems, 23(8):12490–12502, 2021.
[39] Ingy ElSayed-Aly, Suda Bharadwaj, Christopher Amato, Rüdiger Ehlers, Ufuk Topcu,
     and Lu Feng. Safe multi-agent reinforcement learning via shielding. arXiv preprint
     arXiv:2101.11196, 2021.
[40] Francesca M Favarò, Nazanin Nader, Sky O Eurich, Michelle Tripp, and Naresh
     Varadaraju. Examining accident reports involving autonomous vehicles in california.
     PLoS one, 12(9):e0184952, 2017.
[41] Shuo Feng, Yi Zhang, Shengbo Eben Li, Zhong Cao, Henry X Liu, and Li Li. String
     stability for vehicular platoon control: Definitions and analysis methods. Annual Re-
     views in Control, 47:81–97, 2019.
[42] Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson.
     Learning to communicate with deep multi-agent reinforcement learning. Advances in
     neural information processing systems, 29, 2016.
[43] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS
     Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep
     multi-agent reinforcement learning. In International conference on machine learning,
     pages 1146–1155. PMLR, 2017.
[44] Weinan Gao, Zhong-Ping Jiang, and Kaan Ozbay. Data-driven adaptive optimal con-
     trol of connected vehicles. IEEE Transactions on Intelligent Transportation Systems,
     18(5):1122–1133, 2016.
                                             87


[45] Yuanqi Gao, Wei Wang, and Nanpeng Yu. Consensus multi-agent reinforcement learn-
     ing for volt-var control in power distribution networks. IEEE Transactions on Smart
     Grid, 2021.
[46] Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement
     learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
[47] Everette S Gardner Jr. Exponential smoothing: The state of the art. Journal of
     forecasting, 4(1):1–28, 1985.
[48] Mevludin Glavic. (deep) reinforcement learning for electric power system control and
     related problems: A short review and perspectives. Annual Reviews in Control, 48:22–
     35, 2019.
[49] Kailash Gogineni, Peng Wei, Tian Lan, and Guru Venkataramani. Scalability bottle-
     necks in multi-agent reinforcement learning systems. arXiv preprint arXiv:2302.05007,
     2023.
[50] Siyuan Gong, Anye Zhou, and Srinivas Peeta. Cooperative adaptive cruise control for
     a platoon of connected and autonomous vehicles considering dynamic information flow
     topology. Transportation research record, 2673(10):185–198, 2019.
[51] Josep M Guerrero, Juan C Vasquez, José Matas, Luis García De Vicuña, and
     Miguel Castilla. Hierarchical control of droop-controlled ac and dc microgrids—a gen-
     eral approach toward standardization. IEEE Transactions on industrial electronics,
     58(1):158–172, 2010.
[52] Fanghong Guo, Changyun Wen, Jianfeng Mao, and Yong-Duan Song. Distributed
     secondary voltage and frequency restoration control of droop-controlled inverter-based
     microgrids. IEEE Transactions on industrial Electronics, 62(7):4355–4364, 2014.
[53] Paul Young Joun Ha, Sikai Chen, Jiqian Dong, Runjia Du, Yujie Li, and Samuel
     Labi. Leveraging the capabilities of connected and autonomous vehicles and multi-
     agent reinforcement learning to mitigate highway bottleneck congestion. arXiv preprint
     arXiv:2010.05436, 2020.
[54] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic:
     Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv
     preprint arXiv:1801.01290, 2018.
[55] Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observ-
     able mdps. arXiv preprint arXiv:1507.06527, 2015.
[56] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural compu-
     tation, 9(8):1735–1780, 1997.
                                             88


[57] Yi Hou, Praveen Edara, and Carlos Sun. Modeling mandatory lane changing using
     bayes classifier and decision trees. IEEE Transactions on Intelligent Transportation
     Systems, 15(2):647–655, 2013.
[58] John Hourdakis and Panos G Michalopoulos. Evaluation of ramp control effectiveness
     in two twin cities freeways. Transportation Research Record, 1811(1):21–29, 2002.
[59] Shengyi Huang and Santiago Ontañón. A closer look at invalid action masking in
     policy gradient algorithms. arXiv preprint arXiv:2006.14171, 2020.
[60] Leslie N Jacobson, Kim C Henry, and Omar Mehyar. Real-time metering algorithm
     for centralized control. Number 1232. 1989.
[61] Dongyao Jia and Dong Ngoduy. Enhanced cooperative car-following traffic model
     with the combination of v2v and v2i communication. Transportation Research Part B:
     Methodological, 90:172–191, 2016.
[62] Liming Jiang, Yuanchang Xie, Nicholas G Evans, Xiao Wen, Tienan Li, and Dan-
     jue Chen. Reinforcement learning based cooperative longitudinal control for reducing
     traffic oscillations and improving platoon stability. Transportation Research Part C:
     Emerging Technologies, 141:103744, 2022.
[63] I Ge Jin and Gábor Orosz. Dynamics of connected vehicle systems with delayed
     acceleration feedback. Transportation Research Part C: Emerging Technologies, 46:46–
     64, 2014.
[64] Meha Kaushik, K Madhava Krishna, et al. Parameter sharing reinforcement learning
     architecture for multi agent driving behaviors. arXiv preprint arXiv:1811.07214, 2018.
[65] Meha Kaushik, Vignesh Prasad, K Madhava Krishna, and Balaraman Ravindran.
     Overtaking maneuvers in simulated highway driving using deep reinforcement learning.
     In 2018 ieee intelligent vehicles symposium (iv), pages 1885–1890. IEEE, 2018.
[66] Arne Kesting, Martin Treiber, and Dirk Helbing. General lane-changing model mobil
     for car-following models. Transportation Research Record, 1999(1):86–94, 2007.
[67] Zulqarnain H Khattak, Brian L Smith, Hyungjun Park, and Michael D Fontaine.
     Cooperative lane control application for fully connected and automated vehicles at
     multilane freeways. Transportation research part C: emerging technologies, 111:294–
     317, 2020.
[68] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A
     survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
[69] Petar Kormushev, Sylvain Calinon, and Darwin G Caldwell. Reinforcement learning
                                             89


     in robotics: Applications and real-world challenges. Robotics, 2(3):122–148, 2013.
[70] Jingang Lai, Xiaoqing Lu, Xinghuo Yu, Wei Yao, Jinyu Wen, and Shijie Cheng. Dis-
     tributed voltage control for dc mircogrids with coupling delays & noisy disturbances.
     In IECON 2017-43rd Annual Conference of the IEEE Industrial Electronics Society,
     pages 2461–2466. IEEE, 2017.
[71] Jingang Lai, Hong Zhou, Xiaoqing Lu, Xinghuo Yu, and Wenshan Hu. Droop-based
     distributed cooperative control for microgrids with time-varying delays. IEEE Trans-
     actions on Smart Grid, 7(4):1775–1789, 2016.
[72] Ludovic Leclercq, Jorge A Laval, and Nicolas Chiabaut. Capacity drops at merges:
     An endogenous model. Procedia-Social and Behavioral Sciences, 17:12–26, 2011.
[73] Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hut-
     ter. Learning quadrupedal locomotion over challenging terrain. Science robotics,
     5(47):eabc5986, 2020.
[74] Lei Lei, Tong Liu, Kan Zheng, and Lajos Hanzo. Deep reinforcement learning aided
     platoon control relying on v2x information. IEEE Transactions on Vehicular Technol-
     ogy, 71(6):5811–5826, 2022.
[75] Edouard Leurent. An environment for autonomous driving decision-making. https:
     //github.com/eleurent/highway-env, 2018.
[76] Meng Li, Zhibin Li, Shunchao Wang, and Bingtong Wang. Anti-disturbance self-
     supervised reinforcement learning for perturbed car-following system. IEEE Transac-
     tions on Vehicular Technology, 2023.
[77] Nan Li, Hao Chen, Ilya Kolmanovsky, and Anouck Girard. An explicit decision tree
     approach for automated driving. In Dynamic Systems and Control Conference, volume
     58271, page V001T45A003. American Society of Mechanical Engineers, 2017.
[78] Tao Li and Ji-Feng Zhang. Consensus conditions of multi-agent systems with time-
     varying topologies and stochastic communication noises. IEEE Transactions on Auto-
     matic Control, 55(9):2043–2057, 2010.
[79] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,
     Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep rein-
     forcement learning. arXiv preprint arXiv:1509.02971, 2015.
[80] Kaixiang Lin, Renyu Zhao, Zhe Xu, and Jiayu Zhou. Efficient large-scale fleet man-
     agement via multi-agent deep reinforcement learning. In Proceedings of the 24th ACM
     SIGKDD International Conference on Knowledge Discovery & Data Mining, pages
     1774–1783, 2018.
                                            90


[81] Yuan Lin, John McPhee, and Nasser L Azad. Anti-jerk on-ramp merging using deep
     reinforcement learning. In 2020 IEEE Intelligent Vehicles Symposium (IV), pages 7–14.
     IEEE, 2019.
[82] Yuan Lin, John McPhee, and Nasser L Azad. Comparison of deep reinforcement
     learning and model predictive control for adaptive cruise control. IEEE Transactions
     on Intelligent Vehicles, 6(2):221–231, 2020.
[83] Haotian Liu and Wenchuan Wu. Online multi-agent reinforcement learning for decen-
     tralized inverter-based volt-var control. arXiv preprint arXiv:2006.12841, 2020.
[84] Tong Liu, Lei Lei, Kan Zheng, and Kuan Zhang. Autonomous platoon control with
     integrated deep reinforcement learning and dynamic programming. IEEE Internet of
     Things Journal, 10(6):5476–5489, 2022.
[85] Guannan Lou, Wei Gu, Yinliang Xu, Ming Cheng, and Wei Liu. Distributed mpc-
     based secondary voltage control scheme for autonomous droop-controlled microgrids.
     IEEE transactions on sustainable energy, 8(2):792–804, 2016.
[86] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch.
     Multi-agent actor-critic for mixed cooperative-competitive environments. CoRR,
     abs/1706.02275, 2017.
[87] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mor-
     datch. Multi-agent actor-critic for mixed cooperative-competitive environments. Ad-
     vances in neural information processing systems, 30, 2017.
[88] Xiaoqing Lu, Xinghuo Yu, Jingang Lai, Josep M Guerrero, and Hong Zhou. Distributed
     secondary voltage and frequency control for islanded microgrids with uncertain com-
     munication links. IEEE Transactions on Industrial Informatics, 13(2):448–460, 2016.
[89] Joseph Lubars, Harsh Gupta, Adnan Raja, R Srikant, Liyun Li, and Xinzhou Wu.
     Combining reinforcement learning with model predictive control for on-ramp merging.
     arXiv preprint arXiv:2011.08484, 2020.
[90] Florian Marczak, Winnie Daamen, and Christine Buisson. Key variables of merging
     behaviour: empirical comparison between two sites and assessment of gap acceptance
     theory. Procedia-Social and Behavioral Sciences, 80:678–697, 2013.
[91] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored
     approximate curvature. In International conference on machine learning, pages 2408–
     2417. PMLR, 2015.
[92] Carlos Massera Filho, Marco H Terra, and Denis F Wolf. Safe optimization of highway
     traffic with robust model predictive control-based cooperative adaptive cruise control.
                                             91


      IEEE Transactions on Intelligent Transportation Systems, 18(11):3193–3203, 2017.
[93] Ali Mehrizi-Sani and Reza Iravani. Potential-function based control of a microgrid in
      islanded and grid-connected modes. IEEE Transactions on Power Systems, 25(4):1883–
      1891, 2010.
[94] A Mehrizi-Sanir and R Iravani. Secondary control for microgrids using potential func-
      tions: modeling issues. Proceedings of the Conseil International des Grands Résaux
      Électriques (CIGRE), 182, 2009.
[95] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy
      Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods
      for deep reinforcement learning. In International conference on machine learning, pages
      1928–1937, 2016.
[96] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,
      Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.
      arXiv preprint arXiv:1312.5602, 2013.
[97] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,
      Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg
      Ostrovski, et al. Human-level control through deep reinforcement learning. nature,
      518(7540):529–533, 2015.
[98] Aquib Mustafa, Binod Poudel, Ali Bidram, and Hamidreza Modares. Detection and
      mitigation of data manipulation attacks in ac microgrids. IEEE Transactions on Smart
      Grid, 11(3):2588–2603, 2019.
[99] Daiheng Ni and John D Leonard II. A simplified kinematic wave model at a merge
      bottleneck. Applied mathematical modelling, 29(11):1054–1072, 2005.
[100] Boda Ning, Qing-Long Han, and Lei Ding. Distributed secondary control of ac micro-
      grids with external disturbances and directed communication topologies: A full-order
      sliding-mode approach. IEEE/CAA Journal of Automatica Sinica, 2020.
[101] Transportation Officials. A Policy on Geometric Design of Highways and Streets, 2011.
      AASHTO, 2011.
[102] Praveen Palanisamy. Multi-agent connected autonomous driving using deep reinforce-
      ment learning. In 2020 International Joint Conference on Neural Networks (IJCNN),
      pages 1–7. IEEE, 2020.
[103] Jacopo Panerati, Hehui Zheng, SiQi Zhou, James Xu, Amanda Prorok, and Angela P
      Schoellig. Learning to fly—a gym environment with pybullet physics for reinforce-
      ment learning of multi-agent quadcopter control. In 2021 IEEE/RSJ International
                                              92


      Conference on Intelligent Robots and Systems (IROS), pages 7512–7519. IEEE, 2021.
[104] Markos Papageorgiou, Christina Diakaki, Vaya Dinopoulou, Apostolos Kotsialos, and
      Yibing Wang. Review of road traffic control strategies. Proceedings of the IEEE,
      91(12):2043–2067, 2003.
[105] Markos Papageorgiou and Apostolos Kotsialos. Freeway ramp metering: An overview.
      IEEE transactions on intelligent transportation systems, 3(4):271–281, 2002.
[106] Ioannis Papamichail and Markos Papageorgiou. Traffic-responsive linked ramp-
      metering control. IEEE Transactions on Intelligent Transportation Systems, 9(1):111–
      121, 2008.
[107] Mohammad Parvini, Arturo González, Andrés Villamil, Philipp Schulz, and Gerhard
      Fettweis. Joint resource allocation and string-stable cacc design with multi-agent re-
      inforcement learning. 2023.
[108] Ashley Peake, Joe McCalmon, Benjamin Raiford, Tongtong Liu, and Sarra Alqahtani.
      Multi-agent reinforcement learning for cooperative adaptive cruise control. In 2020
      IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI),
      pages 15–22. IEEE, 2020.
[109] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y
      Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Param-
      eter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017.
[110] Philip Polack, Florent Altché, Brigitte d’Andréa Novel, and Arnaud de La Fortelle.
      The kinematic bicycle model: A consistent model for planning feasible trajectories
      for autonomous vehicles? In 2017 IEEE intelligent vehicles symposium (IV), pages
      812–818. IEEE, 2017.
[111] Gunasekaran Raja, Kottilingam Kottursamy, Kapal Dev, Renuka Narayanan,
      Ashmitha Raja, and K Bhavani Venkata Karthik. Blockchain-integrated multiagent
      deep reinforcement learning for securing cooperative adaptive cruise control. IEEE
      transactions on intelligent transportation systems, 23(7):9630–9639, 2022.
[112] James Blake Rawlings, David Q Mayne, and Moritz Diehl. Model predictive control:
      theory, computation, and design, volume 2. Nob Hill Publishing Madison, WI, 2017.
[113] Jackeline Rios-Torres and Andreas A Malikopoulos. A survey on the coordination of
      connected and automated vehicles at intersections and merging at highway on-ramps.
      IEEE Transactions on Intelligent Transportation Systems, 18(5):1066–1077, 2016.
[114] Mehdi Savaghebi, Alireza Jalilian, Juan C Vasquez, and Josep M Guerrero. Secondary
      control scheme for voltage unbalance compensation in an islanded droop-controlled
                                              93


      microgrid. IEEE Transactions on Smart Grid, 3(2):797–807, 2012.
[115] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to
      the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint
      arXiv:1312.6120, 2013.
[116] Dhruv Mauria Saxena, Sangjae Bae, Alireza Nakhaei, Kikuo Fujimura, and Maxim
      Likhachev. Driving in dense traffic with model-free reinforcement learning. In 2020
      IEEE International Conference on Robotics and Automation (ICRA), pages 5385–5392.
      IEEE, 2020.
[117] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz.
      Trust region policy optimization. In International conference on machine learning,
      pages 1889–1897. PMLR, 2015.
[118] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel.
      High-dimensional continuous control using generalized advantage estimation. arXiv
      preprint arXiv:1506.02438, 2015.
[119] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
      Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[120] Qobad Shafiee, Josep M Guerrero, and Juan C Vasquez. Distributed secondary control
      for islanded microgrids—a novel approach. IEEE Transactions on power electronics,
      29(2):1018–1031, 2013.
[121] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, re-
      inforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
[122] Tianyu Shi, Dong Chen, Kaian Chen, and Zhaojian Li. Offline reinforcement learn-
      ing for autonomous driving with safety and exploration enhancement. arXiv preprint
      arXiv:2110.07067, 2021.
[123] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van
      Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc
      Lanctot, et al. Mastering the game of go with deep neural networks and tree search.
      nature, 529(7587):484–489, 2016.
[124] John W Simpson-Porco, Qobad Shafiee, Florian Dörfler, Juan C Vasquez, Josep M
      Guerrero, and Francesco Bullo. Secondary frequency and voltage control of islanded
      microgrids via distributed averaging. IEEE Transactions on Industrial Electronics,
      62(11):7025–7038, 2015.
[125] Vijay K. Sood and Haytham Abdelgawad. Chapter 1 - microgrids architectures. In
      Rajeev Kumar Chauhan and Kalpana Chauhan, editors, Distributed Energy Resources
                                             94


      in Microgrids, pages 1 – 31. Academic Press, 2019.
[126] Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with
      backpropagation. Advances in neural information processing systems, 29, 2016.
[127] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction.
      MIT press, 2018.
[128] Csaba Szepesvári. Algorithms for reinforcement learning. Springer Nature, 2022.
[129] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In
      Proceedings of the tenth international conference on machine learning, pages 330–337,
      1993.
[130] Justin K Terry, Nathaniel Grammel, Ananth Hari, Luis Santos, and Benjamin Black.
      Revisiting parameter sharing in multi-agent deep reinforcement learning, 2021.
[131] Christian Thiemann, Martin Treiber, and Arne Kesting. Estimating acceleration and
      lane-changing dynamics from next generation simulation trajectory data. Transporta-
      tion Research Record, 2088(1):90–101, 2008.
[132] Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Congested traffic states in em-
      pirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000.
[133] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with
      double q-learning. In Proceedings of the AAAI conference on artificial intelligence,
      volume 30, 2016.
[134] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew
      Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko
      Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement
      learning. Nature, 575(7782):350–354, 2019.
[135] Panagis N Vovos, Aristides E Kiprakis, A Robin Wallace, and Gareth P Harrison. Cen-
      tralized and distributed voltage control: Impact on distributed generation penetration.
      IEEE Transactions on power systems, 22(1):476–483, 2007.
[136] Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: A local
      reward approach to solve global reward games. In Proceedings of the AAAI Conference
      on Artificial Intelligence, volume 34, pages 7285–7292, 2020.
[137] Jiawei Wang, Tianyu Shi, Yuankai Wu, Luis Miranda-Moreno, and Lijun Sun. Multi-
      agent graph reinforcement learning for connected automated driving.
[138] Meng Wang. Infrastructure assisted adaptive driving to stabilise heterogeneous vehicle
                                              95


      strings. Transportation Research Part C: Emerging Technologies, 91:276–295, 2018.
[139] Shengyi Wang, Jiajun Duan, Di Shi, Chunlei Xu, Haifeng Li, Ruisheng Diao, and
      Zhiwei Wang. A data-driven multi-agent autonomous voltage control framework using
      deep reinforcement learning. IEEE Transactions on Power Systems, 2020.
[140] Zhuwei Wang, Senfan Jin, Lihan Liu, Chao Fang, Meng Li, and Song Guo. Design
      of intelligent connected cruise control with vehicle-to-vehicle communication delays.
      IEEE Transactions on Vehicular Technology, 71(8):9011–9025, 2022.
[141] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279–292,
      1992.
[142] David P Watson and David H Scheidt. Autonomous systems. Johns Hopkins APL
      technical digest, 26(4):368–376, 2005.
[143] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist
      reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
[144] David H Wolpert and Kagan Tumer. Optimal payoff functions for members of collec-
      tives. In Modeling complexity in economic and social systems, pages 355–369. World
      Scientific, 2002.
[145] Cathy Wu, Alexandre M Bayen, and Ankur Mehta. Stabilizing traffic with autonomous
      vehicles. In 2018 IEEE international conference on robotics and automation (ICRA),
      pages 6012–6018. IEEE, 2018.
[146] Yuhuai Wu, Elman Mansimov, Shun Liao, Roger Grosse, and Jimmy Ba. Scalable
      trust-region method for deep reinforcement learning using kronecker-factored approx-
      imation. arXiv preprint arXiv:1708.05144, 2017.
[147] Huanhai Xin, Zhihua Qu, John Seuss, and Ali Maknouninejad. A self-organizing
      strategy for power flow control of photovoltaic generators in a distribution network.
      IEEE Transactions on Power Systems, 26(3):1462–1473, 2010.
[148] Qiuling Yang, Gang Wang, Alireza Sadeghi, Georgios B Giannakis, and Jian Sun.
      Two-timescale voltage control in distribution grids using deep reinforcement learning.
      IEEE Transactions on Smart Grid, 11(3):2313–2323, 2019.
[149] Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu.
      The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint
      arXiv:2103.01955, 2021.
[150] Chao Yu, Xin Wang, Xin Xu, Minjie Zhang, Hongwei Ge, Jiankang Ren, Liang Sun,
      Bingcai Chen, and Guozhen Tan. Distributed multiagent coordinated learning for
                                             96


      autonomous driving in highways based on dynamic coordination graphs. IEEE Trans-
      actions on Intelligent Transportation Systems, 21(2):735–748, 2019.
[151] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning:
      A selective overview of theories and algorithms. Handbook of reinforcement learning
      and control, pages 321–384, 2021.
[152] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Başar. Fully de-
      centralized multi-agent reinforcement learning with networked agents. arXiv preprint
      arXiv:1802.08757, 2018.
[153] Ying Zhang, Xinan Wang, Jianhui Wang, and Yingchen Zhang. Deep reinforcement
      learning based volt-var optimization in smart distribution systems. IEEE Transactions
      on Smart Grid, 12(1):361–371, 2020.
[154] Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in
      deep reinforcement learning for robotics: a survey. In 2020 IEEE Symposium Series
      on Computational Intelligence (SSCI), pages 737–744. IEEE, 2020.
[155] Hanyu Zhu, Yong Zhou, Xiliang Luo, and Haibo Zhou. Joint control of power,
      beamwidth, and spacing for platoon-based vehicular cyber-physical systems. IEEE
      Transactions on Vehicular Technology, 71(8):8615–8629, 2022.
                                              97