DEEP MULTI-AGENT REINFORCEMENT LEARNING FOR EFFICIENT AND SCALABLE NETWORKED SYSTEM CONTROL By Dong Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical and Computer Engineering—Doctor of Philosophy 2023 ABSTRACT Recently, intelligent systems, such as robots, connected automated vehicles, and smart grids have emerged as promising tools to enhance efficiency and sustainability across di- verse areas, including intelligent transportation, industrial automation, and energy manage- ment. These systems can connect with local communication networks, forming connected systems, and showing high scalability and robustness. Yet, controlling these connected sys- tems presents great challenges, mainly due to the high dimensionality of their state/action spaces and the complex interactions among their components. Traditional control methods often struggle in the real-time management of these systems, given their inherent complexity and uncertainties. Fortunately, reinforcement learning (RL), especially multi-agent reinforce- ment learning (MARL), offers an effective solution by leveraging adaptive online capabilities and their proficiency in solving intricate problems. In this thesis, three unique deep MARL algorithms are explored for safe, efficient, and scalable networked system control (NSC). The efficacy of these algorithms is validated in several practical and real-world applications, such as power grids and connected automated vehicles (CAVs). In the first algorithm, a safe, scalable, and efficient MARL framework is introduced specifically for on-ramp merging in mixed-traffic scenarios, where both human-driven vehicles and connected automated vehicles exist. By leveraging parameter sharing and local reward design, the framework fosters cooperation among agents without compromising on scalability. To mitigate the collision rates and expedite the training process, an innovative priority-based safety supervisor is developed and incorporated into the MARL framework. In addition, a gym-like simulation environment is developed and open-sourced, offering three traffic density levels. Extensive experimental results show that our proposed MARL model consistently surpasses several state-of-the-art (SOTA) benchmarks, showing its significant promise for managing CAVs in the specified on-ramp merging scenarios. In our second exploration, we propose a fully-decentralized MARL framework for Co- operative Adaptive Cruise Control (CACC). This approach differs substantially from the traditional centralized training and decentralized execution (CTDE) method. Within this framework, each agent acts based on its unique observations and rewards, eliminating the need for a central controller. In addition, we further introduce a quantization-based commu- nication protocol to enhance communication efficiency and reduce bandwidth consumption by employing randomized rounding to quantize each transmitted data piece, while only sending the non-zero components after quantization. Through the validation of two dis- tinct CACC scenarios, our method has proven to outperform SOTA models in both control precision and communication efficiency. In our third exploration, we present an efficient MARL algorithm specifically for cooper- ative control within power grids. In particular, we focus on the decentralized inverter-based secondary voltage control problem by formulating it as a cooperative MARL problem. Then, we introduce a novel on-policy MARL algorithm, named PowerNet, where each agent (i.e., each distributed generator (DG)) learns a control policy based on (sub-)global reward, as well as encoded communication messages from its neighbors. Additionally, a novel spatial discount factor is introduced to mitigate the effect of remote agents, expedite the training process and improve scalability. Moreover, a differentiable, learning-based communication protocol is employed to enhance collaboration among adjacent agents. To support com- prehensive training and assessment, we introduce PGSim, an open-source and cutting-edge power grid simulation platform. The evaluation across two microgrid configurations shows that PowerNet not only outperforms conventional model-based control techniques but also several SOTA MARL strategies. Copyright by DONG CHEN 2023 ACKNOWLEDGEMENTS First and foremost, I would like to extend my heartfelt gratitude to my advisor, Dr. Zhaojian Li, for his invaluable advice, unwavering encouragement, inspiring guidance, and continual support throughout my research journey and academic career at Michigan State University. He is always ready to engage in thoughtful discussions about the grand scheme of our research while providing constructive feedback on intricate technical details. Despite his own remarkable creativity and productivity, he generously allowed me the freedom to explore various problems, even those not directly aligned with his own research interests. I also want to express my sincere thanks to Drs. Vaibhav Srivastava, Shaunak D. Bopardikar, and Hamidreza Modares for serving on my thesis committee. I am deeply thankful for the opportunity to collaborate with an exceptional group of colleagues, faculty, and researchers throughout my Ph.D. program. The collaborative work presented in this dissertation would not have been possible without the valuable contributions of Dr. Kaixiang Zhang, Dr. Kaixiang Lin, Mohammad Hajidavalloo, Dr. Kaian Chen, Dr. Yue Wang, Dr. Longsheng Jiang, Dr. Tianshu Chu, Dr. Feng Qiu, Dr. Rui Yao, Dr. Yongqiang Wang, and Dr. Zhaojian Li. Beyond the scope of this thesis, I had the privilege of working alongside other outstanding researchers such as Yu Zheng, Pengyu Chu, Ramin Vahidimoghaddam, Jiajia Li, Lingxuan Hao, Dr. Fengying Dang, Dr. Yanbo Huang, Dr. Yuzhen Lu, Dr. Yichen Zhang, Dr. Shunbo Lei, Dr. Xiaobo Tan, Xinda Qi, and Qianqian Liu. Their insights have enriched my learning experience, and I am grateful for their input. My internships also presented me with wonderful learning and networking opportunities, for which I am grateful. I wish to express my appreciation to Dr. Pei Zheng, Anqi Luo, Tor Fredericks, Dr. Feng Qiu, Dr. Rui Yao, Dr. Yichen Zhang, and Dr. Bo Chen. Special thanks go to Drs. Pei Zheng and Feng Qiu for hosting my internships at T-Mobile USA in 2020 and Argonne National Lab in 2022 and 2023, respectively. Last but certainly not least, owe a profound debt of gratitude to my family, including my cat Huihui, and my friends for their unwavering support and unconditional love. I would v also like to remember and honor my godfather, Mr. Wang, who supported and loved me like a father. His passing last year brought great sorrow. I am forever grateful for his impact on my life and hope that he is at peace and know that his spirit continues to inspire and motivate me. vi TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 PRELIMINARIES OF RL AND MARL . . . . . . . . . . . . . . . . 5 CHAPTER 3 DEEP MARL FOR HIGHWAY ON-RAMP MERGING IN MIXED TRAFFIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 CHAPTER 4 CACC WITH FULLY DECENTRALIZED AND COMMUNICATION EFFICIENT MARL . . . . . . . . . . . . . . . 42 CHAPTER 5 DEEP MARL FOR SECONDARY VOLTAGE CONTROL . . . . . . 62 CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 vii CHAPTER 1 INTRODUCTION In this chapter, we first introduce the motivation of this thesis and the challenges for applying MARL for NSC. Then, we illustrate the specific research objectives of this thesis and provide a summary of the key contributions. 1.1 Motivation Recently, intelligent systems, such as robots, connected automated vehicles, and smart grids have emerged as promising tools to enhance efficiency and sustainability across diverse areas, including intelligent transportation, industrial automation, and energy management [142]. These systems can connect with local communication networks, forming connected systems, and showing high scalability and robustness [68]. Yet, controlling these connected systems presents great challenges, mainly due to the high dimensionality of their state/action spaces and the complex interactions among their components. Traditional control methods often struggle in the real-time management of these systems, given their inherent complexity and uncertainties [69]. Fortunately, reinforcement learning (RL) [96], especially multi-agent reinforcement learning (MARL) [129, 26], offers an effective solution by leveraging adaptive online capabilities and their proficiency in solving intricate problems. 1.2 Challenges In MARL For NSC Implementing MARL for networked system control (NSC) presents significant challenges including scalability due to the exponential growth of the state-action space with a large volume of agents [49], ensuring safety as interactions between autonomous agents can lead to unforeseen and potentially harmful behavior [39], improving communication efficiency which is essential for cooperative decision-making [26]. In addition, the lack of realistic simulators for MARL also makes it difficult to accurately predict and analyze system behavior in complex, real-world scenarios [103]. These challenges highlight the intricate trade-offs between scalability, safety, communication efficiency, and realistic simulation in 1 implementing MARL for NSC. 1.3 Research Goals The principal objective of this thesis is to develop safe, efficient, and scalable MARL algorithms for networked system control (NSC) with specific applications for connected and automated vehicles (CAVs) and smart grids. The detailed objectives are as follows: 1. The first objective is to create scalable MARL algorithms for large-scale NSC. This goal will be achieved by developing a parameter-sharing-based MARL algorithm, where agents independently make decisions based on their own local observations, yet share the same set of parameters. This approach is designed to facilitate scalability in large- scale networked environments, such as CAVs. 2. The second objective is to investigate the development of fully decentralized MARL algorithms for NSC, eliminating the need for a central controller during the train- ing process. Furthermore, a cooperative learning scheme will be proposed, aiming to enhance cooperation and coordination among the agents. 3. The third objective is to develop efficient communication protocols to promote com- munication efficiency among agents. The designed communication protocol should maintain the control performance of the MARL algorithms while enhancing efficiency. 4. The final objective is to implement highly realistic simulators for developing and testing MARL algorithms. The simulators should accurately model the characteristics of real- world objectives, providing robust platforms for the development and refinement of MARL techniques. 1.4 Contributions The principal contributions of this thesis are outlined as follows: 1. A safe, efficient, and scalable MARL framework is developed for on-ramp merging in mixed traffic [21]. In addition, a novel priority-based safety supervisor is incorporated into the MARL framework to significantly reduce the collision rate and expedite the training process. 2 2. A gym-like simulation environment for on-ramp merging is developed and open-sourced with three different traffic density levels (https://github.com/DongChen06/MARL_C AVs), including five SOTA MARL algorithms. 3. A fully-decentralized MARL framework is introduced for Cooperative Adaptive Cruise Control (CACC) without the need for a central controller. In addition, a quantization- based communication protocol is developed to enhance communication efficiency by applying random quantization to the messages being communicated and ensuring that critical information is transmitted with minimized bandwidth usage. 4. The adopted gym-like simulation environment for the specified CACC scenarios is open-sourced with 7 state-of-the-art MARL benchmarks (https://github.com/DongC hen06/CACC_MARL). 5. An efficient MARL algorithm is developed specifically for cooperative secondary volt- age control within power grids, where each agent (i.e., each DG) formulates a control policy based on (sub-)global reward and coded messages from its adjacent neighbors. Additionally, we introduce an innovative spatial discount factor, designed to minimize interference from distant agents, accelerate training, and bolster scalability. Moreover, a differentiable, learning-based communication protocol is developed to strengthen co- ordination among neighboring agents. 6. An open-source software, called PGSim, that offers a highly efficient and high-fidelity simulation platform for power grids is developed and open-sourced (https://github.c om/DongChen06/PGSIM). 1.5 Organization Of Thesis The subsequent chapters of this thesis are organized as follows: Chapter 2: Background This chapter provides a comprehensive introduction to the fundamentals of Reinforce- ment Learning (RL) and delves into several SOTA Multi-Agent Reinforcement Learning (MARL) algorithms, setting the stage for a deeper understanding and contextualization of 3 our proposed research. Chapter 3: Deep MARL for Highway On-ramp Merging in Mixed Traffic [21] In this chapter, we propose an efficient and scalable MARL framework for on-ramp merg- ing in mixed traffic, leveraging parameter sharing and local rewards to encourage cooperation between agents, while still maintaining impressive scalability. In addition, a novel priority- based safety supervisor is introduced to mitigate the collision rates. Chapter 4: CACC with Fully Decentralized and Communication-efficient MARL [24] In this chapter, we propose a fully-decentralized MARL framework for Cooperative Adap- tive Cruise Control (CACC). Furthermore, a quantization-based communication protocol is introduced to enhance communication efficiency. Chapter 5: Deep MARL for Secondary Voltage Control [20] In this chapter, we propose an efficient MARL algorithm for the decentralized inverter- based secondary voltage control problem. A novel on-policy MARL algorithm, named Pow- erNet, is introduced where each agent (i.e., each DG) learns a control policy based on (sub- )global reward, as well as encoded communication messages from its neighbors. Chapter 6: Conclusion This chapter concludes the thesis and discusses its limitations. 4 CHAPTER 2 PRELIMINARIES OF RL AND MARL In this chapter, we provide a comprehensive introduction to the fundamentals of Rein- forcement Learning (RL) and delve into several state-of-the-art Multi-Agent Reinforcement Learning (MARL) algorithms, providing the necessary context to properly position and un- derstand our proposed work. 2.1 Reinforcement Learning (RL) Reinforcement learning (RL), which is often mathematically formulated as a Markov Decision Process (MDP), has emerged as a promising data-driven method for sequential decision-making [26, 31]. The recent advances in Deep Neural Networks (DNNs) have further enhanced the capabilities of RL in handling intricate tasks. Notable examples of these algorithms include the deep Q-network (DQN [97]), deep deterministic policy gradient (DDPG [79]), and advantage actor-critic (A2C [95]). For instance, AlphaGo, a computer program based on DQN, made history as the first of its kind to defeat a professional hu- man player in Go, even going on to beat a world champion in the game [97]. Moreover, in [73], researchers successfully employ the Trust Region Policy Optimization (TRPO, [117]) to navigate Quadrupedal robots across arduous terrains, including challenging surfaces like mud and snow, and through dynamic footholds. In an RL framework (see Figure 2.1), the learner (i.e., agent) navigates the environment by adopting a trial-and-error methodology. The agent makes decisions, performs the action to the environment, and in return, receives a reward signal accompanied by a new state. This reward, provided by the environment, provides feedback to the agent, indicating whether the impact of its actions is positive or negative. Mathematically, we can formulate the RL problem as a Markov Decision Process(MDP). The MDP M = (S, A, P, R) is defined as follows: 1. State space S: a set of states that includes the comprehensive description of the en- vironment provided by the environment itself, which outlines the position or other 5 conditions of the agents at a specific time step t. An observation o is a partial de- scription of a state and may not include all information. If an agent can access the whole state of the environment, then the environment is fully observed. However, if the agent is only able to acquire a partial observation, the environment is considered as partially observed. 2. Action space A: a set of all valid actions within a specific environment. Some environ- ments, like Atari and Go, have discrete action spaces, where the agent has a finite number of available moves. In contrast, environments like those controlling a robot in a physical world possess continuous action spaces. In continuous action spaces, actions take the form of real-valued vectors. 3. Transition Probability Pss′ (St+1 = s′ |St = s): the transition probability describes the likelihood of an agent moving from one state to another. 4. Reward R(st , at , st+1 ): the reward is returned by the environment once the action at is executed at state st . The value of the reward signal can be positive or negative contingent on the actions of the agent. RL Agent state reward action 𝑠𝑡 𝑟𝑡 𝑎𝑡 𝑟𝑡+1 Environment 𝑠𝑡+1 Figure 2.1 Illustration of reinforcement learning (RL). As shown in Figure 2.1, at time step t, the agent observes the state st ∈ S ⊆ Rn from the environment, executes an action at ∈ A ⊆ Rm . The RL agent executes actions guided by a learned policy π(at |st ). In the context of Deep RL, the policy is frequently parameterized using some function approximators, denoted as πθ (·|st ), where θ represents the learnable parameters of the function approximator, such as DNNs or Q-tables. Then the environment 6 evolves to the new state st+1 based on the transition dynamics p(·|st , at ) and returns an immediate reward rt = r(st , at , st+1 ) to the agent. The objective of an RL agent is to learn an optimal policy π ∗ : S → A that maps from state to action, maximizing the accumulated reward: X T Rt = γ k rt+k , (2.1) k=0 where rt+k is the reward at time step t + k and γ ∈ (0, 1] and T represent the discount factor and episode length, respectively. The state-action function is denoted as: Qπ (st , at ) = E (Rt |st = s, at = a), (2.2) |{z} τ ∼π where τ = (s0 , a0 , s1 , a1 , ..., sT , aT ) represents the trajectory containing a sequence of states and actions. The state-action function represents the expected return starting from state st and taking an immediate action at , then following policy π afterward. The optimal Q- function determines the optimal greedy policy π ∗ (at |st ) and is define as: Q∗ (st , at ) = max Qπ (st , at ), (2.3) π The state value function V π (st ) represents the expected return if starting from st and immediately following the policy π and is denoted as: V π (st ) = E (Rt |st = s), (2.4) |{z} τ ∼π The relationship between the action-value function Q(st , at ) and state-value function V π (st ) can be represented as: V π (s) = E [Qπ (s, a)], (2.5) |{z} a∼π In the subsequent subsections, we will delve into three prevalent RL algorithms: Deep Q-learning, Policy Gradient, and the Actor-Critic Network. 2.1.1 Deep Q-Learning In Q-learning, the Q-function, denoted as Qθ , is frequently parameterized by a set of parameters θ. This can be achieved using function approximators, ranging from Q-tables 7 [141] and Linear Regression (LR) [128], to the more powerful Deep Neural Networks (DNNs) [97]. The temporal difference, defined as (T Qθ− − Qθ )(st , at ), provides the basis for updating θ. Here T and Qθ− denote the dynamic programming (DP) operator and a recently frozen model θ− [28, 122], respectively. To reduce the variance in estimating Q-values and improve exploration, techniques like the ϵ−greedy and experience replay are commonly integrated into deep Q-learning [128]. We can obtain the optimal action a∗ (s) through the Q∗ (st = s, at = a) as: a∗ (s) = arg max Q∗ (st = s, at = a), (2.6) a Some of the widely recognized deep Q-learning algorithms include DQN [96], DDQN [133], C51 [9], HER [5], and HR-DQN [30]. 2.1.2 Policy Gradient Unlike the Q-learning approach, within the policy gradient method, the policy, denoted as πθ , is usually parameterized directly by a set of parameters θ. The objective of updating θ is to augment both the likelihood of actions taken and the cumulative rewards. This can be achieved by the loss function: XT ∇θ L(πθ ) = E [ ∇θ log πθ (at |st )Rt ], (2.7) τ ∼πθ t=0 Following this, the policy network parameters take an update via stochastic gradient ascent, described as: θk+1 = θk + α∇θ L(πθ ), (2.8) When compared to Q-learning, the policy gradient is robust to non-stationary transitions within each trajectory. However, it’s worth noting that it tends to exhibit high variance [29]. Renowned algorithms that utilize policy gradients include TRPO [117], PPO [119], and DDPG [79]. 2.1.3 Actor-critic Network To mitigate the high variance associated with the sample return in the policy gradient method, actor-critic algorithms, such as A2C [95], adopt the advantage function to refine the 8 policy gradient technique, since it leverages both policy (actor) and value (critic) functions. The advantage function is represented as: Aπ (st , at ) = Qπθ (st , at ) − Vw (st ), (2.9) Here, the parameters θ are updated via the policy loss function: XT ∇θ L = E [ ∇θ log πθ (at |st )At ], (2.10) πθ t=0 Simultaneously, the value function updates with: L = min E[(Rt + γVw− (st ) − Vw (st ))2 ], (2.11) w D where D represents the experience replay buffer, which aggregates past experiences. This buffer works alongside parameters derived from previous iterations, typically used in a target network [21]. However, despite great advancements, single-agent RL often struggles with scalability challenges, especially in real-world control scenarios with multiple agents. These challenges come from inherent non-stationarities and the partial observability intrinsic to such systems [26]. 2.2 Multi-agent Reinforcement Learning (MARL) Multi-agent systems, often found in day-to-day applications such as drone delivery, smart grids, autonomous driving, and multi-robot assembly, consist of numerous agents interacting within a shared environment. Generally, these systems can be delineated into three distinct categories based on their team objectives: 1. Cooperative: In a cooperative setting, all agents collaboratively work towards max- imizing a shared team reward. In such environments, coordination among agents be- comes vital to ensure effective collaboration. 2. Competitive: In competitive settings, agents operate independently and are primarily self-interested. They frequently have independent objectives and endeavor to maximize their individual rewards. 9 3. Mixed: In a mixed environment, agents may be self-interested with different objectives (not opposite). Real-world applications such as autonomous driving, smart grids, and traffic light control often require agents to work cooperatively, aiming for a shared goal. The agents can work independently or communicate with each other through the local communication channels (see Figure 2.2). Given the importance of collaboration in such scenarios, this thesis primarily focuses on algorithms designed for Cooperative settings. In the following subsections, we will introduce some SOTA MARL algorithms. state 1 state 2 state n state 1 state 2 state n reward 1 reward 2 reward n reward 1 reward 2 reward n Comm. network … … action 2 Figure 2.2 Illustration of multi-agent reinforcement learning (MARL): left framework without communication, while right framework with communication. 2.2.1 Independent MARL To address the scalability issues common in single-agent RL, independent multi-agent RL (MARL) has been introduced. In this approach, each agent learns and adapts its unique policy based on its local observations and rewards [29]. A prominent and simplistic method- ology in this domain is Independent Q-learning (IQL) [129]. In IQL, each local Q-function predominantly relies on the local action, represented as Qi (s, a) ≈ Qi (s, ai ). Along similar lines, the Independent Advantage Actor-Critic (IA2C) serves as an actor-critic version of 10 MARL, as proposed by [28]. While both IQL and IA2C offer highly scalable solutions, they grapple with challenges posed by partial observability and non-stationary Markov Decision Processes (MDP). This is mainly attributed to their intrinsic assumption: the behaviors of all other agents are perceived as part of the environment’s dynamics. This becomes problematic considering these agents’ policies take continual updates during training [29]. 2.2.2 Cooperative MARL To address the non-stationary issues common in MARL, [152] decentralizes the critic, allowing it to take both global observations and actions, followed by consensus updates. Though this method does away with the need for a centralized controller during training, it still requires access to global information. The challenge of partial observability in MARL has led to various studies focusing on the potential of communication. For instance, FPrint [43] investigates the impact of direct communication between agents, demonstrating that shar- ing low-dimensional policy fingerprints can enhance performance. DIAL [42], on the other hand, has each DQN agent produce a communication message in tandem with action-value estimation. This message is subsequently encoded and combined with other input signals on the receiving end. A different approach, CommNet [126], presents a broader communi- cation protocol but simplistically calculates the mean of all messages instead of encoding them. The NeurComm strategy [26] introduces a learnable communication protocol, where messages are intricately encoded and concatenated to curtail information loss. Despite their innovations, these methods commonly adopt a centralized critic network during training. Additionally, their communication messages, whether raw or encoded, are often network parameters, leading to significant data transmission. This can burden com- munication channels due to the large volume of transmitted messages. Moreover, even with safety considerations embedded in their reward functions, the safety of these algorithms re- mains unassured. It’s common practice to first implement MARL algorithms in realistic simulators before transitioning to real-world deployment. Hence, the development of such 11 simulators is paramount. Bearing all these factors in mind, we will delve into our own groundbreaking MARL algorithms. 12 CHAPTER 3 DEEP MARL FOR HIGHWAY ON-RAMP MERGING IN MIXED TRAFFIC In this chapter, we introduce our first exploration of applying MARL for managing on- ramp merging in mixed traffic including both human-driven vehicles (HDVs) and connected autonomous vehicles (CAVs), which is one of the most challenging scenarios in the realm of autonomous driving. Figure 3.1 An illustration of the on-ramp merging scenario. Connected Autonomous Vehicles (CAVs) are denoted in blue, and Human-Driven Vehicles (HDVs) are denoted in green. Both types of vehicles coexist on the ramp and through lanes. 3.1 Background Over the past decade, autonomous vehicle (AV) technologies such as Tesla’s Autopilot [2] and Baidu’s Apollo [1] have witnessed substantial advances, leading to their deployment in (semi-)autonomous vehicles navigating real-world roads. However, alongside this progress, there has been a noticeable increase in traffic accidents involving AVs [35, 40]. These incidents are frequently caused by the inability to adapt to dynamic driving environments, especially in mixed traffic conditions with both AVs and human-driven vehicles (HDVs) sharing the road. In these scenarios, AVs must not only respond to static and moving road obstacles but also interpret and predict HDV behaviors. Among the many challenging driving situations, highway on-ramp merging stands out as especially challenging for AVs [99, 72], which is the topic of this chapter. Figure 3.1 illustrates the on-ramp merging scenario under consideration, where we con- sider a common setting where both autonomous vehicles (AVs) and human-driven vehicles 13 (HDVs) navigate and coexist on the merge and through lanes. For a successful merging maneuver, on-ramp vehicles must efficiently merge into the through lane without causing collisions. In an ideal cooperative setting, vehicles on the through lane proactively adjust their speeds, either decelerating or accelerating, to create sufficient space for the on-ramp vehicles to merge smoothly, whereas on-ramp vehicles, in turn, should regulate their speeds and time to ensure safe merging, eliminating the risk of deadlocks situations [67, 11] at the merge point. Clearly, coordination between vehicles is crucial to facilitate safe and efficient merging. While this is relatively easy to achieve in a full-AV scenario, AV coordination in the presence of HDVs is a significantly more intricate task. Several methods have been proposed to tackle the automated merging problem, including rule-based and optimization-based approaches [113, 60, 58, 81]. Rule-based strategies utilize heuristics and predefined rules to steer autonomous vehicles (AVs) [60, 58]. Although these are effective in easy traffic scenarios, they become impractical for intricate merging scenarios [19]. In an optimal control setting, vehicle interactions between vehicles are perceived as a dynamic system where controlled vehicles’ actions serve as inputs. For instance, a model predictive control (MPC) technique has been developed to navigate an AV through a paral- lel ramp merge [19]. Despite promising results, MPC methods depend on precise dynamic merging models (including those of human drivers) and are computationally intensive due to necessary online optimizations at every time step [112]. Comprehensive reviews of model- based control strategies for on-ramp merging can be found in [104, 105, 106]. Nonethe- less, these methods were primarily developed for fully automated vehicles, making them unsuitable for mixed-traffic situations. In addition, gap acceptance theory has also been investigated for merging behavior [57, 90], emphasizing intricate modeling features for all traffic entities. However, this becomes problematic when traditional vehicles exhibit variable behaviors, especially existing numerous concurrent CAVs [38]. On the other hand, data-driven approaches, especially reinforcement learning (RL), have gained increased attention and been explored for AV highway merging [81, 89]. Notably, 14 [81] employs a multi-objective reward function centered on safety and jerk minimization for AV merging. To address the RL challenge, the Deep Deterministic Policy Gradient (DDPG) algorithm [79] is leveraged. In [89], RL and MPC are fused to enhance learning efficiency, which achieves a good balance between passenger comfort, crash rate, efficiency, and robustness. Nonetheless, these techniques are primarily conceptualized for individual AV, with other vehicles merely considered as part of environmental elements. In this chapter, we explore a specific scenario as illustrated in Figure 3.1, where multiple AVs adaptively engage with HDVs and work to successfully merge, aiming to optimize traffic flow while ensuring safety. This scenario naturally extends the single-agent RL to a more ex- pansive multi-agent reinforcement learning (MARL) framework. Within this paradigm, AVs collaboratively learn control policies to realize the aforementioned objectives. However, this is a challenging task due to dynamic connectivity topology, sophisticated motion dynamics involving AV-coupled behaviors, and intricate decision-making processes. This complexity is even more pronounced when human drivers are involved. While several MARL techniques have been formulated for CAVs in scenarios like car- following and lane overtaking scenarios [137, 102, 64, 53, 12, 36, 150]. However, to the best of our knowledge, no MARL algorithm has been proposed for the considered highway on- ramp merging scenario. In this work, we develop a novel decentralized MARL framework, empowering AVs to adeptly learn and execute a safe and efficient merging policy applicable to vehicles on both lanes of the highway. To enhance safety and the learning process, a priority-based safety supervisor that leverages sequential and multi-step forecasting is pro- posed. Furthermore, we explore parameter sharing and localized rewards to enhance inter- agent collaboration, ensuring optimal scalability. The main contributions and the technical advancements of this chapter are detailed below. 1. Problem Formulation & Simulation Platform: We formulate the mixed-traffic on-ramp merging scenario, where AVs and HDVs coexist on both the ramp and through lanes, into a decentralized MARL framework. Our approach is tailored to accommodate 15 dynamic environments with a dynamic connectivity topology. A corresponding gym- like simulation platform with three different traffic density levels is made publicly accessible1 . 2. MARL Algorithm & Safety Supervisor: We develop a novel, efficient, and scalable MARL algorithm, featuring an effective reward function design, parameter-sharing mechanism, and action masking. In addition, we have also integrated a priority-driven safety supervisor into the MARL framework, which significantly reduces collision rates during training, and subsequently enhances learning efficiency. 3. Curriculum Learning: By employing curriculum learning, we expedite the learn- ing process for more intricate tasks by utilizing models pre-trained on simpler traffic scenarios. 4. Experiments & Performance Metrics: Extensive experiments are conducted to evaluate our approach, showing that our method consistently outperforms several SOTA algorithms, especially in driving safety and operational efficiency metrics. The subsequent sections are structured as follows. The problem formulation and our innovative MARL framework are described in Section 3.2 whereas the priority-based safety supervisor is described in Section 3.3. A detailed exposition of our experiments, findings, and discussions is presented in Section 3.4. We conclude the chapter and discuss future works in Section 3.5. 3.2 On-ramp Merging As MARL In this section, we characterize the on-ramp merging scenario as a partially observable Markov decision process (POMDP) [55]. Subsequently, we introduce our actor-critic-centric MARL approach, featuring a parameter-sharing mechanism, effective reward function design, and action masking, to navigate the challenges of the devised POMDP. For clarity, this approach is referred to as the “baseline method” in Section 3.5. 1 See https://github.com/DongChen06/MARL_CAVs 16 High-level Intelligent Agent Driving Environment agent 1 𝑎1 … agent n 𝑎𝑛 RL actions control signals Low-level controller Figure 3.2 Schematics of system and simulation setup without the safety supervisor, in which actions from the MARL will send directly to the low-level controller. 3.2.1 MARL Formulation In this chapter, we conceptualize the mixed-traffic on-ramp merging environment as a model-free multi-agent network [29, 23], denoted by G = (ν , ε). Here, each agent i ∈ ν interacts with its neighboring agents, defined as Ni := {j|εij ∈ ε)} via the edge connections εij , i ̸= j.The combined state space and action space for all agents are represented as S := ×i∈ν Si and A := ×i∈ν Ai . The intrinsic dynamics of this system are encapsulated by the state transition distribution P, which maps S × A × S to [0, 1]. We adopt a decentralized MARL paradigm, wherein each agent i (AV i) perceives only a part of the entire environment, specifically its immediate surroundings. This is consistent with the reality that AVs can only sense or communicate with vehicles in close vicinity, making the overall dynamical system a POMDP MG . This POMDP can be comprehensively represented by the tuple ({Ai , Si , Ri }i⊆ν , T ), which is described as follows: 1. Action Space: The action space of agent i, denoted by Ai , represents the set of po- tential high-level control decisions. These control decisions include turn left, turn right, cruising, speed up, and slow down following the designs in [77, 22]. Once high-level actions are chosen, lower-level controllers generate the relevant steering and throttle commands to guide autonomous vehicles (AVs). The system and simulation setup is illustrated in Figure 3.2. The entire action space, A, is the union of the indi- vidual action spaces of all AVs and can be represented as the Cartesian product: A = A1 × A2 × · · · × AN . 17 2. State Space: The state of agent i, Si , is conceptualized as a NNi × W matrix, wherein NNi represents the number of vehicles observable by the agent and W refers to the attributes that describe a vehicle’s state. Key attributes include: ispresent: a binary variable indicating the presence of a vehicle within the sensing range of the ego vehicle; xl : the longitudinal position of the detected vehicle relative to the ego vehicle; y: the lateral position of the detected vehicle relative to the ego vehicle; vx : the longitudinal velocity of the observed vehicle relative to the ego vehicle; vy : the lateral velocity of the observed vehicle relative to the ego vehicle. Based on the proximity principle, the ego vehicle can only perceive the “neighboring vehicles”. These neighboring vehicles comprise the nearest NNi vehicles within a longitudinal range of 150 meters from the ego vehicle [150]. In the considered on-ramp merging case as shown in Figure 3.1, we found NNi = 5 achieves the best performance. The overall state space of the system, S, aggregates the individual states of all agents: S = S1 × S2 × · · · × SN . 3. Reward Function: The reward function Ri is crucial to train the RL agents to follow preferred behaviors. In the on-ramp merging context, the agent’s objective is to efficiently and safely navigate through the merging zone. The reward function for an agent at a given time step t is formulated as: ri,t = wc rc + ws rs + wh rh + wm rm , (3.1) Here, wc , ws , wh , and wm are the coefficients that respectively correspond to collision evaluation, stable-speed assessment, headway time assessment, and merging cost eval- uation. Given the paramount importance of safety, we make wc significantly higher than the other coefficients. The performance evaluation on different wc values is dis- cussed in Section 3.4.1 and Table 3.1. The four evaluation metrics are specified as: rc penalizes collisions; it is set to -1 in case of a collision and 0 otherwise; the speed evaluation rs is defined as:   vt − vmin rs = min ,1 , (3.2) vmax − vmin 18 where vt is the current speed of the ego vehicle. Combining the speed recommendation from the US Department of Transportation (20-30 m/s [101]) and the speed range observed in the Next Generation Simulation (NGSIM) dataset2 (the minimum speed at 6-8 m/s [131]), we set the minimum and maximum speeds of the ego vehicle as vmin = 10 m/s, and vmax = 30 m/s, respectively. The time headway evaluation is defined as: dheadway rh = log , (3.3) th vt here, dheadway and th are the distance headway and the predefined time headway thresh- old, respectively. Thus, the ego vehicle will get penalized when the time headway is less than th and rewarded only when the time headway is greater than th . In this chapter, we choose th as 1.2 s as suggested in [6]. The merging cost rm is designed to penalize the waiting time on the merge lane to avoid deadlocks [16]. Here we adopt rm = − exp(−(x − L)2 /10L), where x is the distance the ego vehicle has navigated on the ramp and L is the length of the ramp (see Figure 1). The merging cost function is plotted in Figure 3.3, which shows that the penalty increases as the ego vehicle moves closer to the merging end. 0.0 Merging reward rm 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 Distance traveled on the ramp x (m) Figure 3.3 Illustration of the designed merging reward/penalty. 4. Transition Probabilities: the transition probability T (s′ |s, a) elucidates the sys- tem’s dynamics. The simulator that we have devised harnesses the Intelligent Driver 2 See https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm 19 Model (IDM) [132] and MOBIL model [66] for driving decisions related to longitudinal acceleration and lateral lane changes, respectively, for human-driven vehicles (HDVs). The high-level decisions of AVs are made by the MARL algorithm and will be tracked by the lower-level controller (PID controller) (see Figure 3.2). The system leverages a kinematic bicycle model to trace vehicle trajectories. Importantly, in our MARL approach, we do not need prior knowledge of the transition probability. This section offers a comprehensive framework for a decentralized MARL algorithm spec- ified for the mixed-traffic on-ramp merging scenario. The elements detailed above jointly contribute to the MARL system’s ability to make efficient and safe merging decisions in real time. 3.2.2 MA2C For CAVs In the realm of cooperative MARL, the overall goal is to optimize the cumulative reward, represented as Rg,t = N P i=1 ri,t . Ideally, each agent will be assigned with the same average 1 global reward Rt = R N g,t during training, i.e., r1,t = r2,t = · · · = rN,t . However, Yet, this uniform allocation does not genuinely reflect the individual contributions of each vehicle and introduces complications [144, 136]. Two key challenges emerge: 1. First, aggregating rewards on a global scale can introduce significant latency, increasing communication overheads, which is impractical for real-time systems like AVs; 2. Second, relying only on a global reward leads to the notorious credit assignment prob- lem [127], which can hamper learning efficiency and constrict the scalability to accom- modate numerous agents. Given these challenges, this work employs a more localized reward design. In this framework, the reward assigned to a particular agent, says the ith agent at time t is defined as: 1 X ri,t = rj,t , (3.4) |νi | j∈ν i where νi = i ∪ Ni is a set containing the ego vehicle and its neighbors, and | · | represents set cardinality. This method, founded on local rewards, inherently focuses on agents that 20 are most related to the success or failure of a task [80, 7]. Such an approach is suitable for vehicular scenarios, in which vehicles are predominantly affected by their immediate surroundings, with distant vehicles exerting minimal influence. ispresent states FC (32) logits action prob. w/o action mask FC (128) Actor 5 5 5 5 5 × 0.2 0.2 0.2 0.2 0.2 position states FC (64) (FC (128)) ⊗ 1 0 1 1 1 invalid action mask 0.25 0 0.25 0.25 0.25 action prob. w/ action mask Critic FC (128)) V value speed states FC (64) Figure 3.4 Architecture overview of the proposed network: layer dimensions are indicated in parentheses. “w/o” and “w/” stands for “without” and “with”, respectively. The network backbone utilized is depicted in Figure 3.4. Both the actor and critic networks leverage the same foundational representations, consequently combining the policy loss and value function error loss into a unified loss function [119]. Given these shared parameters, the overall loss function takes the following form: J(θi ) = J πθi − β1 J Vϕi + β2 H(πθi (st )), (3.5) where β1 and β2 are the weighting coefficients, corresponding to the value function loss and the entropy regularization term respectively. The entropy term, H(πθi (st )) = Eπθi [− log(πθi (st ))], is introduced to enhance agent exploration of new states [143, 119]. Given Eq. 3.6, the policy loss can be expressed as: πθ  ∇θi J πθi = Eπθi ∇θi log πθi (ai,t |si,t )Ai,ti ,  (3.6) πθ Here, Ai,ti = ri,t + γV πϕi (si,t+1 ) − V πϕi (si,t ) signifies the advantage function, while V πϕi (si,t ) represents the state value function. The loss for updating state value Vϕi is denoted as: h i2 J Vϕi = min EDi ri,t + γVϕi (si,t+1 ) − Vϕi (si,t ) . (3.7) ϕi 21 Distinct experience reply buffers are allocated for each agent but a unified policy network, sharing the same parameters, is updated across agents. Such an approach is effective for training a universal policy applicable to both on-ramp and through AVs [64, 80]. Further- more, minibatches of trajectory samples are utilized to refine network parameters via Eq. 3.5, aiming to reduce variance. 3.2.3 Deep Neural Network (DNN) Configuration Figure 3.4 describes our deep neural network’s architecture. To enhance scalability and robustness, observations, si,t , are categorized based on their physical units. For example, si,t is segmented into three distinct groups: s1i,t , s2i,t , and s3i,t , corresponding to presence states, position states, and speed states respectively. Each subset is encoded by an individual fully connected (FC) layer, and subsequently, the encoded states are concatenated into a singular vector. This vector is processed by a 128-neuron FC layer, the output of which serves both the actor and critic networks. Under standard configurations, the actor-network produces logits li , which are then inputted through a Softmax layer, outputting a probability distribution, denoted by πθi (si ) = softmax([l1 , l2 , l3 , l4 , l5 ]). Actions are then sampled based on this distribution: ai ∼ πθi (si ). However, this approach faces challenges. Firstly, invalid/unsafe actions are also assigned non-zero probabilities. Leveraging a stochastic policy, these unsafe actions can inadver- tently be sampled during training, risking system malfunctions or potentially catastrophic outcomes. Secondly, the sampling of these unsafe actions obstructs efficient policy training since they lead to erroneous policy updates [59]. Such updates are misleading since they’re rooted in experiences linked with invalid actions. To mitigate these challenges, we utilize an invalid action masking technique [80] that effectively “filters out” inappropriate actions, thus allowing sampling exclusively from valid actions. As illustrated in Figure 3.4, with an invalid action mask obtained from the environment (e.g., based on the traffic scenario) where “0” flags an invalid action and “1” signifies a valid one. The logits corresponding to the in- valid actions are substituted with substantially negative values, such as −1e8 . Consequently, 22 the Softmax layer assigns these actions probabilities nearing zero, ensuring they are seldom sampled. This method effectively “renormalizes the probability distribution” [130]. In our research, we identify two main invalid actions: 1. The ego vehicle trying a lane change to a non-existent lane, e.g., aiming for a left turn while already in the leftmost lane; 2. The ego vehicle adjusting its speed (acceleration or deceleration) beyond predefined speed limits. It’s worth noting that these are foundational invalid actions. Additional unsafe actions un- dergo rigorous verification and are governed by the priority-based safety supervisor discussed in Section 3.3. High-level Intelligent Agent Driving Environment agent 1 𝑎1 … agent n 𝑎𝑛 Safety Supervisor safe actions control signals Low-level controller Figure 3.5 Schematics of system and simulation setup with safety supervisor, in which only safe actions from MARL will send to the low-level controller. 3.3 Priority-based Safety Enhancement While obvious invalid actions can be avoided using the rule-based action masking scheme described above, it cannot prevent inter-vehicle or vehicle-obstacle collisions. Therefore, a more comprehensive safety supervisor is needed to deal with collisions in complex, dy- namic, and cluttered mix-traffic environments. To address this, we introduce a novel safety- enhancement strategy leveraging vehicle dynamics and multi-step predictions. The strategy aims to forecast any potential collisions within a prediction horizon Tn and adjust unsafe exploratory actions. Given mixed traffic, inclusive of HDVs, a robust model is paramount to predict human driver decisions. We deploy IDM [132] for forecasting HDVs’ longitudinal 23 acceleration based on current speed and preceding distance. Simultaneously, the MOBIL lane-change model [66] predicts lane-change behaviors of HDVs. MARL agent defines the high-level decisions for AVs, as described in Section 3.2.1. The high-level acceleration and lane-change decisions are then executed via low-level PID controllers. The vehicle trajectories are then projected employing the kinematic bicycle model [110]. We label these trajectories, steered by high-level decisions, as “motion primitives”. The proposed framework and the associated simulation setup can be visualized in Figure 3.5. 3.3.1 Establishing Priorities Using HDV motion models, it’s feasible to predict potential collisions within the next Tn steps, considering joint motion primitives from all AVs. Naturally, one could consider using the joint action of all AVs for the safety-enhancement design. However, while it is relatively straightforward to determine potential collisions based on a joint action, it is very computationally costly to determine a joint safe action. Given the action space size of |Ai |N (with N being the number of AVs), computational demands scale quickly, especially with real-time constraints. As such, we propose a sequential, priority-based safety enhancement mechanism known for its computational efficiency and real-time application compatibility. The principle behind this approach revolves around sequencing AVs based on urgency, par- ticularly emphasizing those with severe safety buffers. For instance, AVs nearing the merging lane’s end or those approaching predefined safety distances, like narrowly defined headway distances, are prioritized. More specifically, priority assignments are considered as follows: 1. AVs on the merging lane hold precedence over those on the through lane due to the pressing nature of their merging objective. 2. AVs approach the merging lane’s end are prioritized higher given their heightened collision and deadlock risks [16]. 3. AVs with minimal time headways rank higher due to increased collision susceptibility with preceding vehicles. 24 Based on these considerations, the priority index pi for ego vehicle i is formulated as: pi = α1 pm + α2 pd + α3 ph + σi , (3.8) Here, α1 , α2 , and α3 denote positive weightings for pm (merging priority), pd (distance-to-end metric), and ph (time headway metric), respectively. σi ∼ N (0, 0.001) introduces a small random variable to prevent identical priority indices across vehicles. In addition, the merging priority score pm is outlined as:   0.5, if on merge lane;  pm = (3.9)  0,  otherwise, which assigns priority scores to vehicles on the merge lane. The distance-to-end priority score pd is defined by:   x , if on merge lane;   L pd = (3.10)  0,  otherwise, where x and L refer to the ego vehicle’s traveled distance on the ramp and the ramp’s length, respectively (refer to Figure 3.1). Lastly, the time headway priority metric ph is derived as dheadway ph = − log th vt , utilizing the time headway definition from Eq. 3.3. 3.3.2 Priority-based Safety Supervisor In this subsection, we present the design and workings of our proposed priority-based safety supervisor. At each time step t, commences with the safety supervisor predicting HDV motions and assigning priority scores to all AVs, as discussed earlier. This process yields a priority list of AVs, symbolized as Pt , organized in descending order. This implies that the vehicle with the highest priority is in the top position. Then the AV on the top of the obtained list, indexed by Pt [0], undergoes a safety check. More specifically, based on the (exploratory) action generated from the action network of vehicle Pt [0], the safety supervisor will examine whether the motion primitive induced by the exploratory action will conflict with its neighboring vehicles NPt [0] (both AVs and HDVs) over a pre-determined time 25 frame Tn , where Tn is a hyper-parameter that can be tuned. HDV motions are gauged using previously mentioned human-driver decision and vehicle kinematic models, while the rest of the AVs (with lower priority scores) are assessed based on their last recorded actions. To detect potential collisions, the system checks if any two trajectory sequences, each lasting Tn steps, come closer than a defined safety distance. If no collision is detected, the vehicle Pt [0] adopts the exploratory action. Figure 3.6 Illustration of trajectory conflict for Tn = 5 steps. Conversely, if a potential collision is detected, as depicted in Figure 3.6, the exploratory action is labeled as unsafe. A safer alternative is then generated. This safer action maximizes the safety margin, which is captured by the formula: a′t = arg maxat ∈Avalid min dsm,k .  (3.11) k∈Tn where Avalid is the set of valid actions at time step t. The nuances of calculating the safety margin differ based on the nature of the action. For instance, lane-change actions like “turn left” or “turn right” consider the shortest distance to vehicles in both current and target lanes. In contrast, actions such as “speed up”, “idle”, or “slow down” gauge safety margin using the minimum distance headway. These scenarios are visually represented in Figure 3.7. Once vehicle Pt [0]’s action is finalized, its trajectory is recalibrated. Then vehicle Pt [0] is removed from the list and the second highest becomes the first, i.e., Pt [i] ← Pt [i + 1], i = 1, 2, · · · . This sequential safety validation continues for every AV in the list until none remain. The intricate design of this priority-based safety supervisor is further detailed in Algorithm 3.1. 26 Figure 3.7 Illustration of safety margin definitions. Top: safety margin if vehicle 1 turns left; and Bottom: safety margin when vehicle 1 keeps straight. Remark 3.3.1. Our design envisages the priority-based safety supervisor’s real-world ap- plication via vehicle-to-infrastructure [25] (V2I) communication, where a centralized infras- tructure agent in proximity to a ramp can observe HDVs and interface with AVs. With each time step t, the infrastructure agent assigns priority scores grounded in real-time traffic data. It collects exploratory actions from AVs and then applies Algorithm 3.1 to finalize safe ac- tions. Its sequential nature ensures computational efficiency (approximately 28.13 ms for the safety supervisor with Tn = 8 to make a decision in the Hard traffic model, see Table 3.2 in Section 3.5). Given robust computational infrastructure, it’s feasible to apply this algorithm in real time, ensuring decisions are rendered within a single sampling time frame. Enhancing computational efficiency remains a focus for our future endeavors. 27 Algorithm 3.1 Priority-based Safety Supervisor Parameter: L, α1 , α2 , α3 , th , w, Tn . Output : ai , i ∈ ν . for i = 0 to N do compute the priority scores according to Eq. 3.8 rearrange ego vehicles to list Pt accord- ing to their priority scores. end for j = 0 to |Pt | do obtain the highest-priority vehicle Pt [0] find its neighboring vehicles NP⊔ [0] predict tra- jectories ζv , v ∈ Pt [0] ∪ NP⊔ [0] for Tn time steps. if trajectories are overlapped then replace the risky action as at ← a′t according to Eq. 3.11 replace the trajectory ζPt [0] with ζP′ t [0] end remove Pt [0] from Pt update Pt [i] ← Pt [i + 1], i = 1, 2, · · · . end Remark 3.3.2. The prediction horizon, denoted as Tn , plays an important role in the safety- enhancement strategy. If Tn is too small, the safety supervisor is “short-sighted”, potentially leading to an infeasible solution within a few iterations. Conversely, if Tn is too large, the compounded uncertainties of HDVs (the actual vehicle motion in the simulation has noisy perturbations from the human driver models used to predict the trajectories) are propagated. This can cause overly cautious decisions, aiming to ensure extensive horizon safety. Through rigorous cross-validations in our research, we ascertain Tn = 8 or 9 as optimal choices (refer to Figure 3.12 and Table 3.2 for insights). The skeleton of the proposed MARL complemented with a priority-based safety supervi- sor is depicted in Algorithm 3.2. Key hyperparameters include the time-discount factor γ, learning rate η, epoch length T , cumulative training epochs M , and the loss function coeffi- cients (β1 and β2 ). Each epoch commences with agents acquiring state data and pinpointing actions, leveraging an action masking approach to sidestep invalid maneuvers (Lines 4–7). Actions derived from MARL subsequently undergo a safety evaluation by the supervisor, as elaborated in Algorithm 3.1 (Line 9). If the action is unsafe, then the safety supervisor will replace the risky action with a safe action according to Eq. 3.11. Agents then act upon 28 this safer directive, and the resultant experience is documented in the replay buffer (Lines 10–17). Upon ending each episode, the policy network’s parameters undergo an update with experiences extracted from the on-policy experience buffer (Lines 20–26). The DONE signal is flagged either at the episode’s end or in the event of a collision. Once receiving the DONE signal, all agents are reset to their initial states to start a fresh epoch (Lines 28). Algorithm 3.2 MARL with Priority-based Safety Supervisor Parameters: γ, η, T, M, β1 , β2 . Outputs : θ. initialize s0 , t ← 0, D ← ∅; for j = 0 to M − 1 do for t = 0 to T − 1 do for i ∈ ν do observe si update ai,t ∼ πθi (·|si ) with action masking. end for i ∈ ν do check the actions by Algorithm 3.1 if safe then execute ai,t update Di ← (si,t , ai,t , ri,t , vi,t ) end else update ai,t ← a′i,t and execute a′i,t update Di ← (si,t , a′i,t , ri,t , vi,t ). end end update t ← t + 1 if DONE then for i ∈ ν do update θi ← θi + η∇θi J(θi ) end end initialize Di ← ∅, i ∈ ν update j ← j + 1 end update s0 , t ← 0 end 3.4 Numerical Experiments This section evaluates the efficacy of the advanced MARL algorithm through its training efficiency and collision rate within the context of the on-ramp merging paradigm, depicted in Figure 3.1. The length of the road is 520 m, inclusive of a 320 m entrance to the merge 29 Figure 3.8 Simulation settings for the single through lane case (upper) and multiple through lane case (lower). “L” represents the length of the road segments. The numbers under the vehicles are the vehicle spawn points. lane (segment AB) and a merging lane with a length L of 100 m. There are 12 spawn points (numbered beneath the vehicles) evenly distributed on the through lane and the ramp lane from 0 m to 220 m, as shown in Figure 3.8. Vehicles that exceed the road are withdrawn from view, though their kinematics continue to be updated. Three distinct traffic densities, determined by the initial count of vehicles, are defined: 1. Easy mode: Comprising 1-3 AVs and 1-3 HDVs. 2. Medium mode: A mix of 2-4 AVs and 2-4 HDVs. 3. Hard mode: 4-6 AVs combined with 3-5 HDVs. Within each training episode, a diverse set of HDVs and AVs spawn at varying points. A positional random noise (uniformly ranging between -1.5 m and 1.5 m) tweaks their starting positions. Initial speeds fluctuate between 25 to 27 m/s. The vehicle control sampling frequency is 5 Hz, translating to AVs acting every 0.2 seconds. A 5% random noise is added to the predicted acceleration and steering angle for HDVs. The MARL algorithms 30 are trained for 2 million steps, leveraging 3 distinct random seeds, shared among agents, translating roughly to 20,000 episodes with an episode horizon T = 100 steps. We evaluate the algorithm over 3 episodes every 200 training episodes. We set γ = 0.99 and the learning rate η = 5e−4 ; The reward function coefficients wc , ws , wh , and wm are set as 200, 1, 4, and 4, respectively. The priority coefficients α1 , α2 and α3 are equally set as 1. The weighting coefficients β1 and β2 for the loss function are chosen as 1 and 0.01, respectively. For comparison, we label the MARL algorithm without the safety supervisor and introduced in Section 3.2, as the “baseline”. Our simulation environment, derived from the gym-based highway-env simulator [75] and is available for open-source exploration3 . The default IDM and MOBIL model parameters aligned with those in the highway-env simulator [75] are used. The experiments are conducted in a Ubuntu 18.04 server with AMD 9820X processor and 64 GB memory. The video demo of the training process can be found at the site4 . 3.4.1 Reward Function Designs This subsection evaluates the performance of the proposed MARL framework under var- ious reward function designs, namely, the local (baseline) v.s. global rewards (baseline with global reward). Subsequently, the influence of the safety penalty weight wc within the reward function (Eq. 3.1) is evaluated. We investigate the localized reward function by comparing it with the global reward design used in [36, 64], wherein the reward for the ith agent at t time step is determined as the averaged global reward ri,t = N1 N P j=1 rj,t . The performance contrast between our localized and the global reward mechanisms is demonstrated in Figure 3.9. As expected, our localized reward outperforms the global reward design, both in terms of rewards achieved and expedited convergence across all traffic setups. While the global reward mechanism fares well in Easy and Medium modes due to fewer AVs, its efficacy decreases in the Hard mode, with an evaluation reward less than 0, as it suffers from the credit assignment issues [127] 3 See https://github.com/DongChen06/MARL_CAVs 4 See https://drive.google.com/drive/folders/1437My4sDoyPFsUjrThmlu1oJjTkTkvJ7?usp=sharing 31 and the reduced correlation between average global rewards and individual agent actions as agent numbers increase. Easy Mode Medium Mode Hard Mode Evaluation reward Evaluation reward Evaluation reward 60 20 25 0 0 40 −25 −20 20 global reward local reward −50 global reward local reward global reward local reward −40 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Evaluation epochs Evaluation epochs Evaluation epochs Figure 3.9 Evaluation curves during training with distinct reward functions across varied traffic intensities, with the shaded portion standing for standard deviation over three random seeds, smoothed over the nine evaluation epochs. Furthermore, Table 3.1 illuminates the performance under diverse wc values in the Medium traffic mode, configuration, holding other reward function coefficients constant in Eq. 3.1. Noticeably, there are no collisions when a large enough wc is selected (e.g., wc ≥ 100), and the average traffic speed decreases as wc further increases. This is because the CAVs will behave conservatively if we emphasize too much on the safety. To strike a balance between safety and traffic efficiency, subsequent tests assign wc to a value of 200. Table 3.1 Performance with different wc ’s in terms of collision rate and average speed in the Medium traffic mode, while all other weighting coefficients remain consistent. wc = 10 wc = 100 wc = 200 (we chose) wc = 1000 wc = 10000 Collision rate 0.1 0 0 0 0 Avg. speed 24.77 24.09 24.08 23.69 23.62 3.4.2 Curriculum Learning In this subsection, we leverage the concept of curriculum learning [65] to enhance both the speed and performance of learning in the Hard mode scenario. Instead of diving directly into the Hard mode, we build upon the trained model from the easier modes (i.e., easy and medium) and train the models to achieve higher efficiency. This method of building on basic models is particularly valuable for applications where safety is paramount, such as autonomous driving, as starting from a decent model can greatly reduce the number of “blind” explorations which could lead to high-risk situations. 32 Figure 3.10 shows the training performance comparison between the baseline method (i.e., starting from scratch) and curriculum learning (baseline + curriculum learning) for the Hard traffic mode. It is obvious that learning based on the trained model from easier tasks greatly expedites the speed of convergence and improves the final model performance. The average speed during the training, as shown in Figure 3.11, indicates that the curriculum learning strategy also improves the average vehicle speed up to 22 m/s compared to baseline method at 18 m/s, thus achieving high traffic efficiency. Therefore, we apply the curriculum learning in the following experiments for Hard traffic modes. 20 Training return 0 −20 baseline curriculum learning −40 0 2500 5000 7500 10000 12500 15000 17500 Training epochs Figure 3.10 Training curves with and without curriculum learning for Hard traffic mode. 22 Average speed 20 18 16 14 baseline curriculum learning 12 0 2500 5000 7500 10000 12500 15000 17500 Evaluation epochs Figure 3.11 Average speed during training with and without curriculum learning for Hard traffic mode. 3.4.3 Evaluating The Priority-based Safety Supervisor’s Performance In this subsection, we probe into the efficacy of our introduced priority-based safety supervisor. Figure 3.12 shows that the proposed priority-based safety supervisor method 33 has enhanced sample efficiency, evidenced by faster converge speed across all traffic den- sities. In addition, even in the Hard traffic mode, it achieves a higher evaluation reward. This improved performance stems from the safety supervisor’s ability to replace most un- safe maneuvers with secure alternatives, particularly during initial exploration, minimizing premature terminations, and paving the way for better learning efficiency. An exploration of Figure 3.13 underlines the average vehicle speed throughout the train- ing, a metric pointing to traffic throughput. Obviously, algorithms adopting the safety supervisor consistently maintain a faster training speed compared to the baseline with the safety supervision. This indicates that the proposed safety supervisor is not only beneficial for training but also leads to better traffic efficiency. It can also be seen that vehicle speeds are slower as the traffic density increases (26 m/s, 24 m/s and 22 m/s for Easy, Medium and Hard traffic densities, respectively). This is reasonable since denser traffic naturally results in more frequent interactions, resulting in reduced speeds to avoid collisions. Easy Mode Medium Mode Hard Mode 60 Training return Training return Training return 60 40 40 40 20 20 baseline baseline + Tn=6 baseline baseline + Tn=6 baseline baseline + Tn=6 20 baseline + Tn=3 baseline + Tn=8 0 baseline + Tn=3 baseline + Tn=8 0 baseline + Tn=3 baseline + Tn=8 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 Training epochs Training epochs Training epochs Figure 3.12 Training curves for the n-step priority-based safety supervisor. Easy Mode Medium Mode Hard Mode 26 24 25 22 Average speed Average speed Average speed 22 20 24 23 20 18 22 baseline baseline + Tn=6 18 baseline baseline + Tn=6 16 baseline baseline + Tn=6 baseline + Tn=3 baseline + Tn=8 baseline + Tn=3 baseline + Tn=8 baseline + Tn=3 baseline + Tn=8 21 16 14 5000 10000 15000 5000 10000 15000 5000 10000 15000 Training epochs Training epochs Training epochs Figure 3.13 Average speed during training for the n-step priority-based safety supervisor. After training, MARL algorithms for each traffic density are tested across 3 random seeds spanning 30 epochs. The outcomes, in terms of average collision rates, vehicle speeds, and the safety supervisor’s inference time, are represented in Table 3.2. With the intervention of the 34 safety supervisor, particularly for Tn ≥ 7, MARL seamlessly runs without collisions across all traffic modes. In contrast, the baseline method exhibits collision rates of 0.07 and 0.16 for Medium and Hard traffic densities, respectively. It is clear that with only a short prediction horizon, e.g., Tn = 3 or 6, the MARL is still failed under challenging cases. For example, the agents still have 0.03 collision rate in the Hard traffic mode when choosing Tn = 6. The reason is that if Tn is too small, the safety supervisor is “short-sighted” and can lead to no feasible solutions after only a few steps. Conversely, if Tn (e.g., Tn = 10, 12, 14) is excessively increased, collision rates might increase due to compounded uncertainties over extended durations, as discussed in Remark 2 in Section 3.3. This might also suppress average speeds as vehicles would adopt more careful actions to ensure safety. With a reasonable Tn (e.g., 7, 8), the average speed indicates that the safety supervisor leads to higher traffic efficiency. In all traffic modes, the safety supervisor always leads to higher average speed while lower collision rate. For instance, the best average speed for the Easy traffic mode is achieved by baseline + Tn = 8 (27.72 m/s) compared to the baseline method (23.52 m/s). Interestingly, the baseline’s average speed in the Medium mode lags behind that of the Hard mode, a pattern inconsistent with methods incorporating the safety supervisor. Our observation is that the CAVs (in the baseline) often behave conservatively and hesitate to speed up or slow down to let the merging vehicles merge to avoid traffic congestion, which is the main reason causing traffic inefficiencies. In contrast, with the designed safety supervisor, traffic inefficiencies are largely improved and this situation is reduced. Table 3.2 Comparative analysis of the n-step safety supervisor based on the baseline (bs) method, in terms of collision rate, average speed (m/s), and inference time (ms). Scenarios Metrics bs bs + Tn = 3 bs + Tn = 6 bs + Tn = 7 bs + Tn = 8 bs + Tn = 9 bs + Tn = 10 bs + Tn = 12 bs + Tn = 14 collision rate 0 0 0 0 0 0 0 0 0 Easy Mode avg. speed 23.53 25.12 25.38 25.27 27.72 27.50 25.89 25.82 25.74 infrn time - 4.90 7.62 8.30 8.93 10.07 11.08 12.62 14.71 collision rate 0.07 0.03 0.01 0 0 0 0 0 0.01 Medium Mode avg. speed 20.30 24.22 24.61 24.13 24.08 24.19 23.74 24.35 24.13 infrn time - 14.75 14.64 16.78 17.55 19.40 21.29 23.22 26.89 collision rate 0.16 0.14 0.03 0 0 0 0.03 0.05 0.05 Hard Mode avg. speed 21.71 22.52 22.56 22.58 22.73 23.01 22.52 21.31 21.83 infrn time - 14.75 23.13 25.69 28.13 31.45 35.93 39.43 50.13 35 Easy Mode Medium Mode Hard Mode 50 50 Training return Training return Training return 60 25 25 40 0 0 MAA2C MAPPO −25 MAA2C MAPPO −25 MAA2C MAPPO 20 MAACKTR bs + Tn=8 MAACKTR bs + Tn=8 MAACKTR bs + Tn=8 −50 −50 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 Training epochs Training epochs Training epochs Figure 3.14 Training curves comparison between the proposed MARL policy (baseline (bs) + Tn = 8) and 3 state-of-the-art MARL benchmarks. Easy Mode Medium Mode Hard Mode 30 25 25 Average speed Average speed Average speed 25 20 20 20 15 MAA2C MAPPO 15 MAA2C MAPPO MAA2C MAPPO 15 MAACKTR bs + Tn=8 MAACKTR bs + Tn=8 MAACKTR bs + Tn=8 5000 10000 15000 5000 10000 15000 5000 10000 15000 Training epochs Training epochs Training epochs Figure 3.15 Average speed comparison between the proposed MARL policy (baseline (bs) + Tn = 8) and 3 state-of-the-art MARL benchmarks. 3.4.4 Comparison With State-of-the-art Benchmarks Table 3.3 Testing performance comparison of collision rate and average speed between the proposed method and four SOTA benchmark techniques. Scenarios Metrics MPC MAA2C MAACKTR MAPPO baseline + Tn = 8 collision rate 0.03 0.02 0.08 0 0 Easy Mode avg. speed [m/s] 22.05 21.00 24.71 25.70 25.72 collision rate 0.03 0.08 0.12 0.02 0 Medium Mode avg. speed [m/s] 19.67 19.33 21.94 24.00 24.08 collision rate 0.40 0.52 0.18 0.34 0 Hard Mode avg. speed [m/s] 21.02 19.68 18.19 22.41 22.73 In this subsection, we compare the proposed method with several SOTA MARL bench- marks, including MAA2C, MAPPO, and MAACKTR. Additionally, we examine an enhanced model predictive control (MPC) method as cited in [18, 19]. All the MARL benchmarks are structured to share parameters among agents to accommodate dynamic agent num- bers, utilizing global rewards and discrete action space. Specifically, MAA2C [80] inte- grates a context-aware multi-agent actor-critic methodology with a centralized critic net- work employing an expected update. AACKTR [146] fine-tunes the trust region using 36 the Kronecker-factored approximate curvature [91] (K-FAC). Meanwhile, MAPPO [149] en- hances the MARL paradigm by incorporating best practices, including Generalized Advan- tage Estimation (GAE) [118], advantage normalization, and value clipping. Figure 3.14 shows evaluation metrics during the training phase for all MARL strategies. Notably, our proposed approach (baseline + Tn = 8) persistently surpasses its contemporaries across varied traffic levels. Its superiority becomes even more obvious regarding sample efficiency and training outcomes, especially in the Hard mode. Figure 3.15 indicates that our proposed method maintains superior average training speeds, leading to high training effectiveness. After training, all algorithms for each traffic density are tested across different traffic densities for 30 epochs. The derived average collision rates and vehicle speeds are tabulated in Table 3.3. These findings highlight the proposed method’s capability to avoid collisions entirely and operate with enhanced efficiency compared to benchmark methods. In particu- lar, MAPPO stands out in the Easy traffic setting, achieving zero collisions, and showcases commendable performance in the Medium mode with a minor collision rate (0.02). However, it demonstrates a high collision rate (0.34) in the Hard traffic mode due to abrupt speed fluctuations, as visualized in Figures 13 and 14, causing HDVs little room to respond. Con- versely, both MAA2C and MAACKTR, lacking safety checks, fail in the on-ramp merging tasks across all traffic scenarios, leading to high collision rates. It is noteworthy that, the discrepancy between the exact dynamics used in the highway simulator environment and our model used in MPC, along with the uncertainties injected in the simulator, mean that even the MPC can result in collisions (0.03, 0.03, and 0.40 collision rate in the Easy, Medium, and Hard traffic modes, respectively). This becomes particularly evident as traffic complexity increases; the merging challenge amplifies due to an augmented model mismatch, making it difficult for MPC to devise a collision-averse policy. This shows the strengths of model- free approaches that do not rely on explicit models. Furthermore, the MPC’s model-centric implementation relies on potent computational resources to facilitate extensive real-time 37 calculations. This becomes even more critical in on-ramp merging scenarios, which involve nonlinear dynamics and necessitate solving a nonlinear program each time step to calcu- late the control input, requiring significant onboard computation power. In contrast, our RL-based approach requires much less computational capability [82]. Figure 3.16 Frames show the learned policy. The below figure shows the corresponding speed of the AVs. 3.4.5 Policy Interpretation This subsection interprets the behaviors exhibited by learned AVs. Figure 3.16 provides snapshots at time steps 25, 37, and 50 and outlines the speeds of agents 2-4. At time step 25, vehicle 2 decelerates, creating a space for vehicle 3 to merge. As vehicle 3 accelerates to merge, it maintains an appropriate gap from vehicle 1. By time step 37, vehicle 3 has smoothly merged and accelerated, while vehicle 2 moderates its speed to ensure a safe dis- tance from vehicle 3. By the time 50, vehicle 2 accelerates, maintaining a secure following 38 distance from vehicle 3. Similar dynamics are noted with vehicle 4. 3.4.6 Multiple Through-lane Case The efficacy of the proposed model is further highlighted in the complex multiple through- lane scenarios showcased in Figure 3.8(b), where vehicles are allowed to change lanes in the through lanes. As shown in Section 2.1, the on-ramp merging is depicted as a POMDP MG , which can be described by the following tuple ({Ai , Si , Ri }i⊆ν , T ). Here, the action state A extends to fit multiple through-lane scenarios, with a slightly adjusted state space to account for additional surrounding vehicles. Specifically, the observation space (the number of observable neighboring vehicles) is determined by the parameter NNi . For the multiple through-lane case, we choose a larger NNi = 8 (NNi = 5 in the single through-lane case). For the priority-based safety supervisor, we also extend it to the multi-lane case without any changes. The reward function was modified after observing that ego vehicles frequently made unnecessary lane changes, leading to unsafe driving behaviors (a demo on frequent lane changes at https://drive.google.com/file/d/1dO8xPCwLXVRgQFM_xwqscRazoId5ksf4/vi ew?usp=sharing). This revised reward function was constructed with an additional metric, rl , which penalizes unwarranted and repeated lane changes, inspired by designs in [116]. ri,t = wc rc + ws rs + wh rh + wm rm + wl rl , (3.12) where wc , ws , wh , wm and wl are positive weighting coefficients corresponding to collision evaluation rc , stable-speed evaluation rs , headway time evaluation rh , merging cost evalua- tion rm , and lane-changing evaluation rl , respectively. Here, rl is defined as:   −1, if change lanes;  rl = (3.13)  0,  otherwise. To further validate the flexibility and effectiveness of the proposed MARL framework, we implemented the aforementioned multiple through-lane cases in the highway environment. Figure 3.17 shows that our approach can be easily extended to the multiple through-lane cases 39 and achieves good performance. Table 3.4 shows the evaluation performance on the multi- lane scenarios. As expected, the proposed approach achieves the best performance among the MARL benchmarks in terms of the lowest collision rate. Overall, MAPPO, MAA2C, and MAACKTR achieve better performance in terms of lower collision rate than the single through-lane case since there is an extra through lane providing more operating space for the CAVs. However, nearly all MARL algorithms achieve relatively lower average speeds in the multi-lane case than in the single-lane case due to more complicated traffic scenarios with more CAVs and HDVs. It is noted that MAA2C learns a suboptimal policy in the Medium traffic mode, representing the lowest average speed due to conservative operations. Since it is a very involved task to apply MPC approaches to the multi-through lane cases considering the complicated system modeling, we will leave it for future work. The demo video and code for the multiple through-lane scenarios can be found at https://github.com/DongChen06/ MARL_CAVs/tree/multi-lane. 40 Training return 20 0 baseline baseline + Tn=6 −20 baseline + Tn=3 baseline + Tn=8 0 2500 5000 7500 10000 12500 15000 17500 20000 Training epochs Figure 3.17 Training curves for the n-step priority-based safety supervisor for the multiple through-lane cases. 3.5 Conclusions And Discussions This chapter formulated the on-ramp merging challenge in mixed-traffic scenarios as an on-policy MARL, incorporating action masking, local reward design, curriculum learning, and parameter sharing, demonstrating robust performance over several SOTA benchmarks. 40 Table 3.4 Testing performance comparison of collision rate and average speed between the proposed method and 3 state-of-the-art benchmarks on the multiple through-lane cases. Scenarios Metrics MAA2C MAACKTR MAPPO baseline + Tn = 8 collision rate 0 0.03 0 0 Easy Mode avg. speed [m/s] 19.78 22.71 23.07 23.53 collision rate 0.03 0.07 0.03 0 Medium Mode avg. speed [m/s] 15.16 20.07 19.59 21.05 collision rate 0.40 0.27 0.10 0 Hard Mode avg. speed [m/s] 22.04 21.60 19.65 20.95 Furthermore, a unique priority-based safety supervisor was introduced, which significantly improved safety and enhanced learning efficiency. In future work, we aim to bridge the gaps between simulations and their real-world ap- plications. Initial explorations into general RL training can inadvertently result in undesired system behaviors, even leading to potential crashes upon real-world deployment. To address this, safety during the exploration can be enhanced by exploiting the dynamic information of the system to limit the exploration actions within an admissible range, see e.g., our pre- vious work [22] as well as others [4, 46] for safe RL algorithms. It’s essential that before transitioning to real-world applications, our policy network undergoes rigorous examinations in high-fidelity simulations until it achieves optimal performance. Furthermore, it needs to pass various tests before field deployment to ensure both safety and robustness. Once the policy is deployed on the ego vehicles, regular updates and maintenance should be conducted to enhance the model performance in unseen scenarios. For a comprehensive understanding of the sim-to-real paradigm in reinforcement learning, we recommend an in-depth survey by [154]. Therefore, we will develop a more realistic simulation environment by incorporating data from real-world traffic systems to better fill the sim2real gap. 41 CHAPTER 4 CACC WITH FULLY DECENTRALIZED AND COMMUNICATION EFFICIENT MARL In this chapter, we introduce a fully decentralized MARL framework, equipped with a novel quantization-based communication protocol for Cooperative Adaptive Cruise Control (CACC). 4.1 Introduction Recently, connected autonomous vehicles (CAVs) have gained significant attention due to their ability to create safe and sustainable transportation systems in the future [140]. One pivotal technology of CAVs, known as Cooperative Adaptive Cruise Control (CACC), has been recognized for its capacity to increase road usage efficiency, alleviate traffic conges- tion, and decrease both energy consumption and exhaust emissions [28, 107]. The primary objective of CACC is to adaptively coordinate a fleet of vehicles, aiming to minimize the car- following headway and speed variations, utilizing real-time vehicle-to-vehicle (V2V) commu- nication [26]. While autonomous vehicle platooning offers great benefits, developing a robust CACC platform that tightly integrates computing, communication, and control technologies presents a considerable challenge, especially considering the constraints of limited onboard communication bandwidth and computing resources [155]. Classical control theory and optimization-based methodologies have been employed to tackle the CACC problem [92, 44, 3, 145, 50]. Specifically, some research targets the predecessor-following model [92] and string stability [138, 41], modeling CACC within the context of a two-vehicle system. In contrast, other studies formulate the challenges posed by CACC as optimal control problems [63, 44, 145]. However, these approaches frequently hinge on precise system modeling [92, 138, 41] or necessitate online optimization, which may not align with the demands of efficiency and scalability that are essential for real-time application [28]. On the other hand, platoon control has also been conceptualized as a sequential deci- 42 sion problem and addressed with data-driven strategies such as reinforcement learning (RL) [32, 28, 74, 62, 155, 140, 84, 76]. In particular, In [62], the Soft Actor-Critic (SAC [54]) is adopted to mitigate traffic oscillations and enhance platoon stability. Furthermore, deep deterministic policy gradient (DDPG [79]) algorithm is employed in [140] for CACC, taking into account both time-varying leading velocity and communication delays via wireless V2V communication technology. A policy-gradient RL approach is developed in [32] to ensure the safe longitudinal distance to a front vehicle. However, these approaches primarily fo- cus on platoons of only 2 vehicles (i.e., leader-follower architecture). To control multiple CAVs, centralized RL approaches are frequently developed, a strategy that relies heavily on the high bandwidth capabilities of vehicle-to-cloud (V2C) or vehicle-to-infrastructure (V2I) communication [61]. For instance, in [28], a centralized RL controller is introduced for the CACC problem in mixed-traffic scenarios via V2C communication. While these central- ized control strategies have demonstrated promising results, they bear the burden of heavy communication overheads and are often plagued by a single point of failure and the curse of dimensionality [26]. These factors make them impractical for deployment in large-scale CACC systems prevalent in the future landscape. More recently, Multi-Agent Reinforcement Learning (MARL) has emerged as a promising solution to address the CACC control problem involving multiple AVs, owing to its capa- bilities for online adaptation and solving complex problems [108, 26, 111]. For instance, a MARL framework with both local and global reward designs is evaluated in [108] on two platoons of 3 and 5 AVs, concluding that the local reward design (i.e., independent MARL) surpasses the global reward design. However, our experiments will demonstrate that inde- pendent MARL achieves promising performance in straightforward CACC scenarios but falls short in more complex situations (see Sec. 4.4). In [26], a learnable communication MARL protocol is developed to reduce information loss across two CACC scenarios and each agent (i.e., AV) learns a decentralized control policy based on local observations and messages from connected neighbors. Moreover, Blockchain is incorporated into the MARL (i.e., MADDPG 43 [87]) framework to enhance the privacy of CACC. Despite these advances, these approaches uniformly adopt a Centralized Training and Decentralized Execution (CTDE) framework, wherein agents use additional global information to guide training centrally and make deci- sions based on decentralized local policies [152, 151]. However, in many real-world scenarios, such as CACC, installing a central controller (e.g., cloud facilities or roadside units) can be prohibitively expensive. Moreover, the central controller needs to communicate with each agent to exchange information, which perpetually amplifies the communication overhead on the single controller [152]. In this chapter, we formulate CACC as a fully decentralized MARL problem, in which the agents are connected via a sparse communication network without the need for any central controller. To achieve this, we introduce a decentralized MARL algorithm based on a novel policy gradient update mechanism. Throughout the training process, at every time step, each agent takes an individual action based solely on locally available informa- tion. To stabilize training and counteract the inherent non-stationarity in MARL [151], each agent shares its estimate of the value function with its neighbors on the network, collectively aiming to maximize the average rewards of all agents across the network. Furthermore, a novel quantization-based communication scheme is further proposed, which greatly improves communication efficiency in decentralized stochastic optimization without a substantial com- promise on optimization accuracy. The main contributions and the technical advancements of this chapter are summarized as follows. 1. We formulate the CACC problem as a fully decentralized MARL framework, which allows fast convergence without any central controller. A corresponding gym-like sim- ulation platform with two CACC scenarios and six state-of-the-art MARL baseline algorithms is developed and open-sourced1 . 2. We introduce an innovative, effective, and scalable MARL algorithm equipped with a quantization-based communication protocol to enhance communication efficiency. The 1 See https://github.com/DongChen06/CACC_MARL 44 quantization process condenses complex parameters of the critic network into discrete representations, facilitating efficient information exchange among agents. 3. We conduct comprehensive experiments on two CACC scenarios, and the results show that the proposed approach consistently outperforms several state-of-the-art MARL algorithms. The structure of this chapter is as follows: In Section 4.2, we introduce the CACC prob- lem that we are addressing. The problem formulation and the proposed MARL framework are introduced in Sec. 4.3 whereas experiments, results, and discussions are presented in Sec. 4.4. Lastly, in Section 4.5, we conclude the chapter, by summarizing our contributions and suggesting potential insights for future research. V2V communication network … CAV CAV CAV Platoon leader ℎ𝜈 ℎ1 … 𝜈 𝜈-1 1 0 Figure 4.1 Framework of the CACC system. 4.2 Cooperative Adaptive Cruise Control (CACC) In this section, we introduce the system model for vehicle platooning along with the behavior model employed within the platoon. Furthermore, we present an introduction to the two CACC scenarios used in this chapter. 4.2.1 Vehicle Dynamics As shown in Figure 4.1, we consider a platoon, comprised of V + 1 CAVs, navigating along a horizontal road. For simplicity, we assume that all vehicles in the system share identical characteristics. The platooning system is guided by a platoon leader vehicle (PL, 0th vehicle), while the platoon member vehicles (PMs, i ∈ 1, ..., V) travel behind the PL. Each PM i maintains a desired inter-vehicle distance (IVD) hi and velocity vi relative to its preceding vehicle i − 1, based on its unique spacing policy [155]. The one-dimensional 45 dynamics of vehicle i can be expressed as follows: ḣi = vi−1 − vi , (4.1a) v̇i = ui , (4.1b) where vi−1 and ui symbolize the velocity of its preceding vehicle and the acceleration of vehicle i, respectively. As per the design outlined in [28], the discretized vehicle dynamics, given a sampling time ∆t, can be denoted as: Z t+∆t hi,t+1 = hi,t + (vi−1,τ − vi,τ )dτ, (4.2a) t vi,t+1 = vi,t + ui,t ∆t, (4.2b) In order to guarantee both comfort and safety, each vehicle must follow the following con- straints [28]: hi,t ≥ hmin , (4.3a) 0 ≤ vi,t ≤ vmax , (4.3b) umin ≤ ui,t ≤ umax . (4.3c) where hmin = 1 m, vmax = 30 m/s, umin = −2.5m/s2 < 0 and umax = 2.5m/s2 > 0 represent the minimum safe headway, maximum speed, deceleration, and acceleration limits, respectively. 4.2.2 Vehicle Behavior The behavior of vehicles in the platoon is simulated using the Optimal Velocity Model (OVM [8]). The OVM has been widely used in traffic flow modeling due to its ability to capture realistic human driving behaviors [26]. The principal equation of OVM for ith vehicle is defined as follows: ui,t = αi (v ◦ (hi,t ; hs , hg ) − vi,t ) + βi (vi−1,t − vi,t ), (4.4) where αi and βi are the headway gain and relative velocity gain, respectively. These param- eters serve as representations of human driver behavior, encapsulating the influence of both 46 spacing and relative speed in determining vehicle acceleration. Here, hs = 5 m and hg = 35 m denote the stop headway and full-speed headway, both of which are key to understanding traffic dynamics at different vehicle densities. Furthermore, v ◦ represents the headway-based velocity policy, which is defined as:  if h < hs ;      0,   ◦ s v (h) ≜ 1 vmax (1 − cos (π h−h )), if hs ≤ h ≤ hg ; (4.5)  2 hg −hs    if h > hg ;  vmax ,  This policy function serves as an optimal velocity strategy for each vehicle based on the current headway to the preceding vehicle. At small headways less than or equal to hs , the optimal velocity is zero, highlighting the need for vehicle stopping to prevent potential col- lisions. For headways within the range hs to hg , the optimal velocity gradually increases following a cosine curve until reaching the maximum velocity. For larger headways greater than or equal to hg , the optimal velocity is capped at the vehicle’s maximum speed, ensur- ing both safety and efficiency in the traffic flow. This strategy significantly contributes to maintaining fluidity in vehicular traffic under various density conditions. 4.2.3 Two CACC Scenarios In this chapter, the objective of CACC is to adaptively control a fleet of CAVs in order to reduce the car-following headway to h∗ = 20 m and achieve a target velocity of v ∗ = 15 m/s, leveraging real-time V2V communication. Two different CACC scenarios are investigated as presented in [26]: “Catchup” and “Slowdown”. For the “Catchup” scenario, the CAVs (i = 1, ..., V) are initialized with states vi,0 = vt∗ and hi,0 = h∗t , while the platoon leader (PL) is initialized with states v0,0 = vt∗ and h0,0 = a · h∗t m, where a is a random variable uniformly distributed between 3 and 4. In contrast, during the “Slowdown” scenario, all vehicles (i = 0, 1, ..., V) have initial velocities vi,0 = b · vt∗ and hi,0 = h∗t , where b is uniformly distributed between 1.5 and 2.5. Here, vt∗ linearly decreases to 15 m/s within the first 30 seconds and then remains constant. The “Slowdown” scenario poses a more complex 47 and challenging task than the “Catchup” scenario due to the necessity for all vehicles to precisely coordinate their deceleration rates and maintain safe inter-vehicle distances, thereby requiring more precise control strategies. An example of the headway and speed profiles of the CAVs in these scenarios is illustrated in Figure 4.4 and Figure 4.5. 4.3 CACC As MARL In this section, we first formulate the considered CACC problem as a partially observable Markov decision process (POMDP). Subsequently, we present our fully decentralized Actor Critic-based MARL algorithm, which represents our primary strategy for addressing the challenges presented in the CACC problem. Then, we introduce the quantization-based communication protocol to enhance the efficiency of agent communication within the MARL framework. 4.3.1 Problem Formulation In this chapter, we model the CACC problem as a model-free multi-agent network [26], where each agent (i.e., AV) is capable of communicating with the vehicles ahead and behind it via V2V communication channels. We denote the global state space and action space as S := ×i∈ν Si and A := ×i∈ν Ai , respectively. The intrinsic dynamics of the system can be characterized by the state transition distribution P: S × A × S → [0, 1]. We propose a fully decentralized MARL framework where each agent i (equivalently, AV i) has a partial view of the environment, specifically the surrounding vehicles, which accurately reflects the practical scenario where AVs are limited to sensing or communicating with proximal vehicles, thereby rendering the overall dynamical system as a Partially Observable Markov Decision Process (POMDP). This POMDP, MG , can be delineated by the following tuple MG = ({Ai , Si , Ri }i⊆V+1 , T ): 1. Action Space: In the considered CACC problem, the action at ∈ Ai is straightfor- wardly related to the longitudinal control. However, due to the data-driven nature of RL, formulating a safe and robust longitudinal control strategy poses a significant challenge [28]. To address this, we adopt OVM (see Sec. 4.2.2, [8]) to carry out the 48 longitudinal vehicle control. The OVM control behavior is affected by various hyper- parameters: headway gains α, relative velocity gain β, stop headway hs , and full-speed headway hg . Usually,(α; β) represents the driving behavior of a human driver. How- ever, in this work, we leverage MARL to propose suitable values of (α; β) for each OVM controller. These recommended values are selected from a set of four different levels: {(0, 0), (0.5, 0), (0, 0.5), (0.5, 0.5)}. Subsequently, the longitudinal action can be computed using Eq. 4.4 and Eq. 4.5. 2. State Space: The state space represents the description of the environment. The state of agent i, Si , is defined as [v, vdif f , vh, h, u], where v = (vi,t − vi,0 )/vi,0 denotes the current normalized vehicle speed. vdif f = clip((vi−1,t − vi,t )/5, −2, 2) represents clipped vehicle speed difference with its leading vehicle. vh = clip((v ◦ (h) − vi,t )/5, −2, 2), h = (hi,t + (vi−1,t − vi,t )∆t − h∗ )/h∗ , and u = ui,t /umax are the headway-based velocity defined in Eq. 4.5, normalized headway distance, and acceleration, respectively. 3. Reward Function: The reward function ri,t is pivotal for training the RL agents to exhibit the desired behaviors. With our objective being the training of our agents to achieve a predefined car-following headway h∗ = 20 m and velocity v ∗ = 15 m/s, the reward assigned to the ith agent at each time step t is formulated as follows: ri,t = w1 (hi,t − h∗ )2 + w2 (vi,t − v ∗ )2 + w3 u2i,t + w4 (2hs − hi,t )2+ , (4.6) where wi , i ∈ {1, 2, 3, 4} are the weighting coefficients. In this equation, the first two terms, (hi,t − h∗ )2 and (vi,t − v ∗ )2 , penalize deviations from the desired headway and velocity, encouraging the agent to achieve these targets closely. The third term, u2i,t , is included to minimize abrupt accelerations, thereby promoting smoother and more comfortable rides for passengers. Lastly, the term (2hs − hi,t )2+ functions as a safety constraint, penalizing the agent heavily if the inter-vehicle distance is less than twice the stop headway hs , which is critical for preventing collisions and ensuring the safety of the vehicle platoon. This comprehensive reward design serves to balance performance, comfort, and safety considerations in the CACC system. Upon a collision, if the inter- 49 vehicle distance hi,t ≤ 1 m, each agent is subjected to a substantial penalty of 1000, resulting in immediately terminating the training episode. 4. Transition Probabilities: The transition probability, T (s′ |s, a), describe the dynam- ics of the system. Given that our approach is a model-free MARL framework, we do not assume any prior knowledge of this transition probability while developing our MARL algorithm. 4.3.2 Fully Decentralized MARL In this chapter, we formulate the CACC as a fully decentralized MARL scenario, where each agent (i.e., an autonomous vehicle) independently decides its action based solely on its local observation. Importantly, this structure lacks a centralized controller, meaning that each agent possesses its own individual policy networks. During the learning phase, agents rely on locally received rewards to train and update these networks. In this chapter, we adopt the actor-critic MARL framework [29], and the policy loss for agent i is defined as: " T # X πθi ∇θ L(πθi ) = Eπθi ∇θ log πθi (ai,t |si,t )Ai,t , (4.7) t=0 πθ where Ai,ti = ri,t + γV πϕi (si,t+1 ) − V πϕi (si,t ) is the advantage function and V πϕi (si,t ) is the state value function, which is updated following the loss function: h i2 LVϕi = min EDi ri,t + γVϕi (si,t+1 ) − Vϕi (si,t ) . (4.8) ϕi Despite each agent learning independently, the overall goal of the cooperative MARL frame- 1 PV+1 work is to optimize the average global reward rg,t = V+1 i=0 ri,t . To address the non- stationary, in [152], the update of the policy network is executed independently by each agent, eliminating the need for inferring other agents’ policies. However, when it comes to updating the critic network, a collaborative approach is adopted, in which each agent shares its estimate of the value function xi with its neighboring agents within the network through a “mean” operation, i.e., xk+1 = |N1i | j∈Ni xkj . This allows for the joint evolution P i and continuous improvement of the system’s overall performance. However, the approach 50 is based on the assumption that all agents are homogeneous, sharing the same characteris- tics. While this simplifies the problem structure, it doesn’t adequately represent the intrinsic diversity of individual agents, which is particularly relevant for the CACC scenario where di- verse strategies are needed based on vehicles’ positions, speeds, and proximities. To address this concern, we propose a novel update strategy that fosters a balance between individual learning and collaborative influence from neighboring agents as follows. X xk+1 i = xki + ϵ ωij (xkj − xki ) − λgik (4.9) j∈Ni −4 where the scaling factor ϵ = 1.0 × e modulates the impact or collaborative influence from neighboring agents, while the learning rate λ = 5.0 × e−4 adjusts the influence of the gradient on the update process. This novel update strategy fosters collaboration among the agents while preserving the individual learning capabilities of each, thereby striking a balance between global performance optimization and localized adaptivity. The update strategy of the proposed fully decentralized MARL for CACC (abbreviated as MACACC) is given in Algorithm 4.1. Algorithm 4.1 MACACC for CACC Public parameters: W , ϵk , λk , x0i for all i, the total number of iterations k for ith agent do Determine the local gradient gik for the critic network; Send states to all neighboring agents j ∈ Ni ; After receiving xkj from all j ∈ Ni , update network parameters as X xk+1 i = xki + ϵ ωij (xkj − xki ) − λgik j∈Ni end 4.3.3 Quantization-based Communication Protocol To enhance communication efficiency among agents in our MARL framework, we propose a strategy of transmitting quantized parameters, rather than the raw parameters of the critic network. This approach is especially important for autonomous driving applications that 51 are often subject to limitations in communication bandwidth. By transmitting compact, quantized parameters instead of raw data, we ensure optimal use of available bandwidth, thereby fostering efficient and effective communication among the vehicles in the network. Let’s denote the parameters of the critic network as x = [x1 , x2 , ..., xd ]T , with d repre- senting the dimension of the parameter vector. We then apply a quantization function Q(x) to these parameters, yielding a quantized parameter vector [q1 , q2 , ..., qd ]T . The quantization rule is defined as: qi = rsign(xi )bi , (4.10) Here, r is a non-negative real number that is at least the maximum absolute value of the entries in x (denoted as ||x||∞ ), while sign(·) stands for the sign function, which returns the sign of any given real number. The factor bi is a random variable following a designed distribution determined by the magnitude of the corresponding parameter xi . Let n be the resolution of the quantization and overall 2n +1 discrete points will be generated. We assume |xi | belongs to the interval [ nk r, k+1 n r], then probability of bi can be determined as: k+1 |xi | n P (bi = |x) = , (4.11a) n r k+1 k |xi | n P (bi = |x) = 1 − , (4.11b) n r k+1 0, 1, ..., k − 1, k + 2, ..., n P (bi = |x) = 0, (4.11c) n k+1 k+1 If the magnitude of xi is closer to n r, then the higher the probability that bi will be n , and vice versa. An illustration of the proposed quantization-based communication scheme is represented in Figure 4.2. We denote the quantization-based MACACC algorithm as QMACACC (n). An extremely condensed version of QMACACC is QMACACC (1), in which only three discrete numbers {−1, 0, 1} are used to represent each parameter, and bi is defined as: |xi | P (bi = 1|x) = , (4.12a) r |xi | P (bi = 0|x) = 1 − . (4.12b) r 52 𝑚𝑟 (𝑚 + 1)𝑟 0 … … 1 𝑛 𝑛 |𝑥𝑖 | 𝑚 𝑛 𝑥𝑖 − 𝑚𝑟 𝑚+1 𝑛 𝑥𝑖 − 𝑚𝑟 Pr 𝑏𝑖 = = 1− Pr 𝑏𝑖 = = 𝑛 𝑟 𝑛 𝑟 𝑙 𝑃𝑟 𝑏𝑖 = |𝑥 = 0, 𝑙 = 0,1, … , 𝑚 − 1, 𝑚 + 2, … , 𝑛 𝑛 Figure 4.2 Schematics of the proposed quantization scheme. Remark 4.3.1. The quantization resolution n is a significant hyperparameter in the quan- tization scheme. If n is too small, the granularity of the quantization could lead to excessive loss of information, negatively impacting the performance of the MARL framework. Con- versely, if n is too large, the computational and communication overheads could increase due to the larger number of potential quantized values. Therefore, choosing an appropriate value of n is crucial for balancing communication efficiency and the performance of the MARL framework. An empirical evaluation of different n values will be conducted in the experiments section. The update strategy of the proposed quantization-based MARL (i.e., QMACACC (n)) for CACC is given in Algorithm 4.2. Algorithm 4.2 QMACACC (n) for CACC Public parameters: W , ϵk , λk , x0i for all i, the total number of iterations k for ith agent do Determine the local gradient gik for the critic network; Quantize states according to Eq. 4.11 and send to all neighboring agents j ∈ Ni ; After receiving Q(xkj ) from all j ∈ Ni , update state as X xk+1 i = xki + ϵ ωij (Q(xkj ) − Q(xki )) − λgik j∈Ni end 53 4.4 Experimental Results & Discussions In this section, we evaluate our MARL framework in two CACC scenarios detailed in Section 4.2.3. Firstly, we benchmark our approach against several state-of-the-art MARL strategies. Then, we demonstrate the effectiveness of our quantization-based communication protocol. 4.4.1 General Setups To demonstrate the efficiency and robustness of MA2C, we compare it to several state-of- the-art MARL benchmark controllers. IA2C performs independent learning, while ConseNet [152] takes the “mean" operation during updating critic networks, and FPrint [43] incorpo- rates the neighbors’ policy into the inputs. DIAL [42], CommNet [126], and NeurComm [26] are implementations with learnable communication protocols, incorporating more messages from the neighbors, e.g., neighboring states or policy information, relying on higher commu- nication bandwidth. All algorithms use the same DNN structures: one fully-connected layer for input state encoding, followed by one LSTM layer for message extracting. All hidden layers have 64 units. During the training, the network is initialized with the state-of-the-art orthogonal initializer [115]. We train each model over 1M steps, with γ = 0.99, actor learning rate 5.0 × 10−4 , and critic learning rate 2.5 × 10−4 . Also, each algorithm is trained three times with different random seeds for generalization purposes. Each training takes about 12 hours d in a Ubuntu 18.04 server with an AMD 9820X processor and 64 GB memory. The hyperparameter wi , i ∈ {1, 2, 3, 4} in the reward function Eq. 4.6 are set to 1.0, 1.0, 0.1, and 5.0, respectively, with a significant emphasis on penalizing situations where the safety headway distance is insufficient. Considering a simulated traffic environment over a period of T = 60 seconds, we define ∆t = 0.1 seconds as the interaction period between RL agents and the traffic environment, so that the environment is simulated for ∆t seconds after each MDP step. In the following experiments, we assume the platoon size to be V + 1 = 8, implying that there are a total of 8 CAVs in the platoon. The impact of different platoon sizes on our model’s performance will be studied and presented in Sec. 4.4.4. 54 Catchup Slowdown 0 −500 −1000 −500 −1500 Training reward Training reward −2000 −1000 −2500 IA2C DIAL −3000 −1500 NuerComm FPrint ConseNet MACACC −3500 CommNet −2000 −4000 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Training steps 1e6 Training steps 1e6 Figure 4.3 Training curves comparison between the proposed MARL policy (MACACC) and 6 state-of-the-art MARL benchmarks. Table 4.1 Execution performance comparison over trained MARL policies. The best values are in bold. Scenario Name IA2C FPrint ConseNet NeurComm CommNet DIAL MACACC Catch-up -241.38 -198.93 -94.67 -301.41 -397.55 -227.68 -50.44 Slow-down -2103.38 -1470.41 -1746.43 -1912.23 -2590.72 -1933.27 -492.30 Table 4.2 Performance of MARL controllers in CACC environments: “Catchup” (above) and “Slowdown” (below). The best values are in bold. Temporal Average Metrics IA2C FPrint ConseNet NeurComm CommNet DIAL MACACC avg vehicle headway [m] 19.43 20.02 20.28 21.77 22.38 21.86 19.91 avg vehicle velocity [m/s] 15.00 15.34 15.30 15.04 15.01 15.01 15.32 collision number 0 0 0 0 0 0 0 avg vehicle headway [m] - 15.16 9.23 11.45 4.90 9.71 20.44 avg vehicle velocity [m/s] - 13.10 8.08 10.32 4.18 8.91 16.61 collision number 50 14 29 22 38 26 0 4.4.2 Comparison With State-of-the-Art Benchmarks Figure 4.3 shows the performance comparison in terms of the learning curves between the proposed approach MACACC and several state-of-the-art MARL benchmarks. As ex- pected, the proposed approach achieves the best performance, evidenced by higher training rewards in both CACC scenarios. In the more challenging “Slowdown” environment, the proposed approach shows its greater advantages of sample efficiency as seen from the fastest 55 convergence speed and best training reward compared to other algorithms. After training, we evaluate each algorithm 50 times with different initial conditions. Table 4.1 shows the evaluation performance comparison over the trained MARL policies. The proposed method consistently outperforms the benchmarks in all CACC scenarios in terms of the evaluation reward, which reveals the overall evaluation metrics including vehicle headway, velocity, acceleration, and safety as described in Eq. 4.6. Table 4.2 shows the key evaluation metrics in CACC. The best headway and velocity averages are the closest ones to h∗ = 20 m, and v ∗ = 15 m/s. Note the averages are only computed from safe execution episodes, and we use another metric “collision number” to count the number of episodes where a collision happens within the horizon. Ideally, “collision-free” is the top priority. It is clear that our approach achieves promising performance in the “Catchup” environment, and the best performance in the harder “Slowdown” environment. All algorithms achieve relatively good performance in the “Catchup” environment with zero collision number. It is surprising that IA2C achieves excellent average vehicle velocity at v ∗ . However, it demonstrates high collision numbers (i.e., 50) in the “Slowdown” scenario due to non-stationary issues since there is no communication between agents. FPrint yields the best average vehicle headway in the “Catchup” environment, while it has 14 out of 50 collisions during the testing. On the other hand, NeurComm and CommNet show great average vehicle velocity in the “Catchup” environment, however, they failed to track the optimal headway, resulting in average high headway of 21.77 m and 22.38 m, respectively. It is noted that ConseNet achieves promising performance in the “Catchup” environment, with a zero collision rate, and average vehicle headway (20.28 m) and velocity (15.30 m/s) close to the optimal values. However, it yields high collision numbers (29 out of 50) in the “Slowdown” scenario as it simply encourages all agents to behave similarly via the “average” operations during training, which is especially impractical for complex scenarios, such as “Slowdown”, where agents need to react differently to the speed and headway changes. Figure 4.4 and Figure 4.5 show the corresponding headway and velocity profiles for the 56 Vehicle headway (m) Vehicle velocity (m/s) MACACC, veh# 1 20 35 MACACC, veh# 8 30 ConseNet, veh# 1 18 ConseNet, veh# 8 16 25 14 20 12 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Simulation time (sec) Simulation time (sec) (a) Headway profiles. (b) Velocity profiles. Figure 4.4 Headway and velocity profiles in the “Catchup” environment of the first and last vehicles of the platoon, controlled by the proposed approach (MACACC) and the top baseline policy (ConseNet). selected controllers for the two CACC scenarios. In the “Catchup” scenario, as expected, the MACACC controller is able to achieve steady state v ∗ and h∗ for the first and last vehicles of the platoon, whereas the ConseNet controller still has difficulty eliminating the perturbation through the platoon. In a harder “Slowdown” environment, MACACC is still able to achieve optimal headway at about 60 seconds and reach the optimal velocity quickly. However, FPrint fails the control task with a collision that happened at about 35 seconds. This may be because simply incorporating neighboring agents’ policies might not be sophisticated enough to accurately model and adapt to the intricacies among agents. 4.4.3 Performance Of The Quantization-based MACACC In this subsection, we evaluate the effectiveness of the proposed quantization-based com- munication protocol with different quantization resolutions. As shown in Figure 4.6, in the less complex “Catchup” scenario, minor quantization appears to improve control per- formance. This could be attributed to the fact that the quantization process introduces a level of randomness during the training phase, thereby fostering improved exploration by the agents, as discussed in [109]. Conversely, in the more challenging “Slowdown” scenario, the impact of quantization results in a more significant performance degradation relative to the 57 35 35 Vehicle headway (m) Vehicle velocity (m/s) MACACC, veh# 1 30 30 MACACC, veh# 8 25 FPrint, veh# 1 20 25 FPrint, veh# 8 15 20 10 5 15 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Simulation time (sec) Simulation time (sec) (a) Headway profiles. (b) Velocity profiles. Figure 4.5 Headway and velocity profiles in the “Slowdown” environment of the first and last vehicles of the platoon, controlled by the proposed approach (MACACC) and the top baseline policy (FPrint). “Catchup” scenario. Nonetheless, even with extremely quantized communication, such as QMACACC (1), our proposed approach continues to surpass the performance of the robust baseline method, FPrint. Figure 4.7 presents the number of bits required for each communicated parameter as well as the corresponding test performance at varying quantization resolutions. For better visualization, these values are normalized with corresponding maximum values. Within the “Catchup” scenario, QMACACC (1) manages to achieve 98.63% of the control performance achieved by the non-quantized version, i.e., QMACACC (0), while only requiring 12.5% of the communicated bits. However, in the “Slowdown” scenario, QMACACC (1) can only realize 64.64% of the control performance of the non-quantized version, i.e., QMACACC (0). This underscores a trade-off between the benefits of enhanced communication efficiency brought about by quantization and the associated diminution in control performance. 4.4.4 Impact Of Platoon Size In this subsection, we explore how variations in platoon sizes affect the performance of our model. The normalized training curves comparison among MACACC, QMACACC (1), and the top-performing baseline methods under different platoon sizes (i.e., V + 1 ∈ 2, 8, 12) 58 Catchup Slowdown 0 −500 −250 −1000 −500 −1500 Training reward Training reward −50 −750 −2000 −1000 −2500 −100 −1250 −3000 ConseNet QMACACC (1) −3500 FPrint QMACACC (1) −1500 MACACC MACACC −4000 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Training steps 1e6 Training steps 1e6 Figure 4.6 Training curves comparison between the proposed MARL policy (MACACC) and Quantization-based MACACC (QMACACC (n)). outperforms the baseline method (i.e., FPrint) under different platoon sizes, showing the impressive scalability of our proposed approach. Plotoon size = 2 0 Plotoon size = 8 −100 Plotoon size = 12 −125 Training reward Training reward Training reward −50 −50 −150 −10 −100 −175 −100 −15 −200 ConseNet ConseNet ConseNet −150 MACACC MACACC −225 MACACC −150 QMACACC (1) QMACACC (1) QMACACC (1) −200 −250 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Training steps 1e6 Training steps 1e6 Training steps 1e6 Plotoon size = 2 Plotoon size = 8 Plotoon size = 12 −100 −150 −200 Training reward Training reward Training reward −200 −200 −250 −400 −300 −300 FPrint FPrint FPrint MACACC MACACC −350 MACACC −600 QMACACC (1) QMACACC (1) QMACACC (1) −400 −400 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Training steps 1e6 Training steps 1e6 Training steps 1e6 Figure 4.8 Normalized training curves comparison between MACACC, QMACACC (1), and the top baseline methods with different platoon sizes in the CACC scenarios: “Catchup” (above) and “Slowdown” (below). 4.5 Conclusions And Future Work In this chapter, we have addressed the CACC problems by formulating it as a fully decen- tralized MARL problem. This novel approach eliminated the need for a centralized controller, thereby enhancing the system’s scalability and robustness. Additionally, we introduced an innovative quantization-based communication protocol, which significantly enhanced com- munication efficiency among the agents. To validate our proposed approach, we undertook comprehensive experiments and compared it with several state-of-the-art MARL algorithms, showing our approach had superior control performance and communication efficiency. These results underscore the potential of our fully decentralized MARL and quantization-based communication protocol as a robust and effective solution for real-world CACC problems. 60 In this chapter, we employed the Optimal Velocity Model (OVM), popular for its sim- plicity and efficacy in emulating certain traffic flow behaviors. However, it is worth noting that OVM simplifies the intricate nature of human driving behaviors and may not be en- tirely precise. As a result, future research endeavors will focus on the integration of more comprehensive human driver models to improve the accuracy of our simulation. 61 CHAPTER 5 DEEP MARL FOR SECONDARY VOLTAGE CONTROL In this chapter, we propose a scalable and efficient MARL framework for secondary voltage control. 5.1 Introduction Over recent decades, renewable energy sources, notably solar and wind, have received ris- ing attention, driven by their capacity to reduce greenhouse gas emissions and combat global warming [125]. In a modern power grid, these green energies are integrated as distributed generators (DGs), working alongside conventional sources like fossil fuels and nuclear power plants. Specifically, localized energy networks with the flexibility to operate either connected to or isolated from the main grid. As microgrids can still operate when the main grid is dis- connected, they can strengthen grid resilience and service reliability. When disconnected from the main grid, there are two levels of controls in a microgrid: primary control and secondary control [13]. The primary control refers to the basic, low-level control within a Distributed Generator (DG) aimed at maintaining a specific voltage reference. In contrast, the secondary control focuses on the collaborative generation of local references across mul- tiple DGs to meet grid-wise control objectives [71, 51, 94, 114, 120]. Given its potential to significantly enhance grid efficiency, this chapter will focus on secondary voltage control. Secondary control methods can be generally divided into centralized and distributed approaches. Centralized controllers aggregate data from all DGs and make collective control decisions relayed to each DG [135, 93, 96]. Despite their promising results, these centralized schemes have significant communication overheads, potential single points of failure [139] and the curse of dimensionality, which makes them less feasible for large-scale microgrid systems. On the other hand, distributed control approaches are inspired by cooperative control in multi-agent systems [147, 14, 120, 15, 13, 34, 88], which adopt a decentralized approach, where each DG interacts with its neighbors, making decisions based on shared data from local communication networks. Traditional model-based approaches often simplify the 62 intricate dynamics of microgrids for control design and then develop distributed feedback controllers for the formulated tracking synchronization problem [14, 15, 13, 70, 124, 52]. Note that the underlying microgrid dynamics is subject to complex nonlinearity, system and disturbance uncertainty, and high dimensionality, model simplifications have to be made to enable the model-based control designs, which will inevitably negatively impact the control performance. Recently, reinforcement learning (RL) has gained rising attention as a promising frame- work to address the centralized voltage control problem, acclaimed for its online adaptability and capability to handle intricate issues [79, 123, 95, 22, 48]. Notably, the Deep Q-network (DQN) [96] demonstrates impressive performance in autonomous grid control, especially amidst load fluctuations and topological changes [33]. Duan et. al [37] effectively utilize deep deterministic policy gradient (DDPG) [79], an off-policy RL algorithm, to maintain bus voltages within desired ranges. Furthermore, a two-time scale voltage controller is pro- posed in [148], where shunt capacitors are configured to minimize the voltage deviations using a deep reinforcement learning algorithm. At the same time, multi-agent reinforcement learning (MARL) has seen great improve- ments, finding applications in diverse domains like gaming such as StarCraft and DOTA [134, 10], traffic light management [29], and autonomous driving [121]. There are also efforts in applying MARL to microgrid voltage control, emphasizing autonomous voltage control (AVC) [83, 17, 153, 45]. In this chapter, our focus is on secondary voltage control in iso- lated microgrids, aiming to maintain DG output voltages at a predefined reference value [14]. We formulate the secondary voltage control (SVC) of inverter-based microgrid systems as a partially observable Markov decision process (POMDP) and introduce an on-policy MARL algorithm, PowerNet. This decentralized MARL strategy offers stability in training and efficiency in policy learning. Our extensive experiments show that our proposed Power- Net outperforms several SOTA MARL algorithms and a model-based approach in learning efficiency, performance, and scalability. 63 The key contributions and advancements presented in this chapter include: 1. We formulate the secondary voltage control of inverter-based microgrids as a decen- tralized, cooperative MARL problem. To support this, we introduce and open-source a power grid simulation platform, PGSIM, available at https://github.com/Derekab c/PGSIM. 2. We propose PowerNet, an efficient on-policy decentralized MARL algorithm, featuring a novel spatial discount factor, a learning-based communication protocol, and an action smoothing mechanism, all aimed at effectively learning a control policy. 3. We conduct comprehensive experiments that highlight PowerNet’s superiority. It demonstrates better performance than the traditional model-based control method and six other state-of-the-art MARL algorithms, especially in sample efficiency and voltage regulation. The remainder of the chapter is organized as follows. In Section 5.2, we formulate the sec- ondary voltage control as a MARL problem and our developed MARL algorithm, PoweNet, is detailed in Section 5.3. Experiments, results, and discussions are presented in Section 5.4 whereas concluding remarks and future works are drawn in Section 5.5. (a) Diagram of decentralized control of (b) Diagram of inverter-based DG. microgrids. Figure 5.1 Schematic diagram of the decentralized control of microgrids and the inverter- based DG. 64 5.2 MARL Formulation The Voltage-controlled voltage source inverter (VCVSI) is widely used in DGs, offer- ing expedient voltage/frequency support [14]. Figure 5.1(a) shows a typical VCVSI with decentralized control architecture, each DG employs a secondary controller, coordinating with neighboring DGs to dynamically produce voltage references. The primary controller, a low-level level controller, uses this reference for tracking. The overall aim is to ensure the voltage and frequency of all DGs align with the reference value, even with power network disturbances and primary control inaccuracies [14, 15, 100]. As Figure 5.1(b) illustrates, the primary controller of each DG, labeled from i = 1 to N , takes the frequency and voltage references from the secondary controller and regulates the output voltage and frequency towards the desired reference. This is typically achieved via the active and reactive-power droop techniques without DG intercommunication [14, 51]. The readers can refer to [14, 15] for an in-depth understanding of the system dynamics, which is exploited to develop our power grid simulation platform, PGSIM1 . The objective of the SVC is to coordinate with other DGs and generate reference signals Vni to synchronize the voltage magnitude of DG i to the reference value, in the presence of power disturbances and primary control imperfections. While there are existing model-based secondary control strategies [13, 14, 15], they tend to underperform due to simplifications made in tackling non-linearity and uncertain dis- turbances. Thus, in this chapter, we develop a model-free approach using MARL. Here, the microgrid is formulated as a multi-agent network, denoted as G = (ν , ε). Each agent, represented by i ∈ ν communicates only with its adjacent nodes Ni := {j|εij ∈ ε)}. Let’s define the global state and action spaces as S := ×i∈ν Si and A := ×i∈ν Ai respec- tively. These symbolize the collective state data and combined controls of all DGs. The microgrid’s underlying dynamics can be characterized by the state transition distribution P: S × A × S → [0, 1]. For scalable power grid control, we adopt a decentralized MARL framework. Each DG only communicates with its neighbors and makes decisions based on 1 See https://github.com/Derekabc/PGSIM 65 these observations. As each agent i (or DG i) observes only a part of the environment, it naturally results in a POMDP [55]. At each time step t, each agent i receive an observation oi,t ∈ Oi , takes an action ai,t , and then receives the subsequent observation oi,t+1 and a reward signal ri,t := Ot × At → R. The objective is to find an optimal decentralized policy πi := Oi × Ai → [0, 1] that maximizes the expected total rewards. We tackle this challenge using MARL, defining the key POMDP elements as follows: 1. Action Space: Each DG’s control action is the secondary voltage control set point, Vni . For this work, we employ 10 uniformly spaced discrete actions between 1.00 pu and 1.14 pu. The overall action of the microgrid is the combined actions from all DGs, i.e., a = vn1 × vn2 × · · · × vnN . 2. State Space: The DG’s state is defined with nine variables to characterize the op- erations of the DG, denoted as si,t = (δi , Pi , Qi , iodi , ioqi , ibdi , ibqi , vbdi , vbqi ), where δi is the measured reference angle frame; Pi and Qi denote the active and reactive power, respectively; ioqi , ioqi , ibdi and ibqi represent the output d-q currents of the DG i and the directly connected buses, respectively; and vbdi and vbqi are the output d-q voltages of the connected bus, respectively. The entire microgrid state is a Cartesian product of these individual states, i.e., S(t) = s1,t × · · · × sN,t . 3. Observation Space: We assume DGs observe only their local state and messages from neighbors. This observation consists of the local state and the received communication message, i.e., oi,t = si,t ∪ mi,t , where mi,t is the communication message received from its neighboring agents j ∈ Ni and will be detailed in Section 5.3. 4. Transition Probabilities: The transition probability T (s′ |s, a) characterizes the dy- namics of the microgrid. We follow the models in [14, 15] to build our simulation platform but we do not exploit any prior knowledge of the transition probability as our MARL is model free. 5. Reward Function: we design the following reward function to promote the DGs to 66 quickly converge to reference voltages (e.g., 1 pu):       0.05 − |1 − vi |, vi ∈ [0.95, 1.05],   ri,t ≜ −|1 − vi |, vi ∈ [0.8, 0.95] ∪ [1.05, 1.25], (5.1)      −10,  Otherwise. where ri,t is the reward of agent i at time step t. More specifically, we divide the voltage range into 3 operation zones similar to [139]: normal zone ([0.95, 1.05] pu), violation zone ([0.8, 0.95] ∪ [1.05, 1.25] pu), and diverged zone ([0, 0.8] ∪ [1.25, ∞] pu). With the formulated reward, DGs with diverged voltages or no power flow solution will receive a large penalty, while DGs with a voltage close to 1 pu obtain positive rewards. Remark 1. Our formulation for regulating DG voltages follows the literature such as [14, 13, 98]. Although effective for DG voltages, bus voltages might still deviate. An alternative formulation can address bus voltages with a distinct reward function, detailed further on our site2 . 5.3 PowerNet For Secondary Voltage Control In this section, we introduce PowerNet, a new decentralized MARL algorithm devised to address the previously stated POMDP. The proposed PowerNet extends the independent actor-critic (IA2C) to deal with multiple cooperative agents by fostering collaborations be- tween neighboring agents, which is enabled by the following three novel characteristics: 1) a learning-based differentiable communication protocol, fostering agent collaboration; 2) a unique spatial discount factor, mitigating partial observability and boosting learning stabil- ity; and 3) an action smoothing mechanism, offsetting the influence of system uncertainties and on-policy learning noise. The subsequent sections delve into these features. 5.3.1 Differentiable Communication Protocol Within our decentralized MARL framework, each agent (DG) communicates with its neighbors to exchange crucial information, like encoded states and policies. In opposition 2 See https://github.com/Derekabc/PGSIM/tree/R2 67 Figure 5.2 Overview of the proposed communication protocol. to traditional non-communicative MARL algorithms such as IA2C, FPrint, and ConseNet, which often exhibit slow convergence, our method leverages neighbor information for more efficient learning (Figure 5.2 demonstrates this). At each time step t, agent i updates its hidden state hi,t using the given equation: hi,t = fi (hi,t−1 , qo (es (oi,t )), qh (hNi,t−1 )). (5.2) where hi,t−1 is the encoded hidden state from last time step; oi,t is agent i’s observation made at time t, i.e., its internal state ; hNi,t−1 is the concatenated hidden state from its neighbors; es , qo , and qh are differentiable message encoding and extracting functions, where one-layer fully connected layers with 64 neurons are used; and fi is the encoding function for the hidden states and communication information, where we use a Long Short Term Memory (LSTM [56]) network with a 64-neuron hidden layer to improve the observability [55] and better utilize the past hidden state hi,t−1 information. To improve both scalability and robustness, we classify the observation oi,t based on their physical units. These categorized sub-observations are individually encoded and then con- catenated. For instance, the observation oi,t is divided into four groups: o1i,t ∪ o2i,t ∪ o3i,t ∪ o4i,t according to their units, i.e., voltage, power, reference angle frame, and current. These regrouped sub-observations are encoded independently and then concatenated as es (oi,t ) = 68 cat(e1s (o1i,t ), e2s (o2i,t ), e3s (o3i,t ), e4s (o4i,t )), where ejs , j=1,2,3,4, are one-layer fully connected encod- ing layers. The received communication message mi,t for ith agent comprises the encoded hidden states of its neighbors, i.e., mi,t = hNi ,t−1 with hNi ,t−1 being the hidden states of agent i’s neighbors at time t − 1. Given that the hidden state ht−1 is neurally encoded, these mes- sages offer increased security over transmitting raw states directly. The encoded observation es (oi,t ) and the neighbors’ hidden states hNi,t−1 are extracted by qo and qh , respectively. We then concatenate the encoded message as õi,t = cat(qo (es (oi,t )), qh (hNi,t−1 )). Concatenation operation is shown in [27] to have better performance as compared to the summation oper- ation used in DIAL and CommNet on reducing information loss. The hidden state is then processed through the LSTM fi function to encode õi,t and hi,t−1 . Following this, the hidden state hi,t obtained from (5.2) is then employed in the actor and critic networks to generate random actions and predict the value functions, respectively, i.e., πθi (·|hi,t ) and Vωi (hi,t ). Inspired by MADDPG [86], we also include the neighbors’ action information into the critic network Vwi (hi,t , aNi ,t ) to enhance the training. In this chapter, we use a discrete action space and the action is sampled from the last Softmax layer as ai,t ∼ πθi (·|hi,t ). We adopt the centralized training and decentralized execution scheme [27, 29], where each agent has its own actor and critic networks, and their policies are updated independently instead of updating in a consensus way [152] that may hurt the convergence speed. 5.3.2 Spatial Discount Factor In cooperative MARL, the objective is to maximize the shared global reward, Rg,t = P PT k i∈ν Ri,t , where Ri,t = k=0 γ ri,t+k denotes the cumulative reward for agent i. For each PN agent, a natural choice of the reward is the instantaneous global reward, i.e., i=1 ri,t . However, this scheme can lead to several issues. First, aggregating global rewards can cause large latency and increase communication overheads. Second, the single global reward leads to the credit assignment problem [127], which would significantly impede learning efficiency 69 and limit the number of agents to a small size. Third, as an agent is typically only slightly impacted by agents distant from it, using the global reward to train the policy for each agent can lead to slow convergence. To address the aforementioned issues of using a global reward, in this chapter, we employ a novel spatial discount factor to address the above challenges. Specifically, each agent i, i = 1, · · · , N , utilizes the following reward: XT X Ri,t = γk α(di,j )rj,t+k . (5.3) k=0 j∈ν where α(di,j ) ∈ [0, 1] is the spatial discount function with di,j being the distance between agent i and j. The distance di,j can be an Euclidean distance characterizing the physical distance between the two agents or the distances between two vertices in a graph (i.e., the number of shortest connecting edges). Note that the new reward defined in (5.3) characterizes a whole spectrum of reward correlation, from local greedy control (When α(di,j̸=i ) = 0 and α(di,i ) = 1), to a global reward (when α ≡ 1). Note that there are different choices for the spatial discount function. For example, one can choose   1, if dij ≤ D,  α(di,j ) = (5.4)  0, Otherwise.  where D is the distance threshold that defines an “effective distance” near the considered agent. The threshold D can incorporate factors such as communication speed and overhead. In this chapter, we adopt α(di,j ) = αdi,j with α ∈ (0, 1] being a constant scalar and a hyperparameter to be tuned. As a result, the gradient computation in Eq. (5.5) becomes: h  i ∇θi J(θi ) = Eπθ ∇θi log πθi (ai,t |oνi ,t ) Ai,t . (5.5) πθ where Ai,ti = Qπθi (s, a) − V πθi (s) is the advantage function, Qπθi (s, a) = Eπθi (Ri,t |st = s, aνi ,t = aνi ) is the state-action value function, and V πθi (s) = V πi (s, aNi ) is the state value function. We optimize the parameter of critic network Vwi as follows: hX i2 dij min ED α rj,t + Vωi′ (oνi ,t ) − Vωi (oνi ,t ) . (5.6) ωi j∈ν 70 Minibatches of sampled trajectory are used to update the network parameters using Eq. 5.5 and Eq. 5.6 to reduce the variance [29]. 5.3.3 Action Smoothing Our introduced PowerNet operates as an on-policy MARL algorithm, necessitating the sampling of stochastic actions at every time step. This occasionally results in noisy action samples with significant fluctuations, even after the algorithm convergences. These action fluctuations can cause undesirable system perturbations. To address this, we introduce an action smoothing scheme to refine the sampled action ai,t ∼ πθi (·|hi,t ) before its execution as follows:   ai,t ,  t = 1; ai,t ← (5.7)  ρai,t−1 + (1 − ρ)ai,t , t > 1.  where ρ ∈ [0, 1] denotes the discount factor and ai,t−1 represents the action from the previous time step t − 1. This scheme is also recognized as exponential moving average [47]. When ρ is chosen as 0, this scheme performs no action smoothing, allowing the agent to take actions directly from the policy πθi . However, for ρ = 1, the new action ai,t remains unchanged, causing the agent to rely only on the preceding action. We define an action smoothing window Tw to buffer and utilize past actions for this smoothing process. The window size Tw is crucial; an overly large Tw might retain outdated actions, causing the agent’s slow response to sudden changes, while an exceedingly small Tw can limit the smoothing efficacy. In this chapter, both ρ and Tw are considered hyperparameters and are optimized through cross-validation. The comprehensive PowerNet algorithm is detailed in Algorithm 5.1. The primary hy- perparameters encompass the distance discount factor α, action smoothing factor ρ, the (time)-discount factor γ, actor network’s learning rate ηw and critic network’s learning rate ηθ , action sample window Tw , total training iterations M , epoch duration T , and batch size N . In the algorithm, agents repeatedly interact with the environment across multiple epochs (Lines 2–29). During Lines 4–6, each agent gathers and transmits the communication 71 message mi,t to its neighbors. Subsequently, agents consolidate, rearrange, and encode their observations (Line 8). The agents update their hidden states and the actor networks for action sampling (Lines 9–10). In Lines 12–15, the state value is updated and actions are smoothed before deployment. After environment interaction (Lines 16), agents transition to the subsequent state and earn an immediate reward (Lines 17), which is then stored in an on-policy experience buffer (Lines 18). In Lines 20–26, both actor and critic network param- eters get updated, leveraging trajectories from the on-policy experience buffer after the end of each episode. An episode terminates with a DONE signal either upon its completion or if there’s an absence of a power flow solution. Upon receiving this DONE signal, all agents are reset to their initial states, transit to a fresh epoch (Line 28). 72 Algorithm 5.1 PowerNet for Secondary Voltage Control Parameter: α, ρ, γ, ηw , ηθ , T, M, Tw . Output : θi , wi , i ∈ ν . initialize s0 , h−1 , t ← 0, D ← ∅; for j = 0 to M − 1 do for t = 0 to T − 1 do for i ∈ ν do send mi,t = hi,t−1 end for i ∈ ν do get õi,t = cat(qo (es (oi,t )), qh (hNi,t−1 )) get hi,t = fi (hi,t−1 , oei,t ), πi,t ← πθi (·|hi,t ) update ai,t ∼ πi,t end for i ∈ ν do update vi,t ← Vwi (hi,t , aNi ,t ) update ai,t according to Eq. (5.7) execute ai,t end simulate {si,t+1 , ri,t }i∈ν update D ← {(oi,t , ai,t , ri,t , vi,t )}i∈ν update t ← t + 1, j ← j + 1 if DONE then for i ∈ ν do update θi ← θi + ηθi ∇θi J(θi ) update wi ← wi + ηwi ∇ωi V (ωi ) end end initialize D ← ∅ end update s0 , h−1 , t ← 0 end 5.4 Experiment, Results, And Discussion In this section, we deploy the PowerNet model to two distinct microgrid systems: the IEEE 34-bus test feeder equipped with 6 distributed DGs (denoted as microgrid-6, depicted in Figure 5.3(a)), and a more expansive microgrid system hosting 20 DGs (referred to as microgrid-20, showcased in Figure 5.3(b)). Our simulation platform is constructed upon the line and load specifications detailed in [78] and [98], respectively. To achieve a representation that mirrors real-world power systems, we introduced random load variations throughout 73 the microgrid, including perturbations of ±20% from the nominal values mentioned in [98]. Additionally, we incorporated random disturbances, amounting to ±5% of the nominal values for each load, at every simulation step to simulate disruptions characteristic of actual power grids. Each DG operates with a sampling interval of 0.05s and is equipped with the capability to communicate with adjacent DGs via local communication edges. The primary control at the foundational level is facilitated by [14]. All our experiments were executed on an Ubuntu 18.04 server powered by an AMD 9820X processor, accompanied by 64 GB of RAM. Figure 5.3 Schematic representations of the two microgrid simulation setups: (a) microgrid- 6; and (b) microgrid-20. Employing cross-validation, the spatial discount factor, α is determined to be 1.0 for microgrid-6 and 0.7 for microgrid-20. It is reasonable to have a larger spatial discount factor in the microgrid-6 system as agents are strongly and tightly connected in that small microgrid system. Conversely, a smaller α in the larger-scale microgrid-20 system is advantageous to reduce the effect of remote agents, but a too small α will cause the agents to ignore their effect on the neighboring agents. For example, if we choose α = 0, the expected reward for agent i at time step t will become Ri = Tt=1 γ t ri,t , which cannot ensure a maximum global P reward as each agent update its own policy greedily. For the action-smoothing coefficient, ρ, we adopt a value of 0.5 for microgrid-6 and 0.4 for microgrid-20. Given the intricacy of microgrid-20 compared to microgrid-6, a larger ρ is chosen, keeping in mind that a diminished ρ might translate into delayed responses to 74 abrupt voltage anomalies. The sample window, Tw is consistently set at 2 for both microgrid configurations. 5.4.1 Compared To Centralized Control To benchmark PowerNet against a Centralized MARL technique, we utilized PPO [119], a leading centralized actor-critic RL method. In this approach, a singular centralized RL agent makes the control decisions for all agents using the global state as its input, and compare its performance with PowerNet. As illustrated in Figure 5.4, PowerNet outperforms the centralized method both in terms of convergence rate and overall training performance across both microgrid systems. Notably, PPO displays poor convergence issues in the larger microgrid that incorporates 20 DGs. This is not surprising as centralized methods are conventionally challenged by the curse of dimensionality and scalability concerns. Unlike the centralized approach, where the input dimension increases with network size, the input dimension for PowerNet remains constant. Consequently, the centralized control strategy rapidly loses feasibility for expansive networks, a phenomenon detailed in Table 5.1. Figure 5.4 MARL training curves compared with PPO for (a) microgrid-6 and (b) microgrid- 20 systems. The lines show the average reward per training episode which is smoothed over the past 100 episodes. 75 Table 5.1 Performance comparison between trained PowerNet and PPO. Microgrid-6 Microgrid-20 Converge Time Input Size Performance Converge Time Input Size Performance PPO 5.25 h 54 0.21 not converge 180 0.44 PowerNet 0.90h 9 0.22 7.28h 9 0.60 5.4.2 Compared With SOTA Benchmarks PowerNet is compared against several leading benchmark MARL algorithms (IA2L [87], FPrint [43], ConseNet [152], DIAL [42], MADDPG [86], and CommNet [126]), as well as a conventional model-based method [85], to validate its efficiency. Each model is trained across 10,000 episodes, with parameters set as γ = 0.99, minibatch size N = 20, actor learning rate ηθ = 5 × 10−4 , and critic learning rate ηw = 2.5 × 10−4 . To ensure an equitable comparison, unique random seeds are generated for each episode, and the same seed is uniformly applied across different algorithms, ensuring a consistent training/testing environment. Agents are controlled every ∆T = 0.05 seconds, and one episode encompasses T = 20 steps. Figure 5.5 presents the training curves for all the MARL algorithms across the microgrid- 6 and microgrid-20 systems. To better visualize the training processes, we only show the first 2,000 and 4,000 training episodes for microgrid-6 and microgrid-20 systems, respectively. It’s evident that PowerNet, within both microgrid-6 and microgrid-20 contexts, exceeds the performance of all benchmarked MARL algorithms in convergence rate. This is due to the proposed communication protocol structure and suitable spatial discount factor, which help enhance sample efficiency and accelerate learning. Especially in the intricate microgrid- 20 setting, PowerNet’s superior sample efficiency becomes obvious, marked by the fastest convergence rate and the highest average episode reward when measured against the other contenders (refer to Figure 5.5b). To more effectively illustrate voltage control capabilities, we compare the voltage con- trol outcomes between the configurations of only primary controller and with secondary controller in the context of substantial load fluctuations (25%). These comparisons are visu- ally represented in Figure 5.6 (pertaining to microgrid-6) and Figure 5.7 (corresponding to 76 Figure 5.5 MARL training curves for (a) microgrid-6 and (b) microgrid-20 systems. The lines show the average reward per training episode which is smoothed over the past 100 episodes. microgrid-20). The primary control objective is to align all DG voltages with the reference benchmark of 1pu. A black cross mark represents a voltage violation on the DG whereas a black dot means the DG’s voltage is within the normal range. Here we use r to denote the average control reward of the last 5 control steps according to Eq. 5.1. As evidenced in Figure 5.6, all evaluated algorithms consistently bring voltages within the standard range in the compact microgrid-6 setup. Figure 5.6 Execution performance on voltage control in microgrid-6. Figure 5.7 presents a performance evaluation for the intricate microgrid-20 scenario, char- acterized by its dense network interconnections and complex agent interactions. Controlling the voltage in an isolated manner, as done by IA2C, is not wise and results in bad out- comes. While MPC delivers commendable results, its dependency on a precise model makes it computationally taxing, requiring 19 times the inference duration of PowerNet. This is 77 Figure 5.7 Execution performance on voltage control in microgrid-20. primarily due to its need to address an online nonlinear optimization challenge at every timestep, necessitating significant computational resources. Algorithms like FPrint, DIAL, and CommNet also exhibit voltage violations, which shows that simply including neighbors’ policies is not sufficient for a complicated test case such as microgrid-20. The shortcom- ings of DIAL and CommNet suggest that an oversimplified aggregation of communication data, coupled with the computation of immediate rewards based on undiscriminating global metrics, can hinder agents from establishing efficient communication protocols in expansive, cooperative microgrid settings. ConseNet struggles to stabilize voltages, resulting in greater deviation compared to PowerNet. 5.4.3 Robustness To Load Variations And Agent Mis-connections After training for 10,000 episodes, we assess the established policy 20 times under varying load disturbances. Each agent retains the same random seed for each episode, while differ- ent test seeds are employed for distinct episodes. Beyond the leading MARL algorithms, PowerNet’s performance in voltage control is compared against a conventional model-based method [85]. Table 5.2 consolidates the results, showcasing the average episode return across the 20 test episodes for both the MARL algorithms and the conventional method (MPC). Evidently, PowerNet consistently outperforms other MARL methodologies in both sce- narios, regardless of the load fluctuations. Although the MPC method yields comparable results, it takes approximately 19.6 times the convergence time than PowerNet. IA2C’s sub- optimal performance aligns with the training trajectories illustrated in Figure 5.5. While other MARL benchmarks show reasonable outcomes, PowerNet surpasses them by a notable 78 margin. Table 5.2 Performance comparison between trained MARL policies and the conventional model-based method under different load disturbances. The reward is the average reward over 20 evaluation episodes. Best values are bolded. Load Disturbance Network PowerNet MPC [85] IA2C FPrint ConseNet CommNet DIAL MADDPG Microgrid-6 0.240 0.141 0.206 0.236 0.219 0.223 0.222 0.221 10% Microgrid-20 0.781 0.642 0.447 0.711 0.705 0.700 0.706 0.436 Microgrid-6 0.238 0.139 0.206 0.236 0.221 0.220 0.220 0.220 15% Microgrid-20 0.771 0.632 0.474 0.702 0.692 0.690 0.697 0.422 Microgrid-6 0.236 0.134 0.205 0.233 0.223 0.220 0.221 0.220 25% Microgrid-20 0.740 0.598 0.458 0.660 0.645 0.649 0.663 0.438 Furthermore, Figure 5.8 presents a comparative analysis of training curves between train- ing from scratch and training from the pre-trained model when disconnecting one DG unit (microgrid-19) or adding a new DG unit (microgrid-21). The insights suggest that the model trained on microgrid-20 can significantly expedite policy adjustments when a DG unit is ei- ther removed or integrated. Since our algorithm is decentralized and on-policy, we do not need to re-train the whole network from scratch and we can just modify the involved DNN layers that are related to the topology change. For example, if a new DG unit is added, we just initialize and insert a new policy layer into the existing policy network and then link it to its neighboring layers. The other policy network will be loaded from the existing trained weights and the whole system will be adapted. As evident from Figure 5.8, the pre-trained model not only accelerates adaptation but also delivers superior performance compared to starting the training process from scratch. 5.4.4 Scalability To Larger Powergrid In this subsection, we further investigate PowerNet’s scalability by extending the number of DG units to 40. As evident in Figure 5.9, PowerNet maintains its efficacy and continually outperforms other SOTA MARL benchmarks in performance. Notably, CommNet, which initially showed promise, fails to converge, performing even poorer than IA2C. This might be attributed to its approach of averaging communication messages rather than encoding them. 79 Figure 5.8 MARL training curves compared with training from scratch and adapting from a pre-trained microgrid-20 model for (a) microgrid-19 and (b) microgrid-21 systems. The lines show the average reward per training episode which are smoothed over the past 100 episodes. Figure 5.9 MARL training curves comparison between trained MARL policies for microgrid- 40 systems. The lines show the average reward per training episode which is smoothed over the past 100 episodes. Table 5.3 shows the inference times required for an agent to generate an action across microgrid-6, microgrid-20, and microgrid-40 configurations. Even in a large-scale setting with 40 agents, PowerNet is remarkably efficient, necessitating a mere 35.8 ms to generate 80 an action, a time frame that is practically feasible for meeting real-time needs. Given Pow- erNet’s decentralized design, the ratio of inference time to network size remains constant, underscoring its impressive scalability. Table 5.3 Inference time and time/size ratio of PowerNet for different microgrid networks. Microgrid-6 Microgrid-20 Microgrid-40 Inference time 7.9 ms 16.8 ms 35.8 ms Time/size 1.32 0.84 0.90 5.5 Conclusions And Future Work In this chapter, we modeled the secondary voltage control in inverter-based microgrid systems as a MARL problem. We introduced PowerNet, a novel on-policy and cooperative MARL algorithm, integrating a differentiable, learning-centric communication protocol, a spatial discount factor, and an action smoothing mechanism. Comprehensive experiments were conducted that show the proposed PowerNet outperforms other state-of-the-art ap- proaches, both in convergence speed and voltage control proficiency. In our future work, we aim to develop a more realistic simulation environment, collecting data from real-world power systems. Furthermore, we plan to explore severe system disturbances and delve into schemes that guarantee safety during the learning process. 81 CHAPTER 6 CONCLUSION In this thesis, we considered the problem of networked system control (NSC), aiming to explore safe, effective, and scalable MARL algorithms, with specific applications in connected autonomous vehicles and smart grids. First, an efficient and scalable MARL framework was proposed for on-ramp merging in mixed traffic [21]. Furthermore, a novel priority-based safety supervisor was incorporated into the framework to significantly reduce the collision rate and expedite the training process. A gym-like simulation environment for on-ramp merging was also developed and open-sourced with three different traffic density levels. Secondly, a fully decentralized MARL framework was introduced for Cooperative Adap- tive Cruise Control (CACC) without the need for a central controller. To enhance commu- nication efficiency, a quantization-based communication protocol was developed by applying random quantization to the messages being communicated and ensuring that critical informa- tion was transmitted with minimized bandwidth usage. Furthermore, a gym-like simulation environment for cooperative adaptive cruise control was adopted and open-sourced with 7 state-of-the-art MARL benchmarks. Lastly, an efficient MARL algorithm was proposed specifically for cooperative control within power grids, where each agent (i.e., each DG) learns a control policy based on (sub- )global reward and encoded communication messages from its neighbors. Furthermore, a novel spatial discount factor was developed to mitigate the effect of remote agents, expedite the training process and improve scalability. Moreover, a differentiable, learning-based com- munication protocol was developed to strengthen collaboration among neighboring agents. An open-source software, called PGSim, for a highly efficient, high-fidelity power grid simu- lation platform was developed and open-sourced. Despite the many challenges of applying MARL for NSC in real-world applications, such as scalability, safety, efficiency, and the lack of realistic simulators, this thesis has made considerable contributions in addressing these issues. The development of scalable MARL 82 algorithms, fully decentralized MARL frameworks, efficient communication protocols, and highly accurate simulators has effectively paved the way for further advancements in this field. However, the journey does not end here. These accomplishments set the stage for future research efforts to build upon these foundations, continually pushing the boundaries of what is possible in NSC using MARL. Future work may focus on improving the efficiency and robustness of these techniques and exploring their applicability to other complex networked systems. 83 BIBLIOGRAPHY [1] Apollo open platform. https://apollo.auto/developer.html. Accessed: 2021-03-31. [2] Future of driving. https://www.tesla.com/autopilot. Accessed: 2021-03-31. [3] Ahmed MH Al-Jhayyish and Klaus Werner Schmidt. Feedforward strategies for coop- erative adaptive cruise control in heterogeneous vehicle strings. IEEE Transactions on Intelligent Transportation Systems, 19(1):113–122, 2017. [4] Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [5] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017. [6] TJ Ayres, L Li, D Schleuning, and D Young. Preferred time-headway of highway drivers. In ITSC 2001. 2001 IEEE Intelligent Transportation Systems. Proceedings (Cat. No. 01TH8585), pages 826–829. IEEE, 2001. [7] Drew Bagnell and Andrew Ng. On local rewards and scaling distributed reinforcement learning. Advances in Neural Information Processing Systems, 18:91–98, 2005. [8] Masako Bando, Katsuya Hasebe, Akihiro Nakayama, Akihiro Shibata, and Yuki Sugiyama. Dynamical model of traffic congestion and numerical simulation. Phys- ical review E, 51(2):1035, 1995. [9] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449– 458. PMLR, 2017. [10] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019. [11] David Bevly, Xiaolong Cao, Mikhail Gordon, Guchan Ozbilgin, David Kari, Brently Nelson, Jonathan Woodruff, Matthew Barth, Chase Murray, Arda Kurt, et al. Lane change and merge maneuvers for connected and automated vehicles: A survey. IEEE Transactions on Intelligent Vehicles, 1(1):105–120, 2016. [12] Sushrut Bhalla, Sriram Ganapathi Subramanian, and Mark Crowley. Deep multi agent 84 reinforcement learning for autonomous driving. In Canadian Conference on Artificial Intelligence, pages 67–78. Springer, 2020. [13] Ali Bidram, Ali Davoudi, and Frank L Lewis. A multiobjective distributed control framework for islanded ac microgrids. IEEE Transactions on industrial informatics, 10(3):1785–1798, 2014. [14] Ali Bidram, Ali Davoudi, Frank L Lewis, and Josep M Guerrero. Distributed cooper- ative secondary control of microgrids using feedback linearization. IEEE Transactions on Power Systems, 28(3):3462–3470, 2013. [15] Ali Bidram, Ali Davoudi, Frank L Lewis, and Zhihua Qu. Secondary control of micro- grids based on distributed cooperative control of multi-agent systems. IET Generation, Transmission & Distribution, 7(8):822–831, 2013. [16] Maxime Bouton, Alireza Nakhaei, Kikuo Fujimura, and Mykel J Kochenderfer. Cooperation-aware reinforcement learning for merging in dense traffic. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 3441–3447. IEEE, 2019. [17] Di Cao, Weihao Hu, Junbo Zhao, Qi Huang, Zhe Chen, and Frede Blaabjerg. A multi-agent deep reinforcement learning based voltage regulation using coordinated pv inverters. IEEE Transactions on Power Systems, 35(5):4120–4123, 2020. [18] Wenjing Cao, Masakazu Mukai, and Taketoshi Kawabe. Two-dimensional merging path generation using model predictive control. Artificial Life and Robotics, 17(3- 4):350–356, 2013. [19] Wenjing Cao, Masakazu Mukai, Taketoshi Kawabe, Hikaru Nishira, and Noriaki Fujiki. Cooperative vehicle path generation during merging using model predictive control with real-time optimization. Control Engineering Practice, 34:98–105, 2015. [20] Dong Chen, Kaian Chen, Zhaojian Li, Tianshu Chu, Rui Yao, Feng Qiu, and Kaixiang Lin. Powernet: Multi-agent deep reinforcement learning for scalable powergrid control. IEEE Transactions on Power Systems, 37(2):1007–1017, 2021. [21] Dong Chen, Mohammad R Hajidavalloo, Zhaojian Li, Kaian Chen, Yongqiang Wang, Longsheng Jiang, and Yue Wang. Deep multi-agent reinforcement learning for highway on-ramp merging in mixed traffic. IEEE Transactions on Intelligent Transportation Systems, 2023. [22] Dong Chen, Longsheng Jiang, Yue Wang, and Zhaojian Li. Autonomous driving using safe reinforcement learning by incorporating a regret-based human lane-changing de- cision model. In 2020 American Control Conference (ACC), pages 4355–4361. IEEE, 2020. 85 [23] Dong Chen, Zhaojian Li, Tianshu Chu, Rui Yao, Feng Qiu, and Kaixiang Lin. Pow- ernet: Multi-agent deep reinforcement learning for scalable powergrid control. arXiv preprint arXiv:2011.12354, 2020. [24] Dong Chen, Kaixiang Zhang, Yongqiang Wang, Xunyuan Yin, and Zhaojian Li. Communication-efficient decentralized multi-agent reinforcement learning for cooper- ative adaptive cruise control, 2023. [25] Chien-Ming Chou, Chen-Yuan Li, Wei-Min Chien, and Kun-chan Lan. A feasibility study on vehicle-to-infrastructure communication: Wifi vs. wimax. In 2009 tenth in- ternational conference on mobile data management: systems, services and middleware, pages 397–398. IEEE, 2009. [26] Tianshu Chu, Sandeep Chinchali, and Sachin Katti. Multi-agent reinforcement learning for networked system control. In International Conference on Learning Representa- tions, 2020. [27] Tianshu Chu, Sandeep Chinchali, and Sachin Katti. Multi-agent reinforcement learning for networked system control. arXiv preprint arXiv:2004.01339, 2020. [28] Tianshu Chu and Uroš Kalabić. Model-based deep reinforcement learning for cacc in mixed-autonomy vehicle platoon. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 4079–4084. IEEE, 2019. [29] Tianshu Chu, Jie Wang, Lara Codecà, and Zhaojian Li. Multi-agent deep reinforce- ment learning for large-scale traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 21(3):1086–1095, 2019. [30] Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional rein- forcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [31] Fengying Dang, Dong Chen, Jun Chen, and Zhaojian Li. Event-triggered model predic- tive control with deep reinforcement learning for autonomous driving. arXiv preprint arXiv:2208.10302, 2022. [32] Charles Desjardins and Brahim Chaib-Draa. Cooperative adaptive cruise control: A reinforcement learning approach. IEEE Transactions on intelligent transportation sys- tems, 12(4):1248–1260, 2011. [33] Ruisheng Diao, Zhiwei Wang, Di Shi, Qianyun Chang, Jiajun Duan, and Xiaohu Zhang. Autonomous voltage control for grid operation using deep reinforcement learn- ing. In 2019 IEEE Power & Energy Society General Meeting (PESGM), pages 1–5. IEEE, 2019. 86 [34] Lei Ding, Qing-Long Han, and Xian-Ming Zhang. Distributed secondary control for active power sharing and frequency regulation in islanded microgrids using an event- triggered communication mechanism. IEEE Transactions on Industrial Informatics, 15(7):3910–3922, 2018. [35] Vinayak V Dixit, Sai Chand, and Divya J Nair. Autonomous vehicles: disengagements, accidents and reaction times. PLoS one, 11(12):e0168054, 2016. [36] Jiqian Dong, Sikai Chen, Paul Young Joun Ha, Yujie Li, and Samuel Labi. A drl-based multiagent cooperative control framework for cav networks: a graphic convolution q network. arXiv preprint arXiv:2010.05437, 2020. [37] Jiajun Duan, Di Shi, Ruisheng Diao, Haifeng Li, Zhiwei Wang, Bei Zhang, Desong Bian, and Zhehan Yi. Deep-reinforcement-learning-based autonomous voltage control for power grid operations. IEEE Transactions on Power Systems, 35(1):814–817, 2019. [38] Zine el abidine Kherroubi, Samir Aknine, and Rebiha Bacha. Novel decision-making strategy for connected and autonomous vehicles in highway on-ramp merging. IEEE Transactions on Intelligent Transportation Systems, 23(8):12490–12502, 2021. [39] Ingy ElSayed-Aly, Suda Bharadwaj, Christopher Amato, Rüdiger Ehlers, Ufuk Topcu, and Lu Feng. Safe multi-agent reinforcement learning via shielding. arXiv preprint arXiv:2101.11196, 2021. [40] Francesca M Favarò, Nazanin Nader, Sky O Eurich, Michelle Tripp, and Naresh Varadaraju. Examining accident reports involving autonomous vehicles in california. PLoS one, 12(9):e0184952, 2017. [41] Shuo Feng, Yi Zhang, Shengbo Eben Li, Zhong Cao, Henry X Liu, and Li Li. String stability for vehicular platoon control: Definitions and analysis methods. Annual Re- views in Control, 47:81–97, 2019. [42] Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, 29, 2016. [43] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In International conference on machine learning, pages 1146–1155. PMLR, 2017. [44] Weinan Gao, Zhong-Ping Jiang, and Kaan Ozbay. Data-driven adaptive optimal con- trol of connected vehicles. IEEE Transactions on Intelligent Transportation Systems, 18(5):1122–1133, 2016. 87 [45] Yuanqi Gao, Wei Wang, and Nanpeng Yu. Consensus multi-agent reinforcement learn- ing for volt-var control in power distribution networks. IEEE Transactions on Smart Grid, 2021. [46] Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015. [47] Everette S Gardner Jr. Exponential smoothing: The state of the art. Journal of forecasting, 4(1):1–28, 1985. [48] Mevludin Glavic. (deep) reinforcement learning for electric power system control and related problems: A short review and perspectives. Annual Reviews in Control, 48:22– 35, 2019. [49] Kailash Gogineni, Peng Wei, Tian Lan, and Guru Venkataramani. Scalability bottle- necks in multi-agent reinforcement learning systems. arXiv preprint arXiv:2302.05007, 2023. [50] Siyuan Gong, Anye Zhou, and Srinivas Peeta. Cooperative adaptive cruise control for a platoon of connected and autonomous vehicles considering dynamic information flow topology. Transportation research record, 2673(10):185–198, 2019. [51] Josep M Guerrero, Juan C Vasquez, José Matas, Luis García De Vicuña, and Miguel Castilla. Hierarchical control of droop-controlled ac and dc microgrids—a gen- eral approach toward standardization. IEEE Transactions on industrial electronics, 58(1):158–172, 2010. [52] Fanghong Guo, Changyun Wen, Jianfeng Mao, and Yong-Duan Song. Distributed secondary voltage and frequency restoration control of droop-controlled inverter-based microgrids. IEEE Transactions on industrial Electronics, 62(7):4355–4364, 2014. [53] Paul Young Joun Ha, Sikai Chen, Jiqian Dong, Runjia Du, Yujie Li, and Samuel Labi. Leveraging the capabilities of connected and autonomous vehicles and multi- agent reinforcement learning to mitigate highway bottleneck congestion. arXiv preprint arXiv:2010.05436, 2020. [54] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. [55] Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observ- able mdps. arXiv preprint arXiv:1507.06527, 2015. [56] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural compu- tation, 9(8):1735–1780, 1997. 88 [57] Yi Hou, Praveen Edara, and Carlos Sun. Modeling mandatory lane changing using bayes classifier and decision trees. IEEE Transactions on Intelligent Transportation Systems, 15(2):647–655, 2013. [58] John Hourdakis and Panos G Michalopoulos. Evaluation of ramp control effectiveness in two twin cities freeways. Transportation Research Record, 1811(1):21–29, 2002. [59] Shengyi Huang and Santiago Ontañón. A closer look at invalid action masking in policy gradient algorithms. arXiv preprint arXiv:2006.14171, 2020. [60] Leslie N Jacobson, Kim C Henry, and Omar Mehyar. Real-time metering algorithm for centralized control. Number 1232. 1989. [61] Dongyao Jia and Dong Ngoduy. Enhanced cooperative car-following traffic model with the combination of v2v and v2i communication. Transportation Research Part B: Methodological, 90:172–191, 2016. [62] Liming Jiang, Yuanchang Xie, Nicholas G Evans, Xiao Wen, Tienan Li, and Dan- jue Chen. Reinforcement learning based cooperative longitudinal control for reducing traffic oscillations and improving platoon stability. Transportation Research Part C: Emerging Technologies, 141:103744, 2022. [63] I Ge Jin and Gábor Orosz. Dynamics of connected vehicle systems with delayed acceleration feedback. Transportation Research Part C: Emerging Technologies, 46:46– 64, 2014. [64] Meha Kaushik, K Madhava Krishna, et al. Parameter sharing reinforcement learning architecture for multi agent driving behaviors. arXiv preprint arXiv:1811.07214, 2018. [65] Meha Kaushik, Vignesh Prasad, K Madhava Krishna, and Balaraman Ravindran. Overtaking maneuvers in simulated highway driving using deep reinforcement learning. In 2018 ieee intelligent vehicles symposium (iv), pages 1885–1890. IEEE, 2018. [66] Arne Kesting, Martin Treiber, and Dirk Helbing. General lane-changing model mobil for car-following models. Transportation Research Record, 1999(1):86–94, 2007. [67] Zulqarnain H Khattak, Brian L Smith, Hyungjun Park, and Michael D Fontaine. Cooperative lane control application for fully connected and automated vehicles at multilane freeways. Transportation research part C: emerging technologies, 111:294– 317, 2020. [68] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013. [69] Petar Kormushev, Sylvain Calinon, and Darwin G Caldwell. Reinforcement learning 89 in robotics: Applications and real-world challenges. Robotics, 2(3):122–148, 2013. [70] Jingang Lai, Xiaoqing Lu, Xinghuo Yu, Wei Yao, Jinyu Wen, and Shijie Cheng. Dis- tributed voltage control for dc mircogrids with coupling delays & noisy disturbances. In IECON 2017-43rd Annual Conference of the IEEE Industrial Electronics Society, pages 2461–2466. IEEE, 2017. [71] Jingang Lai, Hong Zhou, Xiaoqing Lu, Xinghuo Yu, and Wenshan Hu. Droop-based distributed cooperative control for microgrids with time-varying delays. IEEE Trans- actions on Smart Grid, 7(4):1775–1789, 2016. [72] Ludovic Leclercq, Jorge A Laval, and Nicolas Chiabaut. Capacity drops at merges: An endogenous model. Procedia-Social and Behavioral Sciences, 17:12–26, 2011. [73] Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hut- ter. Learning quadrupedal locomotion over challenging terrain. Science robotics, 5(47):eabc5986, 2020. [74] Lei Lei, Tong Liu, Kan Zheng, and Lajos Hanzo. Deep reinforcement learning aided platoon control relying on v2x information. IEEE Transactions on Vehicular Technol- ogy, 71(6):5811–5826, 2022. [75] Edouard Leurent. An environment for autonomous driving decision-making. https: //github.com/eleurent/highway-env, 2018. [76] Meng Li, Zhibin Li, Shunchao Wang, and Bingtong Wang. Anti-disturbance self- supervised reinforcement learning for perturbed car-following system. IEEE Transac- tions on Vehicular Technology, 2023. [77] Nan Li, Hao Chen, Ilya Kolmanovsky, and Anouck Girard. An explicit decision tree approach for automated driving. In Dynamic Systems and Control Conference, volume 58271, page V001T45A003. American Society of Mechanical Engineers, 2017. [78] Tao Li and Ji-Feng Zhang. Consensus conditions of multi-agent systems with time- varying topologies and stochastic communication noises. IEEE Transactions on Auto- matic Control, 55(9):2043–2057, 2010. [79] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep rein- forcement learning. arXiv preprint arXiv:1509.02971, 2015. [80] Kaixiang Lin, Renyu Zhao, Zhe Xu, and Jiayu Zhou. Efficient large-scale fleet man- agement via multi-agent deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1774–1783, 2018. 90 [81] Yuan Lin, John McPhee, and Nasser L Azad. Anti-jerk on-ramp merging using deep reinforcement learning. In 2020 IEEE Intelligent Vehicles Symposium (IV), pages 7–14. IEEE, 2019. [82] Yuan Lin, John McPhee, and Nasser L Azad. Comparison of deep reinforcement learning and model predictive control for adaptive cruise control. IEEE Transactions on Intelligent Vehicles, 6(2):221–231, 2020. [83] Haotian Liu and Wenchuan Wu. Online multi-agent reinforcement learning for decen- tralized inverter-based volt-var control. arXiv preprint arXiv:2006.12841, 2020. [84] Tong Liu, Lei Lei, Kan Zheng, and Kuan Zhang. Autonomous platoon control with integrated deep reinforcement learning and dynamic programming. IEEE Internet of Things Journal, 10(6):5476–5489, 2022. [85] Guannan Lou, Wei Gu, Yinliang Xu, Ming Cheng, and Wei Liu. Distributed mpc- based secondary voltage control scheme for autonomous droop-controlled microgrids. IEEE transactions on sustainable energy, 8(2):792–804, 2016. [86] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. CoRR, abs/1706.02275, 2017. [87] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mor- datch. Multi-agent actor-critic for mixed cooperative-competitive environments. Ad- vances in neural information processing systems, 30, 2017. [88] Xiaoqing Lu, Xinghuo Yu, Jingang Lai, Josep M Guerrero, and Hong Zhou. Distributed secondary voltage and frequency control for islanded microgrids with uncertain com- munication links. IEEE Transactions on Industrial Informatics, 13(2):448–460, 2016. [89] Joseph Lubars, Harsh Gupta, Adnan Raja, R Srikant, Liyun Li, and Xinzhou Wu. Combining reinforcement learning with model predictive control for on-ramp merging. arXiv preprint arXiv:2011.08484, 2020. [90] Florian Marczak, Winnie Daamen, and Christine Buisson. Key variables of merging behaviour: empirical comparison between two sites and assessment of gap acceptance theory. Procedia-Social and Behavioral Sciences, 80:678–697, 2013. [91] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408– 2417. PMLR, 2015. [92] Carlos Massera Filho, Marco H Terra, and Denis F Wolf. Safe optimization of highway traffic with robust model predictive control-based cooperative adaptive cruise control. 91 IEEE Transactions on Intelligent Transportation Systems, 18(11):3193–3203, 2017. [93] Ali Mehrizi-Sani and Reza Iravani. Potential-function based control of a microgrid in islanded and grid-connected modes. IEEE Transactions on Power Systems, 25(4):1883– 1891, 2010. [94] A Mehrizi-Sanir and R Iravani. Secondary control for microgrids using potential func- tions: modeling issues. Proceedings of the Conseil International des Grands Résaux Électriques (CIGRE), 182, 2009. [95] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016. [96] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. [97] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015. [98] Aquib Mustafa, Binod Poudel, Ali Bidram, and Hamidreza Modares. Detection and mitigation of data manipulation attacks in ac microgrids. IEEE Transactions on Smart Grid, 11(3):2588–2603, 2019. [99] Daiheng Ni and John D Leonard II. A simplified kinematic wave model at a merge bottleneck. Applied mathematical modelling, 29(11):1054–1072, 2005. [100] Boda Ning, Qing-Long Han, and Lei Ding. Distributed secondary control of ac micro- grids with external disturbances and directed communication topologies: A full-order sliding-mode approach. IEEE/CAA Journal of Automatica Sinica, 2020. [101] Transportation Officials. A Policy on Geometric Design of Highways and Streets, 2011. AASHTO, 2011. [102] Praveen Palanisamy. Multi-agent connected autonomous driving using deep reinforce- ment learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2020. [103] Jacopo Panerati, Hehui Zheng, SiQi Zhou, James Xu, Amanda Prorok, and Angela P Schoellig. Learning to fly—a gym environment with pybullet physics for reinforce- ment learning of multi-agent quadcopter control. In 2021 IEEE/RSJ International 92 Conference on Intelligent Robots and Systems (IROS), pages 7512–7519. IEEE, 2021. [104] Markos Papageorgiou, Christina Diakaki, Vaya Dinopoulou, Apostolos Kotsialos, and Yibing Wang. Review of road traffic control strategies. Proceedings of the IEEE, 91(12):2043–2067, 2003. [105] Markos Papageorgiou and Apostolos Kotsialos. Freeway ramp metering: An overview. IEEE transactions on intelligent transportation systems, 3(4):271–281, 2002. [106] Ioannis Papamichail and Markos Papageorgiou. Traffic-responsive linked ramp- metering control. IEEE Transactions on Intelligent Transportation Systems, 9(1):111– 121, 2008. [107] Mohammad Parvini, Arturo González, Andrés Villamil, Philipp Schulz, and Gerhard Fettweis. Joint resource allocation and string-stable cacc design with multi-agent re- inforcement learning. 2023. [108] Ashley Peake, Joe McCalmon, Benjamin Raiford, Tongtong Liu, and Sarra Alqahtani. Multi-agent reinforcement learning for cooperative adaptive cruise control. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pages 15–22. IEEE, 2020. [109] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Param- eter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017. [110] Philip Polack, Florent Altché, Brigitte d’Andréa Novel, and Arnaud de La Fortelle. The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles? In 2017 IEEE intelligent vehicles symposium (IV), pages 812–818. IEEE, 2017. [111] Gunasekaran Raja, Kottilingam Kottursamy, Kapal Dev, Renuka Narayanan, Ashmitha Raja, and K Bhavani Venkata Karthik. Blockchain-integrated multiagent deep reinforcement learning for securing cooperative adaptive cruise control. IEEE transactions on intelligent transportation systems, 23(7):9630–9639, 2022. [112] James Blake Rawlings, David Q Mayne, and Moritz Diehl. Model predictive control: theory, computation, and design, volume 2. Nob Hill Publishing Madison, WI, 2017. [113] Jackeline Rios-Torres and Andreas A Malikopoulos. A survey on the coordination of connected and automated vehicles at intersections and merging at highway on-ramps. IEEE Transactions on Intelligent Transportation Systems, 18(5):1066–1077, 2016. [114] Mehdi Savaghebi, Alireza Jalilian, Juan C Vasquez, and Josep M Guerrero. Secondary control scheme for voltage unbalance compensation in an islanded droop-controlled 93 microgrid. IEEE Transactions on Smart Grid, 3(2):797–807, 2012. [115] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013. [116] Dhruv Mauria Saxena, Sangjae Bae, Alireza Nakhaei, Kikuo Fujimura, and Maxim Likhachev. Driving in dense traffic with model-free reinforcement learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 5385–5392. IEEE, 2020. [117] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015. [118] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. [119] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. [120] Qobad Shafiee, Josep M Guerrero, and Juan C Vasquez. Distributed secondary control for islanded microgrids—a novel approach. IEEE Transactions on power electronics, 29(2):1018–1031, 2013. [121] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, re- inforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016. [122] Tianyu Shi, Dong Chen, Kaian Chen, and Zhaojian Li. Offline reinforcement learn- ing for autonomous driving with safety and exploration enhancement. arXiv preprint arXiv:2110.07067, 2021. [123] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016. [124] John W Simpson-Porco, Qobad Shafiee, Florian Dörfler, Juan C Vasquez, Josep M Guerrero, and Francesco Bullo. Secondary frequency and voltage control of islanded microgrids via distributed averaging. IEEE Transactions on Industrial Electronics, 62(11):7025–7038, 2015. [125] Vijay K. Sood and Haytham Abdelgawad. Chapter 1 - microgrids architectures. In Rajeev Kumar Chauhan and Kalpana Chauhan, editors, Distributed Energy Resources 94 in Microgrids, pages 1 – 31. Academic Press, 2019. [126] Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29, 2016. [127] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [128] Csaba Szepesvári. Algorithms for reinforcement learning. Springer Nature, 2022. [129] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993. [130] Justin K Terry, Nathaniel Grammel, Ananth Hari, Luis Santos, and Benjamin Black. Revisiting parameter sharing in multi-agent deep reinforcement learning, 2021. [131] Christian Thiemann, Martin Treiber, and Arne Kesting. Estimating acceleration and lane-changing dynamics from next generation simulation trajectory data. Transporta- tion Research Record, 2088(1):90–101, 2008. [132] Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Congested traffic states in em- pirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000. [133] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016. [134] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. [135] Panagis N Vovos, Aristides E Kiprakis, A Robin Wallace, and Gareth P Harrison. Cen- tralized and distributed voltage control: Impact on distributed generation penetration. IEEE Transactions on power systems, 22(1):476–483, 2007. [136] Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: A local reward approach to solve global reward games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7285–7292, 2020. [137] Jiawei Wang, Tianyu Shi, Yuankai Wu, Luis Miranda-Moreno, and Lijun Sun. Multi- agent graph reinforcement learning for connected automated driving. [138] Meng Wang. Infrastructure assisted adaptive driving to stabilise heterogeneous vehicle 95 strings. Transportation Research Part C: Emerging Technologies, 91:276–295, 2018. [139] Shengyi Wang, Jiajun Duan, Di Shi, Chunlei Xu, Haifeng Li, Ruisheng Diao, and Zhiwei Wang. A data-driven multi-agent autonomous voltage control framework using deep reinforcement learning. IEEE Transactions on Power Systems, 2020. [140] Zhuwei Wang, Senfan Jin, Lihan Liu, Chao Fang, Meng Li, and Song Guo. Design of intelligent connected cruise control with vehicle-to-vehicle communication delays. IEEE Transactions on Vehicular Technology, 71(8):9011–9025, 2022. [141] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279–292, 1992. [142] David P Watson and David H Scheidt. Autonomous systems. Johns Hopkins APL technical digest, 26(4):368–376, 2005. [143] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. [144] David H Wolpert and Kagan Tumer. Optimal payoff functions for members of collec- tives. In Modeling complexity in economic and social systems, pages 355–369. World Scientific, 2002. [145] Cathy Wu, Alexandre M Bayen, and Ankur Mehta. Stabilizing traffic with autonomous vehicles. In 2018 IEEE international conference on robotics and automation (ICRA), pages 6012–6018. IEEE, 2018. [146] Yuhuai Wu, Elman Mansimov, Shun Liao, Roger Grosse, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approx- imation. arXiv preprint arXiv:1708.05144, 2017. [147] Huanhai Xin, Zhihua Qu, John Seuss, and Ali Maknouninejad. A self-organizing strategy for power flow control of photovoltaic generators in a distribution network. IEEE Transactions on Power Systems, 26(3):1462–1473, 2010. [148] Qiuling Yang, Gang Wang, Alireza Sadeghi, Georgios B Giannakis, and Jian Sun. Two-timescale voltage control in distribution grids using deep reinforcement learning. IEEE Transactions on Smart Grid, 11(3):2313–2323, 2019. [149] Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021. [150] Chao Yu, Xin Wang, Xin Xu, Minjie Zhang, Hongwei Ge, Jiankang Ren, Liang Sun, Bingcai Chen, and Guozhen Tan. Distributed multiagent coordinated learning for 96 autonomous driving in highways based on dynamic coordination graphs. IEEE Trans- actions on Intelligent Transportation Systems, 21(2):735–748, 2019. [151] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of reinforcement learning and control, pages 321–384, 2021. [152] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Başar. Fully de- centralized multi-agent reinforcement learning with networked agents. arXiv preprint arXiv:1802.08757, 2018. [153] Ying Zhang, Xinan Wang, Jianhui Wang, and Yingchen Zhang. Deep reinforcement learning based volt-var optimization in smart distribution systems. IEEE Transactions on Smart Grid, 12(1):361–371, 2020. [154] Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pages 737–744. IEEE, 2020. [155] Hanyu Zhu, Yong Zhou, Xiliang Luo, and Haibo Zhou. Joint control of power, beamwidth, and spacing for platoon-based vehicular cyber-physical systems. IEEE Transactions on Vehicular Technology, 71(8):8615–8629, 2022. 97