SAFE CONTROL DESIGN FOR UNCERTAIN SYSTEMS By Zahra Marvi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering- Doctor of Philosophy 2021 ABSTRACT SAFE CONTROL DESIGN FOR UNCERTAIN SYSTEMS By Zahra Marvi This dissertation investigates the problem of safe control design for systems under model and environmental uncertainty. Reinforcement learning (RL) provides an interactive learn- ing framework in which the optimal controller is sequentially derived based on instantaneous reward. Although powerful, safety consideration is a barrier to the wide deployment of RL algorithms in practice. To overcome this problem, we proposed an iterative safe off-policy RL algorithm. The cost function that encodes the designer’s objectives is augmented with a control barrier function (CBF) to ensure safety and optimality. The proposed formulation provides a look-ahead and proactive safety planning, in which the safety is planned and opti- mized along with the performance to minimize the intervention with the optimal controller. Extensive safety and stability analysis is provided and the proposed method is implemented using the off-policy algorithm without requiring complete knowledge about the system dy- namics. This line of research is then further extended to have a safety and stability guarantee even during the data collection and exploration phases in which random noisy inputs are applied to the system. However, satisfying the safety of actions when little is known about the system dynamics is a daunting challenge. We present a novel RL scheme that ensures the safety and stability of the linear systems during the exploration and exploitation phases. This is obtained by having a concurrent model learning and control, in which an efficient learning scheme is employed to prescribe the learning behavior. This characteristic is then employed to apply only safe and stabilizing controllers to the system. First, the prescribed errors are employed in a novel adaptive robustified control barrier function (AR-CBF) which guarantees that the states of the system remain in the safe set even when the learning is in- complete. Therefore, the noisy input in the exploratory data collection phase and the optimal controller in the exploitation phase are minimally altered such that the AR-CBF criterion is satisfied and, therefore, safety is guaranteed in both phases. It is shown that under the proposed prescribed RL framework, the model learning error is a vanishing perturbation to the original system. Therefore, a stability guarantee is also provided even in the exploration when noisy random inputs are applied to the system. A learning-enabled barrier-certified safe controllers for systems that operate in a shared and uncertain environment is then pre- sented. A safety-aware loss function is defined and minimized to learn the uncertain and unknown behavior of external agents that affect the safety of the system. The loss function is defined based on safe set error, instead of the system model error, and is minimized for both current samples as well as past samples stored in the memory to assure a fast and general- izable learning algorithm for approximating the safe set. The proposed model learning and CBF are then integrated together to form a learning-enabled zeroing CBF (L-ZCBF), which employs the approximated trajectory information of the external agents provided by the learned model but shrinks the safety boundary in case of an imminent safety violation using instantaneous sensory observations. It is shown that the proposed L-ZCBF assures the safety guarantees during learning and even in the face of inaccurate or simplified approximation of external agents, which is crucial in highly interactive environments. Finally, the cooperative capability of agents in a multi-agent environment is investigated for the sake of safety guar- antee. CBFs and information-gap theory are integrated to have robust safe controllers for multi-agent systems with different levels of measurement accuracy. A cooperative framework for the construction of CBFs for every two agents is employed to maximize the horizon of uncertainty under which the safety of the overall system is satisfied. The information-gap theory is leveraged to determine the contribution and share of each agent in the construction of CBFs. This results in the highest possible robustness against measurement uncertainty. By employing the proposed approach in constructing CBF, a higher horizon of uncertainty can be safely tolerated and even the failure of one agent in gathering accurate local data can be compensated by cooperation between agents. The effectiveness of the proposed methods is extensively examined in simulation results. Copyright by ZAHRA MARVI 2021 To researchers who work hard and honest to make a positive impact to the world v ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisor Dr. Bahare Kiumarsi, for her continuous support, time and superb guidance throughout my Ph.D. studies. I would like to express my appreciation to the members of my Ph.D. committee, Dr. Hayder Radha, Dr. Hamidreza Modares and Dr. Xiaboo Tan for their time, support and valuable suggestions. I would also like to thank all faculty and staff with Michigan State University; specially, I would like to thank Dr. Tim Hogan, Dr. Andrew Mason, Dr. Katy Luchini Colbry, and Dr. John Papapolymerou for their support. A special appreciation to my family for their love and support throughout my life, to my beloved mother for empowering me and her consistent emotional support, to my dear father and my dear brother. A special thanks to my beloved husband, Ehsan for his support, encouragement and standing by my side. I praise God for all the blessings. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter1 Introduction and Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter2 Safe Reinforcement Learning: A Control Barrier Function Optimization Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Barrier Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Safe Optimal Control Approach . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Safe Modified Formulation . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Safety and Performance Analysis . . . . . . . . . . . . . . . . . . . . 20 2.3.2.1 Safety Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2.2 Stability and Optimality Analysis . . . . . . . . . . . . . . . 25 2.4 Algorithm for Safe Reinforcement Learning . . . . . . . . . . . . . . . . . . . 28 2.4.1 Safe Off-policy Reinforcement Learning Algorithm . . . . . . . . . . . 28 2.4.2 Neural Network Approximation of Safe RL Algorithm . . . . . . . . . 31 2.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter3 Reinforcement Learning based Control Design with Safety and Stability Guarantees During Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.1 Organization of the chapter . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.1 Control Barrier Functions . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 Adaptive Optimal Control Design . . . . . . . . . . . . . . . . . . . . 44 3.4 Robustified Safety and Stability using Experience Replay Learning . . . . . . 46 vii 3.4.1 Experience Replay System Approximation . . . . . . . . . . . . . . . 47 3.4.2 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.3 Adaptive Robustified CBF . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.4 Safe and Stable Controller . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 Barrier-certified Off-Policy Algorithm . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6.2 Simulation Results and Discussion . . . . . . . . . . . . . . . . . . . . 60 3.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter4 Barrier-certified Learning-enabled Safe Control Design for Systems Oper- ating in Uncertain Environments . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.1 Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Problem Statement and Background . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 Control Barrier Functions . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Learning-enabled ZCBF with Uncertain Sets . . . . . . . . . . . . . . . . . . 72 4.3.1 Learning Safe Set Despite Uncertain Behaviors of External Agents . . 73 4.3.2 External Dynamics Identifier . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Control Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.1 Control Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.2 Mathematical Representation . . . . . . . . . . . . . . . . . . . . . . 91 4.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.6.1 Zero Modeling Error Scenario . . . . . . . . . . . . . . . . . . . . . . 95 4.6.2 First Non-zero Modeling Error Scenario . . . . . . . . . . . . . . . . . 96 4.6.3 Second Non-zero Modeling Error Scenario . . . . . . . . . . . . . . . 97 4.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter5 Robust Satisficing Cooperative Control Barrier Functions for Multi-Robots Systems using Information-Gap Theory . . . . . . . . . . . . . . . . . . . . . . 102 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.1.1 Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Problem Overview and Background . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.1 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2.2.1 Control Barrier Functions . . . . . . . . . . . . . . . . . . . 107 5.2.2.2 Information-Gap Theory . . . . . . . . . . . . . . . . . . . . 109 5.3 Robust-Satisficing Control Barrier Function . . . . . . . . . . . . . . . . . . 110 5.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.2 Robust-satisficing Distributed CBF . . . . . . . . . . . . . . . . . . . 114 viii 5.3.2.1 Distributed ZCBF . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.2.2 Robust-satisficing distributed ZCBF . . . . . . . . . . . . . 115 5.3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.3.3 Controller Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Chapter6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 ix LIST OF TABLES Table 2.1: Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 3.1: Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Table 4.1: Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 x LIST OF FIGURES Figure 2.1: Lateral displacement with and without CBF . . . . . . . . . . . . . . . . . 36 Figure 2.2: The states of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 2.3: Actor Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Figure 2.4: Critic Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Figure 3.1: Overview of the proposed approach . . . . . . . . . . . . . . . . . . . . . . 41 Figure 3.2: States of the system under the proposed framework . . . . . . . . . . . . . 62 Figure 3.3: States of the system with plain off-policy with manual reduced noise . . . 62 Figure 3.4: NN Weight error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Figure 3.5: Convergence of Pk and Kk . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Figure 4.1: (a): Cˆ invariant, ∂C violated (b): Cc invariant (c): Cˆ converges to C . . 74 Figure 4.2: Control scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Figure 4.3: Control scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 4.4: Position of vehicles in ’y’ coordinate (Scenario1) . . . . . . . . . . . . . . 96 Figure 4.5: (a) NN weights (Scenario 1), (b) Optimal solution without CBF . . . . . . 97 Figure 4.6: NN Weights (Scenario 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Figure 4.7: NN Weights with and without Experience replay . . . . . . . . . . . . . . 98 Figure 4.8: Position of vehicles in ’y’ coordinate (Scenario2) . . . . . . . . . . . . . . 99 Figure 4.9: NN Weight (Scenario 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 4.10:Position of agents in ’y’ coordinate (Scenario 3) . . . . . . . . . . . . . . . 100 Figure 5.1: Graph Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Figure 5.2: (a) Agents’ trajectories, no measurement error (b) Corresponding ||∆pij || 127 Figure 5.3: (a) Trajectories, measurement error without IG (b) Corresponding ||∆pij || 127 Figure 5.4: (a) Trajectories, measurement error with IG (b) Corresponding ||∆pij || . . 128 Figure 5.5: Pairwise distances between agents for different values of safety distance Ds 128 xi Figure 5.6: Time Lapse of agents’ position with measurement error using IG method 129 xii LIST OF ALGORITHMS Algorithm 1 Safe Off-policy RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Algorithm 2 Safe and Stable Off-policy RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Algorithm 3 Barrier-certified Learning-enabled Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Algorithm 4 Safe and Robust Control Design for each Agent i . . . . . . . . . . . . . . . . . . . . . . . . 123 xiii Chapter1 Introduction and Literature Review Safety-critical systems are the systems whose failure or malfunction can result in injury to people, damage to the equipment or harm to the environment [1]. Being that said, most control systems face instrumental or environmental limitations and thus are considered as safety-critical systems. The limitations of the system itself include states constraints, such as saturation of actuators, limited range of motion in a joint of a robotic arm, maximum allowable speed of a vehicle, the relative portion of materials in a chemical process, and so on. The environment in which the system is operating also imposes different types of safety constraints on the system. For example, when the operating environment is shared between different agents such as pick-and-place robotic arms in a factory, a multi-robot system and autonomous driving, the collision should be avoided between the nearby agents. In addition, the safety of human operators and nearby facilities must be guaranteed as well. All these safety constraints need to be satisfied for a safe and reliable operation. The set of states in which these safety constraints are satisfied are considered as the safe set. The controller need to be designed accordingly to get the desired performance within the safe set of the system to avoid safety violation. Moreover, conflicts can always arise between safety and performance requirements, and, in a conflicting situation, safety objectives must always be prioritized to the performance. For example, in the adaptive cruise control system, the system’s performance level that can be achieved without safety violation in terms of reaching the desired speed depends on the traffic situation and assuring a safe maneuver (maintaining 1 a safe distance from the vehicle ahead) must be prioritized to the performance. Specially, with the emerge of robotics and autonomous systems, which have a high level of interactions with humans, and typically operate in a cluttered and uncertain environment; it is crucial to design safe and smart controllers in the face of the model and environmental uncertainty, which is the goal of this dissertation. Reinforcement learning (RL) is an emerging framework in control systems that learns the optimal controller for uncertain systems online in real time [2, 3, 4]. Although powerful, assuring its safety is one of the main challenges to pave the way to widespread deployment of RL in practice. RL algorithms typically consist of two phases of operation: exploration and exploitation. In the former phase, random noisy inputs are applied to the system to collect rich data. The collected data is used to learn improved control policies, followed in the exploitation phase to gain more rewards. However, under uncertainty, little or no knowledge of the system dynamics might be available, and therefore, RL agent faces the risk of stability or safety violation. Satisfaction of these properties is very challenging since, on one hand, noisy exploratory inputs must be applied to the system, and, on the other hand, their consequences cannot be fully predicted because the complete knowledge of the system dynamics is not known priori. Different approaches are proposed in the literature to address the safe RL problem. Safety in RL framework has been addressed in two general ways; one takes into account the uncertainty of the reward and the other one deals with possible risks in the exploration process [5]. In the former case, stochastic cost-to-go functions are considered and appropriate risk measure functions are applied [6], while in the latter one, the learning agent is typically provided with some external knowledge or advice for safe exploration [5],[7], [8]. While applying risk measures on stochastic functions is a strong tool to deal with uncertainty, it does not take into account the constraints on the system’s state and the control input. Economists have studied this risk-aware approach because the goal is to obtain the highest profit while the risk is the chance of loss which is inherent to the concept of profit. However, 2 for many control systems that risk arises from state or input constraints such as collision avoidance in multi-agent systems, forbidden states of a robotic arm, and safe autonomous vehicles, this approach cannot be directly applied. Risk in the exploration process has been addressed through learning simulators, using external advice and prior knowledge [9, 10]. However, all of these approaches need some prior knowledge about the risk or distance to the risk. These approaches are applicable for cases such as the risky height of flight for an airplane; but they are not constructive for applications that the information about dangerous occasions is not available which is somehow inherent to the concept of risk. [11] employs expert demonstration in a surgical robotic system which provides an area with a high probability of safe task completion. The forward RL is then solved in this region in conjunction with an area with a return route to this region based on the task completion cost. [8] uses the idea of the escape route and backup, and therefore, in the face of a safety crisis, a backup safe path is taken. To reduce the need for prior knowledge, which might not be available, learning from data can be leveraged and combined with prior knowledge. For example, in [12], two stages of learning are considered: in the first one, a rule-based safeguard is employed. As more data become available, the rule-based safeguard is replaced with a data-driven counterpart in the second stage. [13] identifies undesirable actions in a set of previously learned tasks and uses transfer learning. In [14], a recovery RL algorithm is proposed which employs the offline data to learn about unsafe zones. Then, a recovery policy is employed which acts as a backup policy in the face of imminent risk. However, safe offline data collection demands human supervision. In addition, in a hostile environment, frequent alteration of policies might prevent reaching performance objectives. A broad class of methods in safe RL are model-based and rely on information about the system/environment or prediction of the risk. This includes shielding approaches based on reachability [15], safety modules [16], or safety layers that adjust the policy to prevent violation of safety. [17] employs risk state estimation module, which activates the safe policy search module in the face of risk. Employing constrained Markov decision process, [18, 3 19] and safe region of attraction calculation [20] are other methods to tackle the safe RL problem. Reachability analysis has also been widely used to handle safety in the exploration pro- cess by finding the set of initial states for which there is a control input that keeps the state of the system within the safe feasible set despite disturbance [21]. Moreover, in the boundary of the safe set, it needs to switch to a controller to push the state back into the safe region, which can cause chattering. In [22], the Gaussian regression is used to learn the disturbance and, also a term is added to the cost function to incorporate the risk within the learning scheme. More relevant work in safety within the context of reachability can be found in [23], [24]. [23] presented a safe control framework based on the Hamilton-Jacobi reachability method for partially unknown systems. The safety problem is then defined as a differential game between controller and disturbance. A Gaussian process is leveraged to learn about the disturbance, and Bayesian analysis computes its bound. In [25], a safeguard layer is incorporated using trajectory-parametrized reachable set analysis which is computed offline. Although elegant, reachability-based approaches are computationally demanding. In addi- tion, these methods are still model-based, and offline model-learning results in dependency of safety to the accuracy of learning. In addition, by any change in the operation regime of the system, offline learning needs to be re-initiated. Control barrier function (CBF) is another widely used method to guarantee the safety of the control systems [26, 27, 28, 29, 30, 31]. This includes adaptive cruise control problem in [27, 28], safe control of robots [29, 32] and collision-free multi-agents systems [30, 33]. These methods generally integrate CBFs and control Lyapunov functions and solve a point-wise quadratic programming optimization problem to certify the safety and stability of a nominal controller. CBFs are conceptually similar to Lyapunov functions and are used to ensure forward invariance of a specific set. However, these methods require complete knowledge of the system dynamics as well as the feasible set. therefore, it is not straightforward to integrate it with an RL framework for which the knowledge of the system model is not required. In 4 addition, for the systems that operate in uncertain environments, the safe or feasible set is uncertain: safety criteria are affected by some external factors with possibly uncertain or unknown behaviors which are not known a priori. For example, in autonomous vehicles, the operation platform of vehicles is highly complicated and shared between autonomous, semi-autonomous, and human driving vehicles and pedestrians. Therefore, it is necessary to design a controller that can ensure the safety of the system despite the uncertainty in the feasible set due to the existence of unknown external agents while reaching as much performance as possible. To account for uncertainties in designing safe controllers, several robust and adaptive approaches are presented. In [34], the robustness of zeroing CBFs (ZCBFs) under model perturbation is investigated. It is shown that the existence of ZCBF ensures the input-to- state stability of the safe set under perturbations. However, external agents that affect the safety of the ego system cannot be modeled as a perturbation. In [35], an adaptive CBF (aCBF) is proposed to ensure safety despite parametric uncertainty. To reduce conservatism, [36] proposed a robust aCBF (RaCBF), which guarantees forward invariance of a tightened set within the safe set. However, in both approaches, the invariance criterion needs to be satisfied for all values within the uncertain parameters that are not always known ahead of time. In addition, the effect of external dynamics in the environment shared with the ego-system can be completely modeled as neither parametric uncertainty nor disturbance. In [37], uncertainties impacting CBFs are learned to design a safe controller for a wider class of uncertainties. However, it is assumed that the CBF for the nominal system is a CBF for the uncertain system, which is not always applicable. To partially compensate for the need for full knowledge on dynamics, [32, 38] have proposed data-driven methods which use the Gaussian process to learn about disturbance. [32] Uses learning to explore uncertain states to expand and maximize the barrier-certified safe region by updating probabilistic parameters and decreasing the variance of the disturbance Gaussian Model. Then, the least square method is used to find the closest control input to the nominal input, which ensures safety. 5 [38] uses a similar method to form the CBF by learning about the disturbance. It also takes the optimality into account by finding the optimal control input by policy-gradient RL, which is then combined with the control input obtained from CBF, which ensures safety. In both of these works, a nominal model is needed to form the CBF constraint and the disturbance is modeled by the Gaussian process, which is not always applicable. To ensure the convergence on the original goal and to avoid the conflict between safety and performance, [39] uses an iterative search algorithm using the sum-of-squares method to find the maximum region in which safety and stabilization are compatible. In [29], sparse optimization is used to extract the dynamical structure. The model and long-term reward are adaptively estimated, and the learned model is used at each instant to provide required information on the dynamics to ensure safety using the CBF method for non-stationary discrete-time control systems. RL method for handling constrained states is proposed in [40], in which a non-quadratic function is incorporated in the performance functional that becomes dominant in case of constraint violation. In this method, safety is considered as a soft constraint. [41] incorporates state and input constraints in RL framework using penalty function and barrier function (BF)- based state transformation; however, the possible conflict between safety and stability is not considered. Model predictive control (MPC) is another suitable framework for handling state and in- put constraints. In some MPC approaches, a barrier function (BF) is incorporated into the cost function to convert a constrained optimization problem into an unconstrained optimiza- tion problem, which provides a smooth transition of states within the feasible set [42], [43]. However, they only deal with state constraints imposing the condition that safe set must contain the origin, while in practical applications, safety and performance/stability might be in conflict, and safety must be prioritized. Our previous work [44] has extended those approaches to a general safe set, capable of handling even complicated and nonlinear safety criteria due to the interaction of different states. MPC-based approaches are mainly model based, and since they are short-sighted, it is hard to guarantee the stability and feasibility 6 of the solution in the presence of uncertainty. In [45], path planning in uncertain and dense obstacles environment is investigated in which a reachability set estimator of dynamic obstacles is employed to predict its threat. The CBF-based method, in contrast, ensures safety without the need for finding the reachability set, which is typically computationally demanding. Inverse RL is used in [46] to learn about the reward function and, consequently, the behavior of the human agent in control of human- robot systems. However, this line of work assumes that the human operators or external agents choose their course of actions based on a perfectly rational framework that makes optimal decisions with respect to a reward function, which might not coincide with reality and is also computationally expensive. 1.1 Organization of the Dissertation Based on the above elaborated problems, the brief contribution and organization of this dissertation is as follows. 1. Chapter 2 presents a learning-based barrier-certified method to learn safe optimal con- trollers that guarantee the operation of safety-critical systems within their safe regions while providing optimal performance. The cost function that encodes the designer’s objectives is augmented with a CBF to ensure safety and optimality. A damping coefficient is incorporated into the CBF, which specifies the trade-off between safety and optimality. The proposed formulation provides a look-ahead and proactive safety planning and results in a smooth transition of states within the feasible set. That is, instead of applying an optimal controller and intervening with it only if the safety constraints are violated, the safety is planned and optimized along with the perfor- mance to minimize the intervention with the optimal controller. It is shown that the addition of the CBF into the cost function does not affect the stability and optimality of the designed controller within the safe region. This formulation enables us to find 7 the optimal safe solution iteratively. An off-policy RL algorithm is then employed to find a safe optimal policy without requiring the complete knowledge about the system dynamics while satisfying the safety constraints. The efficacy of the proposed safe RL control design approach is demonstrated on the lane keeping as an automotive control problem. 2. Satisfaction of safety and stability properties of RL algorithms has been a long-standing challenge. These properties must be satisfied even during learning, for which explo- ration is required to collect rich data. However, satisfying the safety of actions when little is known about the system dynamics is a daunting challenge. After all, predict- ing the consequence of RL actions requires knowing the system dynamics. Chapter 3 presents a novel RL scheme that ensures the safety and stability of the linear systems during the exploration and exploitation phases. First, the system model is learned for the sake of safety. That is, the update law is designed to assure that the actual model’s safety properties are preserved by the learned model. Second, a fast and effi- cient learning scheme is presented to ensure that the model learning error remains in a prescribed bound with a desired convergence rate. This occurs because of the efficient deployment of data from past experiences in an off-policy RL framework. Then, the presented model and its prescribed errors are employed in a novel adaptive robustified control barrier function (AR-CBF) which guarantees that states of the system remain in the safe set even when the learning is incomplete. Therefore, the noisy input in the exploratory data collection phase and the optimal controller in the exploitation phase are minimally altered such that the AR-CBF criterion is satisfied, and therefore, safety is guaranteed in both phases. It is shown that under the proposed prescribed RL framework, the model learning error is a vanishing perturbation to the original system. Therefore, a stability guarantee is also provided even in the exploration when noisy random inputs are applied to the system. 8 3. Chapter 4 presents learning-enabled barrier-certified safe controllers for systems that operate in a shared environment for which multiple systems with uncertain dynamics and behaviors interact. That is, safety constraints are imposed by not only the ego system’s own physical limitations but also other systems operating nearby. Since the model of the external agent is required to impose CBFs as safety constraints, a safety- aware loss function is defined and minimized to learn the uncertain and unknown behavior of external agents. More specifically, the loss function is defined based on barrier function error, instead of the system model error, and is minimized for both current samples as well as past samples stored in the memory to assure a fast and generalizable learning algorithm for approximating the safe set. The proposed model learning and CBF are then integrated together to form a learning-enabled zeroing CBF (L-ZCBF), which employs the approximated trajectory information of the external agents provided by the learned model but shrinks the safety boundary in case of an imminent safety violation using instantaneous sensory observations. It is shown that the proposed L-ZCBF assures safety guarantees during learning and even in the face of inaccurate or simplified approximation of external agents, which is crucial in safety- critical applications in highly interactive environments. The efficacy of the proposed method is examined in a simulation of safe maneuver control of a vehicle in an urban area. 4. Chapter 5 integrates the CBFs and information-gap theory to present robust safe con- trollers for collision avoidance problem in multi-agent systems with different levels of measurement accuracy. It is assumed that agents have uncertain and inaccurate mea- surements about the relative distance to neighboring agents. A cooperative framework for the construction of CBFs for every two agents is employed to avoid collision and ensure the safety of the overall system. To maximize the horizon of uncertainty under which the safety of the overall system is satisfied, the information-gap theory is lever- aged to determine the contribution and share of each agent in the construction of CBFs. 9 This results in the highest possible robustness against measurement uncertainty. It is shown that the overall system can tolerate higher measurement uncertainty and safely operate if the agent that is more confident about its measurement contributes more to the construction of the CBF. By employing the proposed approach in constructing CBF, the possible failure of one agent in gathering accurate local data can be com- pensated by cooperation between agents. The effectiveness of the proposed method is demonstrated via performing simulations for multi-robot systems. 5. Chapter 6 summarizes and concludes the dissertation and provides future research directions. The contributions of this dissertation are published in [44, 47, 48, 49, 50, 51, 52, 53]. 10 Chapter2 Safe Reinforcement Learning: A Con- trol Barrier Function Optimization Ap- proach Contents of this chapter first appeared as [50] and have been reformatted to fit the requirements of this dissertation. 2.1 Introduction In this chapter, a safe RL scheme is proposed which is based on optimization of a cost function that is augmented with a CBF candidate. The proposed approach is capable of handling a pre-defined safe and feasible polytope set formed by state constraints and process risk. RL algorithm is used to learn the optimal control policy that minimizes this augmented cost function without requiring the complete knowledge about the system dynamics. It is shown that sequential improvement of the controller ensures safety and stability within the safe region. The main contribution is that the concept of the CBF is unified with an RL scheme to bring together the best of two worlds, i.e, to guarantee safety in a data-driven fashion. It also provides a look-ahead and proactive approach for safety planning for smooth handling of a sudden danger. Although the idea of using BFs in the cost function has been used in the context of MPC and dynamic programming, its main goal is to alter a constrained optimization problem into an unconstrained optimization. However, the proposed approach 11 here differs in the following aspects: 1. It addresses possible conflict between safety and performance and the safe set does not necessarily contain the origin. 2. Off-policy RL is employed which allows to learn about an optimal safe policy that minimizes the augmented cost while applying a safe and possibly conservative policy to collect data during learning. This is because off-policy RL separates the target policy (policy we learn about) from the behavior policy (policy we apply to the system to collect data). Rigorous proofs are provided to show that sequential improvement of the control policy provides optimality and guarantees safety. That is, the safety of the optimal solution is verified. 3. To provide an optimal performance, instead of using a zeroing factor, a function is considered as a CBF that rapidly damps to zero within a specific distance to the safety boundary; this facilitates taking safety as a control objective not only as a constraint. A parameter is incorporated in CBF which determines the relative importance of the original control objectives to the safety. 2.1.1 Notations The interior of set C is denoted as int C and ∂C stands for its boundary. Throughout the √ paper, ∥·∥M denotes the weighted Euclidean norm of a vector i.e. ∥x∥M = xT M x in which M is a positive semi-definite matrix. U is the set of all admissible control inputs. C 1 denotes the set of continuously differentiable functions. 2.1.2 Organization of the Chapter Background information, preliminaries and problem statement are given in Section 2. Safe optimal control approach with safety and stability proofs are provided in Section 3. Section 12 4 employs neural networks for estimation of optimal controller and value function using off- policy RL algorithm. Section 5 shows the efficiency of the proposed method by providing comprehensive simulation results and section 6 concludes the chapter. 2.2 Preliminaries Consider a nonlinear system described by the following differential equation ẋ = f (x) + g(x)u (2.1) where x ∈ C ⊂ Rn and u ∈ U ⊂ Rm are the state of the system and the control input, respectively. C represents the set of safe feasible states while U denotes the set of all admissible inputs. Moreover, f (x) ∈ Rn is the drift dynamics and g(x) ∈ Rn×m is the input dynamics. f (x) is C 1 and f (0) = 0. It is also assumed that the system is stabilizable. Before proceeding, the problem formulation and a short background are provided as follows. 2.2.1 Problem Statement The goal is to design a safe optimal control policy for the system (2.1). To take into account optimality, an infinite horizon cost function is considered and is minimized along with the trajectories of the system (2.1) and within a safe set. That is, the safe optimal control problem is formulated as Z ∞ min J(u, x) = r(x(τ ), u(τ ))dτ u∈U t s.t. (2.1), x(0) = x0 , x ∈ C , (2.2) 13 where the utility function r(x, u) is defined as r(x, u) = Q(x) + uT Ru (2.3) where Q(x) is a positive-definite function and R is a symmetric positive-definite matrix R = RT > 0. The set C is called the safe set inside which the system’s state must evolve to assure a safe operation. The safe set is formed by operational inequality constraints of the system such as actuator saturation of a robotic arm or unsafe region of exploration of a mobile robot and it is mathematically defined as C = {x|h(x) ≥ 0} (2.4) where h(x) is a continuously differentiable function of x. Note that h(x) > 0 represents the admissible state space that respects safety constraints. For example if −1 < x < 1, then h(x) = [h1 , h2 ] where h1 = 1 − x and h2 = x + 1. In the absence of safety constraints, using (2.3) in the cost function J in (2.2), the optimal value function is defined as [54] Z ∞ ∗ V (x) = min (Q(x) + uT Ru)dτ (2.5) u t Denoting the minimizer policy by u∗ , the Hamiltonian function is defined as H(x, u∗ , ∇V ) = r(x, u∗ ) + (∇V )T (f (x) + g(x)u∗ ) (2.6) The right-hand side of (2.6) is the infinitesimal equivalent of (2.5) which is a nonlinear Lyapunov equation. H = 0 forms the continuous-time (CT) Bellman equation and is used for obtaining the optimal solution [54]. This framework, however, cannot guarantee safety. One standard approach to design a safe control policy for system (2.1) utilizes the concept 14 of CBFs. We will discuss it briefly in the following subsection. 2.2.2 Barrier Function A BF is a function which is positive within a set and reaches infinity at the boundary of this set. Moreover, the BF has a negative derivative in the vicinity of the boundary, and thus, it never reaches infinity. In other words, if the initial state is within a set, existence of the BF on that set guarantees its forward invariance. The BFs or barrier certificate functions (BCFs) are defined and used to certify safety of dynamical systems and control barrier functions (CBFs) is the terminology of the same concept for control systems. Under this approach, the control input is designed to satisfy the properties of a CBF candidate. The above mentioned properties of a CBF are formally defined as follows. Definition 2.1. Class K Function. A continuous function α : [0, a) → [0, ∞) is a class K function if it is strictly increasing and α(0) = 0 [55]. Definition 2.2. CBF Properties. For the control system (2.1), the C 1 function B : C → R is a CBF for the set (2.4), if there exist locally Lipschitz class K functions α1 , α2 and α3 such that [27] 1 1 ≤B(x) ≤ , ∀x ∈ intC (2.7) α1 (h(x)) α2 (h(x)) Ḃ(x) ≤ α3 (h(x)), ∀x ∈ intC (2.8) Remark 2.1. The condition Ḃ < 0 also can be used instead of the CBF derivative condition (2.8) in Definition 2.2. However, compared to (2.8), it could unnecessary shrink the sub- levels even if they are within the desired set [26]. The condition (2.8) let Ḃ increase when it is far from the boundary and makes it negative only in the vicinity of the boundary. 15 Remark 2.2. The control input is designed by choosing a CBF candidate that satisfies (2.7) and, then, (2.8) is imposed as an inequality constraint to the control problem. While elegant, this framework does not consider the optimality of the solution and the complete knowledge of the system dynamics is required to check if the condition (2.8) is satisfied, ∂B because trajectory information ẋ appears in Ḃ = ∂x ẋ. To obviate these requirements and design an optimal safe control policy, RL will be integrated with the CBF concept in the subsequent sections. 2.3 Safe Optimal Control Approach We present a new formulation for designing a safe and optimal control input by integration of CBF into performance (2.2). The proposed approach guarantees safety in case it has a conflict with other control objectives, and in a safe condition, it guarantees an optimal performance. This formulation enables us to learn an optimal safe policy in a data-driven fashion using off-policy RL algorithm. 2.3.1 Safe Modified Formulation To ensure safety, the cost-to-go function is augmented with a CBF term Bγ (x) and the performance defined in (2.2) is modified to Z ∞ min J(x, u) = (Q(x) + uT R(x)u + Bγ (x))dτ u∈U t s.t. (2.1), x(0) = x0 (2.9) where Bγ (x) : C → R has the following properties. Assumption 2.1. CBF Properties. Bγ in (2.9) is a function with the following properties, 1. Bγ (x) > 0 ∀x ∈ C 16 2. Bγ (x) → ∞ ∀x ∈ ∂C with ∂C as the boundary of the safe set C 3. Bγ (x) is monotonically decreasing ∀x ∈ C . A coefficient γ is included in the CBF to specify the relative dominancy of the CBF to the utility function. While any CBF function that satisfies Assumption 2.1 can be used, a possible candidate is used in this chapter as follows γh(x) Bγ (x) = −log( ) (2.10) γh(x) + 1 The parameter γ determines how rapidly Bγ (x) damps as it gets further away from the safety boundary. In other words, the coefficient γ trades-off between safety and optimality by specifying the margin that safety dominates other control objectives. Compared to (2.3) and (2.5), the augmented utility function and the augmented value function are defined, respectively, as ra (x, u) = Q(x) + uT Ru + Bγ (x) (2.11) and Z ∞ ∗ V aug (x) = min (Q(x) + uT Ru + Bγ (x))dτ (2.12) u t Remark 2.3. In contrast to the condition (2.8), the new formulation does not impose any conditions on the derivative of Bγ (x); the reason is that Bγ (x) is incorporated into the cost function and in the vicinity of the safety boundary, Bγ (x) becomes the dominant term in (2.12) and the optimal controller acts in a descent direction of Bγ (x). In other words, Ḃγ implicitly becomes negative near the boundary in an optimal manner without imposing any inequality constraints. Moreover, numerical methods for solving unconstrained optimization problem are applicable. Finally, safety satisfaction over a long horizon plays an important 17 role in performing anticipatory safe planning, and avoiding excessive intervention with the optimal solution. Before proceeding to the next section, some definitions and assumptions are given. Definition 2.3. The set of safe inputs. The set of safe inputs for the current state x is defined as Uc = {u ∈ Rm |xu ∈ intC } (2.13) where intC is the interior of the set defined in (2.4) and xu is the state of the system evolved by the input u. Definition 2.4. Admissible policy. A control policy is said to be admissible for an optimal control problem if it stabilizes the system (2.1) and its associated cost is bounded. The following proposition shows that every admissible policy for the original optimization problem (2.2) that satisfies the safety and state constraints, is an admissible policy for the modified formulation (2.9). Proposition 2.1. A control policy is admissible for the modified optimal control problem (2.9), if and only if, u ∈ U ∩ Uc where U is the admissible control policy for the optimal control problem (2.2) and Uc is defined in (2.13). Proof. The cost function in (2.9) augments a utility function with a CBF. Therefore, to have an admissible policy, in addition to r(x, u), Bγ (x) should also remain bounded. Since r(x, u) 18 is bounded for u ∈ U , and Bγ remains within the safe set and is bounded for u ∈ Uc , as a result, for the modified formulation, a policy results in a bounded cost function for (2.9) if u ∈ U ∩ Uc (2.14) On the other hand, since u ∈ U stabilizes the system (2.1) by definition, u ∈ U ∩ Uc also stabilizes the system (2.1). This completes the proof. The set of admissible inputs for (2.9) is now defined as Ua = U ∩ Uc Assumption 2.2. Strict interiority of the initial condition. The initial condition of (2.1) belongs to the interior of C . That is, x0 ∈ intC Assumption 2.3. Existence of an admissible control input. We assume the set of admissible inputs for (2.9) is non-empty, i.e., U ∩ Uc ̸= ∅ and for any initial condition x0 satisfying Assumption 2.2, there exists a control policy u(x0 ) ∈ Ua . Remark 2.4. Assumptions 2.1, 2.2, 2.3 are standard assumptions in safe control design. More specifically, the function Bγ (x) in (2.10) actually satisfies Assumption 2.1. However, besides the CBF in (2.10), any other CBF that satisfies this assumption would also be acceptable. Assumptions 2.2 and 2.3 imply that the system must start from a safe initial condition and that a feasible control input exists to keep the system in its safe set. If these assumptions are not satisfied, then there is no hope to maintain the system safety using any control strategy, and thus the system itself is ill-posed. Other assumptions are also 19 made throughout the chapter such as Lipschitz continuity of the system or existence of value functions, which are standard in optimal control literature, for example see [54]. The Hamiltonian function Hj (2.6) for the augmented utility function (2.11) and the value function Wj is given as Hj (x, uj , ∇Wj ) = ra (x, uj ) + (∇Wj )T (f + guj ) (2.15) Then, Hminj , i.e., the minimizer of Hj , is obtained by the control input uj ∗ = −0.5R−1 g T (x)∇Wj (2.16) and is given by Hminj = Hj (x, uj ∗ , ∇Wj ) (2.17) In the following subsections, RL is employed to solve the modified safe optimal formulation (2.9), which iteratively estimates the value function and sequentially improves the control input toward the optimal minimizer while not violating safety constraints. 2.3.2 Safety and Performance Analysis We now present how the formulation (2.9) trades-off between safety and performance. In this approach, safety is ensured while a desired performance is maintained within the safe region. In addition, to improve safety robustness and avoid taking myopic safe actions, the CBF acts as a safety measure along other control objectives to be optimized over time. As a result, it provides a platform for safety planning and to specify the importance of safety compared to other objectives. All of these goals should be achieved in an iterative method while a closed-form solution to the value function is not available. To prove the claims, a couple of theorems are presented. First, it is proved that the 20 proposed approach guarantees the safety of the system. Second, the concept of safe region is introduced. Finally, stability and optimality of the solution in the safe region are shown. 2.3.2.1 Safety Analysis First, the existence of the value function is shown and, inspired by [56] the boundedness of the CBF is demonstrated through sequential improvement of the controller, and, finally, based on these results, the main theorem on guaranteeing safety is provided. Lemma 2.1. Consider an admissible feedback control policy u1 ∈ Ua . If a time invariant positive-definite function W ∈ C 1 exists such that ∂W T (f (x) + g(x)u1 ) + Q(x) + Bγ (x) + uT1 Ru1 = 0 (2.18) ∂x W (x0 , u1 ) = J(x0 , u1 ) (2.19) then, W is the value function of the system for all t ∈ [0, ∞), i.e., W (x, u) = J(x, u) Proof. Assume W (x, u1 ) > 0 exists; since it is a continuously differentiable function, one has Z t W (x(t), u1 ) − W (x0 , u1 ) = Ẇ (x(τ ), u1 )dτ 0 Z t ∂W = (f + gu1 )dτ (2.20) 0 ∂x Considering (2.9) and (2.11), one has Z t J(x(t), u1 ) − J(x0 , u1 ) = − ra (x(τ ), u1 )dτ (2.21) 0 21 Subtracting both sides of (2.21) from (2.20) yields Z t ∂W J(x(t), u1 ) − W (x(t), u1 ) = (− (f + gu1 ) − ra (x(τ ), u1 )) dτ + J(x0 , u1 ) − W (x0 , u1 ) 0 ∂x (2.22) Considering (2.18) and (2.19) in (2.22) gives Z t J(x(t), u1 ) − W (x(t), u1 ) = ra (x(τ ), u1 ) − ra (x(τ ), u1 )dτ = 0 0 Therefore, one has J(x(t), u1 ) = W (x(t), u1 ) which completes the proof. Lemma 2.2. Consider positive-definite value functions W (x, t, u1 ), W (x, t, u2 ), ..., W (x, t, ui ) abbreviated by W1 , W2 , ..., Wi which are associated with the sequence of admissible inputs u1 (x, t), u2 (x, t), . . . , ui (x, t) ∈ Ua . If corresponding minimized Hamiltonian values defined in (2.17) satisfy Hmin1 ≤ Hmin2 ≤ ... ≤ Hmini (2.23) then, the CBF candidate Bγj , 1 ≤ j ≤ i at each step of the sequence is bounded. Proof. For any j and k such that 0 ≤ j ≤ k ≤ i, assume Hminj ≤ Hmink ; consider Wk = Wj + Wd (2.24) 22 where ∆ Wd = Wd (x(t), uj ) then, by applying u∗k = −0.5R−1 g T ∇Wk , one has 1 Hmink = Q(x) + Bγ (x) + ∇WkT gR−1 g T ∇Wk + ∇WkT (f + g(−0.5R−1 g T ∇Wk )) 4 Considering L(x) = Q(x) + Bγ (x), using (2.24), and doing some manipulations yield 1 Hmink = L(x) + ∇WjT f − ∇WjT gR−1 g T ∇Wj + ∇WdT f 4 1 1 − ∇WdT gR−1 g T ∇Wd − ∇WdT gR−1 g T ∇Wj 4 2 or equivalently Hmink = Hminj + ∇WdT (f + gu∗ j ) − (u∗ Td Ru∗ d ) Since Hmink − Hminj + u∗ Td Ru∗ d ≥ 0, one has dWd (x, uj ) ≥0 dt In addition, limt→∞ Wd (x(t)) = 0. Therefore, Wd ≤ 0 As a result, Wk ≤ Wj 23 Considering the sequence in (2.23) results in W (x, t, u1 ) > W (x, t, u2 ) > ... > W (x, t, ui ) (2.25) In other words, Wj < W1 ∀1 ≤ j ≤ i From Lemma 2.1, J(x(t), uj ) < J(x(t), u1 ) ∀1 ≤ j ≤ i R∞ Since J(x(t), uj ) = t ra (x(τ, uj ))dτ is bounded and ra is positive definite, then, ra , and as a result Bγj are bounded. This completes the proof that the CBF is bounded at each sequence. Theorem 2.1. Consider the optimization problem defined in (2.9) and let Assumptions 2.2 and 2.3 be satisfied. Then, the states of the system evolving through sequential improvement of the control input (2.16) stay within the safe set and safety of the system is ensured for all t > 0. Proof. Lemma 2.2 shows that the performance function J(x, uj ) and consequently the barrier function Bγ j remain bounded after each policy improvement step (2.16). On the other hand, based on Assumption 2.1, the value of the CBF function Bγ j becomes infinity only at the boundary of the safe set . Therefore, since the barrier function remains bounded after every iteration, it guarantees that the system states never reach the boundary of the safe set. This in turn guarantees safety. Remark 2.5. By using Lemma 2.1, Lemma 2.2 and subsequently Theorem 2.1, it is proved that the safety of the control system is ensured for all t > 0 and 0 < γ < ∞. 24 2.3.2.2 Stability and Optimality Analysis Although safety is assured in Theorem 2.1, since a term is added into the cost function, the stability and performance of the system also need to be investigated. A desired safe controller should prioritize safety in case of a conflict with the desired performance. However, it still needs to ensure stability and demonstrate a good performance within the safe region. Feasible set and safe region are defined as follows and stability and optimality proofs are then given. Definition 2.5. Feasible set. The intC defined in (2.4) is considered as the feasible set. Definition 2.6. Safe region. The safe region is defined based on the feasible set as D = {x|x ∈ intC − β(x∗h , r0 )}, x∗ ∈ D where x∗h = {x|h(x) = 0} and β is the ball around the boundary with radius of r0 and x∗ is the equilibrium point of the system assumed to be the origin. The damping factor γ is chosen such that Bγ (x) ≤ 0.5 ∀x ∈ D Bγ (x) + Q(x) Therefore, within the safe region, Q(x) is the dominant term in the optimization problem. Remark 2.6. The safe region is the set containing the origin such that the CBF is not dominant compared to Q(x). Remark 2.7. The safe set might or might not contain the origin. However, as it is shown in the previous section, safety is guaranteed either way. Here, the safe region is defined for the condition that safety is not in conflict with the performance. Then, it is demonstrated that under this condition, optimality is achieved and uniform stability is also guaranteed. 25 Lemma 2.3. Assume that x = 0 is the equilibrium point of the system (2.1), and D ⊂ R contains the origin. Let M : [0, ∞) × D → R be a continuously differentiable function such that Λ1 (x) ≤ M (t, x) ≤ Λ2 (x), (2.26) ∂M ∂M + (f (x) + gu) ≤ 0, ∀t > 0, ∀x ∈ D (2.27) ∂t ∂x where Λ1 and Λ2 are continuous positive-definite functions on D. Then, the origin is uni- formly stable. Proof. See [55] Theorem 4.8 page 151. Theorem 2.2. The sequence of control inputs uj ∗ obtained by optimization over Hamiltonian functions (2.15) associated with positive-definite value functions Wj and the augmented utility function ra , uniformly stabilize the system within the safe region D. Proof. Lemmas 2.1 and 2.2 prove that Wj (t, x) is positive definite and 0 < Wj (t, x) < W1 (t, x), ∀1 ≤ j ≤ i (2.28) where W1 (t, x) is bounded and one can define positive-definite function Λ as Λ(x) = max W1 (t, x) (2.29) t Thus, condition (2.26) is satisfied. Moreover, (2.25) proves that Wj is decreasing at each sequence. Therefore, using results of Lemma 2.3, the control system (2.1) is uniformly stable. Remark 2.8. The CBF in (2.9) is included as a safety objective to ensure safety for any value of 0 < γ < ∞. Meanwhile, the trade-off bewteen Q(x) and Bγ (x) within the safe region is specified by the coefficient γ; larger values of γ could speed up damping of Bγ (x) when 26 it goes further away from the safety boundary and retaining the original utility function r = Q(x) + uT Ru, while smaller values of γ lead to more emphasis on safety and a more γh(x) conservative control design. The CBF candidate Bγ (x) = −log( γh(x)+1 ) rapidly goes to zero, for example for h(x) = 1 and γ = 5, Bγ (x) = 0.08. In other words, one may design γ in the safe region such that Bγ (0) gets arbitrary close to zero, which means depending on the application, optimality of the controller is achievable. This is proved in the following theorem. Theorem 2.3. Assume that the equilibrium point of the system is located at the origin. Assume (2.9) has a minimizer in the safe region denoted by u∗ . Then, by proper selection of γ within the safe region, the minimum can get arbitrarily close to zero and, lim ra (u∗ ) = 0 γh(x)→∞ eϵ Proof. For an arbitrary small value ϵ, define γ1 = h·(1−eϵ ) . Then, for any γ ≥ γ1 , one has 0 ≤ r(u∗ ) + Bγ (xu∗ ) ≤ r(u∗ ) + Bγ1 (xu∗ ) ≤ r(u∗ ) + ϵ In other words, ∀ϵ > 0 ∃γ s.t. Bγ (x) ≤ ϵ For ϵ = 0 and a finite value of h(x), γ1 → ∞. Therefore, γh(x) → ∞; which completes the proof showing that the minimum of augmented utility function converges to zero. Remark 2.9. Based on Theorem 2.3, the optimal solution is feasible if h(x) has a finite value and γ is selected properly. Remark 2.10. From the theoretical perspective, the convergence of the proposed approach to the origin within the safe region is guaranteed if γ is large enough. However, in practice, 27 the value of γ depends on the physical system and we can achieve convergence with even small values of γ for some systems. For example, in the lane changing problem in Section 5, the states of the system have reached the origin with γ1 = 0.95 and γ2 = 2. 2.4 Algorithm for Safe Reinforcement Learning In this section, an off-policy RL algorithm is presented to find a safe solution to the op- timization problem (2.9). First, the off-policy RL algorithm is presented and then, neural networks (NNs) are used to approximate its solution for systems with the lack of knowledge about their dynamics. 2.4.1 Safe Off-policy Reinforcement Learning Algorithm Off-policy RL is a policy iteration algorithm to find an optimal controller without requiring the knowledge on the system dynamics [57, 58, 59]. This method uses two different policies, called behavior policy and target policy. The behavior policy is a safe policy that is applied to the system for gathering data and the target policy is a policy that is updated toward the optimal policy using the collected data. Any available prior knowledge about the system dynamics can be used to find a safe but possibly conservative behavior policy to ensure safety during learning. The safety of optimal policy found by iterating on the target policy is also guaranteed based on Theorem 2.1. The optimal safe policy is applied to the system once the learning is finished. The details of the proposed off-policy RL algorithm is provided in the following. Considering the augmented cost function (2.9), its infinitesimal version is the Bellman equation 0 = JxT ẋ + ra + Bγ (2.30) 28 where ∂J ∂x J˙ = = JxT ẋ (2.31) ∂x ∂t In the off-policy approach, the dynamics (2.1) is rewritten to separate the behavior policy and the target policy. This yields ẋ = f (x) + g(x)ui + g(x)(u − ui ) (2.32) where ui is the target policy which is updated in the algorithm but not applied to the system; while u is the behavior policy which is applied to the system to generate data for learning. Integrating from both sides of (2.31) and considering (2.30) and (2.32) yield Z t Z t Z t i i J (x(t)) − J (x(t − T )) = − (Q(x) + Bγ (x))dτ − iT i u Ru dτ + (JxiT g(x)(u − ui ))dτ t−T t−T t−T (2.33) The control input ui is updated by optimizing over the Hamiltonian function ui+1 = −0.5R−1 g T Jxi (2.34) Substituting g T Jxi term in (2.33) using (2.34) yields the off-policy Bellman equation Z t Z t i i J (x(t)) − J (x(t − T )) = − (Q(x) + Bγ (x))dτ − uiT Rui dτ t−T t−T Z t −2 (u(i+1)T R(u − ui ))dτ (2.35) t−T In the off-policy Bellman equation (2.35), both control policy (i.e. ui+1 ) and value function (i.e. J i ) are updated simultaneously for a given target policy ui using collected data by applying the behavior policy u. 29 Remark 2.11. Compared to on-policy method that improves the same policy that is applied to the system, the off-policy RL algorithm is a data-efficient method in which the learning agent evaluates as many policies as required without even applying them to the system using only a set of collected data. Being able to evaluate possibly unsafe policies without even applying them to the system is of vital importance for safety-critical systems. Lemma 2.4. Off-policy Bellman equation (2.35) is equivalent to the Bellman equation (2.30) and both have the same update law (2.34). Proof. Equations (2.30) - (2.35) demonstrate that off-policy Bellman equation is obtained by manipulating Bellman equation (2.30) and update law (2.34). Interchangeably, the Bellman equation can be obtained using (2.35). By dividing both sides of (2.35) by T and taking the limit from both sides, one has J i (x(t)) − J i (x(t − T )) lim T →0 T Rt Rt Rt − t−T (Q(x) + Bγ (x))dτ − t−T uiT Rui dτ − 2 t−T (u(i+1)T R(u − ui ))dτ − lim =0 T →0 T By using L’Hopital’s rule, one has J˙i (x(t)) + (Q(x) + Bγ (x) + uiT Rui + 2ui+1 R(u − ui )) = 0 using (2.31) and (2.32), one has Jx iT (f (x) + g(x)ui + g(x)(u − ui )) + Q(x) + Bγ (x) + uiT Rui + 2ui+1 R(u − ui ) = 0 Then, using (2.34), one has Jx iT (f (x) + g(x)ui + g(x)(u − ui )) + Q(x) + Bγ (x) + uiT Rui − Jx iT g(x)(u − ui ) = 0 which is equivalent to the Bellman equation (2.30). This completes the proof. 30 2.4.2 Neural Network Approximation of Safe RL Algorithm In this section, the solution to the off-policy RL algorithm is learned using an actor-critic structure which does not require knowledge of the system dynamics. The critic network estimates the value function J i and the actor network represents the control input ui+1 as follows Jˆi (x) = Ŵ i Φ(x) (2.36) ûi+1 (x) = P̂ i Ψ(x) (2.37) where Φ = [Φ1 Φ2 ... ΦlΦ ] ∈ RlΦ and Ψ = [Ψ1 Ψ2 ... ΨlΨ ] ∈ RlΨ are the suitable activation functions for critic and actor networks with lΦ and lΨ neurons, respectively; in addition, Ŵ i ∈ RlΦ and P̂ i ∈ Rm×lΨ are the weight vectors. Note that ui+1 in (2.34) is estimated by a NN as (2.37) and no knowledge about the system dynamics is required. We define v i = [v1i ..., vm i ] = u − ui and define the error on off-policy Bellman equation (2.35) using (2.36) and (2.37) [58], Z t i iT iT e (t) = Ŵ (Φ(x(t))) − Ŵ (Φ(x(t − T ))) − (−Q(x) − Bγ (x) − uiT Rui )dτ t−T Xm Z t +2 ρj (P̂jiT Ψ(x(t))vji )dτ (2.38) j=1 t−T where ρj is the j th diagonal element of R and P̂jiT is the j th column of P̂ iT . The least squares method is used to obtain the minimum of Bellman approximation error (2.38). In doing so, (2.38) is rewritten in regression form as iT y i (t) + ei (t) = Ŵt h(t) (2.39) 31 i in which Ŵt is a matrix composed of weight vectors as iT Ŵt = [Ŵ iT , P̂1iT , . . . , P̂miT ] and hi (t) is    Φ(x(t)) − Φ(x(t − T ))   2ρ1 t (Ψ(x(t))v i )dτ   R  t−T 1 hi (t) =  (2.40)   ..    .    Rt  i 2ρm t−T (Ψ(x(t))vm )dτ and y i (t) is Z t i y (t) = (−Q(x) − Bγ (x) − uiT Rui )dτ (2.41) t−T We collect the state and input data at N points at the time interval T to solve (2.39) for iT Ŵt . Let the collected information be saved in matrices H i and Y i as H i = [hi (t1 ) , . . . , hi (tN )] Y i = [y i (t1 ) , . . . , y i (tN )]T Therefore,the least-square equation is iT Ŵt Hi = Y i (2.42) and its solution is iT Ŵt = (H i H iT )−1 H i Y i (2.43) 32 Equation (2.43) has a solution if N > l1 + ml2 (2.44) Remark 2.12. The CBF is the dominant term in the vicinity of the risky area, while it rapidly damps as gets further away from the safety boundary. As a result, for having a reliable training, one needs to collect samples from both the safe region and the region in vicinity of the safety boundary in which the CBF comes to play. Remark 2.13. The off-policy RL algorithm provides an optimal and safe solution to the optimization problem defined in (2.9). This is because, the off-policy RL algorithm applies a safe (possibly conservative) policy to the system while learning about an optimal and safe policy. Only the behavior policy requires partial knowledge of the dynamics and the learning process is model free. Theorem 2.4. Algorithm 1 converges to a safe optimal solution. Proof. Algorithm 1 iterates on the off-policy Bellman equation (2.35) with update law (2.34). According to Lemma 2.4, the off-policy Bellman (2.35) is equivalent to Bellman equation (2.30) with the same update law (2.34). On the other hand, it is shown in Lemma 2.2 that the value function obtained by iterating on the Bellman equation is monotonically decreasing and bounded. Therefore, Algorithm 1 converges to the optimal solution. Remark 2.14. Note that the behavior policy is assumed to be an exploratory policy and provides rich data for learning, i.e., the collected data guarantees that the least-square equa- tion (2.42) has a feasible solution. Under this condition, similar to [59], it can be shown that at each iteration of Algorithm 1, the controller’s weights converge to their desired values, which in turns results in an admissible policy that makes the system stable and the value function bounded. Boundedness of the value function guarantees safety at each iteration. 33 Algorithm 1 Safe Off-policy RL 1: Initialize actor and critic networks (2.36), (2.37). 2: procedure Data Collection 3: Employ the initial noisy stabilizing control policy u ∈ Ua as (2.14) until (2.44) is satisfied. This input must bring the system in vicinity of the risky area as well for reliable learning. 4: end procedure 5: procedure Find an optimal solution by reusing the collected data 6: For all t = t1 , ..., tN , given ui , obtain matrices hi (t) and y i (t) as (2.40), (2.41). 7: Find NNs weights using (2.43), update J i and uj+1 in (2.36), (2.37). 8: Stop if a stopping criterion is met, otherwise set i = i + 1 and go to 5. 9: end procedure 2.5 Simulation Results The efficiency of the proposed method is examined in lane keeping problem for an au- tonomous vehicle. This problem aims to keep the car centered in the lane in spite of possible curvature of the road. In addition to this regulation objective, there is a safety objective which specifies the maximum allowable lateral displacement of the car according to the width of the road. The linear tire-force model and constant longitudinal speed is considered [28]. More details on system model and its formulation can be found in [60]. The state model of the system is given as           ẏ  0 1 vl0 0  y   0  0 bCr −aCf     Cf   v̇  0 − Cf +Cr 0      M vl0 M vl0 − vl0   v   M      0  =   +  u +  d       ϕ̇  0 0 0 1  ϕ  0  −1                   bCr −aCf Cf ψ̇ 0 Iz vl0 0 0 ψ a Iz 0 where y and v are the lateral displacement and its velocity, respectively, while ymax and ymin show the maximum and the minimum allowable displacement from the center of the road. ϕ is the error yaw angle and ψ is its derivative. u is the steering angle, while d is the desired vl0 yaw rate obtained from the curvature of the road as d = Rr ; vl0 is the longitudinal speed and Rr is the road radius of curvature. M is the total mass of the car and Iz is its moment 34 of inertia with respect to the center of the mass. Cr and Cf are stiffness parameters of tire. Finally, a and b show the distance of front and rear tires to the center of the mass. The value of parameters used in simulation are given in Table 4.1. To have a unified notation, the states of the system are denoted by x = [x1 , x2 , x3 , x4 ]T = [y, v, ϕ, ψ]T The modified formulation (2.9) is employed with the following utility function, γ1 (x1 − ymin ) γ2 (−x1 + ymax ) ra (x, u) = xT Qx + uT Ru − log( ) − log( ) γ1 (x1 − ymin ) + 1 γ2 (−x1 + ymax ) + 1 where Q, R, γ1 , γ2 are design parameters. The activation functions for critic and actor networks are considered respectively, as Φ(x) = [x21 x22 x23 x24 x1 x2 x1 x3 , x1 x4 x2 x3 x2 x4 x3 x4 (x1 − ymax ) x1 4 ] Ψ(x) = [x1 x2 x3 x4 ]T Then, these networks are trained using off-policy Algorithm 1. Critic and actor networks should be trained in the safe states as well as close to the risky states. So, the learned network is reliable in recognizing risk. After six iterations, the learning process is completed. The lateral displacement of the car is shown in Figure 2.1 with and without incorporation of the CBF. As it can be seen after learning, the states of the system have stayed within the safe region and have not exceeded the limits. Trajectories of other states of the system are given in Figure 2.2, which are converged to the origin. The actor weights and critic weights are shown in Figures 2.3 and 2.4, respectively. The graphs with different ranges have been separated. 2.6 Conclusion In this chapter, a safe off-policy RL scheme is proposed which trades-off between safety and performance. This method guarantees and plans for the safety by incorporation of a 35 Table 2.1: Simulation Parameters Parameter Value Parameter Value M 1650 Kg |Rr | 0 → 0.1 2 Iz 2315.3 m .Kg ymax , ymin 0.45, -0.45 m v0 27.7 m/s Q 2 × In×n Cf 133000 N/rad R 1 Cr 98800 N/rad γ1 , γ2 0.95, 2 a 1.11 m b 1.59 m 0.6 ymax 0.4 0.2 Lateral Displacement Learning Completed and Controllwe Updated 0 -0.2 -0.4 ymin -0.6 -0.8 With CBF Without CBF 0 1 2 3 4 5 6 7 8 9 Time Figure 2.1: Lateral displacement with and without CBF 20 lateral velocity 0 -20 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.1 Yaw rate 0 -0.1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.2 Error yaw angle 0 -0.2 0 1 2 3 4 5 6 Time Figure 2.2: The states of the system 36 3 W1 W2 4 2.5 2 2 0 2 4 6 2 4 6 300 120 W3 W4 250 100 200 80 60 150 40 100 20 50 2 4 6 2 4 6 Figure 2.3: Actor Weights 15 4 P4 P1 P 10 10 2 P 12 5 0 0 1 2 3 4 5 6 1 2 3 4 5 6 2000 200 P3 P6 1800 150 1600 100 1400 50 1 2 3 4 5 6 1 2 3 4 5 6 0.15 P5 -2 P7 0.1 P8 P 11 -4 P9 0.05 -6 0 1 2 3 4 5 6 1 2 3 4 5 6 Figure 2.4: Critic Weights 37 CBF term into the cost function and forming an augmented value function. Using iterative approximation of the augmented value function, the application of CBF is extended to a data-driven approach. Rigorous proof of safety is presented. The notion of safety region is introduced for the case of no conflict between safety and performance and proof of stability and optimality in the safe region is derived accordingly. 38 Chapter3 Reinforcement Learning based Control Design with Safety and Stability Guar- antees During Exploration Contents of this chapter first appeared as [53] and have been reformatted to fit the requirements of this dissertation. 3.1 Introduction Proper learning-based algorithms require satisfaction of the persistence of excitation (PE) condition. The PE condition is typically satisfied by applying noisy inputs to the system to excite all its dynamical modes. Since this noise is random and arbitrary, it might result in violation of safety. All methods mentioned in Chapter 1 need information about the system dynamics, environment or human supervision for safe exploratory data collection. This chapter proposes a novel off-policy RL algorithm with prescribed learning perfor- mance with safety and stability guarantees during exploration and exploitation phases. To the best of our knowledge, it is the first time that safety and stability guarantees of the system during the excitation of the system in the presence of noisy input is ensured without any external knowledge about the risk, dynamics or environment. The schematic of the presented idea is depicted in Figure 3.1 as two main interconnected modules: i) a prescribed learning method with verifiable PE condition. ii) a robustified safe control design. In the first 39 module, experience replay-based safe model learning along with an off-policy RL algorithm are employed to present a framework to specify conditions under which the learning can be prescribed and how the data quality affects it. This method is capable of guaranteeing the exponential convergence of the learning error to zero with a prescribed bound that can be considered as a vanishing perturbation term to the nominal system, enabling stabilizing controller design. The outcome of the first module is then employed in designing a novel adaptive robustified control barrier function (AR-CBF). AR-CBF benefits from learning to compensate for uncertainties without being overly conservative and accounts for estimation error to guarantee safety despite learning inaccuracy. Any policy that satisfies this criterion assures the safe performance of the system. Since AR-CBF criterion is built based upon the current approximation of the dynamics, it can ensure safety during learning. The safe and stabilizing input obtained in the robustified design module is employed to collect more safe exploratory data. This collected data is then repetitively used to update the approxi- mation of the dynamics and find the optimal target policy. The relationship between these two modules is reciprocal. The proposed learning approach provides a better description about the behavior of model learning error and its bound without excess conservatism, and the robustified design module enables deriving a noisy random and yet safe and stabilizing controller for further data collection. As the learning improves, the AR-CBF converges to the nominal CBF exponentially fast and provides more room for taking safe actions. When the optimal target policy is found, it is minimally altered to respect AR-CBF and safely applied to the system. Therefore, even if the system model is not perfectly approximated, the safe and optimal target policy can be successfully found and be applied to the system. In a nutshell, the contributions of the chapter are as follows. 1. Proposing a learning-enabled safe model-free RL framework with safety and stability guarantee during data collection, exploration, and exploitation without external advice. 2. Integrating efficient RL with prescribed learning and verifiable PE condition in con- junction with a robustified formulation. 40 Figure 3.1: Overview of the proposed approach 3. Employing prescribed performance in the stability analysis based on perturbation the- ory. 4. Presenting a novel AR-CBF for safe control of uncertain systems with safety verification during learning. 3.1.1 Organization of the chapter Section II is allocated for problem statement. Background information on CBFs and RL tech- niques is presented in Section III. The robustified safety and stability design using experience replay method is given in Section IV. Section V represents the proposed barrier-certified off- policy RL algorithm. Section IV represents the simulation results and Section V concludes the chapter. 3.2 Problem Statement Consider a continuous-time linear system as ẋ = Ax + Bu (3.1) where x ∈ Rn is the system state and u ∈ Rm is the control input. It is assumed that the system is stabilizable. 41 Assumption 3.1. The dynamics and input matrices A ∈ Rn×n and B ∈ Rn×m are unknown. Moreover, their initial approximations Â0 and B̂0 can be chosen arbitrarily within the set {(Â0 , B̂0 )|(Â0 , B̂0 )is stabilizable)}. The control objective is to design u to optimize a performance function while assuring satisfaction of safety specifications. The safety objective is to ensure that, as the system’s states evolve according to (3.1), they never leave a safe set C , i.e., x(t) ∈ C , ∀t ≥ 0 where the safe set is formed using a safety criterion as C = {x|h(x) ≥ 0} (3.2) where h(x) : Rn → R is a smooth function. The performance objective encodes the quality of the control solution in achieving a goal. For the optimal stabilizing problem, the long-term cost function is typically chosen as Z ∞ J= (xT Qx + uT Ru)dτ (3.3) 0 where Q = QT is a positive semi-definite matrix, while R = RT is a positive definite matrix. 1 It is assumed that (A, Q 2 ) is observable. Remark 3.1. Safety and performance can be in conflict and the performance level that can be achieved safely depends on the uncertainty level. Therefore, possible conflicts between safety and performance is considered in the proposed framework. When conflicts arise, the safety satisfaction is prioritized by imposing it as a hard constraint while the performance is considered as a soft constraint. 42 Therefore, the controller is in the form of u = −Kx + δ (3.4) where u∗ = −Kx is the optimal controller obtained by minimizing (3.3) without considering safety constraints, while δ is the safety modifier added to the optimal feedback policy to certify the safety of the system while minimally altering its actions. In case of no conflict between safety and performance, δ = 0. Finding the optimal control policy for uncertain systems is not directly possible and demands iterative approaches to approximate the optimal controller and the value function using neural networks (NNs). This, however, does not account for the safety of the system. Safety and stability guarantees are especially challenging at the beginning, as the collection of rich data for training NNs is required. This chapter presents a method with safety and stability guarantee in data collection, exploration, and exploitation phases. 3.3 Background In this section, the background on CBFs and off-policy RL algorithm are briefly reviewed. 3.3.1 Control Barrier Functions CBFs provide conditions for the control input that restricts the trajectories of the system to evolve in a pre-defined safe set by ensuring forward invariance of the set. Thus, by starting initially within the safe set and designing the controller to respect the CBF conditions, the safety of the system is guaranteed. Zeroing CBF as one major form of CBFs is formally defined as follows. Definition 3.1. A continuous function α : (−b, a) → (−∞, ∞) with a, b > 0 is an extended class K function, if it is strictly increasing and α(0) = 0 [55, 34]. ■ 43 Definition 3.2. Considering the dynamical system (3.1) and the set C ⊂ Rn (3.2) defined using a C 1 function h(x), if there exists a locally Lipschitz extended class K function α such that ∂h ∂h sup [ A+ Bu + α(h(x))] ≥ 0, ∀x ∈ D (3.5) u∈U ∂x ∂x then, the function h(x) is a ZCBF on D with C ⊆ D ⊂ Rn [28]. ■ The set of safe control inputs for h(x) is formed accordingly as ∂h ∂h Um (x) = {u ∈ U | A+ Bu + α(h(x)) ≥ 0} ∂x ∂x Ensuring the forward invarinace of a set using ZCBFs is the result of the following theorem. Theorem 3.1. Given dynamical system (3.1) and the set C ⊆ D (3.2) defined for a C 1 function h(x), if h is a ZCBF on D, any Lipschitz continuous controller {u : D → R|u ∈ Um (x)} renders the set C forward invariant. Proof. See [28]. Remark 3.2. Note that complete knowledge of the system dynamics, i.e., A and B matrices are required to guarantee (3.5). To obviate these requirements, a novel robustified CBF is proposed, which accounts for a non-conservative bound of error as well. 3.3.2 Adaptive Optimal Control Design Having A and B known for the system (3.1), the optimal value function for the objective function (3.3) in the form of [54] V (x) = xT P x (3.6) 44 where P is the solution of well-known algebraic Riccati equation (ARE) AT P + P A + Q − P BR−1 B T P = 0 (3.7) which is quadratic in P . To sidestep the difficulty of solving quadratic equations, the Bellman equation is iteratively solved. Considering (3.3) and (3.6), the Bellman equation is formed as Z t+δt T T x(t + δt) P x(t + δt) − x(t) P x(t) = (xT Qx + uT Ru)dτ (3.8) t To iteratively solve the Bellman equation, and by having K0 ∈ Rm×n as a stabilizing feedback gain matrix, the Lyapunov equation is formed (A − BKk )T Pk + Pk (A − BKk ) + Q + Kk T RKk = 0 (3.9) where Pk = Pk T is the solution of (3.9) and is positive definite. Then, Kk is recursively defined as Kk = R−1 B T Pk−1 , k = 1, 2, ... (3.10) Then, one achieves the following properties: 1) A − BKk is Hurwitz 2)P ∗ ≤ Pk+1 ≤ Pk 3)limk→∞ Kk = K ∗ , limk→∞ Pk = P ∗ where P ∗ is the solution of ARE (3.7) and K ∗ is the optimal feedback gain. Therefore, the solution of ARE is approximated by iteratively solving (3.9) which is linear with respect to Pk . However, A, B are needed in (3.10). To overcome this issue, [57] proposed online strategy 45 to solve (3.9) when the system is fully unknown. The system (3.1) is re-written as ẋ = Ak x + B(Kk x + u) (3.11) where Ak = A − BKk . Then, using (3.9), (3.10) and (3.11), the off-policy Bellman equation is formed x(t + δt)T Pk x(t + δt) − x(t)T Pk x(t) Z t+δt = [xT (Ak T Pk + Pk Ak )x + 2(u + Kk x)T B T Pk x]dτ t Z t+δt Z t+δt T =− x Qk xdτ + 2 (u + Kk x)T RKk+1 xdτ (3.12) t t where Qk = Q + Kk T RKk . (3.12) is equivalent to the on-policy Bellman equation (3.8). However, this method does not consider safety of the system. In this chapter, a novel method to certify the safety of this algorithm is proposed. 3.4 Robustified Safety and Stability using Experience Replay Learning In an off-policy algorithm, the behavior policy is applied to the system to collect data. A NN is assigned to learn about the dynamics of the system which its weights are updated by means of replaying the past experiences. After applying a few initial policies, a mild rank condition is satisfied which ensures in continuation of dynamics approximation, the learning error exponentially fast converges to zero with a predefined rate. This prescribed behavior of the learning error along the current rough approximation of the system is employed in design of robustified safe and stabilizing controller which is then integrated to the off-policy learner for safe data acquisition. In this section, the experience replay approximation and the prescribed behavior of the 46 learning error is presented. It is shown that using this learning platform, the learning error is a vanishing perturbation to the system and condition for having a stabilizing controller is derived. Finally, a novel non-conservative robustified CBF is presented which ensures safety during learning. 3.4.1 Experience Replay System Approximation The system dynamics (3.1) can be written in the form of ẋ = W ϕ(x, u) (3.13) where W = [A, B] ∈ Rn×(n+m) and ϕ(x, u) = [x, u]T ∈ R(n+m)×1 . The system dynamics (3.13) is written as a compact linear form ẋ = G(t)ψ (3.14) 2 2 +nm)×1) where G(t) ≜ ϕ(x, u)T ⊗In ∈ R(n)×(mn+n ) and ψ = vec(W ) ∈ R((n . Let ψ̂ be a rough estimation of ψ and ψ̃ = ψ − ψ̂ be the estimation error. The following filters are applied to ẋ, G(t) in (3.14) and ϕ(x, u) in (3.13) in terms of σ, Ω and xs , respectively as σ̇(t) = −βσ(t) + ẋ (3.15) Ω̇(t) = −βΩ(t) + G(t) (3.16) ẋs (t) = −βxs (t) + ϕ(x, u) (3.17) where β > 0 is a design gain and Ω(0) = 0, xs (0) = 0. The filtered signal Ω in (3.16) can be written using xs in (3.17) as Ω(t) = xs T ⊗ In 47 The solution of (3.15), (3.16) and (3.17) are given, respectively as Z t −βt σ(t) = e eβτ ẋ(τ )dτ (3.18) 0 Z t −βt Ω(t) = e eβτ G(τ )dτ (3.19) 0 Z t −βt xs (t) = e eβτ ϕ(x, u)dτ (3.20) 0 2 where σ ∈ Rn , Ω ∈ Rn×(mn+n ) , and xs ∈ R(m+n) . The system dynamics (3.13) can be written using filtered signals as σ(t) = W xs (3.21) Using (3.14), (3.18) and (3.19), one has σ(t) = Ω(t)ψ (3.22) From (3.18) and using integration by part, σ can be expressed in terms of known variables x(t) and xs (t) as σ(t) = x(t) − eβt x(0) − βxs (t) (3.23) According to (3.22) and (3.23), the prediction error is defined as e(t) = σ(t) − Ω(t)ψ̂(t) (3.24) where ψ̂ is an estimation of ψ, and ψ̂ = vec(Ŵ ). In order to store and use the past data in the update law, two memory stacks {σi }i=1:p , {Ωi }i=1:p are employed, which store the values of σ(ti ) and Ω(ti ), respectively at each time instance ti . The prediction error at time 48 constant ti is defined accordingly as ei (t) = σi − Ωi ψ̂(t) (3.25) The following update law using the past stored data is then employed p ˙ X ψ̂ = βψ1 ΩT (t)e(t) + βψ2 ΩT i ei (t) (3.26) i=1 where βψ1 and βψ2 are positive scalar gains. This update law ensures exponential convergence of ψ̂ to ψ under a rank condition and in the presence of enough stored data. This result if formally represented as follows. Lemma 3.1. [61] Considering the dynamics (3.26), if there exists p∗ such that for all p ≥ p∗ , for any sequence t1 <2 < .... < tp , rank([Ω1 T , Ω2 T , ..., Ωp T ]) = mn + n2 (3.27) Then, using the update law (3.26) ψ̂ converges to ψ exponentially fast with employing the Lyapaunov function Vψ = 0.5ψ̃ T ψ̃ and there exists a positive gain βψ12 such that V̇x ≤ −2(βψ12 )Vx (3.28) Remark 3.3. In a nutshell, in experience-replay dynamics approximation, the regressor form of the dynamics is derived and an update law which incorporates past stored data is employed. By satisfaction of a rank condition (3.27), and considering (3.28) the learning error exponentially fast converges to zero. In other words, after mn + n2 number of samples are collected fast convergence of error to zero is ensured with a prescribed rate. Remark 3.4. The significance of this method in safe RL exploration is twofold. First, 49 since the learning error exponentially converges to zero, the error term can be taken as a vanishing perturbation to the approximated dynamics and therefore perturbation theory can be employed to design a controller based on the approximated dynamics which guarantees stability for the true dynamics. Second, an accurate bound of learning error can be derived. This bound can be taken as a non-conservative worst case in formation of the novel robustified CBF. 3.4.2 Stability Analysis Stability analysis is performed based on control Lyapunov function (CLF). However, due to uncertainty in the model, CLF is built based on the available approximated model and its validity for the original system needs to be investigated. Having the experience replay model learning enables designing stabilizing controllers for the true system based on the approximated dynamics. Theorem 3.2. Let x = 0 be an exponentially stable equilibrium point of the following closed loop approximated system with stabilizing feedback gain k ẋ = Âx − B̂kx (3.29) Let V (x) = xT P x be the Lyapunov function for (3.29), where P is the solution to the Lyapunov equation. Suppose that update law (3.26) is employed and (3.27) is satisfied. Then the origin is exponentially stable for the original system (3.1). Proof. By considering (3.14) in closed loop format and taking Âc = A − Bk where k is the feedback gain, (3.1) can be written as ẋ = Âc x + Gψ̃ (3.30) Model learning error Gψ̃ = W̃ [x, −kx] which is equal to zero at the origin. Therefore, error 50 term vanishes at the origin. Furthermore, satisfaction of (3.27) results in satisfaction of (3.33). Therefore, one has G(t)ψ̃ ≤ G(t)ψ̃(0) Thus, G(t)ψ̃ ≤ Ãc (0)x Thus, the error term satisfies linear growth bound and there exists a coefficient γ such that G(t)ψ̃ ≤ γ||x|| Therefore, the error term Gψ̃ is a vanishing perturbation to the approximated system ẋ = Âc x. Since Âc is Hurwitz and P = P T is the solution to the Lyapunov equation P Âc + ÂTc P = −Q Then, the quadratic Lyapunov function V = xT P x satisfies the following properties [55]. λmin (P )||x||2 ≤ V (x) ≤ λmax (P )||x||2 ∂V Âc x = −xT Qx ≤ −λmin (Q)||x||2 ∂x ∂V || || = ||2xT P || ≤ 2λmax (P )||x|| ∂x The derivative of V (x) along the trajectory of the original system (3.30) becomes ∂V ∂V V̇ (x) = Âc x + Gψ̃ ∂x ∂x 51 which satisfies V̇ (x) ≤ −λmin (Q)||x||2 + 2λmax (P )γ||x||2 Thus, if λmin (Q) γ≤ (3.31) 2λmax (P ) Then, the origin of the original system (3.1) is exponentially stable. This completes the proof. In other words, by employing experience-replay model learning, the modeling error can be taken as a vanishing perturbation to the approximated system, where the bound in (3.31) depends on choice of Q. Hence, stability analysis for the approximated system is valid in stability guarantee for the original system with proper design. 3.4.3 Adaptive Robustified CBF The ARCBF condition is a stricter version of the CBF based on a rough estimation and worst-case model-learning’s error which its satisfaction ensures the safety of the system. Definition 3.3. Consider the dynamical system (3.13) and the set C ⊂ Rn (3.2) defined using a C 1 function h(x). Let there exist a locally Lipschitz extended class K function α such that ∂h ∂h sup [ G(t)ψ̂ − || G(t)||a + α(h(x))] ≥ 0, ∀x ∈ D (3.32) u∈U ∂x ∂x where a is the bound of estimation error as ||ψ̃|| ≤ a. Then, the function h(x) is an ARCBF on D with C ⊆ D ⊂ Rn . ■ 52 The set of safe control inputs for h(x) is formed accordingly as ∂h ∂h Ur (x) = {u ∈ U | G(t)ψ̂ − || G(t)||a + α(h(x)) ≥ 0} ∂x ∂x Ensuring the forward invariance of a set using ARCBF is presented in the following theorem. Theorem 3.3. Consider dynamical system (3.1), and its compact form (3.14) with estima- tion update law given by (3.26) and error bound of a as defined in Definition 3.3, and the set C ⊆ D (3.2) defined for the C 1 function h(x). If h is an ARCBF on D, then any Lipschitz continuous controller {u : D → R|u ∈ Ur (x)} renders the set C forward invariant. Proof. Considering (3.32), one has ∂h ∂h G(t)ψ̂ − || G(t)||a + α(h(x)) ≤ ∂x ∂x ∂h ∂h G(t)ψ̂ − || G(t)ψ̃|| + α(h(x)) ≤ ∂x ∂x ∂h ∂h G(t)ψ̂ + G(t)ψ̃ + α(h(x)) ∂x ∂x Therefore, ∂h ∂h G(t)ψ̂ − || G(t)||a + α(h(x)) ≤ ∂x ∂x ∂h G(t)ψ + α(h(x)) ∂x Since the left-hand side of above equation is positive, it ensures positiveness of its right-hand side and therefore the original CBF. ∂h (Ax + Bu) + α(h(x)) ≥ 0 ∂x Considering Theorem 3.1, the safety of the system is ensured. This completes the proof. As seen above, in addition to the estimation of the dynamics, the bound of modeling 53 error is employed in formation of ARCBF. Improper model learning and inaccurate worst- case value of error result in conservatism of the controller. This issue is obviated using the results of Lemma 3.1. Considering (3.28) and comparison lemma, one has Vx ≤ Vx (t0 )e−2(βψ12 )(t−t0 ) Therefore, ||ψ̃|| ≤ ||ψ̃(t0 )||e−(βψ12 )(t−t0 ) (3.33) This gives an accurate bound of approximation error. By employing (3.33) in (3.32), the ARCBF criterion becomes ∂h ∂h [ G(t)ψ̂ − || G(t)||||ψ̃(t0 )||e−(βψ12 )(t−t0 ) ∂x ∂x + α(h(x))] ≥ 0, ∀x ∈ D (3.34) From Theorem 3.3, if the control policy satisfies (3.34), then the safety of the system is ensured. Remark 3.5. Note that the ARCBF provides an invariance safety criterion for the worst- case uncertainty by incorporation of error bound, while experience-replay model learning quantifies the exponential convergence rate of the error to zero. In other words, an accurate bound of uncertainty is obtained that rapidly vanishes and results in convergence of ARCBF to the original CBF. Remark 3.6. While (3.27) is satisfied by applying mn + n2 safe initial actions, ARCBF is formed and the rest of policies which are needed for data acquisition are chosen such that (3.34) is satisfied. Therefore, off policy RL and model learning are safe without human intervention. 54 Remark 3.7. Since safety is certified after that rank condition (3.27) is satisfied, one might consider ||ψ̃(0)|| + ϵ instead of ||ψ̃||(0) in (3.34), where ϵ > 0. This gives a room of safe action for the initiation of model learning. 3.4.4 Safe and Stable Controller Safety condition is encoded in (3.34), while quadratic CLF satisfying (3.31) ensures expo- nential stability of the system. Therefore, to have safety and stability, any policy is first minimally modified to satisfy these conditions through a quadratic programming optimiza- tion [27]. min ||u − u∗ || + ||ρ|| u,ρ s.t. (3.34), V̇ < −λmin (Q)||x||2 + ρ. (3.35) where V is the control Lyapunov function that encodes the performance objective which is relaxed by factor ρ, while safety is applied as the hard constraint. Remark 3.8. Quadratic programming formulation (3.35) is based on u; while considering G(t) = [x, u]T ⊗ In , ||u|| appears in (3.34). Therefore, ||u|| = sgn(u)u is used, and thus ARCBF criterion and and Lyapunov derivative condition incorporated in (3.35) are linear with respect to control policy u and therefore it is a well-defined optimization problem to be solved. 3.5 Barrier-certified Off-Policy Algorithm In the off-policy RL, two policies are defined. The behavior policy which is applied to the system to gather training data and target policy which is updated toward the optimal policy 55 using training data. This method is superior in safety critical applications for a couple of reasons. First, the target policy is updated without even being applied to the system. Second, it is efficient and repetitively uses the same data and therefore it demands application of less noisy inputs to the system. This method however faces safety risk at two stages. First, at the beginning, where no model about the system is available and noisy behavior policy is applied to the system and second when the learned target policy is applied to the system which is not necessarily safe. Therefore, it is desired to make sure the safety and stability of the system is preserved in the whole operation. It is shown so far that by employing the experience replay dynamics approximation, any behavior policy that stabilizes the approximated dynamics and satisfies (3.34) is safe and stabilizing for the original system. The outcome of this robustified design is integrated into off-policy RL in order to have a safe and stable data acquisition, exploration and exploitation. Note that although the system is getting approximated for the sake of reduced conservatism of controller, the controller does not need to wait until the identification is complete; rather, the identification and off-policy controller are working at the same time. For a given stabilizing Kk , (3.12) can be written in the matrix form [57]. To do so, 1 1 P̂ ∈ R 2 n×(n+1) and x̄ ∈ R 2 n×(n+1) are defined based on P ∈ Rn×n and x ∈ Rn , as P̂ = [p11 , 2p12 , ..., 2p1n , p22 , 2p23 , ...2p(n−1),n , pnn ]T x̄ = [x1 2 , x1 x2 , ..., x1 xn , x2 2 , x2 x3 , ..., xn−1 xn , xn 2 ]T The matrix form of (3.12) is   p̂k Θk   = Ξk (3.36)   vec(Kk+1 ) 56 1 where Θk ∈ Rl×( 2 n(n+1)+mn) , Ξk ∈ Rl are defined as Θk = [δxx , −2Ixx (In ⊗ Kk T R) − 2Ixu (In ⊗ R)] Ξk = −Ixx vec(Qk ) In which for a positive integer l and time sequence 0 ≤ t0 < t1 < ... < tl δxx = [x̄(t1 ) − x̄(t0 ), x̄(t2 ) − x̄(t1 ), ..., x̄(tl ) − x̄(tl−1 )]T , Z t1 Z t2 Z tl Ixx = [ x ⊗ xdτ, x ⊗ xdτ, ..., x ⊗ xdτ ]T t0 t1 tl−1 Z t1 Z t2 Z tl Ixu =[ x ⊗ udτ, x ⊗ udτ, ..., x ⊗ udτ ]T t0 t1 tl−1 1 2 where δxx ∈ Rl× 2 n(n+1) , Ixx ∈ Rl×n and Ixu ∈ Rl×mn . Note that if Θk is full rank, then (3.36) can be uniquely solved. This criterion is employed for rich data collection. Safe off-policy algorithm is achieved in the following three phases. Safe and Stable Data Collection: In this phase a few safe policies are applied to satisfy experience replay rank condition (3.27). The guaranteed prescribed behavior of the learning error is employed in deriving the condition on safe and stabilizing controller. The noisy input is modified accordingly and then is applied to the system for more safe data collection. The AR-CBF and the system approximation enhances at each iteration by collection of more data exponentially fast. Collection continues until it suffices for off-policy optimal controller calculation. Optimal Policy Approximation: In this phase, the safe collected data is repetitively used at each target policy iteration toward the optimal controller. Safe Target Policy Calculation: In this phase the optimal controller is minimally altered to satisfy AR-CBF condition and then is safely applied to the system. Experience replay approximation continues at each time instance by replaying the stored noisy input along the current response of the system until it converges to true dynamics which is equivalent with 57 convergence of AR-CBF to CBF. More mathematical and detailed steps of the algorithm is represented in the following. Algorithm 2 Safe and Stable Off-Policy RL 1: Initialization: Initiate a NN for the dynamics (3.13) with stabilizing K0 , set numerator k = 0. 2: procedure Safe and Stable Data Collection 3: Apply p initial policies until (3.27) is satisfied. 4: Update dynamics NN weights based on (3.26). 5: Form the quadratic Lyapunov function such that (3.31) is satisfied. 6: Form AR-CBF based on (3.34). 7: Form the noisy input u = K̂x + e and modify it based on (3.35). 8: Go to the procedure ”Optimal Policy Approximation” if Θk is full rank in (3.36). Otherwise, repeat the procedure until enough data for optimal policy approximation is collected. 9: end procedure 10: procedure Optimal Policy Approximation 11: Solve Pk and Kk+1 . 12: If ||Pk − Pk+1 || < ϵ then go to ”Safe Target Policy Calculation”, otherwise k = k + 1 and repeat the procedure. 13: end procedure 14: procedure Safe Target Policy Calculation 15: Minimally modify optimal controller u∗ using (3.35) and apply it to the system. 16: If dynamics approximation is not converged, use the outcome of the system along the previous stored data to update dynamics NN. 17: Update AR-CBF based on the new approximation. Repeat procedure until control objectives are met. 18: end procedure 58 3.6 Simulation 3.6.1 Simulation Setup Consider the following dynamical system     0 1 1 ẋ =  x +  u (3.37) 1 2 1.5 where x = [x1 , x2 ]. The safety set in which the states of the system should belong to is defined as C = {x| − a1 ≤ x1 ≤ a1 , −a2 ≤ x2 ≤ a2 } (3.38) It is desired to ensure (3.38) is forward invariant. Thus, to certify the safety of the system the following CBFs are defined h1 = a1 − x1 h2 = x1 + a1 h3 = a2 − x2 h4 = x2 + a2 To form AR-CBF condition (3.34), single layer NNs are used to learn the system dynamics.     ŵ1 ŵ2  ŵ3  ẋˆ =  x +  u ŵ4 ŵ5 ŵ6 Which is updated using experiance replay update law (3.26) and after applying a few safe initial condition until (3.27) is satisfied. The bound of learning error is obtained using (3.33) 59 and denoted by w̄i for 1 ≤ i ≤ 6. ARCBF criteria is formed accordingly based on (3.34) as −ẋˆ1 + α1 h1 − ||w̄1 x1 || − ||w̄2 x2 || − ||w̄3 ||sgn(u)u ≥ 0 ẋˆ1 + α2 h2 − ||w̄1 x1 || − ||w̄2 x2 || − ||w̄3 ||sgn(u)u ≥ 0 −ẋˆ2 + α3 h3 − ||w̄4 x1 || − ||w̄5 x2 || − ||w̄6 ||sgn(u)u ≥ 0 ẋˆ2 + α4 h4 − ||w̄4 x1 || − ||w̄5 x2 || − ||w̄6 ||sgn(u)u ≥ 0 (3.39) The noisy input in the form of (3.4) is modified in the (3.35) with (3.39) as a hard inequality constraint and solution of linear quadratic programming (LQR) with Q = I, R = 1 which satisfies (3.31) as its soft equality constraint. The output is then applied to the system for further data collection. The collected data is iteratively used for optimal policy approximation which is then minimally modified using (3.35) for a safe and optimal operation as Algorithm 2. The numerical details of simulation setup are given is Table 3.1. 3.6.2 Simulation Results and Discussion The states of the system under the proposed RL controller is depicted in Figure 3.2 where the safety boundary is shown with dashed red lines. To have a safe performance, the trajectory of the system must stay between two lines. As can be seen in this figure, the safety and stability of the system is preserved even at the beginning of the simulation where noisy input is applied to the system. To demonstrate the advantage of the proposed method, plain off-policy under the same setup is applied to the system. With the same value of added noise, the system becomes unstable in the data collection phase. To avoid instability of the system the value of noise is manually reduced. Its result is shown in Figure 3.3. As can be seen in this figure, although the stability of the system is satisfied by manual modification of the noise, the safety of the system is violated. 60 Remark 3.9. Comparison of Figures 3.2 and 3.3 reveals two significant advantages of the proposed method. First of all, we ensure automatic stability guarantee during exploration. This obviates the need of manual adjustment of noise to avoid instability. Second, we ensure safety guarantee during the challenging phase of data collection which is not tractable to do it manually. The weight errors and their exponential bound is shown in Figure 3.4. As can be seen in this figure, with replaying the past experiences, the behavior of the learning error is properly prescribed and exponentially fast has converged to zero. Remark 3.10. Considering the time scale of Figures 3.4 and (3.2) reveals that at early stages of data collection, although the learning error is high, still, the safety of the system is satisfied. As mentioned earlier, system’s dynamics is approximated along the operation of off- policy controller and off-policy controller does not need to wait until system approximation is finished. In other words, safety during learning is ensured. The result of iterations toward optimal policy is shown in Figure 3.5. As can be seen in this figure, Kk and Pk are successfully converged to their optimal values by repetitive employment of safe collected data. Table 3.1: Simulation Parameters Parameter Value Parameter Value α1 , α2 40 Q I2×2 α3 , α4 10 R 1 βψ1 , βψ2 10 H [1,0; 0, 10] w̄i , i = 1, 2, 4, 5 1.8e−0.5t F [1, -1] w̄3 e−0.5t w̄6 0.5e−0.5t a1 1 a2 1.4 3.7 Conclusion and Future Work A barrier-certified safe RL framework with safety and stability guarantee in exploration and exploitation phases is proposed. It is obtained my means of efficient learning with prescribed 61 Figure 3.2: States of the system under the proposed framework Figure 3.3: States of the system with plain off-policy with manual reduced noise 62 Figure 3.4: NN Weight error Figure 3.5: Convergence of Pk and Kk 63 performance along a robutified safe and stabilizable controller throughout the algorithm in- cluding the data collection phase. Experience replay-based model approximation is employed which ensures the exponential convergence of the learning error to zero after a mild rank condition is satisfied. This makes the learning error as a vanishing perturbation to the ap- proximated model which facilitates designing stabilizing controller using the available rough knowledge of the system. The accurate bound of error is then employed in formation of a novel non-conservative AR-CBF which ensures safety during learning. AR-CBF and stabi- lizing controller are integrated through quadratic programming and is used for further data collection needed for off-policy iteration. The noisy input is modified accordingly to result in safe and stable action. After collecting safe rich data, the optimal policy is approximated and then again is certified using AR-CBF for safe exploitation. The efficacy of the proposed method is demonstrated in simulation. Extension to nonlinear dynamics, considering the effect of network reconstruction error are future directions of this line of research. 64 Chapter4 Barrier-certified Learning-enabled Safe Control Design for Systems Operating in Uncertain Environments Contents of this chapter first appeared as [51] and have been reformatted to fit the requirements of this dissertation. 4.1 Introduction This chapter presents a method for designing a learning-enabled safe controller for systems that must operate in environments that are shared with other agents with uncertain behav- iors: The behaviors of surrounding agents affect the safe set and thus safe control design of the ego system, which are unknown and uncontrollable from the ego system’s perspective. This is in sharp contrast with existing safe control methods requiring complete knowledge of the safe set. The uncertainty of the safe set caused by the uncertain behaviors of sur- rounding agents makes safe control design much more challenging. Fast and sample-efficient learning of uncertainties is of vital importance to avoid an overly conservative control design (which can also result in infeasibility) or unsafe behavior. A slow model-learning approach also avoids proactive safe control design, which can jeopardize the performance. Moreover, and even more importantly, a naive model learning approach based on minimizing the mod- eling error cannot account for safety even if the expected estimation error decreases over 65 time; This is because different models with the same modeling errors might have different characteristics in preserving the invariant behaviors of the actual system: Novel learning algorithms are required to avoid misrepresentation of the safe set as much as possible. The interaction between agents is formulated using two sets of decoupled differential equations corresponding to the ego system and the risk-imposing external agent. A safety criterion is defined as a function of both subsystems’ states. This is in sharp contrast with the existing works, which only consider partial uncertainty in the system dynamics and define the safety criterion solely based on the ego-system’s states. The proposed framework is far more inclusive for safety-critical control scenarios where the agent operates in a cluttered uncertain environment shared with other agents. Since the trajectory information is required to form ZCBFs, the unknown external agent dynamics need to be learned. To make less conservative decisions and avoid misrepresentation of the safe set, a safety-aware model-learning approach that leverages safety-aware loss functions and the experience replay method is presented to learn uncertain and unknown behavior of the external agent. More specifically, the loss function is defined based on the barrier function error, instead of the system model error, and is minimized for both current samples and past samples stored in the memory to assure a fast and generalizable learning algorithm for approximating the safe set. Moreover, it provides an easy-to-verify metric on collected data to assure learning of the actual safe set, allowing to make more informative control decisions. Then, a learning-enabled ZCBF (L- ZCBF) is presented that integrates the proposed safety-aware model learning and a novel ZCBF to assure the safety of the ego system in the presence of uncertainty in the behavior of its surrounding agents. Since ensuring forward invariance of the approximated safe set does not necessarily ensure forward invariance of the actual safe set and strict safety might be violated, L-ZCBF employs the approximated trajectory of external agent and also the trajectory of ego-system, but shrinks the boundary of the safe set in case of an imminent risk that can be predicted using observations of the states of the external agents. These observations can be acquired using embedded sensors such as Light Detection and Ranging 66 (LIDAR). Guaranteeing forward invariance of the intersection of the approximated safe set and the actual safe set assures safety during learning and automatically shrinks the boundary of the safe set to the extent that safety of the overall system is guaranteed despite uncertainty. As learning enhances, this set expands to the actual safe set. In a nutshell, the contributions of the chapter are as follows. 1. The problem of safe control design for systems operating in uncertain shared environ- ments is formulated as two sets of decoupled dynamics with a safety criterion defined as a function of both ego and external agent’s states to have a more inclusive scheme for safety-critical systems operating in the cluttered environment. 2. A novel learning-enabled ZCBF is proposed, which is capable of safety guarantee during learning of unknown dynamics. 3. The safety-aware model learning is proposed for rapid convergence of the approximated safe set to the exact one. 4.1.1 Organization of the Chapter Section 2 provides the problem statement as well as preliminaries and background informa- tion. The main idea of learning-enabled ZCBFs for uncertain sets is presented in Section 3. Section 4 represents the overall control framework and the proposed algorithm. Case study, simulation results, and conclusion are given in Section 5, 6 and Section 7, respectively. 4.2 Problem Statement and Background In this section, the safe control problem in the presence of external agents is stated, and some background information on ZCBFs and safe control design is provided. 67 4.2.1 Problem Statement Consider the control system in the nonlinear affine form as ẋ = f (x) + g(x)u (4.1) where x ∈ X and u ∈ U are the states of the controlled system and the control input, respectively. f (x) ∈ Rn is its drift dynamics and g(x) ∈ Rn×m is its input dynamics. f (x) is C 1 and f (0) = 0. It is assumed that the ego system is stabilizable and U is non-empty. The goal is to ensure the safety of the control system (4.1) in a shared environment with external agents with uncertain and unknown behaviors. The dynamics of the external agents that affect the safety of (4.1) is given as ż = f2 (z) (4.2) which is unknown and out of control of the ego system and z = [z1 , ..., zp2 ] is the state vector of external agents which can be measured in real-time by the ego system (e.g., measuring the position of a leading vehicle using embedded sensors that measure the distance and relative steering) and f2 ∈ Rp2 is assumed to be locally Lipschitz. Note that (4.2) does not need to capture the complete dynamical behavior of external agents in the surrounding environment, as it might require high-dimensional dynamics, which makes their learning computationally intractable; rather, it concerns simplified dynamics that best captures the effect of external agents on the safety of the ego system. For example, in urban driving, distance to other agents and obstacles and how they are approaching the ego vehicle matter most when safety is the main concern. The safety of the ego system is then formulated as a function of both x and z, which is uncertain due to unknown dynamics of z. It is desired to satisfy uncertain safety criteria which are impacted by states of the system (4.1) and the external dynamics (4.2) and achieve stability and performance specifications 68 as long as it is safe. Control Objectives: The following objectives must be achieved for the system 4.1: 1. Assuring safety by guaranteeing that the following safety conditions are satisfied all the time li (x, z) ≥ 0, ∀1 ≤ i ≤ q where li (x, z) > 0 is the ith element of the safety criteria which is a smooth function describing a constraint on the system, and q is the total number of constraints. 2. Guaranteeing stability of the controlled system, i.e., x → 0 as t → ∞ in the case of no conflict with safety. The safe set is formed as the intersection of all the sets, each satisfying a safety constraint. That is, the safe set is defined as C = C1 ∩ C2 ... ∩ Cq (4.3) where Ci = {x|li (x, z) ≥ 0}, ∀1 ≤ i ≤ q (4.4) Safety imposes hard constraints on the control design, while performance is a soft constraint satisfied in the case of no conflict with safety. Remark 4.1. Note that two sets of dynamics are considered in this framework in which (4.1) represents the first one and is known, and (4.2) represents the second one and is assumed to be uncertain and unknown. The safety set is represented as a function of both dynamics’ states (4.4). Therefore, even when the dynamics of the ego system is partially available, this method is applicable since this unknown part is included in (4.2) which is learned. 69 Therefore, this covers not only disturbances that can be learned by collecting data but also a more general class of uncertainties in the environment and the ego-system dynamics. 4.2.2 Control Barrier Functions Guaranteeing positive invariance of a set of states has broad applications in control system design, such as control of constrained systems and region of attraction maximization. For a dynamical system, positive invariance of a set means that inclusion of states in a specific set at any time ensures the inclusion of states in that set in the future time. Extension of this notion to control systems is called controlled positive invariance of a set, which guarantees forward invariance of the set by designing a proper control input. One of the widely referred theorems in the characterization of positive invariant sets is the Nagumo’s theorem [62, 63, 64]. This theorem is presented using the concept of the tangent cone of a set [65, 64]. Theorem 4.1. Nagumo’s Theorem. Given a dynamical system ẋ = f (x) which has a globally unique solution for any initial condition x0 ∈ X , let S ⊂ X be a closed set. Then, S is positively invariant if and only if f (x) ∈ TS (x), ∀x ∈ ∂S (4.5) where ∂S is the boundary of the set S and TS (x) is the tangent cone to S . Proof. See Theorem 3.1 in [63] and Theorem 4.7 in [64]. Remark 4.2. The Nagumo’s theorem implies that to have a positive (forward) invariant set, ẋ should point inside the set at the boundary, or in the worst case, it should be tangent to the boundary. CBFs are used to ensure forward invariance of a specific set in a control system. ZCBF is a positive function within a set and zero at its boundary and thus, having a zeroing derivative 70 in the vicinity of the boundary prevents the states of the system from exceeding the limits. Theretofore, forward invariance of the set is ensured while handling unbounded functions are avoided [28]. Based on the definition of class K function in [55], extended class K function is defined as follows. Definition 4.1. A continuous function α : (−b, a) → (−∞, ∞) with a, b > 0 is an extended class K function, if it is strictly increasing and α(0) = 0 (Definition 1 in [34]). ■ Definition 4.2. ZCBF Properties. For the control system (4.1) and a given set M ⊆ D ⊂ Rn defined as M = {x|l(x) ≥ 0} (4.6) the C 1 function l : Rn → R is a ZCBF on the set D, if there exists an extended class K function α such that sup [Lf l(x) + Lg l(x)u + α(l(x))] ≥ 0, ∀x ∈ D (4.7) u∈U where Lf and Lg are Lie derivatives of l(x) along f and g, respectively, and dl ∂l = ẋ = Lf l(x) + Lg l(x)u dt ∂x Then, the set of inputs that satisfy (4.7) is Vzcbf = {u ∈ U |[Lf l(x) + Lg l(x)u + α(l(x))] ≥ 0}, ∀x ∈ D ■ Lemma 4.1. For the given set M ⊆ D ⊂ Rn with function l; if l is a ZCBF on D, then, any Lipschitz continuous controller u ∈ Vzcbf for the system (4.1) renders the set M forward 71 invariant. Proof. See Proposition 1 in [28]. In safety-critical control systems, the safe set is presented by M with a safety criterion expressed by l(x) ≥ 0 as (4.6). By starting from a safe initial condition x0 ∈ M and selection of a control input that satisfies (4.7), the system never leaves M and thus guaranteeing safety. Despite the incredible power of ZCBFs in ensuring the safety of control systems, this method faces a couple of challenges. First of all, to ensure the satisfaction of (4.7), complete information about the dynamics of the system is needed. Second, the safety criteria and the safe set are assumed to be certain and known. However, in many real scenarios, the safe set is uncertain and affected by unknown external dynamics as described in (4.4). In the following section, the application of ZCBFs to guarantee safety under uncertain safety criteria in the presence of unknown external dynamics is investigated. 4.3 Learning-enabled ZCBF with Uncertain Sets The system is considered to be operating in an environment that is shared with other agents. These external agents impose safety consideration on the system, while their dynamics are unknown and uncontrollable. This results in uncertainty in the environment and designing a safe controller. Therefore, in this section, the L-ZCBF platform is presented to ensure safety despite uncertainty in the behavior of external agents. The influential unknown dynamics of the external agents are learned, and consequently, an L-ZCBF is formed that assures the forward invariance of a set that is contained in the safe set, and its size becomes closer to the size of the actual safe set as the learning progresses. 72 4.3.1 Learning Safe Set Despite Uncertain Behaviors of External Agents In order to design a safe controller for (4.1) in the presence of uncertain external agents in the environment, first, influential dynamics of external agents need to be approximated. Considering the Lipschitz continuity assumption on f2 and the fact that any smooth function within a compact set can be approximated by an NN [66], (4.2) is approximated as żˆ = Ŵ Φ(ẑ) (4.8) where Ŵ is the estimated NN weights and Φ is its activation function. Then, considering (4.4), the approximated safe set is defined as formed by Cˆ = Cˆ1 ∩ Cˆ2 ... ∩ Cˆq where Cˆi = {x|li (x, ẑ) ≥ 0}, ∀1 ≤ i ≤ q (4.9) where ẑ is the state of the approximated external dynamics represented in (4.8). Fig. 4.1 shows an example with both the actual safe set and its approximated one for a specific time. As can be seen from Fig. 4.1(a), designing a controller based on the ZCBF (4.7) to ensure the forward invariance of this approximated set Cˆ does not guarantee the forward invariance of the actual safe set and safety boundary might be violated. On the other hand, while the actual safe set can be formed based on the real-time measurement of the state of the external agent z, its forward invariance requires knowing the entire trajectory of the external agent, which is not available, and it is impossible to design (4.7) to make the actual safe set forward invariant. However, as shown in Fig. 4.1(b), if the control input is designed to assure the 73 Figure 4.1: (a): Cˆ invariant, ∂C violated (b): Cc invariant (c): Cˆ converges to C forward invariance of the intersection of the actual safe set and its approximation, which is contained in the actual safe set, the safety of the system is guaranteed. This shrinks the boundary of the approximated safe set to assure that it is contained in the actual safe set. Note that this set can be made forward invariant using (4.7) since the approximate knowledge of the state trajectory of the external agent is available through (4.8). As learning progresses and the external dynamics becomes more accurate, as shown in Fig. 4.1(c), the approximated safe set becomes more accurate, and the system’s maneuverability improves. The faster the external dynamics converges, the faster the intersection of the approximated safe set and actual safe set expands which provides more room of safe maneuver of the ego system. In order to shrink the boundary of the approximated safe set and assure that it is con- tained in the actual safe set, the instantaneous sensory observations of the ego system from z are used to form the actual safe set, and the intersection of the safe set and its approximation is derived accordingly. Cc is defined as the intersection of C and Cˆ Cc = Cˆ ∩ C (4.10) Before presenting the proposed approach, the following assumptions are made. Assumption 4.1. Strict interiority of the initial condition. The initial condition of the system (4.1) belongs to the interior of the safe set C . That is, x0 ∈ intC 74 Assumption 4.2. The initial value of the approximated external dynamics ẑ satisfies li (x0 , ẑ0 ) > 0, ∀1 ≤ i ≤ q Remark 4.3. Considering (4.3) and (4.9), Assumptions 4.1 and 4.2 imply that Cc = Cˆ ∩ C is non-empty. Remark 4.4. Note that Assumptions 4.1 and 4.2 which state strict interiority of the initial condition and also its approximation, respectively, are mild and reasonable because if the initial condition is not safe, no controller can be designed to ensure safety in the future time. Lemma 4.2. Consider Assumptions 4.1, 4.2, and the set Cl defined as Cl = Cl1 ∩ Cl2 ... ∩ Clq where Cli = {x| min(li (x, z), li (x, ẑ)) ≥ 0}, ∀1 ≤ i ≤ q (4.11) Then, Cl = Cc where Cc is defined in (4.10) as the intersection of sets C and Cˆ. Proof. Given any x ∈ X and 1 ≤ i ≤ q, if x ∈ Cl , from (4.11), one has li (x, z) ≥ min(li (x, z), li (x, ẑ)) ≥ 0 ⇒ x ∈ C li (x, ẑ) ≥ min(li (x, z), li (x, ẑ)) ≥ 0 ⇒ x ∈ Cˆ 75 Therefore, ∀x ∈ Cl ⇒ x ∈ CC which means Cl ⊂ Cc (4.12) On the other hands, if x ∈ Cc li (x, z) ≥ 0, li (x, ẑ) ≥ 0 and therefore, min(li (x, z), li (x, ẑ)) ≥ 0 which implies Cc ⊂ Cl (4.13) From (4.12) and (4.13), one has Cc = Cl The boundary of Cc is defined as ∂Cc = {x| min(li (x, z), li (x, ẑ)) = 0}, ∀1 ≤ i ≤ q 76 Definition 4.3. Given the control system (4.1), the smooth function l = [l1 , ...., lq ] ∈ C 1 is L-ZCBF for the set Cc if for each 1 ≤ i ≤ q ∂li ∂li ˆ sup { ẋ + ż + α(min(li (x, z), li (x, ẑ)))} ≥ 0 (4.14) u∈U ∂x ∂ ẑ Moreover, the set of inputs that satisfy L-ZCBF condition is Uzcbf = {u ∈ U | ∂li ∂li ˆ ẋ + ż + α(min(l(x, z), l(x, ẑ))) ≥ 0, ∀1 ≤ i ≤ q} (4.15) ∂x ∂ ẑ where α is an extended class K funciton. ■ This definition is used to guarantee the safety of the system and forward invariance of Cc using tangent cone of practical sets and the Nagumo’s theorem. Definition 4.4. Practical Set (Definition 4.9 in [64]) Let O be an open set. Consider the set S1 ⊂ O defined by a set of inequalities in the form of S1 = {x|li (x) ≥ 0, i = 1, 2, ..., q} where li is continuously differentiable function in O. The set S1 is said to be a practical set if 1) For all x ∈ S1 , there exists y such that li (x) + ∇T li (x)y > 0, ∀i = 1, 2, ..., q (4.16) 2) There exists a Lipschitz continuous vector field ψ(x) such that for all x ∈ ∂S1 (x), ∇li (x)ψ(x) > 0 (4.17) 77 ■ For all x ∈ ∂S1 , the tangent cone of the practical set is TS1 (x) = {y|∇T li (x)y ≥ 0, ∀i ∈ S1Act (x)} (4.18) where S1Act (x) is the set of active constraints, which is defined as S1Act (x) = {x|li (x) = 0} For more details, see [64]. Theorem 4.2. Given the control system (4.1) and the set Cc (4.10), any Lipschitz controller u ∈ Uzcbf defined in (4.15) ensures safety criteria li (x, z) ≥ 0, ∀1 ≤ i ≤ q. Proof. For each 1 ≤ i ≤ q, if li (x, z) ≥ li (x, ẑ), the L-ZCBF condition (4.14) becomes ∂li ∂li ˆ ẋ + ż + α(li (x, ẑ)) ≥ 0 ∂x ∂ ẑ From direct result of Lemma 4.1, li (x, ẑ) ≥ 0 and since li (x, z) ≥ li (x, ẑ), then li (x, z) ≥ 0. In addition, existence of u ∈ Uzcbf implies that for all (x, ẑ) ∈ Cc ˆ + α(l(x, ẑ))) ≥ ∇T li (x, ẑ)[ẋ, ż] ˆ + α(min(l(x, ẑ), l(x, z)))) ≥ 0 ∇T li (x, ẑ)[ẋ, ż] Therefore, (4.16) is satisfied. If li (x, z) < li (x, ẑ), then L-ZCBF (4.14) turns to ∂li ∂li ˆ ẋ + ż + α(li (x, z)) ≥ 0 ∂x ∂ ẑ 78 Thus, at the boundary of Cc in which li (x, z) → 0, one has ∂li ∂li ˆ ẋ + ż ≥ 0 ∂x ∂ ẑ In other words, ∇T li (x)[ẋ, ż]ˆ >0 Therefore, (4.17) is satisfied for all (x, ẑ) ∈ ∂Cc . Considering the definition of practical set ˆ is within the tangent cone of Cc (4.11) as and from (4.18), [ẋ, ż] ˆ ∈ TCc (x, z) [ẋ, ż] This implies that if li (x, z) → 0, then (ẋ, ẑ)˙ point inside the set at the boundary of Cc or in the worst case is tangent to the boundary. According to the Nagumo’s Theorem, Cc is forward invariant and since Cc ⊂ C , therefore, the approximated trajectories do not exceed the boundary li (x, z) = 0 implying li (x, z) ≥ 0 for all t > 0. Since this proof is valid for all 1 ≤ i ≤ q, safety of the system is ensured. This completes the proof. Corollary 4.1. Given the control system (4.1), L-ZCBF introduced in (4.14) renders the intersection of the safe set and its approximation, Cc , forward invariant. Proof. According to Theorem 4.2, boundary of the positive invariant set is shrunk to a more conservative value that provides a bigger margin to the safety boundary. According to Lemma 4.2, this forms the intersection of the safe set and its approximation. Therefore, the introduced L-ZCBF renders Cc invariant indeed. Remark 4.5. The proposed L-ZCBF assures that at least a conservative safe set remains forward invariant, which guarantees safety. The conservativeness will be reduced next by presenting a fast and data-efficient learning approach for modeling the external agent. 79 Remark 4.6. It is shown that the external agent dynamics and, consequently, the unknown safe set are approximated using NNs. However, these approximations alone cannot be relied upon to ensure the forward invariance of the safe set. The reason is that the approximation might not be perfect and lead to exceeding the safety limits, which is not acceptable for safety-critical systems. Therefore, to design a more realistic and practical controller, the system observations and the approximated external agents dynamics are also combined with ZCBF. Although the safety of the system can be guaranteed with an inaccurate model of the external dynamics, as learning enhances, the intersection set Cc expands to the exact safe set C and the system would be able to take less conservative control actions. In other words, employing a proper learning approach that suits the application boosts the control system’s performance. In the following subsection, the application of the experience replay method is demonstrated in this problem to identify the dynamical behaviors of external agents. This method provides a fast convergence of the network leading to the control system’s fast response, which is crucial in safety-critical applications. 4.3.2 External Dynamics Identifier The motivation behind learning about the dynamics of external agents is to provide the ego system with a larger set of feasible actions and reduce the conservatism of the controller. In other words, enhancing the approximation of the safe set has higher importance compared to learning about the external agent states, and inspired by [67], an experience replay-based method is proposed which updates the identifier weights to reduce the set approximation error rather than the external state estimation error. Experience replay method uses recorded and stored data in the update law and provides fast convergence and an easy-to-check and verifiable the persistence of excitation (PE) condition, which is necessary to guarantee the convergence of the identifier weights. In contrast, online checking of this condition is generally difficult and even infeasible [67, 68, 69, 70]. 80 Considering (4.2) and (4.8), the external dynamics model is formulated into a filtered regressor form ż = W Φ(z) + ϵf (4.19) where W and Φ are the weight matrix and the activation function, respectively. Also, ϵf is the model approximation error. To convert the dynamics into the regressor form, let Az be added to the both sides of (4.19), where A = aIj×j , a > 0 ż = −Az + W Φ(z) + Az + ϵf (4.20) Assumption 4.3. There exists a constant 0 < ϵf ∗ < ∞ such that ∥ϵf (t)∥ ≤ ϵf ∗ (4.21) Note that ϵf ∗ is unknown and depends on the quality of selected basis functions. If the basis functions are chosen such that the unknown function dynamics is near the span of the basis functions, this error will be small. Note also that the boundedness of reconstruction error and its gradient are standard assumptions in neural network identification literature. Furthermore, using neural networks, the approximation guarantees are limited to a compact set. Since for safety-critical systems, the safe set is generally compact, and the system must not leave this set, therefore, approximation over a compact set is reasonable (Chapter 1 in [66]). 81 Lemma 4.3. Considering (4.20), Eq. (4.19) can be written as z = W h(z) + ad(z) + ϵ (4.22) ḣ(z) = −ah(z) + Φ(z), h(0) = 0 ˙ = −Ad(z) + z, d(0) = 0 d(z) where Z t h(z) = e−a(t−τ ) Φ(z(τ ))dτ Z 0t d(z) = e−A(t−τ ) z(τ )dτ 0 Z t −At ϵ(t) = e z(0) + e−A(t−τ ) ϵf dτ 0 Proof. See Lemma 1 in [67]. Consider identifying weight estimator as ẑ(t) = Ŵ (t)h(z) + ad(z) (4.23) where Ŵ (t) is the estimated value of the weight matrix W at time t. The state estimation error is defined as ez (t) = ẑ(t) − z(t) (4.24) By considering (4.22), (4.23) and (4.24), one has the state estimation error as ez (t) = Ŵ (t)h(z) + ad(z) − W h(z) − ad(z) − ϵ 82 which is simplified to ez (t) = W̃ (t)h(z(t)) − ϵ where W̃ (t) = Ŵ (t) − W is the weight estimation error. The approximation of external agents dynamics is needed to expand the approximated safe set to the exact one, which reduces conservatism and provides more room of safe maneuver for the ego system. To accelerate the convergence of the approximated safe set and its approximation, weights are updated in a way to decrease set approximation error rather than the state estimation error. Set approximation error is defined as e(t) = l(x, ẑ) − l(x, z) By using the Taylor expansion around (x, z) and some manipulations, one has e(t) = ez (t)K(x, z) (4.25) with K(x, z) = ∂l(x, z) ∂ 2 l(x, z) ∂ q1 +1 l(x, z) + )e z (t) + ... + )ez (t)q1 ∂z 2!∂z 2 (q1 + 1)!∂z q1 +1 where q1 is the maximum degree of z in l(x, z). The derivatives with the order of higher than q1 + 1 are zero and eliminated from the Taylor expansion. In experience replay method, recorded samples are used in the update law. Define the state estimation error using the k th sample as ez (tk ) = ẑ(t, tk ) − z(tk ) (4.26) 83 where ẑ(t, tk ) = Ŵ (t)h(z(tk )) + ad(z(tk )) (4.27) Using (4.22) and (4.27), the error defined in (4.26) becomes ez (tk ) = Ŵ (t)h(z(tk )) + ad(z(tk )) − W h(z(tk )) − ad(z(tk )) − ϵ(tk ) which further is simplified to ez (tk ) = W̃ (t)h(z(tk )) − ϵ(tk ) (4.28) and the set estimation error at the k th sample is defined accordingly as e(tk ) = ez (tk )K(x(t), z(t)) The update law is then given as ˙ Ŵ (t) = − Γe(t)(h(z(t))K(x(t), z(t)))T XP −Γ e(tk )(h(z(tk ))K(x(t), z(t)))T (4.29) k=1 where P is the overall number of stored data and Γ is a positive-definite matrix which determines the learning rate. Let the matrix of stored data be Z = [h(z(t1 ), ..., h(z(tP )] (4.30) 84 Then, the persistence of excitation condition is defined as If h ∈ Ro1 , then rank(Z) = o1 (4.31) Remark 4.7. Using the history of data in the experience replay approach makes learning the safe set fast and data-efficient. This is of vital importance for safety-critical systems operating in an uncertain environment since the learning phase and operation phase in these systems are not separated. Therefore, control approaches with fast convergence capability in the learning process make control of safety-critical systems more practical. Remark 4.8. Adaptive optimal control schemes require a PE condition to ensure the suffi- cient exploration of the state space. An exploratory signal consisting of sinusoids of varying frequencies can be added to the control input to ensure PE qualitatively. Note that the requirement of rank satisfaction is much less restrictive than the standard PE condition re- quirement and is much easier to verify online. The exploratory noise can be removed as soon as the rank condition is satisfied, which can be easily certified. Theorem 4.3. Consider the model (4.19), the update law (4.29) and assume full rank of the matrix Z in (4.30) 1) If there is no reconstruction error, i.e., ϵf = 0, then the set approximation error (4.28) converges to zero exponentially fast. 2) If ϵf ̸= 0, then the set estimation error is uniformly ultimately bounded (UUB), and the ultimate bound can be made small by recording rich data in the history stack. Proof. Let the Lyapunov function on weight error be as VW = 0.5tr(W̃ Γ−1 W̃ ) ˙ ˙ (t), By differentiating along the trajectory of (4.29) and considering the fact that Ŵ (t) = W̃ 85 one has V̇W = −tr(W̃ (t)[K T (t)hT (z(t))h(z(t))K(t)) XP + (K T (t)hT (z(tk ))h(z(tk ))K(t)]W̃ T (t)) k=1 X P T T + tr([ϵ(t)K (t)h (z(t)) + ϵ(tk )K T (t)hT (z(tk ))]W̃ T (t) (4.32) k=1 where K(t) stands for K(x(t), z(t)). Eq. (4.32) is simplified as V̇W = −tr(W̃ (t)K T (t)[hT (z(t))h(z(t))) XP + (hT (z(tk ))h(z(tk ))]K(t)W̃ T (t)) k=1 X P + tr([ϵ(t)K T (t)hT (z(t)) + ϵ(tk )K T (t)hT (z(tk ))]W̃ T (t) (4.33) k=1 If the rank condition on Z holds, then XP (hT (z(tk ))h(z(tk )) > 0 k=1 Therefore, for case of no reconstruction error, V̇W < 0 means that W̃ exponentially converges to 0. This completes the first part of the proof. For the second part and under reconstruction error, assume XP B = K T (t)[hT (z(t))h(z(t))) + (hT (z(tk ))h(z(tk ))]K(t) (4.34) k=1 XP ϵn = ϵ(t)K T (t)hT (z(t)) + ϵ(tk )K T (t)hT (z(tk )) (4.35) k=1 86 Using (4.33), one has ∥ϵn ∥ if W̃ ≥ ⇒ V̇w < 0 (4.36) λmin (B) where λmin is the smallest eigen value of B defined in (4.34). Therefore, if ϵf = o, W̃ converges to zero exponentially fast and thus e(t) in (4.25) converges to zero exponentially fast. According to (4.19), (4.21), (4.22), and (4.35), one has P +1 ∗ ∥ϵn ∥ ≤ (ϵf ) a where by proper selection of a as identifier design parameter, (4.36) is satisfied. For any value of a, V̇w is negative outside of the following compact set P +1 ω = {W̃ | W̃ ≤ (ϵf ∗ )} (4.37) aλmin (B) Based on (4.25), e(t) will also remain bounded, and this completes the proof of the second part of the theorem. Remark 4.9. Note that the proposed set identification method provides exponential con- vergence of the set approximation error to zero. This implies that there are times t1 , t2 , ... during learning that set approximation error e(tk ) = l(x(tk ), ẑ(tk )) − l(x(tk ), z(tk )) has decreased i.e., e(tk+1 ) < e(tk ). Considering the approximated safe set at this sequence {x(tk )|l(x(tk ), ẑ(tk )) ≥ 0} which is equivalent to the set {x(tk )|l(x(tk ), z(tk )) ≥ −e(tk )} re- veals that by decreasing the error, the approximated set gets closer to the exact safety set C . Utilizing experience replay technique has at least two advantages: 1) it significantly improves the decay rate of the approximation error and thus reduces the conservatism, and 2) it provides the ego system with an easy-to-verifiable metric to check if the approximated safe set converges to the actual safe set. Remark 4.10. Fast convergence of the model enables the control system to act in a less 87 conservative manner leading to enhanced performance. Even in the case of an inaccurate model with a non-zero reconstruction error, this method provides an acceptable performance with a UUB weight estimation error. 4.4 Control Framework The proposed control framework is demonstrated in Fig. 4.2. First, the control system gathers data, e.g., distance from other agents in the environment collected by camera or LIDAR sensor, by observing its surrounding environment. The observed data are labeled as risky and safe, respectively. The safe data coming from the external agents that do not impose any risk on the control system are removed from the collected data. Then, the risky data representing external agents that can impose risk on the control system are applied to identifier blocks that approximate the dynamics of risky external agents using the modified experience replay method. Next, the state of the system and the output of identifier modules are injected into the CBF block to form L-ZCBF constraints according to the strict safety criterion. Finally, these L-ZCBF constraints govern the performance of the controller block, and control action must satisfy L-ZCBF constraints. The combination of identifier networks and the CBF block is called the guardian block. The quadratic programming [27, 28] is employed to design the controller for this platform. The performance objective is formulated as a soft inequality constraint on derivative of the Lyapunov function. This constraint on the Lyapunov function and ZCBF constraint are unified by imposing them as constraints of quadratic programming problem, which aims to minimize a cost function. The cost function is a combination of the control input u and the relaxation factor η which is considered in the performance objective to make it a soft constraint. As a result, the minimum value of the control input, which satisfies safety is obtained and the system gets close to desired performance as much as possible. ZCBF-based quadratic programming is formulated as 88 Figure 4.2: Control scheme min ut T Hut + F ut u,η s.t. (4.14), V̇ < ρη. (4.38) where ut = [u, η], H and F are the weight matrices, V is the Lyapunov function and ρ is the coefficient of the relaxation factor η. Remark 4.11. If optimal controller u∗ for performance objective is available, such as linear quadratic regulator (LQR) solution in a linear control system, then, the Lyapunov inequality in (4.38) can be replaced by equality u = u∗ + ρη. The overall algorithm is given in Algorithm 3. 4.5 Case Study The effectiveness of the proposed approach is verified here by designing a safe maneuver controller for an autonomous vehicle in the presence of other vehicles on the road. 89 Algorithm 3 Barrier-certified Learning-enabled Controller 1: Start with a safe initial condition x0 ∈ intC . 2: procedure 3: procedure Observation 4: Store states of previously observed agents zi , i ∈ 1, ..., ni with ni as the number of the previously observed agents (store null if agent vanished). 5: Store states of new observed agents zni +j , j ∈ 1, ..., nj with nj as the number of the new observed agents . 6: Go to ”Initial Safety Assessment”. 7: end procedure 8: procedure Initial Safety Assessment 9: Check states of previously observed agents zi , i ∈ 1, ..., ni . Store them if they are still risky or null. Store null if they are safe. 10: Check states of new observed agents zni +j , j ∈ 1, ..., nj . Store if they are risky. Discard if they are safe. 11: Go to ”Update”. 12: end procedure 13: procedure Update 14: If stored data corresponds from previously observed agents, go to ”Existing Agents”. 15: If stored data corresponds from new agents, go to ”New Agents”. 16: procedure Existing Agents 17: If stored data is null, discard corresponding identifier and L-ZCBF constraint. 18: If stored data is not null, update weights using (4.29). 19: end procedure 20: procedure New Agents 21: Initialize an identifier network Ŵni +j 0 and safe initial approximation ẑni +j 0 for each new state j ∈ 1, ..., nj . 22: Form L-ZCBF constraint using (4.14) and incorporate it in quadratic programming (4.38). 23: end procedure 24: Go to ”Quadratic Programming”. 25: end procedure 26: procedure Quadratic Programming 27: Solve the quadratic programming problem (4.38). 28: Apply the obtained controller to the system and update ni = ni + nj , nj = 0. 29: Go to ”Observation” and repeat until performance objectives are met. 30: end procedure 31: end procedure 90 4.5.1 Control Scenario Fig. 4.3 shows a safety-critical maneuver for autonomous vehicles in an urban area. The ego vehicle, specified by its position (xe , ye ), is traveling in the road, and the control objective is to reach a pre-defined destination, which is marked in Fig. 4.3, in an optimal manner. However, the road is shared with other vehicles with uncertain behaviors, and their objectives might be in conflict with the ego vehicle desired objective. Vehicle (x1 , y1 ) is traveling next to the ego vehicle; although it is very close, it looks safe. Vehicle (x2 , y2 ) was not previously observable to the ego vehicle while it is now reaching the cross-section and might impose risk on the ego vehicle passing the crossroad. Vehicle (x3 , y3 ) is farther but, it is moving in the same path as the ego vehicle and might impose risk on its maneuver in the future time. These types of maneuver scenarios are practically challenging but so common in everyday driving. This even becomes more challenging if instead of vehicles, bicycles or pedestrian are in the road which add more unpredictability and complexity to the control scenario. To elaborate on that, the effect of having an agent with a more complicated behavior instead of vehicle three is investigated as well. The following section mathematically formulates this scenario. 4.5.2 Mathematical Representation The simple mass point model for vehicles is used [46].      ẋ   v · cos ψ       ẏ   v · sin ψ   =      ψ̇   v · u     s      v̇ ua − µ · v where x, y are cartesian coordinate of the vehicle, v stands for vehicle’s velocity and ψ is the heading angle of the vehicle. us is the steering angle and ua is acceleration. µ is the friction 91 coefficient. For simplicity, the state vector is presented by X = [x y ψ v]. When moving in a straight line with zero friction coefficient, the dynamics of the ego vehicle and vehicles 1 and 3 are simplified as a double integrator        ẏi  0 1 yi  0  =    +   uai , i = e, 1, 3 v̇i 0 0 vi 1 and the dynamic of vehicle 2 is given as        ẋ2  0 1 x2  0  =    +   ua2 v̇2 0 0 v2 1   0 1  Note that although the open-loop system is unstable, its controllability matrix is  , 1 0 which is full rank. Therefore, the system satisfies the stabilizability assumption, and there is a control input to make the closed-loop system stable. As explained in Algorithm 3, after gathering observation data, an initial safety assessment is required. In this scenario, using distance to assess safety is not functional since vehicle one is close to the ego vehicle, but it is safe. However, other vehicles are far from the ego vehicle, but they might impose risk on it. Therefore, the minimum distance of surrounding agents to the center of the road that the ego vehicle is moving along is considered for initial safety assessment and is named as the minimum safe lateral distance rmin . In this scenario, rmin is in x-coordinate defined as xmin , which is defined to be the lane width here and can be modified based on the application. As a result, if surrounding agents are within this range, they will be considered risky. Therefore, vehicle one is safe and will not be included in the loop as long as its lateral distance is in the safe range. However, other agents are considered risky, and headway safety criteria are 92 applied to the guardian block regarding them. Headway rule stated in [27, 71] is employed D > ve /2 where ve is the ego vehicle speed, and D is the distance between two vehicles. Then, the safety criteria for vehicles 2 and 3 in this scenario would be l2 (ye , y2 ) = y2 − ye − ve /2 > 0 when |x2 − xe | < xmin l3 (ye , y3 ) = y3 − ye − ve /2 > 0 (4.39) This formulation shows that if any vehicle gets very close to the lane that the ego vehicle is moving, then a minimum headway is required. Therefore, if the distance between the ego vehicle and any other vehicle gets shorter, the ego vehicle should decline its velocity to operate under these safety criteria. The ego vehicle observes the identified external agents as black boxes in which only their current states are measurable. Thus, the identifier NNs for vehicles 2 and 3 are defined as ŷ˙ 2 = Ŵ2 ϕ3 (y2 ) ŷ˙ 3 = Ŵ3 ϕ3 (y3 ) The ego vehicle identifies the dynamics of vehicle 2 only in y coordinate because x is only needed for initial assessment, and it is not included in the headway criterion. One of the advantages of the proposed approach is that safety is ensured even with inaccurate modeling of the external agents. Thus, to reduce computational cost and learning time, simple single layer perceptron with polynomial activation functions are employed as ŷ˙ 2 = Ŵ2 · y2 ŷ˙ 3 = Ŵ3 · y3 (4.40) 93 Remark 4.12. Since vehicle 2 is crossing the lane, the corresponding identifier is activated after it becomes observable for the ego vehicle. However, the corresponding L-ZCBF is formed and incorporated in quadratic programming when it reaches the lane that the ego vehicle is moving in. This setup can be adjusted based on the application. For example, one might decide to design a more conservative controller and incorporate the L-ZCBF at the time of observation. L-ZCBFs are defined using (4.14) and (4.39) as ∂l2 ∂l2 ∂l2 ˆ y˙e + v˙e + y˙2 + α2 (min(l2 (ye , y2 ), l2 (ye , ŷ2 ))) ≥ 0 ∂ye ∂ve ∂ ŷ2 ∂l3 ∂l3 ∂l3 ˆ y˙e + v˙e + y˙3 + α3 (min(l3 (ye , y3 ), l3 (ye , ŷ3 ))) ≥ 0 ∂ye ∂ve ∂ ŷ3 For the performance purposes, LQR problem is solved as mentioned in Remark 4.11. Then, the overall controller is performed using Algorithm 3. The values of parameters used in the simulation can be found in Table 4.1. Table 4.1: Simulation Parameters Parameter Value Q I2×2 R 1 α2 , α3 15 H, F, ρ I2×2 , 0, 1 a 0.7 Ŵ20 -0.2 Ŵ30 0.1 4.6 Simulation Results Simulation is performed for the aforementioned control scenario in three sub-scenarios. First, an accurate network model with zero reconstruction error is employed which can 94 Figure 4.3: Control scenario converge to the exact vehicle model. In the second sub-scenario, an inaccurate model is used, which cannot converge to the exact model. The third sub-scenario adds more complexity by considering an agent with a more complicated behavior in front of the ego vehicle, which the employed NN cannot accurately model. The purpose of using an inaccurate model is to demonstrate the strength of the proposed approach in guaranteeing safety in case of modeling error. 4.6.1 Zero Modeling Error Scenario The network defined in (4.40) is assumed to be accurate without any reconstruction error, so after learning its weights, it converges to the exact model. Fig. 4.4 shows the results, where y as coordination of the ego vehicle and two risky vehicles 2, 3 are demonstrated. Without loss of generality, the destination of the ego vehicle is assumed to be located at the origin. The ego vehicle starts from its initial position, but a crossing vehicle is reaching, so the ego vehicle slows down and proceeds in a smooth maneuver when the crossing vehicle passes. After passing the crossroad, the ego vehicle faces another slow-moving vehicle in front of it; as a result, it slows down to adapt to the flow of traffic. As can be seen in Fig. 4.4, because 95 Figure 4.4: Position of vehicles in ’y’ coordinate (Scenario1) of the presence of vehicle 3, the ego vehicle could not reach the destination; but, it reached as close as possible while safety is still ensured. Fig. 4.5a shows the convergence of the weights of networks. The LQR performance of the system in lack of safety considerations is demonstrated in Fig. 4.5b. As can be seen in this figure, without safety consideration, the ego vehicle would crash with either vehicle 2 or vehicle 3. To further clarify the advantage of employing the proposed learning method, a simulation is conducted to compare the weight convergence with and without using the past stored data in the update law as depicted in Fig. 4.7. As seen in this figure, the network weight has a fast and exponential convergence under the proposed method. 4.6.2 First Non-zero Modeling Error Scenario In this sub-scenario, the same network as (4.40) is employed while vehicle 3 demonstrates a different behavior as ẏ3 = a. In other words, the allocated network is not a proper one 96 Figure 4.5: (a) NN weights (Scenario 1), (b) Optimal solution without CBF for modeling the dynamics of vehicle 3. Fig. 4.6 shows the convergence of both networks’ weights. As can be seen, W3 could not converge. Fig. 4.8 shows the locations of the ego vehicle and vehicles 2 and 3 in this scenario. Despite inaccurate modeling, the safety of the system is ensured. The ego vehicle slows down to avoid crash with vehicle 2, and after that, it accelerates to reach the destination. However, it has faced vehicle 3 and has adjusted its velocity accordingly until it gets to the destination safely. 4.6.3 Second Non-zero Modeling Error Scenario One of the big challenges of safe urban driving is unpredicted and hard-to-model dynamics such as the jump of an animal to the road or human behavior. The proposed method is functional in handling these unpredicted behaviors. To further analyze the result of employ- ing this method, an agent with a more complicated dynamics is considered to be the only risky agent which is moving in front of the ego vehicle with dynamics of ẏ = 0.4t + 1.2 − 0.7 sin(2πy) + 0.1y (4.41) The network (4.40) is employed for the identification of this dynamics, which has non-zero reconstruction error. The network weight update is shown in Fig. 4.9 which could not converge. The y-coordinate of both agents is shown in Fig. 4.10. As can be seen in this figure, despite the complexity in the behavior of the external agent and the existence of 97 Figure 4.6: NN Weights (Scenario 2) -0.2 W with EReplay -0.22 W without EReplay -0.24 -0.26 -0.28 -0.3 -0.32 -0.34 -0.36 -0.38 -0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4.7: NN Weights with and without Experience replay reconstruction error, the ego vehicle has a safe maneuver. Remark 4.13. The purpose of this simulation is to demonstrate the capability of the method for guaranteeing safety in case of facing an agent whose dynamics cannot be modeled using pre-defined networks. 4.6.4 Discussion The efficacy of the proposed method is examined in three different scenarios: 1) the assigned NN properly captures the dynamics of the external agent, but safety and performance are in conflict. It is shown that the agent has a safe maneuver during and after learning and gets close to its destination as far as it is safe. 2) there exists a reconstruction error, and the as- 98 Figure 4.8: Position of vehicles in ’y’ coordinate (Scenario2) Figure 4.9: NN Weight (Scenario 3) 99 Figure 4.10: Position of agents in ’y’ coordinate (Scenario 3) signed NN cannot fully capture the dynamics of the external agents. It is shown that despite this error, the ego-vehicle still maintains a safe maneuver. This is of significant applicability since, in many real applications having an NN that fully captures the unknown dynamics is not always possible or tractable therefore, usage of a simplified model is facilitated. 3) A more complicated dynamical behavior in the presence of reconstruction error is considered in this scenario. It is shown that despite complex dynamics, still the safety of the ego-vehicle is ensured. 4.7 Conclusion and Future Work In this chapter, a learning-enabled ZCBF controller for safety-critical systems under uncer- tainty has been proposed. It has been proved that the proposed method is capable of ensuring safety in complicated and uncertain environments in the presence of external agents with unknown dynamics. It has been also demonstrated that safety during learning and even with inaccurate modeling of external agents is guaranteed. As a result, this approach has pro- vided a practical method in control scenarios that accurate modeling needs a great number 100 of data and computationally expensive learning schemes while still un-predicted objects are expected such as autonomous driving in an urban area. Meanwhile, having a better model has enabled the controller to take a less conservative action and has resulted in a better per- formance. To achieve this goal, a modified experience replay method has been proposed that identifies the external agents dynamic to minimize the difference between the safe set and its approximation. This method provides fast convergence and ensures a bounded error to the exact model even with inaccurate modeling which are both crucial in safety-critical control systems. Future work includes consideration of disturbance in the ego system’s dynamics and extension to a robust framework. Furthermore, the reciprocal behavior of agents in the environment can be considered. 101 Chapter5 Robust Satisficing Cooperative Control Barrier Functions for Multi-Robots Sys- tems using Information-Gap Theory Contents of this chapter first appeared as [52] and have been reformatted to fit the requirements of this dissertation. 5.1 Introduction Successful deployment of multi-robot and swarm systems demands safety guarantee of agents. While a centralized control approach can be used to design safe controllers for all agents in a swarm, the communication and computation complexity are expensive and do not scale up with the size of the swarm. Therefore, it is desired to prevent collision in a distributed manner using only local information exchange among agents, either explicitly through com- munications or implicitly through internal sensors. However, this local information is not certain and accurate due to imperfect communication, measurement errors, aging of sensors, weather conditions, and even failure of the sensing system. As a result, the failure of one agent in avoiding collision due to these uncertainties can lead to catastrophic failure of the whole system. This necessitates taking the uncertainty on the local information into account in any safe control design. In addition, it is desired to leverage the cooperative capability of the swarm system in ensuring safety via sharing responsibility in avoiding collision and 102 compensating for uncertainties as much as possible. Common approaches for solving collision avoidance problems include conflict resolution [72], model predictive control (MPC) [73], potential field function[74], geometric guidance [75], and barrier function-based methods [33, 76, 77, 78]. Conflict resolution approaches, such as reachability-based methods rely on the availability of trajectories of other agents to find an obstacle-free route. MPC solves an optimization problem at every sample and ac- counts for state and input constraints. Collision avoidance criteria are also represented as the state constraint, and thus, MPC is employed to address this problem. In the potential field approach, each agent follows the gradient of potentials from which the target is attracting to and obstacles are repelling from. Geometric guidance methods such as collision cone [79] and velocity obstacle [75] result in less computational cost than the MPC and the conflict resolution approach. However, coping with sensors’ errors which directly affects the safety criterion and thus the safety of the system and extension to decentralized safe frameworks which only rely on local information rather than global functions, remain as challenges [77]. Control barrier function (CBF)-based methods prevent collision between agents by en- suring the forward invariance of a safe set. This provides safety without neither the need for computing a safe reachable set in reachability-based methods nor solving an online nonlinear optimization for every instance of time in MPC-based methods, and, thus, is more computa- tionally tractable. In addition, it can be applied to the control loop in a minimally invasive manner, in the sense that a nominal (e.g., optimal) controller can be modified as little as possible to ensure safety. Therefore, it is a versatile method that can be integrated into a variety of control approaches [27, 50, 80]. This is superior to the collision avoidance methods, which employ a secondary controller in the face of collision risk. This is because switching between primary and secondary controllers can delay or even prevent safe task completion, especially in a dense environment. In [34], the robustness of CBFs under model perturbation is investigated and asymptotic stability of the safe set is established. In [81], a distributed CBF-based approach is presented for multi-agent systems. This work is then extended in 103 [33] and [76] to heterogeneous swarm systems in which the maximum acceleration of agents is not equal, and the CBF is shared between agents based on their maximum acceleration. In [82], each agent is modeled as an ellipsoid and a distributed CBF is used to ensure the safety of a swarm system while disturbance and parametric uncertainty in gravity term of Lagrangian dynamics is considered and estimated. However, in a distributed CBF approach, which relies on the local information, measurement uncertainty (i.e., inaccuracy in the rela- tive distance to other agents measured by the ego-agent) might jeopardize the safety of the overall system, which is not considered in the existing distributed CBF methods. Therefore, it is desired to investigate the effect of measurement uncertainty on the safe performance of swarm systems. Considering a worst-case uncertainty limits the action of agents and results in an overly conservative controller. Especially in a situation that error bound might be substantial, unknown, and time-varying, the worst-case method might not be feasible at all. In addition, when the measurement confidence of agents is not uniform, higher uncer- tainty in one agent’s measurement might lead to catastrophic failure of the overall system; therefore, agents need to cooperatively decide on their roles in ensuring pairwise safety. In addition, sources of measurement uncertainty are not always known a priori. For example, different weather and lighting conditions can alter the accuracy of the reading or make it unreliable. Therefore, to tackle this issue, rather than considering the probabilistic model or bound of the measurement error, we propose to use the cooperative capability of agents to maximize the horizon of uncertainty under which the safety of the overall system is ensured. When agents have different measurement confidences, the agents with higher certainty de- cide to take more responsibility for ensuring the safety of the overall system. Even in an extreme case that one agent fails in sensing, other agents can compensate by taking responsi- bility for the safety guarantee. This is achieved by sharing CBFs using information-gap (IG) theory to maximize the safe horizon of uncertainty between every two agents. IG theory is a decision-making tool for prioritizing alternatives when neither the probabilistic distribu- tion of uncertainty nor its worst case is available. The uncertainty in IG theory rather is 104 modeled with an ambiguity set of possible outcomes with an unknown bound or uncertainty horizon. The robust satisficing IG method takes action that results in the highest horizon of uncertainty up to which a critical requirement is satisfied. IG theory is employed in var- ious engineering problems [83, 84, 85, 86]. However, its application is less investigated in conjunction with control theory. In this chapter, a safe and robust satisficing control protocol is proposed for the multi- agent collision avoidance problem in the presence of measurement uncertainty. In the pro- posed approach, it is assumed that agents are unaware of their neighboring agents’ trajec- tories, and only uncertain local measurement information about neighbors is available (e.g., through embedded sensors) in the sense that neither a probabilistic model of the uncertainty nor its worst case is known. To maximize the robustness of the system to measurement uncertainty and satisfy the safety of the overall system, the IG theory along with the CBF approach is employed to determine the contribution of each agent in constructing shared CBFs between agents. It is shown that based on the IG theory, agents with more certain measurements must take more responsibility to ensure pairwise safety, and consequently, the overall safety of the multi-agent system, as they are allowed to have more agile behaviors and have more influence on the overall agility of the system. It is also shown that using the proposed approach, agents with more accurate measurements can cooperatively compensate for agents with less accurate measurements without sacrificing performance. In a nutshell, the contributions of the chapter are as follows. 1. Accounting for the measurement uncertainty in a distributed multi-agent barrier cer- tified control framework. 2. Employing the cooperative capability of agents for safety guarantee and compensating for each other’s measurement inaccuracy. 3. Presenting a robust satisficing approach to maximize acceptable horizon of uncertainty using information gap theory which enables a non-conservative robust design. 105 5.1.1 Organization of the Chapter Section II is allocated for problem overview and background information on CBFs and IG theory. The problem statement and the proposed framework of robust-satisficing distributed safe control are presented in Section III. Section IV represents the simulation results and Section V concludes the chapter. 5.2 Problem Overview and Background The problem overview and motivation are first presented in this section. Then, background on CBFs and IG theory is provided. 5.2.1 Problem Overview The controller for safety-critical multi-agent systems that share an environment must be carefully designed to not only achieve their tasks but also satisfy coupled safety constraints. Satisfying these coupled safety constraints under uncertainty (e.g., measurement uncertainty) is challenging and without the agent’s collaboration (e.g., shared collision avoidance strat- egy), might result in conservative control design and even infeasibility. To achieve this cooperation in ensuring safety, CBFs and IG theory are integrated. CBFs are employed to ensure forward invariance of the safe set. However, since the trajectory information of involving states is needed in each pairwise CBF (i.e., dynamics of each two agents, which collision should be avoided between them). Therefore, the pairwise CBF is broken into two distributed CBFs in which each of them only relies on each agent’s trajectory and local information. Satisfaction of distributed CBFs for each two agents in the vicinity of each other results in collision avoidance between them. However, measurement uncertainty leads to inaccuracy of distributed CBFs and, therefore collision. Considering the worst-case uncertainty severely limits the action of agents and leads to the conservatism 106 of the controller. In addition, the worst-case uncertainty is not always known. Therefore, as the main contribution of this chapter, another approach to tackle this problem is employed. Share of each agent in ensuring safety is determined in a cooperative manner using IG theory. In other words, CBF is shared between agents such that the system tolerates the highest horizon of uncertainty, while usage of uncertainty information is avoided in the formation of CBFs, resulting in agile and non-conservative controllers. It is shown that agents with more accurate measurements are able to compensate inaccuracy of other agents by taking more responsibility in ensuring safety. The formulation of this method unfolds in subsequent sections. 5.2.2 Background In this section, the background on CBFs and IG theory is briefly provided. An affine non- linear system is considered as ẋ = f (x) + g(x)u (5.1) where x ∈ Rn and u ∈ U ⊂ Rm are the state of the system and the control input, respec- tively. It is assumed that f and g are locally Lipschitz and the equilibrium point of the system is stabilizable. 5.2.2.1 Control Barrier Functions CBFs are used to guarantee the forward invariance of a predefined set in a control system. Zeroing CBF (ZCBF), as one major form of CBFs, is positive within a set of interest and reaches zero at the set’s boundary. Imposing a proper condition on its derivative prevents the system trajectories from passing the boundary of the set of interest, which guarantees its forward invariance. The safety criterion is represented as h(x) ≥ 0, where h(x) : Rn → R is a smooth 107 function. The safe set is defined accordingly as C = {x ∈ Rn |h(x) ≥ 0} (5.2) Definition 5.1. [55, 34] A continuous function β : (−b, a) → (−∞, ∞) with a, b > 0 is an extended class K function, if it is strictly increasing and β(0) = 0 . ■ Definition 5.2. [28] Considering the dynamical system (5.1) and the set C ⊂ Rn (5.2) defined using a h(x) ∈ C 1 function, if there exists a locally Lipschitz extended class K function β such that sup [Lf h(x) + Lg h(x)u + β(h(x))] ≥ 0, ∀x ∈ D (5.3) u∈U then, the function h(x) is a ZCBF on domain of interest D with C ⊆ D ⊂ Rn . ■ The set of feasible control inputs for h(x) is formed accordingly as Um (x) = {u ∈ U |Lf h(x) + Lg h(x)u + β(h(x)) ≥ 0} Ensuring the forward invarinace of a set using ZCBFs is the result of the following theorem. Theorem 5.1. [28]. Given dynamical system (5.1) and the set C ⊆ D (5.2) defined for h(x) ∈ C 1 , if h is a ZCBF on D, any Lipschitz continuous controller {u : D → R|u ∈ Um (x)} renders the set C forward invariant. Remark 5.1. Note that based on the definition of the safe set (5.2) and the ZCBF criterion defined in (5.3), if the initial state of the system (5.1) is inside the safe set, i.e., the initial condition x(0) satisfies h(x(0)) > 0, then even if h(x) decreases and the system’s trajectory gets close to the safety boundary, since, the derivative of h(x(t)) is positive in the boundary, i.e., Lf h(x)+Lg h(x)u ≥ 0, the value of h(x) starts increasing. This pushes back the systems’ 108 trajectories inside the safe set for which h(x(t)) > 0. Furthermore, β(h(x)) determines how fast the states of the system can reach the safety boundary. 5.2.2.2 Information-Gap Theory Uncertainty is typically modeled by either probability distribution or its worst case. However, in a scenario in which the system changes over time and future variations in the condition is poorly known in advance, a poor or overly-conservative controller might be induced. In such scenarios, IG theory can be employed to drastically improve robustness and performance of the system. Robust satisficing IG theory is a non-probabilistic decision making method that prioritizes the choices to maximize robustness against uncertainty. First, an ambiguity set is leveraged to model the uncertainty. Based on the application and the available knowledge about the uncertain entity, different IG models can be employed such as hybrid IG model, slope bound model, and Fourier bound model [83, 84]. Note that the horizon of uncertainty in the ambiguity set is unknown. To clarify this, assume that the parameter K is the entity of interest while limited knowledge about it is available and let K̂ be a rough estimation of K, while the exact value of deviation from the true value is unknown. The following fractional IG model can be used as an ambiguity set K − K̂ U (s, t) = {K | | < s} (5.4) ϕ(t) where the parameter s is the horizon of uncertainty which is unknown. The true value of K 1 can deviate from the available estimation by at most ±s|ϕ(t)|. Note that ϕ(t) is considered as the measurement confidence which is a rough measure on validity of the sensed or measured data. The sensed data with higher measurement confidence is more reliable. Depending on the application, there are different methods in the literature to quantify measurement con- fidence such as sensor fusion-based methods [87, 88]. In the simplest form, with no external 109 processing method to extract the reliability of measurement, measurement confidence is the measurement accuracy of the instrument in which the manufacturer has provided. Mea- surement confidence depends on the situation, the operating environment, and the accuracy of the sensing equipment of agents. For example, a change in the weather condition and the discrepancy between different sensors’ measurements of an agent result in less confident measurements, and, therefore, the agent demands recommunication. An illustrative example is provided in the subsequent section to clarify the basis for calculation and change of the measurement confidence. Note that IG approach is entirely different from the standard practice in robust control for which the uncertainty bounds are known and the goal is to design a controller that satisfies the system’s requirement within known uncertainty boundaries. Instead, the goal of IG is to maximize the uncertainty horizon under which the system achieves its requirement. All IG models including fractional-error model (5.4) have two properties of contraction and nesting. Contraction property states that U (s) is a singleton when s = 0, while nesting property means that increment in s makes U (s) more inclusive and s2 < s1 =⇒ U (s2 ) ⊂ U (s1 ) This reveals the importance of our desire to maximize the horizon of uncertainty. When horizon of uncertainty gets bigger, the ambiguity set (5.4) expands and thus more uncertainty will be tolerated. 5.3 Robust-Satisficing Control Barrier Function This section presents the problem of collision-free and safe control of multi-agent systems using CBF and IG theory in the presence of uncertainty in the agent’s local measurement information. To cope with uncertainty, a robust satisficing approach is proposed to determine the share of each agent in the CBF between each two agents to tolerate the highest horizon 110 of uncertainty under which the pairwise safety is still ensured. Considering the system model (5.1) and the uncertainty model (5.4) where the uncertain entity h(x) is the relative distance between each pair of agents, h(x) − ĥ(x) U (x, s, t) = {h(x) | | < s} ϕ(t) by determining the minimum performance requirement that must be satisfied, the IG safety robustness is defined as s∗ (xc ) = max{s|( min h(x)) ≥ xc } (5.5) s h(x)∈U (s,t) where h(x) is used to represent safety requirement and xc is a critical value (which is 0 here) from which h(x) should not exceed. Eq. (5.5) implies that our goal is to maximize the horizon of uncertainty s∗ which in the worst case, the system requirement h ≥ 0 is satisfied. Illustrative Example: As mentioned earlier, the quantification of measurement confidence depends on the sys- tem’s application and sensing capability. Thus, a variety of methods, such as sensor fusion approaches, can be employed to examine the reliability of the sensed data. Here we provide a simple example to clarify the concept further. Assume the system is equipped with two different distance sensors, a laser scanner with the measured value of dli , i = 1, ..., N , where N is the number of the nearby objects; and a radar with the measured distance to the nearby objects of dri , i = 1, ..., N . To determine the reliability of the sensed data dli , one might examine the norm of the discrepancy between these two readings as ϕi (t) = (||dli − dri ||) + eli 111 where eli is the nominal measurement error of the laser scanner. As the discrepancy between 1 two measurements increases, the measurement confidence ϕi decreases. On the other hand, 1 if this discrepancy is negligible, then the measurement confidence simplifies to eli . To have a discrete index, one might define this function as ϕi (t) = f (||dli − dri ||) + eli where ( a1 , if 0 < x < b1 .. f (x) = . ai , if bi−1 < x < bi It will be shown later that agents need to recommunicate in case of a change in measurement confidence. Therefore, by having a discrete index they only need to re-communicate if a critical value of discrepancy is passed resulting in a reduced communication cost. 5.3.1 Problem Formulation Consider a swarm system with N agents and index set of M = {1, ..., N }. Each agent is modeled as a single integrator ṗi (t) = ui (t), ∀i ∈ M (5.6) For simplicity, in the rest of paper, pi (t) and ui (t) are written as pi and ui , respectively. pi ∈ R2 is the position vector in the Cartesian space and ui ∈ R2 is the velocity of the agent which is considered as the control input. It is desired to satisfy the following pairwise safety 112 criterion for collision avoidance between each two agents i and j ∥∆pij ∥ ≥ Ds , ∀i, j ∈ M , i ̸= j where ∆pij = pi − pj is the relative position between agents i and j and Ds is the minimum safe distance. The pairwise safety criterion hij between each two agents i and j is then defined as hij = ∥∆pij ∥ − Ds ≥ 0, ∀i, j ∈ M , i ̸= j (5.7) which specifies that the pairwise distances between agents should be kept above a critical value Ds , resulting in collision avoidance. According to (5.7), the following pairwise safety sets are formed Cij = {(pi , pj )|hij ≥ 0}, ∀i, j ∈ M , i ̸= j (5.8) The pairwise ZCBF constraint which ensures forward invariance of (5.8) and consequently, their pairwise safety based on Theorem 5.1 and using (5.3) with taking β(hij ) = αij hij is ∆pTij ∆uij + αij (||∆pij || − Ds ) ≥ 0, ||∆pij || ∀i, j ∈ M , i ̸= j (5.9) where αij > 0 is a design parameter which determines how fast the trajectories of the system can approach the safety boundary and ∆uij = ui − uj . The overall safety set is formed using (5.8) as follows [33] Y \ C = { Cij } (5.10) i∈M j∈M ,j̸=i 113 This implies that in order to have a collision-free maneuver for the overall system, the collision should be avoided between each two agents. This result is presented in the following theorem. Lemma 5.1. [33]. The multi-agent system represented by M is safe and C in (5.10) is forward invariant, if the control input u = [u1 T , ..., uN T ]T satisfies all pairwise ZCBF con- straints (5.9). As can be seen in (5.9), the information about trajectories of both agents i and j is needed in ZCBF inequality constraint. However, in reality, the exact trajectories of agents are not available due to measurement inaccuracy or communication noise and sensor or communication failure. As a result, alternative ZCBFs should be employed for which each agent takes responsibility on ensuring safety based on its own trajectory information and local uncertain measurements about the position of other agents. 5.3.2 Robust-satisficing Distributed CBF In this subsection, the effect of measurement uncertainty in certifying the safety of the system is investigated. Afterwards, the idea of distributing ZCBF constraints in order to achieve highest robustness considering the measurement uncertainty is presented. 5.3.2.1 Distributed ZCBF It is desired to guarantee safety of the overall system in a distributed network using only local information. By considering the fact that CBF provides safe and admissible set of inputs which can be divided into safe and admissible subsets, [33] and [76] propose to distribute the pairwise ZCBF constraint (5.9) between agents i and j as follows ∆pTij ZCBFi : ui + αi · (||∆pij || − Ds ) ≥ 0 (5.11) ||∆pij || −∆pTij ZCBFj : uj + αj · (||∆pij || − Ds ) ≥ 0 (5.12) ||∆pij || 114 where αi + αj = αij (5.13) with αij as a design parameter set in advance to achieve a specific performance while safety is ensured. ZCBFi and ZCBFj are ZCBF constraints that agent i and agent j follow based on their local information. Thus, if each agent’s controller ui and uj are designed such that (5.11) and (5.12) hold, then their summation, which is the pairwise ZCBF constraint (5.9), is satisfied as well. In this formulation, each agent only needs its own trajectory and local measurement information about the relative distance to surrounding agents to satisfy the corresponding ZCBF constraint; therefore, a distributed implementation is feasible. Param- eters αi and αj specify how the pairwise ZCBF constraint (5.9) is shared between agents. The greater αi and therefore the faster αi · (||∆pij || − Ds ) gets close to zero, the faster the derivative terms become positive, which pushes harder and faster the trajectory of the system back into the safe set. In other words, the agent with greater allocated αi is allowed to have a more agile maneuver. Since measurement uncertainty is inevitable and must be considered when designing safe controllers, it is desired to determine a method for allocation of these parameters to achieve the best possible robustness against measurement uncertainty. This will be covered in the following subsection. 5.3.2.2 Robust-satisficing distributed ZCBF In a distributed safe control framework, each agent relies on its own local measurement information. However, measurement uncertainty and accuracy reduction due to aging of sensors or uncertainty due to imperfect communication can affect the safety of the overall system. Therefore, it is important to model and incorporate uncertainty in control design. In this chapter, IG theory is employed to address the question of how to design distributed ZCBFs which are capable of tolerating the highest horizon of uncertainty while avoiding 115 collision. It is assumed that agent i is capable of measuring instantaneous relative position of surrounding agents. However, the local information of agent i about this relative position is uncertain ∆p̂i = ∆pij + ei (t) (5.14) where ∆p̂i is a rough estimation of agent i from ∆pij and the measurement error is denoted by ei (t) which is unknown to the agent i. Therefore, an ambiguity set is employed instead, to model measurement uncertainty ∆p̂i − ∆pij U (si , t) = {∆pij || || ≤ si } (5.15) ϕi (t) 1 where si is the horizon of measurement uncertainty of agent i, and ϕi (t) indicates the con- fidence of agent i from its measurement. Note that ∆pij , ∆p̂i are vectors, and therefore, radial and angular uncertainties are reflected in (5.15). Assumption 5.1. The measurement error ei (t) is bounded as ei (t) || || ≤ si ϕi (t) Note that in deeply uncertain scenarios, the exact value of error ei (t), is unknown. In contrast with robust framework in which a known worst-case horizon of uncertainty is respected, in here the goal is to make decisions that maximize the unknown horizon of uncertainty and then the highest possible worst case is derived based on made robust satisficing decisions. Due to uncertainty, agents don’t have access to the exact value of ||∆pij ||, instead they have an uncertain measurement of it denoted by ||∆p̂i ||, ||∆p̂j || and thus each agent perceives 116 safety criterion (5.7) differently as hi = ||∆p̂i || − Ds hj = ||∆p̂j || − Ds (5.16) where hi and hj are the perception of agents i and j from safety criterion hij , respectively. Note that in case of no measurement uncertainty hj = hi = hij = ||∆pij || − Ds ; however, in the presence of measurement uncertainty, exact value of hij is not available to agents, and this uncertainty is reflected in the perception of agents from safety criterion as (5.16). The distributed ZCBF constraints with uncertain measurements become ∆p̂Ti ZCBF i : ui + αi · (||∆p̂i || − Ds ) ≥ 0 (5.17) ||∆p̂i || −∆p̂Tj ZCBF j : uj + αj · (||∆p̂j || − Ds ) ≥ 0 (5.18) ||∆p̂j || where ZCBF i and ZCBF j are uncertain interpretation of ZCBF i and ZCBF j , respec- tively. The robust design of (5.17) and (5.18) to guarantee the safety criterion (5.7) is the result of the following problem. Problem 5.1. Consider the multi-agent system (5.6) and define the measurements ambi- guity sets U (si , t) and U (sj , t) for agents i and j by (5.15). Consider the pairwise ZCBF constraints (5.17) and (5.18). The goal is to distribute ZCBF constraint between agents i and j by assigning αi and αj to design a robust safe control mechanism that maximizes the uncertainty horizons in U (si , t) and U (sj , t) under which the system still remains safe. The goal in Problem 5.1 can be achieved by solving the following max-min problem (max min [ZCBF i + ZCBF j ]) ≥ 0 (5.19) S ∆p̂i ∈U (si ,t),∆p̂j ∈U (sj ,t) √ where Sij = si · sj is the pairwise horizon of uncertainty based on each agent’s horizon of 117 uncertainty. Having a uniform horizon of uncertainty in which si = sj is desired, because lack of robustness in one agent affects the safety of the overall system. The inner minimization in (5.19) gives the worst case of ZCBF constraint considering the ambiguity set (5.15). In this scenario, positiveness of ZCBF constraint ensures safety of the system. It translates the worst case to the smallest value of ZCBF constraint. The outer maximization term gives the maximum horizon of uncertainty under which ZCBF constraint still remains positive. Theorem 5.2. The highest horizon of uncertainty that guarantees uniform robustness for all agents is obtained when the agility parameter αij is distributed between agents based on their measurement confidence as |ϕi (t)| αi (t) = αij (1 − ) |ϕi (t)| + |ϕj (t)| |ϕj (t)| αj (t) = αij (1 − ) |ϕi (t)| + |ϕj (t)| Proof. IG robustness is the highest horizon of uncertainty under which the safety of the system is ensured. ZCBF i + ZCBF j in (5.19) based on the available uncertain measured values is ZCBF i + ZCBF j = ∆p̂Ti ∆p̂Tj ui − uj + αi (t)||∆p̂i || + αj (t)||∆p̂j || − αij Ds (5.20) ||∆p̂i || ||∆p̂j || Define the inner minimization problem in (5.19) as m(si , sj ). That is, m(si , sj ) = min [ZCBF i + ZCBF j ] ∆p̂i ∈U (si ,t),∆p̂j ∈U (sj ,t) Note that m(si , sj ) is the minimum value of (5.20) obtained when smallest values of ||∆p̂i || 118 and ||∆p̂j || within the ambiguity set (5.15) occurred, which using triangular inequality are ||∆p̂i || = ||∆pij || − |ϕi (t)||si | ||∆p̂j || = ||∆pij || − |ϕj (t)||sj | (5.21) Therefore, by substituting (5.21) into (5.20) and some manipulations, one has ∆p̂Ti ∆p̂Tj m(si , sj ) = ui − uj ||∆p̂i || ||∆p̂j || + αij · (||∆pij || − Ds ) − αi (t)|ϕi (t)||sj | − αj (t)|ϕj (t)||si | (5.22) ∆p̂T ∆pT ij Note that i u ||∆p̂i || i = u ||∆pij || i cos θi , where θi is deviation on direction of ∆p̂i from ∆pij . ∆p̂T ∆pT ij Since θi << 1 then cos θi ≈ 1 and therefore i u ||∆p̂i || i = u. ||∆pij || i Considering (5.22), problem (5.19) is simplified to max m(si , sj ) ≥ 0 (5.23) S Since the coefficients corresponding to si and sj in (5.22) are negative, their maximum occurs when m(si , sj ) = 0. Therefore, by denoting maximum of si and sj as si ∗ and sj ∗ , respectively, one has (αi (t)|ϕi (t)|)si ∗ + (αj (t)|ϕj (t)|)sj ∗ = ∆p̂Ti ∆p̂Tj ui − uj + αij · (||∆pij || − Ds ) (5.24) ||∆p̂i || ||∆p̂j || √ To maximize the pairwise horizon of the uncertainty previously defined as Sij = si · sj , one 119 must solve the following optimization problem max si ∗ sj ∗ αi ,αj s.t. (5.24) which is a maximization problem over multiplication of two parameters while a linear relation exists between them. Therefore, by denoting the right-hand side of (5.24) by v, one has v v si ∗ = , sj ∗ = (5.25) 2αi (t)|ϕi (t)| 2αj (t)|ϕj (t)| In addition, it is desired to have a uniform robustness for all agents, i.e., si ∗ = sj ∗ . That is, αi (t)|ϕi (t)| = αj (t)|ϕj (t)| (5.26) By considering (5.13) and (5.26), one has |ϕi (t)| αi (t) = αij (1 − ) |ϕi (t)| + |ϕj (t)| |ϕj (t)| αj (t) = αij (1 − ) (5.27) |ϕi (t)| + |ϕj (t)| This completes the proof. Equation (5.27) provides a rule to share the ZCBF constraint between two agents to achieve the highest robustness against measurement uncertainty. Based on (5.27), the agent with higher measurement confidence takes a higher responsibility in ensuring pairwise safety, and behaves in an agile manner while the agent with lower confidence behaves conservatively and this leads to higher overall robustness. Remark 5.2. The proposed method employs the cooperative capability of agents to deal with measurement uncertainty. It is also applicable to extreme cases. Assume that the 120 sensing system of agent i fails, and the agent realizes this through the discrepancy between measurements of two different sensors. This results in having very small measurement con- 1 fidence ϕi (t) . Therefore, according to (5.27), αi = 0 and αj = αij . This means that agent j takes the whole responsibility in ensuring pairwise safety and compensates for the failure of agent i. This example clarifies the advantage of the proposed method to handle rare failure cases with infinity bound of uncertainty which cannot be obtained using worst-case analysis. Remark 5.3. Note that even if agents share their positions through a communication net- work, the uncertainty in the information needs to be considered because knowledge of agents about their own positions might be drifted and also reliability and accuracy of communica- tion network should be considered. Remark 5.4. Note that the exact measurement error might be greater than ϕ(t) and mea- 1 surement confidence ϕ(t) is the best knowledge of agents on the reliability of their mea- surements which does not affect the strict safety of the system as long as the horizon of uncertainty is not exceeded. However, more accurate measurement confidence leads to a more robust distribution of ZCBF constraint between them. 5.3.2.3 Discussion The proposed approach can be extended to a more general dynamics. Assume that dynamics of agents i, j with states of xi , xj are, respectively, modeled as ẋi = fi (xi ) + gi (xi )ui and ẋj = fj (xj )+gj (xj )uj with the safety criterion (5.7). Then, the distributed ZCBF conditions similar to (5.11) and (5.12) are ∂hij ZCBFi : (fi (xi ) + gi (xi )ui ) + αi · (hij ) ≥ 0 ∂xi ∂hij ZCBFj : (fj (xj ) + gj (xj )uj ) + αj · (hij ) ≥ 0 ∂xj ∂hij ∂hij Now, if ∂xi is only a function of xi and ∂xj is only a function of xj , then the distributed ZCBFs are functions of each agent’s states and its local measurement. Therefore, by having 121 an ambiguity set for ĥi and ĥj , one can form the uncertain interpretations of ZCBFs similarly as (5.17) and (5.18); which is used in solving optimization Problem 1 and similar results apply. However, if the partial derivative of hij with respect to xi and xj is a function of both states, such as higher-order derivatives of norm functions, then it demands further calculations to derive distributed formulation based on the agent state itself and the local measurement. This generalized formulation is one of the future research directions. 5.3.3 Controller Design Considering Figure 5.1, each agent is a graph node, and the measurement confidence of agents is the edges of the graph. Each agent communicates with its surrounding agents at the initiation time, and they exchange their measurement confidence through a bidirectional communication graph. After that, distributed ZCBF constraints are formed and agents no longer need to communicate and they rely on their own measurements until a change in measurement confidence (e.g., the discrepancy in measured data obtained from two different sensors) of an agent is observed or a new agent gets close to it. In this case, the agent would re-communicate and reset the pairwise safety responsibility based on (5.27). Therefore, only on an event that an agent’s measurement confidence changes, communication and change of αi and αj is needed. Thus, αi and αj can be considered as piecewise constants that will remain constant between two events. Designing a controller for solving a safe and robust collision avoidance problem using the proposed approach includes two loops: 1. an outer loop that determines the share of each agent in the pairwise ZCBF constraints. 2. an inner loop that solves an optimization problem to find a safe controller that satisfies the safety of the overall multi-agent system by imposing the ZCBF constraint obtained from the outer loop while minimizing the intervention with the optimal controller for each agent. Algorithm 4 shows the proposed approach. Remark 5.5. Note that Algorithm 4 is simultaneously performed for each agent using its own local information. Therefore the obtained control input for the overall multi-agents 122 Algorithm 4 Safe and Robust Control Design for each Agent i 1: Initialization: Start with safe initial conditions for all agents. 2: procedure 3: Outer Loop Control Design. Use Theorem 5.2 to find the distributed pairwise ZCBF constraints, i.e., the responsibility of each agent in ensuring pairwise safety for each two agents in the vicinity of each other. The overall distributed ZCBF constraint for each agent i is then formed as ZCBFit = [ZCBF i1 ; ...; ZCBF iN i ] where ZCBFil , l = {1, ..., Ni } is the pairwise ZCBF constraint between agent i and its neighboring agent l and Ni is the number of neighbors of agent i. 4: Inner Loop Control Design. Use the matrix of distributed ZCBF constraints for each agent obtained by the outer loop as a hard constraint on the control design and solve an optimization problem that finds a safe controller which is robust to measurement uncertainty and minimally invasive to the optimal controller found by the linear quadratic regulator (LQR). This formulation integrates the LQR controller and ZCBF for each agent using quadratic programming, inspired by [27] ui ∗ = arg min ∥ui − ûi ∥ ui s.t. ZCBFit ≥ 0 (5.28) where ûi is the nominal controller obtained from LQR and ui ∗ is the safe controller which is the minimally altered version of ûi for agent i such that ZCBF constraints and therefore safety is ensured. Note that the nominal controller ûi can be designed based on any performance objective or any other control approach. 5: end procedure 123 Figure 5.1: Graph Topology system is u∗ = [u1 ∗ , u2 ∗ , ..., uN ∗ ] With initial communication between agents in the vicinity of each other, ZCBFs sharing parameters are formed and the safe control input for each agent is obtained independently. After initial communication, they only need to re-communicate if the measurement confi- dence of one agent or the graph topology changes. In this condition, the agent informs its surrounding agents and they compromise again on the share of ZCBF constraint and this cycle continues. This provides a robust-satisficing distributed framework in which agents rely on their local measurements. This is an efficient method in which only occasional com- munication with surrounding agents is needed. 5.4 Simulation A multi-robot system with five agents with integrator dynamics as (5.6) is considered. Agents are located around a circle with a radius of r = 3 and are supposed to get to the opposite point on the circle in a safe and collision-free manner in the sense that a pre-defined min- imum safety distance Ds = 0.3 is respected between every two agents. Agents are aware of their destination and the LQR with Q, R = 1 is employed as the nominal controller for 124 this task objective. Simulation is conducted in three different scenarios; 1) Agents have accurate measurements and accurate distributed ZCBFs are employed and integrated into the controller as (5.28). 2) Measurement uncertainty is considered and its results compared to scenario (1) is investigated. 3) Measurement uncertainty is considered and the proposed method using IG is employed to share ZCBF constraints between agents to maximize ro- bustness against uncertainty. Simulation results are given in two subplots, Figures(a), are trajectories of agents, in which their initial locations are depicted with triangle markers and their desired positions are depicted with star markers. Trajectories of agents are shown with dashed lines. Since this plot is given in x − y plane, to have a sense of their maneuvers in time, the positions of agents in a time t1 are also shown with filled circles. Figures.(b) demonstrate the pairwise distance ||∆pij || between all agents from beginning to the end of the simulation. The minimum safety distance Ds is shown with a horizontal line. To have a safe maneuver, all pairwise distances should be higher than this minimum value. Figure 5.2 depicts the result for the first scenario in which the measurement uncertainty is not considered, and accurate ZCBFs are available and equally shared between agents. Note that without the incorporation of ZCBFs, all agents would have crashed at the origin; however, in a barrier-certified fashion, agents get close to each other, turn around and move toward their desired positions. Figure 5.2 (b) demonstrates that pairwise distances have always been higher than minimum distance, and the safety of the system is preserved. In the second scenario, the measurement uncertainty is incorporated as well. By clockwise numbering of agents in their initial positions, absolute measurement errors of agents 1, 2, 5 are 0.01. Measurement errors of agents 3 and 4 are considered 0.2 and 0.1, respectively. Note that these are high values of error considering that the minimum safety distance is 0.3. The exact values of errors are not known to agents, and their ZCBFs deviate from the actual value. The result of employing the same approach in the previous scenario and equally distributing ZCBF constraints between agents without using info-gap is depicted in Figure 5.3. As can be seen in this figure, agents approach the origin carelessly and 125 measurement error causes safety violation. The last scenario employs the proposed method to handle the measurement uncertainty. Measurement confidence of each agent is considered to be proportional to the inverse of its error, and pairwise ZCBF constraints are shared between agents using (5.27). The result is shown in Figure 5.4. As shown in Figure 5.4 (a), agents with higher confidence have more agile maneuvers and rapidly approach the origin, turn around and move toward their desired positions. Agent 4, with the next high measurement confidence, gives the right of way to agile agents while approaching the origin faster than agent 3 and agent 3, which has the lowest measurement confidence, moves slowly and conservatively until the path is clear. As it is shown in Figure 5.4 (b), the safety of the system is ensured despite inaccurate ZCBFs due to the measurement error. Note that measurement errors of agents 3 and 4 are significantly high and are 67 and 34 percent of the minimum safety distance, respectively. Note that the exact value of errors and their horizon are unknown to agents and the pairwise horizon of uncertainty based on each agent’s horizon √ is Sij = si · sj . Thus, considering agents 3 and 4, which have the highest measurement uncertainty of 0.2 and 0.1, the pairwise uncertainty of at least 0.14 is safely tolerable. To show the effectiveness of the method for different values of safety distances, simulation is conducted for different values of Ds = 0.1, 0.3, 0.6, 1. The pairwise distances between every two agents are depicted in Figure 5.5. As can be seen in this figure, the pairwise distances are above the critical safety line, indicating that the task is accomplished safely. To better show the maneuvers of agents, a time-lapse is also shown in Figure 5.6. Remark 5.6. Note that the proposed approach has a couple of advantages compared to the worst-case approach. First, in the worst-case approach, the information about the worst case of the measurement error is needed, and inaccuracy in this information results in violation of safety. Second, since the measurement error is high with respect to the minimum safety distance (67 percent for one of the agents), even if the worst-case is exactly known, still an overly conservative safety distance needs to be kept among agents which is not needed for certain agents. In addition, it makes the overall system slow and conservative and 126 3 6 ---- Agent 1 ---- Agent 2 ---- Agent 3 2 5 Pairwise distance between agents ---- Agent 4 1 ---- Agent 5 4 y 0 3 -1 2 -2 1 -3 0 -3 -2 -1 0 1 2 3 0 5 10 15 x t (a) (b) Figure 5.2: (a) Agents’ trajectories, no measurement error (b) Corresponding ||∆pij || 3 6 ---- Agent 1 ---- Agent 2 ---- Agent 3 2 5 ---- Agent 4 Pairwise distance between agents 1 ---- Agent 5 4 y 0 3 -1 2 -2 1 -3 0 -3 -2 -1 0 1 2 3 0 5 10 15 x t (a) (b) Figure 5.3: (a) Trajectories, measurement error without IG (b) Corresponding ||∆pij || might result in the infeasibility of solution as well. Furthermore, the worst-case approach is not capable of rare cases of failure. Finally, the cooperative capability of agents in safety remains unused. However, in the proposed approach, non-conservative robustness against measurement uncertainty and rare cases of failure is achieved by proper distribution of ZCBF constraints between agents and giving them the benefit of cooperation for a safe maneuver. 5.5 Conclusion In this chapter, designing safe controllers for collision-avoidance problem in multi-agent systems in the presence of measurement uncertainty is considered, and a robust-satisficing 127 3 6 ---- Agent 1 ---- Agent 2 ---- Agent 3 2 5 Pairwise distance between agents ---- Agent 4 1 ---- Agent 5 4 y 0 3 -1 2 -2 1 -3 0 -3 -2 -1 0 1 2 3 0 5 10 15 x t (a) (b) Figure 5.4: (a) Trajectories, measurement error with IG (b) Corresponding ||∆pij || 5 5 Pairwise distance between agents Pairwise distance between agents 4 4 3 3 2 2 1 1 Ds =1 Ds =0.6 0 0 0 5 10 15 0 5 10 15 t t 5 5 Pairwise distance between agents Pairwise distance between agents 4 4 3 3 2 2 1 1 Ds =0.3 Ds =0.1 0 0 0 5 10 15 0 5 10 15 t t Figure 5.5: Pairwise distances between agents for different values of safety distance Ds 128 3 3 t=0 t=1.13 2 2 1 1 y 0 y 0 -1 -1 -2 -2 -3 -3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 x x 3 3 t=2 t=2.78 2 2 1 1 y 0 y 0 -1 -1 -2 -2 -3 -3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 x x 3 3 t=6.83 t=15 2 2 1 1 y 0 y 0 -1 -1 -2 -2 -3 -3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 x x Figure 5.6: Time Lapse of agents’ position with measurement error using IG method 129 CBF-based approach is proposed for safe cooperative maneuvers of agents. It is assumed that neither the probabilistic model of measurement nor its worst-case uncertainty is available. Then, IG theory is employed to achieve the best way of sharing safety in the sense of robustness toward uncertainty. It is shown that the certainty of one agent’s measurement can compensate for the lack of accuracy of other agents. The simulation results with five agents in different scenarios are presented to show the performance of the proposed method. Future work includes providing a method for deriving and employing local confidence levels, which enables communication-free safety task assignment. 130 Chapter6 Conclusion In this dissertation, the safety of the systems using CBFs in the face of two major challenges of uncertain system dynamics and uncertain environment is investigated. For the model un- certainty, a novel RL framework is proposed which augments the performance cost function with a barrier-type safety cost to form a safety-aware performance metric. It is shown that the presented performance also assures stability of the learned solution when there is no con- flict between safety and stability. It is also shown that safety is guaranteed during successive approximation of control policies. A safe off-policy algorithm is employed to implement the proposed method. Afterward, the challenging problem of safe exploration is tackled with a barrier-certified safe RL framework which is obtained by means of efficient learning with prescribed performance along with a robutified safe and stabilizable controller throughout the algorithm including the data collection phase. Experience replay-based model approxi- mation is employed, which ensures the exponential convergence of the learning error to zero after a mild rank condition is satisfied. This makes the learning error a vanishing perturba- tion to the approximated model, which facilitates designing stabilizing controller using the available rough knowledge of the system. The accurate bound of error is then employed in formation of a novel non-conservative AR-CBF which ensures safety during learning. AR- CBF and stabilizing controller are integrated through quadratic programming and is used for further data collection needed for off-policy iteration. The noisy input is modified ac- cordingly to result in safe and stable action. After collecting safe rich data, the optimal 131 policy is approximated and then again is certified using AR-CBF for safe exploitation. Afterward, the impact of uncertainty in the operating environment of the system is in- vestigated. A learning-enabled ZCBF controller for safety-critical systems under uncertainty has been proposed. It has been proved that the proposed method is capable of ensuring safety in complicated and uncertain environments in the presence of external agents with unknown dynamics. It has been also demonstrated that safety during learning and even with inaccurate modeling of external agents is guaranteed. As a result, this approach has pro- vided a practical method in control scenarios that accurate modeling needs a great number of data and computationally expensive learning schemes while still un-predicted objects are expected such as autonomous driving in an urban area. Meanwhile, having a better model has enabled the controller to take a less conservative action and has resulted in a better performance. To achieve this goal, a modified experience replay method has been proposed that identifies the external agents’ dynamic to minimize the difference between the safe set and its approximation. This method provides fast convergence and ensures a bounded error to the exact model even with inaccurate modeling, which are both crucial in safety-critical control systems. Finally, the cooperative capability of multi-agent systems is employed for robust safety guarantee in the presence of measurement uncertainty. A robust-satisficing CBF-based approach is proposed for safe cooperative maneuvers of agents. It is assumed that neither the probabilistic model of measurement nor its worst-case uncertainty is avail- able. Then, information-gap theory is employed to determine the share of agents in safety guarantee to achieve the highest robustness against uncertainty. It is shown that the cer- tainty of one agent’s measurement can compensate for the lack of accuracy of other agents, and even rare failure case of one agent can be compensated by others. Future research direction includes the extension of safe exploratory RL framework to nonlinear systems and further employment of nonlinear theory in characterizing the learning behavior of learning-based controllers. It is also suggested to investigate the reciprocal interaction of agents in a cluttered environment and develop safe cooperative methods in 132 which conservatism is further reduced by efficient communication methodology. 133 BIBLIOGRAPHY 134 BIBLIOGRAPHY [1] Safety-critical Systems - Wikipedia @ONLINE. [2] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction (2nd Edition, in preparation). Vol. 1. MIT press Cambridge, 2017. [3] F. L. Lewis and D. Vrabie. “Reinforcement learning and adaptive dynamic program- ming for feedback control”. In: IEEE Circuits and Systems Magazine 9.3 (2009), pp. 32–50. [4] B. Kiumarsi et al. “Optimal and Autonomous Control Using Reinforcement Learning: A Survey”. In: IEEE Transactions on Neural Networks and Learning Systems 29.6 (2018), pp. 2042–2062. [5] J Garcı́a and F Fernández. “A comprehensive survey on safe reinforcement learning”. In: Journal of Machine Learning Research 16 (Aug. 2015), pp. 1437–1480. [6] A. Tamar et al. “Sequential Decision Making With Coherent Risk”. In: IEEE Trans- actions on Automatic Control 62.7 (2017), pp. 3323–3338. [7] O. Mihatsch and R. Neuneier. “Risk-sensitive reinforcement learning”. In: Machine Learning 49.2 (2014), 267––290. [8] T. Mannucci et al. “Safe Exploration Algorithms for Reinforcement Learning Con- trollers”. In: IEEE Transactions on Neural Networks and Learning Systems 29.4 (2018), pp. 1069–1081. [9] Brenna D. Argall et al. “A survey of robot learning from demonstration”. In: Robotics and Autonomous Systems 57.5 (2009), pp. 469 –483. [10] Kurt Driessens and Sašo Džeroski. “Integrating Guidance into Relational Reinforce- ment Learning”. In: Machine Learning 57.3 (2004), pp. 271–304. [11] Brijen Thananjeyan et al. “Safety Augmented Value Estimation From Demonstra- tions (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks”. In: IEEE Robotics and Automation Letters 5.2 (2020), pp. 3612–3619. [12] Jingyu Niu et al. “Two-Stage Safe Reinforcement Learning for High-Speed Autonomous Racing”. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 2020, pp. 3934–3941. 135 [13] Canhuang Dai et al. “Reinforcement Learning with Safe Exploration for Network Se- curity”. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019, pp. 3057–3061. [14] Brijen Thananjeyan et al. “Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones”. In: IEEE Robotics and Automation Letters 6.3 (2021), pp. 4915– 4922. [15] Shuo Li and Osbert Bastani. “Robust Model Predictive Shielding for Safe Reinforce- ment Learning with Stochastic Dynamics”. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). 2020, pp. 7166–7172. [16] Zhaojian Li, Tianshu Chu, and Uroš Kalabić. “Dynamics-Enabled Safe Deep Reinforce- ment Learning: Case Study on Active Suspension Control”. In: 2019 IEEE Conference on Control Technology and Applications (CCTA). 2019, pp. 585–591. [17] Shuojie Mo, Xiaofei Pei, and Chaoxian Wu. “Safe Reinforcement Learning for Au- tonomous Vehicle Using Monte Carlo Tree Search”. In: IEEE Transactions on Intelli- gent Transportation Systems (2021), pp. 1–8. [18] Wei Wang et al. “Safe Off-Policy Deep Reinforcement Learning Algorithm for Volt- VAR Control in Power Distribution Systems”. In: IEEE Transactions on Smart Grid 11.4 (2020), pp. 3008–3018. [19] Yangyang Ge et al. “Safe Q-Learning Method Based on Constrained Markov Decision Processes”. In: IEEE Access 7 (2019), pp. 165007–165017. [20] Zhehua Zhou et al. “A General Framework to Increase Safety of Learning Algorithms for Dynamical Systems Based on Region of Attraction Estimation”. In: IEEE Trans- actions on Robotics 36.5 (2020), pp. 1472–1490. [21] I. M. Mitchell, A. M. Bayen, and C. J. Tomlin. “A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games”. In: IEEE Transactions on Automatic Control 50.7 (2005), pp. 947–957. [22] A. K. Akametalu et al. “Reachability-based safe learning with Gaussian processes”. In: 53rd IEEE Conference on Decision and Control. 2014, pp. 1424–1431. [23] Jaime F. Fisac et al. “A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems”. In: IEEE Transactions on Automatic Control 64.7 (2019), pp. 2737–2752. [24] M. Chen, S. Herbert, and C. J. Tomlin. “Exact and efficient Hamilton-Jacobi guaran- teed safety analysis via system decomposition”. In: 2017 IEEE International Confer- ence on Robotics and Automation (ICRA). 2017, pp. 87–92. 136 [25] Yifei Simon Shao et al. “Reachability-Based Trajectory Safeguard (RTS): A Safe and Fast Reinforcement Learning Safety Layer for Continuous Control”. In: IEEE Robotics and Automation Letters 6.2 (2021), pp. 3663–3670. [26] “Barrier Lyapunov Functions for the control of output-constrained nonlinear systems”. In: Automatica 45.4 (2009), pp. 918 –927. issn: 0005-1098. [27] A. D. Ames, J. W. Grizzle, and P. Tabuada. “Control barrier function based quadratic programs with application to adaptive cruise control”. In: 53rd IEEE Conference on Decision and Control. 2014, pp. 6271–6278. [28] A. D. Ames et al. “Control Barrier Function Based Quadratic Programs for Safety Critical Systems”. In: IEEE Transactions on Automatic Control 62.8 (2017), pp. 3861– 3876. [29] M. Ohnishi et al. “Barrier-Certified Adaptive Reinforcement Learning With Appli- cations to Brushbot Navigation”. In: IEEE Transactions on Robotics 35.5 (2019), pp. 1186–1205. [30] M. Srinivasan, S. Coogan, and M. Egerstedt. “Control of Multi-Agent Systems with Finite Time Control Barrier Certificates and Temporal Logic”. In: IEEE Conference on Decision and Control (CDC). 2018, pp. 1991–1996. [31] Ayush Agrawal and Koushil Sreenath. “Discrete Control Barrier Functions for Safety- Critical Control of Discrete Systems with Application to Bipedal Robot Navigation”. In: Proceedings of Robotics: Science and Systems. Cambridge, Massachusetts, 2017. [32] L. Wang, E. A. Theodorou, and M. Egerstedt. “Safe Learning of Quadrotor Dynam- ics Using Barrier Certificates”. In: IEEE International Conference on Robotics and Automation (ICRA). 2018, pp. 2460–2465. [33] L. Wang, A. Ames, and M. Egerstedt. “Safety barrier certificates for heterogeneous multi-robot systems”. In: 2016 American Control Conference (ACC). 2016, pp. 5213– 5218. [34] Xiangru Xu et al. “Robustness of Control Barrier Functions for Safety Critical Con- trol”. In: IFAC-PapersOnLine 48.27 (2015), pp. 54 –61. [35] A. J. Taylor and A. D. Ames. “Adaptive Safety with Control Barrier Functions”. In: 2020 American Control Conference (ACC). 2020, pp. 1399–1405. [36] B. T. Lopez, J. J. E. Slotine, and J. P. How. “Robust Adaptive Control Barrier Func- tions: An Adaptive and Data-Driven Approach to Safety”. In: IEEE Control Systems Letters 5.3 (2021), pp. 1031–1036. 137 [37] Andrew Taylor et al. “Learning for Safety-Critical Control with Control Barrier Func- tions”. In: Proceedings of the 2nd Conference on Learning for Dynamics and Control. Vol. 120. Proceedings of Machine Learning Research. PMLR, 2020, pp. 708–717. [38] Richard Cheng et al. “End-to-End Safe Reinforcement Learning through Barrier Func- tions for Safety-Critical Continuous Control Tasks”. In: Proceedings of the AAAI Con- ference on Artificial Intelligence 33 (July 2019), pp. 3387–3395. [39] L. Wang, D. Han, and M. Egerstedt. “Permissive Barrier Certificates for Safe Stabiliza- tion Using Sum-of-squares”. In: American Control Conference (ACC). 2018, pp. 585– 590. [40] “Nearly optimal state feedback control of constrained nonlinear systems using a neural networks HJB approach”. In: Annual Reviews in Control 28.2 (2004), pp. 239 –251. issn: 1367-5788. [41] Y. Yang et al. “Safety-Aware Reinforcement Learning Framework with an Actor-Critic- Barrier Structure”. In: 2019 American Control Conference (ACC). 2019, pp. 2352– 2358. [42] Adrian G. Wills and William P. Heath. “Barrier function based model predictive con- trol”. In: Automatica 40.8 (2004), pp. 1415–1422. [43] C. Feller and C. Ebenbauer. “Weight recentered barrier functions and smooth poly- topic terminal set formulations for linear model predictive control”. In: 2015 American Control Conference (ACC). 2015, pp. 1647–1652. [44] Z. Marvi and B. Kiumarsi. “Safety Planning Using Control Barrier Function: A Model Predictive Control Scheme”. In: 2019 IEEE 2nd Connected and Automated Vehicles Symposium (CAVS). 2019, pp. 1–5. [45] N. Wen et al. “UAV online path planning algorithm in a low altitude dangerous envi- ronment”. In: IEEE/CAA Journal of Automatica Sinica 2.2 (2015), pp. 173–185. [46] Dorsa Sadigh. “Safe and Interactive Autonomy: Control, Learning, and Verification”. PhD thesis. EECS Department, University of California, Berkeley, 2017. [47] Z. Marvi and B. Kiumarsi. “Safe Off-policy Reinforcement Learning Using Barrier Functions”. In: 2020 American Control Conference (ACC). 2020, pp. 2176–2181. [48] Zahra Marvi and Bahare Kiumarsi. “Barrier-certified Learning-based Control of Sys- tems with Uncertain Safe Set”. In: 2021 American Control Conference (ACC). 2021, pp. 3482–3487. 138 [49] Zahra Marvi and Bahare Kiumarsi. “Barrier-Certified Model-Learning and Control of Uncertain Linear Systems using Experience Replay Method”. In: 2021 Conference on Decision and Control (CDC). 2021. [50] Zahra Marvi and Bahare Kiumarsi. “Safe reinforcement learning: A control barrier function optimization approach”. In: International Journal of Robust and Nonlinear Control (2020), pp. 1–18. [51] Z. Marvi and B. Kiumarsi. “Barrier-certified learning-enabled safe control design for systems operating in uncertain environments”. In: IEEE/CAA J. Autom. Sinica (2021), pp. 1–13. [52] Zahra Marvi and Bahare Kiumarsi. “Robust Satisficing Cooperative Control Barrier Functions for Multi-Robots Systems using Information-Gap Theory”. In: International Journal of Robust and Nonlinear Control(Accepted) (2021). [53] Z. Marvi and B. Kiumarsi. “Reinforcement Learning based Control Design with Safety and Stability guarantees during Exploration”. In: (Under Review) (). [54] F.L. Lewis and V.L. Syrmos. Optimal Control. A Wiley-interscience publication. Wiley, 1995. [55] H.K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002. [56] G. N. Saridis and C. G. Lee. “An Approximation Theory of Optimal Control for Trainable Manipulators”. In: IEEE Transactions on Systems, Man, and Cybernetics 9.3 (1979), pp. 152–159. [57] Yu Jiang and Zhong-Ping Jiang. “Computational adaptive optimal control for continuous- time linear systems with completely unknown dynamics”. In: Automatica 48.10 (2012), pp. 2699–2704. [58] Y. Jiang and Z. Jiang. “Robust Adaptive Dynamic Programming and Feedback Stabi- lization of Nonlinear Systems”. In: IEEE Transactions on Neural Networks and Learn- ing Systems 25.5 (2014), pp. 882–893. [59] H. Modares, F. L. Lewis, and Z. Jiang. “H∞ Tracking Control of Completely Unknown Continuous-Time Systems via Off-Policy Reinforcement Learning”. In: IEEE Trans- actions on Neural Networks and Learning Systems 26.10 (2015), pp. 2550–2562. [60] Eric J. Rossetter and J. Christian Gerdes. “Lyapunov-based performance guarantees for the potential field lane-keeping assistance system”. In: Journal of Dynamic Systems, Measurement, and Control 128.3 (Aug. 2005), pp. 510–522. 139 [61] C. Chen et al. “Reinforcement Learning-Based Adaptive Optimal Exponential Track- ing Control of Linear Systems With Unknown Dynamics”. In: IEEE Transactions on Automatic Control 64.11 (2019), pp. 4423–4438. [62] Mitio Nagumo. “Uber die Lage der Integralkurven gewo hnlicher Differentialgleichun- gen.” In: Proceedings of the Physico-Mathematical Society of Japan. 3rd Series 24 (1942), pp. 551–559. [63] F. Blanchini. “Set invariance in control”. In: Automatica 35.11 (1999), pp. 1747 –1767. [64] Franco Blanchini and Stefano Miani. Set-Theoretic Methods in Control. Birkhäuser Basel, 2015. [65] Georges Bouligand. Introducion a la Geometrie Infinitesimale Directe. Gauthiers–Villars, 1932. [66] F. L. Lewis, A. Yesildirak, and Suresh Jagannathan. Neural Network Control of Robot Manipulators and Nonlinear Systems. Bristol, PA, USA: Taylor & Francis, Inc., 1998. isbn: 0748405968. [67] H. Modares, F. L. Lewis, and M. Naghibi-Sistani. “Adaptive Optimal Control of Un- known Constrained-Input Systems Using Policy Iteration and Neural Networks”. In: IEEE Transactions on Neural Networks and Learning Systems 24.10 (2013), pp. 1513– 1525. [68] Paul J. Werbos. “Approximate dynamic programming for real-time control and neural modeling”. In: Handbook of Intelligent Control. 1992. [69] P. J. Werbos. “Neural networks for control and system identification”. In: IEEE Con- ference on Decision and Control. 1989, 260–265 vol.1. [70] D. Zhao et al. “Experience Replay for Optimal Control of Nonzero-Sum Game Sys- tems With Unknown Dynamics”. In: IEEE Transactions on Cybernetics 46.3 (2016), pp. 854–865. [71] Katja Vogel. “A comparison of headway and time to collision as safety indicators”. In: Accident Analysis and Prevention 35.3 (2003), pp. 427–433. [72] C. Tomlin, G. J. Pappas, and S. Sastry. “Conflict resolution for air traffic management: a study in multiagent hybrid systems”. In: IEEE Transactions on Automatic Control 43.4 (1998), pp. 509–521. [73] Bassam Alrifaee, Kevin Kostyszyn, and Dirk Abel. “Model Predictive Control for Collision Avoidance of Networked Vehicles Using Lagrangian Relaxation”. In: IFAC- 140 PapersOnLine 49.3 (2016). 14th IFAC Symposium on Control in Transportation Sys- tems CTS 2016, pp. 430 –435. [74] O. Khatib. “Real-time obstacle avoidance for manipulators and mobile robots”. In: Proceedings. 1985 IEEE International Conference on Robotics and Automation. Vol. 2. 1985, pp. 500–505. [75] Paolo Fiorini and Zvi Shiller. “Motion Planning in Dynamic Environments Using Ve- locity Obstacles”. In: The International Journal of Robotics Research 17.7 (1998), pp. 760–772. [76] L. Wang, A. D. Ames, and M. Egerstedt. “Safety Barrier Certificates for Collisions-Free Multirobot Systems”. In: IEEE Transactions on Robotics 33.3 (2017), pp. 661–674. [77] Sunan Huang, Rodney Swee Huat Teo, and Kok Kiong Tan. “Collision avoidance of multi unmanned aerial vehicles: A review”. In: Annual Reviews in Control 48 (2019), pp. 147 –164. [78] S. Chung et al. “A Survey on Aerial Swarm Robotics”. In: IEEE Transactions on Robotics 34.4 (2018), pp. 837–855. [79] A. Chakravarthy and D. Ghose. “Obstacle avoidance in a dynamic environment: a collision cone approach”. In: IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 28.5 (1998), pp. 562–574. [80] Fangyuan Xu, Li Tang, and Yan-Jun Liu. “Tangent barrier Lyapunov function-based constrained control of flexible manipulator system with actuator failure”. In: Interna- tional Journal of Robust and Nonlinear Control (2021), pp. 1–14. [81] Urs Borrmann et al. “Control Barrier Certificates for Safe Swarm Behavior”. In: IFAC- PapersOnLine 48.27 (2015). Analysis and Design of Hybrid Systems ADHS, pp. 68 – 73. [82] C. K. Verginis and D. V. Dimarogonas. “Closed-Form Barrier Functions for Multi- Agent Ellipsoidal Systems With Uncertain Lagrangian Dynamics”. In: IEEE Control Systems Letters 3.3 (2019), pp. 727–732. [83] M. Majidi, B. Mohammadi-Ivatloo, and A. Soroudi. “Application of information gap decision theory in practical energy problems: A comprehensive review”. In: Applied Energy 249 (2019), pp. 157 –165. [84] V. Marchau et al. Decision Making under Deep Uncertainty: From Theory to Practice. Springer, Jan. 2019. 141 [85] Jun-Ming Hu, Hong-Zhong Huang, and Yan-Feng Li. “Reliability growth planning based on information gap decision theory”. In: Mechanical Systems and Signal Pro- cessing 133 (2019), p. 106274. [86] S. G. Pierce et al. “Evaluation of Neural Network Robust Reliability Using Information- Gap Theory”. In: IEEE Transactions on Neural Networks 17.6 (2006), pp. 1349–1361. [87] J. Frolik, M. Abdelrahman, and P. Kandasamy. “A confidence-based approach to the self-validation, fusion and reconstruction of quasi-redundant sensor data”. In: IEEE Transactions on Instrumentation and Measurement 50.6 (2001), pp. 1761–1769. [88] V. Zambianchi et al. “Distributed Nonasymptotic Confidence Region Computation Over Sensor Networks”. In: IEEE Transactions on Signal and Information Processing over Networks 4.2 (2018), pp. 308–324. 142