ASSURING THE ROBUSTNESS AND RESILIENCY OF LEARNING-ENABLED AUTONOMOUS SYSTEMS By Michael Austin Langford A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2022 ABSTRACT ASSURING THE ROBUSTNESS AND RESILIENCY OF LEARNING-ENABLED AUTONOMOUS SYSTEMS By Michael Austin Langford As Learning-Enabled Systems (LESs) have become more prevalent in safety-critical applications, addressing the assurance of LESs has become increasingly important. Because machine learning models in LESs are not explicitly programmed like traditional software, developers typically have less direct control over the inferences learned by LESs, relying instead on semantically valid and complete patterns to be extracted from the system’s exposure to the environment. As such, the behavior of an LES is strongly dependent on the quality of its training experience. However, run-time environments are often noisy or not well-defined. Uncertainty in the behavior of an LES can arise when there is inadequate coverage of relevant training/test cases (e.g., corner cases). It is challenging to assure safety-critical LESs will perform as expected when exposed to run-time conditions that have never been experienced during training or validation. This doctoral research contributes automated methods to improve the robustness and resilience of an LES. For this work, a robust LES is less sensitive to noise in the environment, and a resilient LES is able to self-adapt to adverse run-time contexts in order to mitigate system failure. The proposed methods harness diversity-driven evolution-based methods, machine learning, and software assurance cases to train robust LESs, uncover robust system configurations, and foster resiliency through self-adaptation and predictive behavior modeling. This doctoral work demonstrates these capabilities by applying the proposed framework to deep learning and autonomous cyber-physical systems. Copyright by MICHAEL AUSTIN LANGFORD 2022 To Jinny Shows, who kept me sane throughout this process and fed when I was hungry. iv ACKNOWLEDGEMENTS This dissertation would not be possible without the support and assistance of others, whom I appreciate a great deal. Many thanks go to my advisor, Dr. Betty H.C. Cheng, who has always been a welcoming presence from the first day I stumbled onto campus. I appreciate our many brainstorming sessions, the tangents we explored, and her reliably cheerful attitude, even during a global pandemic and complete social isolation. Even more, I appreciate the many red ink-filled pages of mark-ups she returned to me on a near weekly basis. While often a source of frustration, I always valued her constructive criticisms and know that she has helped me to become a better communicator of my own ideas. It is no doubt that this dissertation could not have been created without her guidance and support. Recognition also goes to my many collaborators at Michigan State University. Dr. Philip K. McKinley has been equally as welcoming and supportive throughout all my studies. Thanks go to Glen Simon, Jared Clark, and Jonathon Fleck for their collaboration on AC-ROS and the EvoRally platform. Thanks also go to Kira Chan for his collaboration on MoDALAS. Finally, thanks go to Dr. Wolfgang Banzhaf and Dr. Sandeep Kulkarni for agreeing to be members of my doctoral committee. I want to thank Dr. Banzhaf specifically for sparking my interest in evolutionary computation and also his genuine enthusiasm to hear about my research. None of this work would have been possible without the input of others, and it has been a joy to spend these past years studying at Michigan State University. Finally, I would like to thank my family and friends for their support with each life decision. I thank my parents for providing me the freedom and love to grow and learn on my own, to explore whatever passions interest me the most, rather than limiting myself to those that make me most comfortable. Likewise, I give love to all the friends I have met along the way and to those I have yet to meet, because life is no fun without friends to share both the best and worst moments. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Elements of a Deep Neural Network . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Challenges for Deep Neural Networks . . . . . . . . . . . . . . . . . . . . 8 2.2 Evolutionary Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Search-Based Software Engineering with Evolution . . . . . . . . . . . . . 9 2.2.2 Evolutionary Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Novelty Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Self-Adaptive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Self-Adaptation at Run Time . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Run-Time Monitoring with Utility Functions . . . . . . . . . . . . . . . . . 13 2.4 Validation Datasets and Implementations for Deep Neural Networks . . . . . . . . 13 2.4.1 CIFAR-10 Dataset for Image Recognition . . . . . . . . . . . . . . . . . . 14 2.4.2 KITTI Vision Benchmark Suite for Autonomous Driving Object Detection . 15 2.4.3 Waymo Open Dataset for Autonomous Driving Object Detection . . . . . . 17 2.5 Demonstration Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Robot Control Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.2 The EvoRally Autonomous Vehicle . . . . . . . . . . . . . . . . . . . . . 19 2.5.3 Deep Learning-Driven Autonomous Rover . . . . . . . . . . . . . . . . . . 20 CHAPTER 3 LEARNING ROBUSTNESS THROUGH DIVERSIFIED TRAINING . . . 23 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Assessing and Improving the Robustness of Learning Models . . . . . . . . . . . . 27 3.2.1 Definition of an Environmental Condition . . . . . . . . . . . . . . . . . . 27 3.2.2 Assessing System Behavior When Exposed to an Uncertain Condition . . . 28 3.2.3 Improving System Robustness for Exposure to an Uncertain Condition . . . 37 3.3 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.1 Deep Neural Network Applications . . . . . . . . . . . . . . . . . . . . . . 41 vi 3.3.2 Evaluation of System Behavior Under Uncertain Environmental Conditions 42 3.3.2.1 System Performance on Synthetic Test Data . . . . . . . . . . . . 42 3.3.2.2 Analysis of Resulting System Behavior . . . . . . . . . . . . . . 44 3.3.3 Evaluation of System Robustness . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.4 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1 Automated Testing with Novelty Search . . . . . . . . . . . . . . . . . . . 52 3.4.2 Automated Testing for Deep Learning . . . . . . . . . . . . . . . . . . . . 53 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 CHAPTER 4 ENABLING THE EVOLUTION OF ROBUST ROBOTIC SYSTEMS . . . . 57 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Digital Twin-Based Simulation of a Cyber-Physical System . . . . . . . . . . . . . 59 4.3 Exploring Environmental Diversity with Novelty Search . . . . . . . . . . . . . . . 60 4.3.1 Fitness-Driven Optimization of System Settings . . . . . . . . . . . . . . . 61 4.3.2 Diversity-Driven Evolution of Simulation Settings . . . . . . . . . . . . . . 62 4.4 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.1 The EvoRally Throttle Controller . . . . . . . . . . . . . . . . . . . . . . 65 4.4.2 Evolving a New Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.3 Assessing the Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.4 Further Enhancing the Controller . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 CHAPTER 5 UNCOVERING INADEQUATE LEARNING MODELS AT RUN TIME . . 72 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Predictive Behavior Modeling for System Resilience . . . . . . . . . . . . . . . . . 76 5.3 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.3 Evaluation of Context Generation . . . . . . . . . . . . . . . . . . . . . . 89 5.3.4 Evaluation of Predictive Behavior Models . . . . . . . . . . . . . . . . . . 89 5.3.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.1 Model Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.2 Predictive Behavior Modeling . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 CHAPTER 6 ASSURANCE CASES FOR SELF-ADAPTIVE SYSTEMS . . . . . . . . . 96 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Functional Assurance for Robotic Self-Adaptive Systems . . . . . . . . . . . . . . 98 6.2.1 Goal Structuring Notation for Functional Assurance . . . . . . . . . . . . . 98 6.2.2 Constructing and Evaluating an Assurance Case Model . . . . . . . . . . . 100 vii 6.2.3 A Shared Knowledge Base for the Managing Adaptations . . . . . . . . . . 103 6.2.4 Design-time Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.5 Assuring System Adaptations at Run time . . . . . . . . . . . . . . . . . . 105 6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 CHAPTER 7 GOAL MODELS FOR LEARNING-ENABLED SYSTEMS . . . . . . . . . 110 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2 Constructing Goal Models for Autonomous Systems . . . . . . . . . . . . . . . . . 112 7.2.1 Developing Assurance Cases . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2.2 Developing Goal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2.3 Relaxing Goal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.3 Goal Model-Driven Self-Adaptation of Learning-Enabled Systems . . . . . . . . . 121 7.3.1 Preparation at Design Time . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.3.2 Self-Adaptation at Run Time . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.4 Demonstration With Autonomous Rover . . . . . . . . . . . . . . . . . . . . . . . 130 7.4.1 Assessing Visual Inputs With Behavior Oracles . . . . . . . . . . . . . . . 132 7.4.2 Run-Time Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 CHAPTER 8 ADDRESSING ROBUSTNESS AND RESILIENCY AT RUN TIME . . . . 141 8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.2 A Service-Oriented Framework for Trusted AI . . . . . . . . . . . . . . . . . . . . 144 8.2.1 Resiliency Through Predictive Behavior . . . . . . . . . . . . . . . . . . . 145 8.2.2 Robustifying Learning Models . . . . . . . . . . . . . . . . . . . . . . . . 146 8.2.3 Constructing Goal Models . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.2.4 Model-Driven Monitor and Control . . . . . . . . . . . . . . . . . . . . . 149 8.2.5 MAPE-K Microservices . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.2.6 Software Composability For Trusted AI . . . . . . . . . . . . . . . . . . . 152 8.3 Demonstration With Autonomous Rover . . . . . . . . . . . . . . . . . . . . . . . 152 8.3.1 Creating Behavior Oracles . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.3.2 Creating Robustified Learning Models . . . . . . . . . . . . . . . . . . . . 155 8.3.3 Implementing Anunnaki Microservices . . . . . . . . . . . . . . . . . . . 156 8.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 CHAPTER 9 CONCLUSIONS AND FUTURE INVESTIGATIONS . . . . . . . . . . . . 161 9.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.2 Future Investigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 viii LIST OF TABLES Table 3.1: Enki Configuration Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Table 3.2: Environmental Parameter Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 41 Table 3.3: Paired Samples 𝑡-Test For Significance of CIFAR10-Tenki Accuracy Comparisons 44 Table 3.4: Paired Samples 𝑡-Test For Significance of KITTI-Tenki Recall Comparisons . . 44 Table 3.5: Paired Samples 𝑡-Test Results For CIFAR10-DNNenki Robustness Comparisons 49 Table 3.6: Paired Samples 𝑡-Test Results For KITTI-DNNenki Robustness Comparisons . 49 Table 4.1: PID Controller Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Table 4.2: Fitness EA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 4.3: Enki EA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 5.1: Parameters for Environmental Effects . . . . . . . . . . . . . . . . . . . . . . . 87 Table 5.2: Context Generation Results for KITTI Data . . . . . . . . . . . . . . . . . . . . 90 Table 5.3: Context Generation Results for Waymo Data . . . . . . . . . . . . . . . . . . . 90 Table 5.4: Accuracy of KITTI Behavior Oracles . . . . . . . . . . . . . . . . . . . . . . . 91 Table 5.5: Accuracy of Waymo Behavior Oracles . . . . . . . . . . . . . . . . . . . . . . . 91 Table 7.1: Example of RELAX operators and uncertainty factors [63, 222] . . . . . . . . . 117 Table 7.2: Behavior categories for an object detector. . . . . . . . . . . . . . . . . . . . . . 125 Table 7.3: Example evaluation of non-RELAX-ed KAOS goal model (Figure 7.3) . . . . . 137 Table 7.4: Example evaluation of RELAX-ed KAOS goal model (Figure 7.5) . . . . . . . . 137 ix LIST OF FIGURES Figure 2.1: High-level illustration of a simple DNN. In practice, the network topology may be more complex. Hidden layers enable the network to model a target function that maps input features (𝑥𝑖 ) to output labels (𝑦𝑖 ). Neurons are represented by circles and links represent the flow of data from input to output. The output of each neuron is determined by applying an activation function to a linear transformation of input values by weight matrix (W𝑖 ) and bias vector (b𝑖 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Figure 2.2: Illustration of a generic MAPE-K control loop for an SAS. An autonomic manager monitors, analyzes, plans, and executes changes to a collection managed components at run-time. . . . . . . . . . . . . . . . . . . . . . . . . . 12 Figure 2.3: High-level illustration of a CIFAR-10 DNN with a ResNet-20 architecture. Images are provided as input. The network is composed of a series of residual blocks with “shortcut connections.” Output is a single classification category determined from a probability distribution. . . . . . . . . . . . . . . . . . . . . 15 Figure 2.4: Examples of an image input and output from a KITTI object detector DNN. Images are taken from a vehicle’s onboard camera (a). Output is a set of labeled bounding boxes corresponding to each detected object (b). . . . . . . . 17 Figure 2.5: High-level illustration of a RetinaNet DNN with a ResNet-50 backbone ar- chitecture. Images are provided as input. The network is composed of a feature extraction sub-network (yellow) with classification and regression sub-networks (blue and purple, respectively) at each level of a feature pyra- mid (green). Output is a collection of classification categories and bounding boxes for each respective object detected in the image. . . . . . . . . . . . . . . 17 Figure 2.6: Examples of an image input and output from a Waymo object detector DNN. Images are taken from a vehicle’s onboard camera (a). Output is a set of labeled bounding boxes corresponding to each detected object (b). . . . . . . . 18 Figure 2.7: Typical ROS configuration [176]. Software for a ROS-based system executes ROS nodes over multiple onboard and offboard processors that communicate over a wireless bridge via ROS topics and services. . . . . . . . . . . . . . . . 19 Figure 2.8: The EvoRally autonomous driving platform. The physical platform is shown on left (a), and a simulated version is shown on right (b). . . . . . . . . . . . . 20 x Figure 2.9: For demonstration, an autonomous rover has been assembled to explore deep learning on embedded systems. Sensors include a camera and an ultrasonic range finder. The physical platform is shown on left (a), and a simulated version is shown on right (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Figure 3.1: Examples of how occluding raindrops with changing size and position can im- pact an image recognition DNN trained with 91% test accuracy for CIFAR-10 [119] images. Examples images are provided for an automobile (a) and a deer (b). Unaltered images (i.) are shown with resulting DNN classifications labeled on the bottom. Synthetic raindrops can be introduced with either no impact on classification (ii.) or negative impact (iii.). The impact any given raindrop will have on the resulting classification is not known a priori. . . . . . . . . . . 25 Figure 3.2: Examples of real-world and simulated water droplets. Images in (a) are not obscured by water droplets on the lens. Images in (b) have had real water droplets placed on the lens. Images in (c) show simulated droplets superimposed over the original images. . . . . . . . . . . . . . . . . . . . . . . 29 Figure 3.3: DFD for using Enki to assess a DNN with an uncertain environmental condi- tion. Circles represent computational steps, boxes represent external entities, parallel lines mark persistent data stores, and connecting arrows show the flow of data between steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 3.4: Plot of novelty scores for Enki-generated raindrop contexts over 50 genera- tions. The blue curve shows the mean novelty score of all contexts archived. The green-shaded area shows the bounds for the highest and lowest novelty score in the archive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 3.5: Example activation patterns for the same DNN. Activated neurons (marked green), are assigned a value of 1, and inactive neurons are assigned a value of 0. Finally, all designated values are concatenated into a vector form. . . . . . 34 Figure 3.6: Plots of Enki-archived raindrops, evolved over 30 generations. On top, raindrops are plotted as overlapping circles at the same relative image position and with the same relative raindrop radius. On bottom, accuracy is shown for when the DNN is exposed to each raindrop. Enki starts with a random assortment of raindrops and evolves raindrops that produce diverse effects on the DNN, rather than evolving raindrops with diverse appearances. After 30 generations, Enki found that larger raindrops towards the center created the widest distribution of DNN behavior. . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 3.7: DFD for using Enki-generated contexts to retrain and evaluate a more robust DNN with synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 xi Figure 3.8: Test accuracy of CIFAR10-DNNbase for each respective test dataset and each environmental condition. Displayed values are the mean over multiple trial runs, with 95% confidence intervals shown by whiskers. . . . . . . . . . . 43 Figure 3.9: Test recall of KITTI-DNNbase for each respective test dataset and each environmental condition. Displayed values are the mean over multiple trial runs, with 95% confidence intervals shown by whiskers. . . . . . . . . . . . . . 43 Figure 3.10: Random-generated (a) and Enki-generated (b) contexts are plotted to show the relationship between the neuron coverage and accuracy of CIFAR10-DNNbase under each respective environmental condition. Bold (i.e., colored) points show contexts resulting from a single trial run. Dashed lines are regres- sion lines with a Pearson’s correlation coefficient (𝑟). Dotted vertical lines mark the mean accuracy over all contexts. Compared to random-generation, Enki-generated contexts are more evenly distributed and result in more mis- classifications overall (i.e., lower accuracy). . . . . . . . . . . . . . . . . . . . 45 Figure 3.11: Random-generated (a) and Enki-generated (b) contexts are plotted to show the relationship between the neuron coverage and accuracy of KITTI-DNNbase under each respective environmental condition. Bold (i.e., colored) points show contexts resulting from a single trial run. Dashed lines are regres- sion lines with a Pearson’s correlation coefficient (𝑟). Dotted vertical lines mark the mean recall over all contexts. Compared to random-generation, Enki-generated contexts are more evenly distributed and result in more mis- classifications overall (i.e., lower recall). . . . . . . . . . . . . . . . . . . . . . 46 Figure 3.12: Test performance of CIFAR10-DNNrand (a) and CIFAR10-DNNenki (b) for each respective test dataset and for each environmental condition. Dis- played values are the mean accuracy over multiple trial runs, with 95% confi- dence intervals shown by whiskers. When compared to CIFAR10-DNNbase (Figure 3.8), synthetic test data accuracy has improved for both CIFAR10-DNNrand and CIFAR10-DNNenki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 3.13: Test performance of KITTI-DNNrand (a) and KITTI-DNNenki (b) for each respective test dataset and for each environmental condition. Displayed values are the mean recall over multiple trial runs, with 95% confidence intervals shown by whiskers. When compared to KITTI-DNNbase (Fig- ure 3.9), synthetic test data recall has improved for both KITTI-DNNrand and KITTI-DNNenki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 xii Figure 3.14: Box plots of the robustness of CIFAR10-DNNs (rows) when evaluated against each test dataset for each environmental condition (columns). Robustness (y- axis) was measured over multiple trials for each test dataset (x-axis). Each box shows the interquartile range for the DNN’s measured robustness. Me- dians values are marked orange, and whiskers show the full range. For each plot, the mean (𝜇) robustness found across all test datasets for an environ- mental condition is also shown (top-left corner and gray line). On average, CIFAR10-DNNenki was found to be the most robust to each introduced environmental condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 3.15: Box plots of the robustness of KITTI-DNNs (rows) when evaluated against each test dataset for each environmental condition (columns). Robustness (y-axis) was measured over multiple trials for each test dataset (x-axis). Each box shows the interquartile range for the DNN’s measured robustness. Me- dians values are marked orange, and whiskers show the full range. For each plot, the mean (𝜇) robustness found across all test datasets for an environ- mental condition is also shown (top-left corner and gray line). On average, KITTI-DNNenki was found to be the most robust to each introduced envi- ronmental condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 4.1: High-level DFD of the Evo-Enki framework. Circles represent computational steps, boxes represent external entities, parallel lines mark persistent data stores, and connecting arrows show the flow of data between steps. . . . . . . . 61 Figure 4.2: Data flow between the steps responsible for optimizing a robust system con- figuration for a diverse set of operating scenarios. . . . . . . . . . . . . . . . . 62 Figure 4.3: Data flow between the steps responsible for generating diverse operating scenarios for a given system configuration . . . . . . . . . . . . . . . . . . . . 63 Figure 4.4: A visualization of phenotype distances between scenarios explored by Enki. Blue points show archived scenarios. Gray points show all other evaluated scenarios. Archived scenarios increase in diversity over generations. . . . . . . 64 Figure 4.5: Comparison of the default 𝐶0 PID controller (left) settings with the evolved 𝐶1 PID controller (right) settings over a variety of different speed signals. Subplots (a) and (b) show the speed signal against which the 𝐶1 was evolved. Subplots (c), (d), (e), and (f) show performance of 𝐶0 and 𝐶1 against other random test signals, and subplots (g) and (h) show performance of 𝐶0 and 𝐶1 on a test signal while braking is allowed. . . . . . . . . . . . . . . . . . . . . . 67 Figure 4.6: MSE observed from 𝐶0 for signals from 𝑆enki and 𝑆rand . Regions are shaded by quartile ranges, with green being the bottom quartile, blue being the middle two quartiles, and red being the top quartile. . . . . . . . . . . . . . . . . . . . 69 xiii Figure 4.7: Error observed on controller settings 𝐶0 , 𝐶1 , and 𝐶2 when tested against signals from 𝑆0 . Regions are shaded by quartile ranges, with green being the bottom quartile, blue being the middle two quartiles, and red being the top quartile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 5.1: Example of an Enlil behavior model use-case. An object detector is given a clear image (a) and identifies all automobiles (b). A raindrop is introduced into the scene, which causes the system to fail to detect occluded automobiles (c). The Enlil model is able to correctly detect the interfering raindrop (d) and predicts that the system will fail. . . . . . . . . . . . . . . . . . . . . . . . 75 Figure 5.2: High-level DFD of the Enlil framework. Computational steps are shown as circles. Boxes represent external entities. Directed lines indicate data flow. Persistent data stores are marked within parallel lines. . . . . . . . . . . . . . . 77 Figure 5.3: Demonstrative examples for how rainfall can affect an image-based DNN for object detection. Detected objects have been marked with overlaying bounding boxes, as shown in (a). As “rainfall” increases in intensity, as seen in (b) and (c), the effectiveness of the DNN diminishes to a point where it fails to detect the cyclist, as seen in (d). . . . . . . . . . . . . . . . . . . . . . . 78 Figure 5.4: Scatter plot comparison of Enlil-generated (a) versus Monte Carlo-generated (b) raindrop contexts (where a raindrop occludes the view). Points correspond to raindrops with varying size/position. Colors show whether a raindrop resulted in failure for the LES (red) or had no significant impact (green). Enlil contexts show distinct clusters of equal quantity. Monte Carlo contexts contain few problem-causing raindrops. . . . . . . . . . . . . . . . . . . . . . . 82 Figure 5.5: High-level illustration of the Enlil-Oracle DNN architecture. The archi- tecture comprises four sub-networks. First, the suspected adverse noise is separated from the sensor input by the interference isolation sub-network. Second, a set of “latent” features are distilled from the isolated noise via a feature extraction sub-network. Next, a context regression sub-network converts these latent features into a “perceived” context for the current envi- ronment. Finally, a behavior classification sub-network predicts a behavior category for the LES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 5.6: Examples of contexts generated for each environmental effect: (a) shows the original, unaltered image, (b) shows examples of fog applied to the scene, (c) shows examples of lens flare applied to the camera, (d) shows examples of raindrops on the camera lens, and (e) shows examples of rainfall applied to the scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 xiv Figure 5.7: Proof-of-concept validation of Enlil-Oracle on real-world examples. (a) and (b) show images with real raindrops from the Waymo dataset [205]. (c) and (d) show that Enlil-Oracle was able to perceive the possibly disruptive raindrop in each image (highlighted in red). . . . . . . . . . . . . . . . . . . . 93 Figure 6.1: Example module of a GSN model. This module depicts an assurance ar- gument for the claim that an autonomous rover can successfully patrol its environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 6.2: Screenshot of the GSN-Draw user interface. GSN-Draw enables the creation of standard GSN models that can be imported into AC-ROS. . . . . . . . . . . 101 Figure 6.3: A standard GSN model of an assurance case for a patrolling rover. (a) Module M0 argues that a rover can successfully patrol its environment. (b) M1 argues that a rover can navigate its environment safely. (c) M2 argues that a rover can safely navigate with full sensor capabilities. (d) M3 argues that a rover can safely navigate with only limited sensor capabilities. (e) M4 argues that a rover can localize its position. (f) M5 argues that a rover can determine its speed. 102 Figure 6.4: Data flow for the process AC-ROS takes to prepare GSN models and ROS configuration at design time. Circles, boxes, arrows, and parallel lines re- spectively represent computational steps, external entities, data flow, and data stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Figure 6.5: Data flow for run-time processes involved with managing adaptations of a ROS-based platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Figure 7.1: Example GSN assurance case for design-time and run-time validation of a rover. At design time, validation is supported by formal proofs, test results, and simulation (highlighted in green). At run time, verification is supported by the evaluation of a KAOS goal model (highlighted in blue). . . . . . . . . . 113 Figure 7.2: Legend key for interpreting the KAOS goal model notation. . . . . . . . . . . . 114 Figure 7.3: Example KAOS goal model. Goals (blue parallelograms) represent system objectives. The top-level goal (G1) is refined by sub-goals down to leaf-level system requirements. Agents (white hexagons) represent entities responsible for accomplishing requirements. Obstacles (red parallelograms) represent threats to the satisfaction of a goal (e.g., O1 and O2). . . . . . . . . . . . . . . 115 Figure 7.4: Screenshot of the KAOS-Draw user interface. KAOS-Draw enables the creation of standard KAOS models that can be imported into MoDALAS. . . . 116 xv Figure 7.5: A RELAX-ed version of the KAOS goal model shown in Figure 7.3. Goals G20, G21, and G22 (shown as green) have been RELAX-ed and use fuzzy logic to express a degree of goal satisficement. . . . . . . . . . . . . . . . . . . 120 Figure 7.6: Range of values returned by the utility function for evaluating the friction sensor (G21 in Figure 7.5) using the RELAX language with fuzzy logic. . . . . 122 Figure 7.7: Utility function for evaluating friction sensor satisficement (G21 in Figure 7.5). 122 Figure 7.8: Range of values returned by the utility function for evaluating the tire pressure monitor system (G22 in Figure 7.5) using the RELAX language with fuzzy logic.122 Figure 7.9: Utility function for evaluating the tire pressure monitor satisficement (G22 in Figure 7.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Figure 7.10: High-level data flow diagram of MoDALAS. Processes are shown as circles, external entities are shown as boxes, and persistent data stores are shown within parallel lines. Directed lines between entities show the flow of data. . . . 123 Figure 7.11: Scatter plot of object detector recall when exposed to simulated dust clouds from Enlil. Each point represents a different dust cloud context with the corresponding density and intensity. Green, yellow, and red points correspond to behavior categories 0, 1, and 2, respectively. . . . . . . . . . . . . . . . . . . 127 Figure 7.12: Example behavior oracle input/output for an image-based object detector LEC. Input is identical to the input given to the LEC. Output is a “perceived context” to describe the environmental condition and a “behavior category” to describe the expected LEC behavior. Examples of behavior categories 0, 1, and 2 are shown in (a), (b), and (c), respectively. . . . . . . . . . . . . . . . . 127 Figure 7.13: Logical expression parsed from KAOS model in Figure 7.3. . . . . . . . . . . . 128 Figure 7.14: Example tactic for reconfiguring a rover to a “cautious mode”. Precondi- tions and post-conditions refer to KAOS elements and utility functions (see Figure 7.3). Actions are abstract operations to achieve the post-condition. . . . . 129 Figure 7.15: Elided ROS graph for MAPE-K loop in rover software. ROS nodes shown as green ellipses and ROS topics as yellow boxes. Arrows indicate data flow. . . . 132 Figure 7.16: Example of object detection at a construction site. A pedestrian is detected by an image-based object detector (a). New environmental phenomena are introduced in simulation, such as (b) rainfall, (c) dust clouds, and (d) lens flares. Enlil explores different contexts to find examples that have (i) little impact, (ii) degrade, or (iii) compromise the object detector’s ability to achieve validated design-time performance. . . . . . . . . . . . . . . . . . . . . . . . . 134 xvi Figure 7.17: Example set of utility parameter values for Scenario 3 . . . . . . . . . . . . . . 137 Figure 8.1: A high-level DFD of the Anunnaki framework. Anunnaki processes are shown as circles, interacting with external systems, such as the managed AI system and a simulator, shown as rectangles. Labeled arrows show data communicated between processes, with persistent data stores shown within parallel lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Figure 8.2: An example KAOS goal model to graphically depict system requirements of a robot rover as a hierarchy of logically interconnected goals. Blue paral- lelograms represent system goals and red parallelograms represent potential obstacles to the satisfaction of goals. White hexagons represent system com- ponents responsible for achieving leaf-level goals. The Anunnaki framework extends goal models by attaching utility functions to goals/obstacles (shown in yellow ellipses). Agents can be associated with specific message topics (also shown in yellow ellipses) to inform a MAPE-K controller. . . . . . . . . . 148 Figure 8.3: A logic tree representation of the KAOS goal model in Figure 8.2. The Anunnaki framework automatically parses and interprets goal models as logic trees of utility functions for run-time evaluation of goal satisfaction. . . . . 148 Figure 8.4: Example adaptation tactic defined in XML. This “fail-safe” tactic is triggered when precondition goal G3 (from Figure 8.2) is unsatisfied. Fail-safe ac- tions include a request for the managed robot system to switch to “manual” mode and a request to email the user. Upon executing these actions, the postcondition states that goal G3 is expected to be satisfied. . . . . . . . . . . . 151 Figure 8.5: Examples of real adverse phenomena for vision-based object detection. Ob- jects are correctly identified in normal lighting ((a)i. and (b)i.). Detection is degraded in dim lighting ((a)ii. and (a)iii.). Detection is also degraded when a bright light is placed behind the camera ((b)ii.) or behind the objects ((b)iii.). The boundary between contexts leading to acceptable and unaccept- able performance is unknown. Hence, these are known unknown phenomena. . . 153 Figure 8.6: Scatter plots of Enlil’s automated assessments of a vision-based object de- tector under HSL variations (a) and with raindrop occlusion (b). Each point represents a unique context of the respective phenomena. Green points repre- sent acceptable conditions, yellow points represent degraded conditions, and red points represent fully compromised conditions. With this data, Enlil can generate a behavior oracle for each respective phenomena. . . . . . . . . . . . . 155 Figure 8.7: Anunnaki ROS graph. Ellipses represent ROS nodes. ROS topics and ROS services are shown as rectangles. Anunnaki nodes dynamically subscribe and publish to topics of the managed AI system by referencing agents found in given goal models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 xvii Figure 8.8: Example of Utu monitoring an autonomous rover. The rover’s vision-based object detector can correctly identify pedestrians in bright lighting (a). A GUI displays the state of each Utu MAPE-K microservice (b). Utu evaluates the goal model in Figure 8.2 as a logic tree, highlighting satisfied goals green and unsatisfied goals red. Because the behavior oracle detects no adverse interference (Category 0), the overall goal model is satisfied. . . . . . . . . . . 158 Figure 8.9: Example of Utu reconfiguring an autonomous rover to prevent use of its ob- ject detector LEC in poor lighting (a). A GUI displays the state of each Utu MAPE-K microservice (b). Utu evaluates the goal model in Figure 8.2, high- lighting satisfied goals green and unsatisfied goals red. Because the behavior oracle detects an adverse environment condition that degrades object detec- tion (Category 1), a fail-safe tactic has been executed by Utu to reconfigure the rover into manual operation. . . . . . . . . . . . . . . . . . . . . . . . . . . 158 xviii LIST OF ALGORITHMS Algorithm 3.1 Enki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Algorithm 3.2 Evaluate Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Algorithm 3.3 Evaluate Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Algorithm 3.4 Train DNN with Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 38 Algorithm 5.1 Enlil: Context Generation (Step 1) . . . . . . . . . . . . . . . . . . . . . 81 Algorithm 5.2 Enlil: Behavior Modeling (Step 3) . . . . . . . . . . . . . . . . . . . . . . 85 xix KEY TO ABBREVIATIONS AC Assurance Case AI Artificial Intelligence API Application Programming Interface CNN Convolution Neural Network DFD Data Flow Diagram DNN Deep Neural Network EA Evolutionary Algorithm EEA Estimation-Exploration Algorithm ES Evolutionary Search DFD Data Flow Diagram DNN Deep Neural Network GAN Generative Adversarial Network GSN Goal Structuring Notation GUI Graphical User Interface KAOS Keep All Objectives Satisfied LEC Learning-Enabled Component LES Learning-Enabled System MAPE-K Monitor-Analyze-Plan-Execute over shared Knowledge MSE Mean Squared Error PID Proportional-Integral-Derivative ReLU Rectified Linear Unit ROS Robot Operating System SAS Self-Adaptive System SBSE Search-Based Software Engineering SBT Search-Based Testing xx SOA Service-Oriented Architecture SUT System Under Test XAI Explainable Artificial Intelligence XML Extensible Markup Language xxi CHAPTER 1 INTRODUCTION New methods are needed to address the assurance of safety-critical, autonomous Learning-Enabled Systems (LESs), particularly when faced with uncertainties presented by real-world environments. For this dissertation, an LES is defined as any system that has one or more Learning-Enabled Components (LECs), where an LEC is any functional component with behavior that can be refined and optimized based on information gathered through experience (i.e., learning models) [187]. Many state-of-the-art LESs are trained via supervised learning and rely on finite sets of training and validation examples [73]. For example, an autonomous vehicle may include a vision-based LEC for obstacle detection that is trained to identify common driving obstacles based on thousands of hours of video recordings [70, 205]. However, once deployed, autonomous systems will likely encounter conditions not addressed at design time, such as adverse weather conditions for a cyber- physical system. Stakeholders require some level of confidence that these systems will deliver safe behavior in all contexts, even those that have not been explicitly evaluated at design time. For assurance purposes, LESs must be shown to enforce system-level requirements even when faced with real-world uncertainties. Assurance cases are often used by the software engineering community as a formal means to certify a system satisfies its requirements [167]. Each assurance case is a structured argument, comprising claims and supporting evidence. To build an assurance argument for an LES, this dissertation uses assurance cases constructed with Goal Structuring No- tation (GSN), which depicts each assurance case as a hierarchically-organized graphical model [2]. This dissertation also uses Keep All Objectives Satisfied (KAOS) goal models to hierarchical de- compose high-level system objectives into system-level requirements [124]. In order to certify a system satisfies its requirements, assurance cases and goal models must be developed to consider the role and impact of uncertainty. For LESs, however, additional steps must be taken to assure that a system’s learned behavior will respond appropriately to conditions that deviate from training experience. 1 This dissertation presents a framework to improve the robustness and resiliency of an LES by evaluating and mitigating the impact of uncertainty. Robustness refers to the ability of an LES to adequately process contexts that deviate from previously seen data. Resiliency refers to the ability of an LES to adapt and mitigate the impact of adverse operating conditions. This dissertation investigates how automatic techniques can evaluate the impact of uncertain conditions on an LES for assurance purposes. Furthermore, this dissertation contributes newly discovered automated methods to address the robustness and resiliency of an LES. To improve LES robustness when subject to environmental uncertainties, both design-time and run-time methods have been developed to identify gaps in training coverage and harden the LES with diverse synthetic training examples. Finally, to improve LES resiliency, self-adaptive methods have been developed to monitor and control an LES at run time to mitigate potential requirements violations due to inadequately trained learning models in uncertain environmental conditions. 1.1 Problem Description The purpose of software assurance is to provide a level of confidence to stakeholders that a software system conforms to established requirements [167]. In contrast to traditional software programs, where an explicit algorithm exists to determine which actions will be performed, machine learning models in an LEC are typically trained and evaluated end-to-end in a black box manner with internal logic that is difficult to interpret by humans. Training effective machine learning models is also highly dependent on the quality of training data. If all run-time scenarios are not covered during training, then the behavior of an LEC may unexpectedly violate system requirements for corner cases. Software assurance for an LES must be able to show that even when training coverage is insufficient, mitigating actions are taken to ensure the system delivers acceptable behavior (i.e., fail-safe actions). Using assurance case models that consider the impact of uncertain hazardous conditions to an LES, this dissertation focuses on the robustness and resiliency of an LES to uncertainty, which are described in the context of LECs as follows. 2 Robustness. Robustness for an LES refers to the ability of its LECs to adequately process data that deviate from training experience. Learning models are optimized to perform well on a given set of scenarios during a training phase. It is assumed that these trained models will generalize to new data drawn from a similar distribution as provided training data. However, if training data fails to fully cover all real-world conditions, the behavior of the learning model can be unpredictable (i.e., corner cases may exist for which there has been no training or testing). Combined with issues of overfitting, the behavior of learning models for any data that deviates from seen training examples is uncertain. This dissertation explores techniques to improve the robustness of an LES to targeted forms of operational uncertainty, supported by established GSN assurance cases and KAOS goal models. Resiliency. For an LES, resiliency refers to the ability of the system to adapt and compensate for inadequately trained LECs, in order to maintain satisfaction of system requirements at run time. When an LEC implements a fixed supervised learning model that has been trained on insufficient data, steps must be taken to determine under which contexts the learning model has been unprepared and actions must be taken to mitigate any requirement violations resulting from the inadequately trained learning model. This dissertation explores self-adaptative methods to enable an LES to assess its operating environment, determine when its LECs are insufficient for the current context, and perform alternative actions to mitigate failures that would result from the use of inadequate learning models. 1.2 Thesis Statement This doctoral research explores assurance-guided methods to assess and mitigate the impact of uncertainty on systems with machine learning components. In particular, techniques are considered for assessing and improving the robustness and resiliency of an LES based on the analysis of system- level requirements with respect to assurance cases at both design time and run time. Thesis Statement. The combined application of machine learning and diversity-driven search- based software engineering techniques can be used to assess and improve the robustness and 3 resiliency of an LES when faced with uncertain operating conditions. 1.3 Research Contributions The objective of this dissertation is to address key challenges surrounding the robustness and resiliency of an LES when faced with uncertain operating conditions. The research contributions of this doctoral work are as follows. 1. This doctoral work introduces an automated method to assess the behavior of an LES in response to uncertain operating conditions. This method observes the statistical performance and structural behavior of an LES under a range of simulated operating contexts to create an archive of operating contexts that lead to unique system behaviors. This method then generates an archive of diverse contexts (i.e., mutually unique with respect to system behavior) to characterize the response of an LES to uncertain operating conditions [128]. 2. This doctoral work introduces an automated method to augment the training and validation of learning models for operating conditions that are inadequately covered by training at design time [126, 130]. 3. This doctoral work introduces an automated method to detect when learning models have encountered operating conditions for which they have been inadequately trained [127]. 4. This doctoral work introduces a run-time framework to monitor and control the robustness and resiliency of learning models in response to run-time operating conditions, in order to mitigate failure resulting from the use of inadequately trained learning models [129]. This doctoral work has validated the above techniques on real-world autonomous vehicle datasets and on real-world autonomous platforms, such as a 1:5-scale autonomous vehicle and an autonomous rover with an embedded artificial intelligence (AI) processor for real-time com- puter vision and deep learning [34, 59, 70, 205]. This work has benefited from collaborators in the aerospace industry, the United States Air Force Research Laboratory (AFRL), and the automotive industry. 4 1.4 Organization of Dissertation The remainder of this dissertation is organized as follows. Chapter 2 provides background infor- mation on key enabling technologies and foundational concepts for this work. Chapter 3 describes an approach to improve the robustness of an LES through the evolution of diverse synthetically augmented training/validation data. Chapter 4 describes an approach to improve the robustness of an onboard controller for a cyber-physical system. Chapter 5 describes methods to predict how LECs behave under uncertain run-time scenarios. Chapter 6 describes an approach to ensure autonomous systems maintain acceptable run-time behavior with respect to established assurance cases. Chapter 7 describes an approach to dynamically adapt the behavior of autonomous systems at run time with respect to high-level system requirements models. Chapter 8 describes an approach to monitor and control LECs at run time to prevent their use under conditions for which they are inadequately trained. Finally, Chapter 9 presents conclusions, summarizes research contributions, and overviews future investigations for this doctoral work. 5 CHAPTER 2 BACKGROUND This chapter provides background information on enabling technologies and foundational concepts used to support this doctoral work. Section 2.1 describes Deep Neural Networks (DNNs) and challenges associated with their use in LESs. Section 2.2 describes evolutionary computation techniques that are relevant for this doctoral work. Section 2.3 describes fundamental concepts for autonomous systems with self-adaptive capabilities. Section 2.4 describes the various DNN datasets and implementation used to validate this doctoral work. Finally, Section 2.5 describes the autonomous platforms used to demonstrate this doctoral work. 2.1 Deep Neural Networks This section describes fundamental concepts for DNNs and current challenges facing the use of DNNs in software systems. 2.1.1 Elements of a Deep Neural Network As a form of machine learning, DNNs are universal approximators [89], capable of approximating any complex function with sufficient training. In the case of autonomous vehicles, they may be used to process data from onboard sensors [99] for tasks such as lane-keeping [191] and collision- avoidance [204]. Represented with multi-layered architectures (as shown in Figure 2.1), DNNs comprise multiple intermediate hidden layers that connect a set of input features (𝑥𝑖 ) to target output labels (𝑦𝑖 ). Input features can include any observable property, such as pixels for camera images. Output labels can describe any piece of information that may be inferred from the input, such as a safe steering angle or brake pedal pressure for a vehicle. Each hidden layer, comprising a number of units called neurons, represents a single linear transformation of the layer’s input followed by a non-linear activation function that acts as a mechanism to filter or amplify information derived from the layer’s input. DNNs are trained to approximate target functions by adjusting values of the 6 weight matrix (W𝑖 ) and bias vector (b𝑖 ) associated with each layer. Typically, weights and biases are adjusted to minimize an objective loss function that measures the amount of error between a DNN’s actual output and the expected (i.e., trained) output. Once the objective loss is minimized, the training phase is terminated, and the DNN is verified by evaluating it against separate test data. Figure 2.1: High-level illustration of a simple DNN. In practice, the network topology may be more complex. Hidden layers enable the network to model a target function that maps input features (𝑥𝑖 ) to output labels (𝑦𝑖 ). Neurons are represented by circles and links represent the flow of data from input to output. The output of each neuron is determined by applying an activation function to a linear transformation of input values by weight matrix (W𝑖 ) and bias vector (b𝑖 ). By linking layers together, DNNs are capable of learning increasingly abstract relationships between features in the source input [73]. However, when weights are adjusted for the network as a whole with an objective defined only in terms of the network’s final output, as with typical training methods, developers have no direct control over how any specific hidden layer infers information in isolation. As such, intermediate representations of information inferred by the DNN (i.e., the decision logic) can be difficult to interpret. Even when the final output of a DNN may appear to be correct for all evaluated test data, the exact reasoning for how a DNN determines its output is often a “mystery” [112]. 7 2.1.2 Challenges for Deep Neural Networks With DNNs, an increasing number of hidden layers (i.e., the depth) has been associated with an ability to derive and learn from increasingly abstract features from the source input [116]. AlexNet [120], an image recognition application, popularized the use of DNNs for vision tasks by showing significant improvements over existing computer vision techniques. Research even suggests that DNNs can compete with the human visual cortex [108]. However, this intuition has been challenged by the discovery of “adversarial examples” that exploit a DNN into making incorrect predictions in unexpected ways [54, 75, 122]. Adversarial examples have been shown to force a DNN to make false negative predictions by adding small amounts of noise (typically imperceptible to a human) to images [209] and false positive predictions for images constructed entirely from noise [160]. Furthermore, studies have shown that when both training and test data contain the same surface statistical regularities, it is possible for a DNN to learn and make predictions based on superficial details of an image’s composition rather than the high-level semantics of the image’s contents [101]. In such cases, a DNN can show a high degree of accuracy on test data while not successfully generalizing to data outside of the training or test datasets. This doctoral work is complementary to the field of Explainable Artificial Intelligence (XAI), where XAI aims to address the accessibility of machine learning systems for human understand- ing [10]. Active research in XAI has developed methods to quantify the interpretability of hidden layers [12], visualize intermediate representations [229], and verify software implementations of an LES [26, 50, 193]. One emphasis of XAI research is to develop methods that lead to more robust and reliable machine learning systems [80]. In the context of XAI, robustness refers to the ability of a DNN to reduce its sensitivity to malicious noise in its input. Chapter 3 proposes a method for improving the robustness of a DNN by identifying gaps in training data that make the DNN perform poorly in the presence of a given environmental condition and “filling” in the gaps with synthetic training data. This doctoral work supports the goal of XAI to help humans understand, trust, and effectively manage DNNs in noisy environments. 8 2.2 Evolutionary Computation This section describes how evolutionary computation has been harnessed for search-based software engineering (SBSE) and evolutionary robotics; for a more comprehensive review, see the survey by Harman et al. [82]. Additional information is provided on novelty search, a diversity-driven form of evolution. 2.2.1 Search-Based Software Engineering with Evolution Test case generation for software verification and validation has been a long-standing challenge, including considerable research in automated Search-Based Testing (SBT) techniques [82]. These search-based methods explore a space of candidate test cases to find a subset that optimizes a given objective (e.g., software coverage, program faults, etc.). [148]. Common parameter optimization methods include random search and grid search [17]. Random search methods generate candidate solutions with random parameter values until a desirable solution is found. Grid search methods exhaustively search every parameter value to determine the optimal solution. While random search can typically find solutions more efficiently than grid search, a high amount of variation in results is possible, where the resulting solution may be far from optimal. Grid search guarantees an optimal solution will be found, but it is too computationally expensive to implement for high-dimensional solutions spaces or when the evaluation time for each candidate solution is non-trivial. More advanced techniques introduce heuristics to help guide the search towards relevant solutions more reliably than random generation and more efficiently than grid search [82]. Testing tools such as EvoSuite [60] and Sapienz [147] use Evolutionary Algorithms (EAs) [52]. These tools require software engineers to specify a test case representation as a genotype for the EA, and each encoded instance of a test case is referred to as a genome. The EA includes an iterative process of test case generation and selection to evolve better quality test cases. Genomes are manipulated through variation operations, such as mutation and recombination, and are selected for preservation between generations by a fitness heuristic. Fitness is typically a metric based on 9 observable properties of the System Under Test (SUT), such as code coverage, specified as the phenotype. Thus, EAs explore the space of test cases by their genomes and compare the quality of each test case the fitness of its associated phenome. Through this process, SBTs can use evolution to discover test cases more pertinent to the SUT with respect to the chosen fitness metric. For solution spaces with large regions of suboptimal solutions, the use of a fitness heuristic allows the search process to avoid much of the unnecessary evaluation of an exhaustive search (i.e., the evaluation of undesirable candidates). However, since EAs do rely on stochastic elements, there can be some variation in resulting solutions. The amount of variation is often determined by the quality of the chosen fitness metric, the corresponding fitness landscape [180], and on how well the fitness heuristic is able to converge onto an optimal solution. 2.2.2 Evolutionary Robotics The field of evolutionary robotics [58] harnesses the open-ended search capabilities of EAs by using genomes to encode specifications of the robot’s control system or even aspects of its morphology (external structure), allowing for evolution to be leveraged in the process of designing more robust and resilient robots. Individuals in a population are evaluated with respect to one or more tasks, with the best performing individuals selected to pass their genes to the next generation. Simulation is typically used to evaluate individuals, greatly reducing the time to evolve solutions while avoiding possible damage to physical robots. From an engineering perspective, a major advantage of evolutionary search is the possible discovery of solutions (as well as potential problems) that the engineer might not otherwise have considered. 2.2.3 Novelty Search Novelty search is an evolution-based method to explore large solution spaces for interesting candi- date solutions in the absence of an explicit fitness objective [133, 134]. Greedy algorithms that focus on maximizing or minimizing an objective function have been found to be efficient and effective for simple solution spaces, but they are prone to discovering suboptimal solutions when the solution 10 space contains many local optima or the global optima covers a very small region with no smooth gradient (i.e., non-convex spaces). In contrast, novelty search ignores any specific objective other than diversifying each candidate solution. Diversity is encouraged by comparing each candidate with its neighbors and favoring candidates that are least similar to their nearest neighbors. Thus, through an iterative process, novelty search can discover a collection of candidate solutions that cover a wide region of the solution space while preventing all candidates from being attracted to the same local optima. When implemented as an EA, a population of individual candidates can be evolved based on how diverse they are in relation to an archive of the most diverse candidates discovered. 2.3 Self-Adaptive Systems This section describes basic concepts for Self-Adaptive Systems (SASs) and methods for run-time monitoring to support run-time adaptations for autonomic computing. 2.3.1 Self-Adaptation at Run Time Research in autonomic computing has led to systems that can manage and re-configure (i.e., self- adapt) to maintain high-level objectives in complex and dynamic run-time environments [105]. The MAPE-K (Monitor-Analyze-Plan-Execute over shared Knowledge) control loop (see Figure 2.2) was introduced by Kephart and Chess [105] as a framework to manage adaptations for SASs. First, the Monitor step gathers information about managed system components and the run-time environment. Second, the Analyze step infers a context for the current system states and selects an appropriate response (i.e., system adaptation). Third, the Plan step identifies an appropriate procedure to perform the selected adaptation. Finally, the Execute step realizes the adaptation by re-configuring managed components. Each step has access to a shared Knowledge database that includes information such as resource availability, performance constraints, and functional objectives. Thus, an autonomic manager enables continuous monitoring and re-configuration of a SAS at run time. 11 This dissertation uses the terminology described by Cheng et al. [37] to describe adaptations for an SAS. An adaptation tactic is a procedure that comprises a condition, action, and effect, where a condition describes when an adaptation is applicable, an action (or sequence of actions) describes what operations to perform on managed components, and an effect describes the intended result of the adaptation. For each cycle of the autonomic manager’s control loop, a tactic is chosen based on the determined status of managed components. Once a tactic has been selected, the control loop is paused while the actions associated with the tactic are executed from beginning to end. Finally, the effectiveness of the tactic is determined by comparing its intended effect to the resulting status of managed components. Success or failure of a tactic can inform future iterations of the control loop. Figure 2.2: Illustration of a generic MAPE-K control loop for an SAS. An autonomic manager monitors, analyzes, plans, and executes changes to a collection managed components at run-time. By modeling the behavior of an SAS as finite state automata, Zhang and Cheng [233] categorize adaptations into three general categories: one-point adaptation, guided adaptation, and overlap adaptation. These categories focus on the behavior of the managed element before an adaptation (i.e., source behavior) and after an adaptation (i.e., target behavior) Since adaptations may not be safe to perform in all states of a managed element, focus is also given to those states for which adaptation can be safely executed (i.e., quiescent states). For one-point adaptation, the managed element is able to transition from the source behavior to the target behavior in a single state transition. For guided adaptation, a request is made before executing the adaptation, and the 12 managed element is forced into a restricted mode, where the source behavior is constrained such that a quiescent state can be found for executing the adaptation. Finally, for overlap adaptation, the managed element begins exhibited target behavior before the source behavior ceases, such that for a period it exhibits both behaviors. Particularly for safety-critical applications, it is important to consider the appropriate category of adaptation to avoid an accidental transition into a hazardous system state. 2.3.2 Run-Time Monitoring with Utility Functions When monitoring run-time systems for self-adaptation [105], utility functions have been used to simplify system behavior assessments [36, 215]. Utility functions map system attributes (i.e., the system state) into real scalar values to express a degree of goal (i.e., requirements) satisfaction [45]. Explicitly, a utility function takes the following form. 𝑢 = 𝑓 (v) (2.1) The utility value is a real scalar value (𝑢 ∈ [0, 1]), and the system state vector (v = [𝑠0 , ..., 𝑠𝑛 ]) reflects specific attributes (𝑠𝑖 ) of a system and its environment. Utility functions can be elicited manually by domain experts and application users (i.e., subjective utility), or derived automatically from software artifacts (i.e., objective utility) [45]. Whether derived manually or automatically, the intent of a utility function is to rank desirable system states higher than undesirable system states. Thus, utility functions enable a quantifiable comparison of low-level system states in terms of high-level task-oriented objectives. 2.4 Validation Datasets and Implementations for Deep Neural Networks A variety of datasets have been selected to demonstrate and validate DNNs using the methods proposed by this doctoral work. This section describes each respective dataset and DNN application in detail. 13 2.4.1 CIFAR-10 Dataset for Image Recognition Portions of this doctoral work have been applied to image recognition DNNs trained for the CIFAR-10 benchmark, a dataset widely cited in deep learning research [119]. For image recognition, the goal is to classify the contents of each CIFAR-10 image into one of ten evenly distributed categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, or truck. The benchmark includes two datasets: a set of 50,000 labeled training images and a set of 10,000 labeled test images. Performance for a CIFAR-10 DNN is typically measured by its accuracy when given a set of validation images, where accuracy is the percentage of correct predictions over the total number of given images. State-of-the-art DNNs have been reported to achieve accuracy above 91% on default CIFAR-10 test data by use of highly-tuned architectures, optimization methods, and data augmentation [68, 232]. All CIFAR-10 image recognition DNNs constructed for this doctoral work use PyTorch [174] machine learning libraries to implement a ResNet-20 [87] architecture. The ResNet architecture (illustrated in Figure 2.3) comprises a series of residual blocks with decreasing resolutions and increasing numbers of filters. Each residual block can be bypassed when processing a given input via a “shortcut connection.” Individual blocks include convolution layers to transform incoming images, batch normalization operations to reduce covariate shift [95], and rectified linear unit (ReLU) [157] activations to filter out or amplify relevant features. Source images begin with a resolution of 32 × 32, pixels, and intermediate representations are gradually reduced in size down to an 8 × 8 resolution. The network then produces a feature vector by performing a global average pooling operation to compute the average value for each feature image. A softmax1 operation is applied to the final layer to predict a corresponding classification category. For example, in Figure 2.3, the deer classification category has been assigned by the DNN with the highest probability for the given input image. The baseline CIFAR-10 DNN for this doctoral study has been trained on default CIFAR-10 1 Softmax is a function used to estimate a classification probability distribution [73, p. 179]. 14 training images to minimize the categorical cross entropy2 of its predictions against the expected training output. The training procedure implements an Adaptive Moment Estimation (Adam) gradient descent method [109] with a learning rate that decays from 10−3 to 10−7 over 200 epochs.3 These settings were chosen by empirical analysis and cross-validation on a subset of the training data. Standard data augmentation methods such as image shifting and image flipping have also been used to add variation to the training data. Trained under these conditions, the baseline CIFAR-10 DNN can achieve an accuracy of 91% on default CIFAR-10 test data. Figure 2.3: High-level illustration of a CIFAR-10 DNN with a ResNet-20 architecture. Images are provided as input. The network is composed of a series of residual blocks with “shortcut connections.” Output is a single classification category determined from a probability distribution. 2.4.2 KITTI Vision Benchmark Suite for Autonomous Driving Object Detection Portions of this doctoral work have also been applied to DNNs trained for image-based object detection. The KITTI Vision Benchmark Suite [70], produced by Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago, provides a real-world benchmark for data-driven autonomous driving tasks. KITTI data has been collected from an autonomous car equipped with a laser scanner, a stereo camera, and a global positioning system. However, for two-dimensional object detection, only static images taken from the vehicle’s camera system are referenced in this dissertation. Separate training and validation images have been selected and prepared from the benchmark dataset [166]. In contrast to the CIFAR-10 image recognition problem, each image in the 2 Cross entropy measures the relative difference between two probability distributions [73, p. 73]. 3 An epoch refers to a single training iteration [73, p. 239]. 15 KITTI dataset can contain multiple objects of interest. Object detection consists of two sub-tasks: classifying and locating each object in the image with a bounding box. Possible classification categories for each object include car, van, truck, pedestrian, sitting-person, cyclist, bus, and miscellaneous. Bounding boxes for each object are defined as the smallest rectangle in the source image to contain all pixels related to the object of interest. Thus, for each given input image, an object detector will output a corresponding set of classifications and bounding boxes for each relevant object detected in the image. For demonstration, Figure 2.4 provides an example image input (Figure 2.4a) and the corresponding output (Figure 2.4b) produced by an object detector. All KITTI object detection DNNs constructed for this doctoral work use the RetinaNet [139] architecture for two-dimensional object detection with the PyTorch machine learning libraries. The implemented RetinaNet architecture (illustrated in Figure 2.5) uses a ResNet-50 “backbone” to extract a pyramid of features from the source image. These features are then fed into a set of classification sub-networks to classify each detected object and a set of regression sub-networks to compute bounding boxes for each detected object. The baseline KITTI DNN referenced in this dissertation has been trained by an Adam gradient descent method to minimize a focal loss4 for given training images. Training began with an initial learning rate set to 10−4 and ended after 25 epochs, when the focal loss converged to a minimum. These settings were chosen by empirical analysis and cross-validation with a subset of the training data. In contrast to the CIFAR-10 image recognition application, where an image is either correctly classified or misclassified, object detection can result in false positives (i.e., detecting false objects) and false negatives (i.e., failing to detect an object), and therefore, the performance of an object detector is measured by its precision, recall, and F-scores.5 For this doctoral study, the baseline KITTI DNN has been trained on 6,373 training images to achieve a precision of 77%, a recall of 88%, and an F-score of 82% on 1,108 validation images. 4 Focal loss is an adaptive form of cross entropy to account for class imbalance [139]. 5 Precision is the ratio of correctly detected objects to all detected objects (i.e., the ratio of true positives to both true positives and false positives). Recall is the ratio of correctly detected objects to all real objects (i.e., the ratio of true positives to both true positives and false negatives). An F-score is the harmonic mean of both precision and recall [202]. 16 (a) Example Input Image (b) Example Output From Object Detector Figure 2.4: Examples of an image input and output from a KITTI object detector DNN. Images are taken from a vehicle’s onboard camera (a). Output is a set of labeled bounding boxes corresponding to each detected object (b). Figure 2.5: High-level illustration of a RetinaNet DNN with a ResNet-50 backbone architecture. Images are provided as input. The network is composed of a feature extraction sub-network (yellow) with classification and regression sub-networks (blue and purple, respectively) at each level of a feature pyramid (green). Output is a collection of classification categories and bounding boxes for each respective object detected in the image. 2.4.3 Waymo Open Dataset for Autonomous Driving Object Detection In addition to training object detection DNNs on KITTI benchmark data, portions of this doctoral work have also trained object detection DNNs on the Waymo Open Dataset [205] comprising real-world autonomous driving data. Separate training and validation datasets have been selected 17 and prepared from the forward-facing camera images [166] in the Waymo dataset. Possible classification categories for each object in the Waymo dataset include vehicle, pedestrian, sign, cyclist, and unknown. For demonstration, Figure 2.6 provides an example image input (Figure 2.6a) and the corresponding output (Figure 2.6b) produced by an object detector on an image from the Waymo dataset. (a) Example Input Image (b) Example Output From Object Detector Figure 2.6: Examples of an image input and output from a Waymo object detector DNN. Images are taken from a vehicle’s onboard camera (a). Output is a set of labeled bounding boxes corresponding to each detected object (b). All Waymo object detection DNNs constructed for this doctoral work use the same RetinaNet architecture illustrated in Figure 2.5 for two-dimensional object detection. The baseline Waymo DNN referenced in this dissertation has been trained by an Adam gradient descent method to minimize a focal loss for given training images. Training began with an initial learning rate set to 10−4 and ended after 25 epochs, when the focal loss converged to a minimum. These settings were chosen by empirical analysis and cross-validation with a subset of the training data. For this doctoral study, the baseline Waymo DNN has been trained on 158,081 training images to achieve a precision of 88%, a recall of 74%, and an F-score of 80% on 39,987 validation images. 2.5 Demonstration Platforms Two separate autonomous platforms have been used to demonstrate and validate the methods proposed by this doctoral work. This section describes the underlying control software and imple- 18 mentations of each autonomous system. 2.5.1 Robot Control Software In order to manage heterogeneous hardware and enable software reuse in robotic applications, many developers implement the control logic of sensors and actuators as components of a robot middleware [53]. The Robot Operating System (ROS) [176] is an open-source robot middleware that has been widely adopted by both academia and industry [114]. The fundamental elements of a ROS-based system are nodes, topics, and services. ROS enables the controlling algorithms for a single application to be divided into multiple independent processes (i.e., ROS nodes). ROS nodes can publish/subscribe data unidirectionally through message buses (i.e., ROS topics) and also handle bidirectional request/reply interactions (i.e., ROS services). As a peer-to-peer network of nodes, a ROS-based system can be implemented over multiple processing units with a common registry service to facilitate communication between nodes (illustrated in Figure 2.7). Figure 2.7: Typical ROS configuration [176]. Software for a ROS-based system executes ROS nodes over multiple onboard and offboard processors that communicate over a wireless bridge via ROS topics and services. 2.5.2 The EvoRally Autonomous Vehicle Portions of this doctoral work have been validated on the EvoRally autonomous vehicle platform. EvoRally is based on the AutoRally [71, 223] open-source platform designed and developed by the Georgia Institute of Technology (Georgia Tech) as a test bed for various autonomous vehicle sensing and control methods. EvoRally is modeled to represent a 1:5 scale remote control truck 19 and has a top speed of 27 m/s (60 mph). Figure 2.8 provides images of the EvoRally platform in both its physical and simulated form. Software for controlling EvoRally is modular and highly customizable, implemented using the Kinetic [184] distribution of ROS packages. The rover can be controlled either autonomously or manually by a remote operator. An advantage for using EvoRally for research purposes is that it is smaller and less expensive than a full-sized autonomous vehicle, yet it uses much of the same control software and mechanical structures. (a) Physical platform (b) Simulated platform Figure 2.8: The EvoRally autonomous driving platform. The physical platform is shown on left (a), and a simulated version is shown on right (b). Simulation of EvoRally is managed by the Gazebo simulator [113], chosen for its support of complex environments and sensors modeled after many commercially available devices. An accurate model of the EvoRally platform has been constructed within Gazebo, closely matching all components and physical characteristics of the real platform. With the capabilities offered by Gazebo and an accurate model of the vehicle, the reality gap between what is observed in simulation and the behavior of a physical system is expected to be minimal. 2.5.3 Deep Learning-Driven Autonomous Rover Portions of this doctoral work have also been validated on a robotic rover assembled with a suite of sensors and actuators to enable autonomous behavior. Photographed in Figure 2.9, the dimensions of the rover are approximately 30.5 × 20.5 × 22.0 centimeters. The rover includes an NVIDIA Jetson Nano processor to support efficient onboard deep learning computations [59] for computer 20 vision. Sensors include a forward-facing camera, an ultrasonic range finder, and a touch-sensitive bumper. Control software for the rover is implemented using the Melodic [185] distribution of ROS packages. The rover can be controlled either autonomously or manually by a remote operator. When operating autonomously, the rover uses both an ultrasonic range finder and a vision-based object detector to detect obstacles in the environment. (a) Physical platform (b) Simulated platform Figure 2.9: For demonstration, an autonomous rover has been assembled to explore deep learning on embedded systems. Sensors include a camera and an ultrasonic range finder. The physical platform is shown on left (a), and a simulated version is shown on right (b). In autonomous mode, the rover relies on computer vision to identify the types of obstacles present in its environment. The rover’s vision-based object detector is implemented as a RetinaNet DNN, using PyTorch deep learning libraries. The object detector has been trained to detect objects from two-dimensional images taken from the rover’s forward-facing camera. For each object detection, both a category label and bounding box are given to identify the type of object and what region of the image it covers. To train and validate the object detector, a set of 2, 500 labeled images were collected with images of both humans and deer scattered throughout a controlled test environment, where 2, 000 of the images were reserved for training and 500 were reserved for validation purposes only. The object detector DNN was trained until the training error converged to a minimum (after 25 epochs). When evaluated against the reserved validation images, it was found to correctly detect images of 21 humans and deer with a precision of 98.8%, a recall of 94.8%, and an F-score of 96.8%. 22 CHAPTER 3 LEARNING ROBUSTNESS THROUGH DIVERSIFIED TRAINING Data-driven LESs are limited by the quality of available training data, particularly when trained offline. For systems that must operate in real-world environments, the space of possible conditions that can occur is vast and difficult to comprehensively predict at design time. Environmental uncertainty arises when run-time conditions diverge from design-time training conditions. To address this problem, automated methods can generate synthetic data to fill in gaps for training and test data coverage. This chapter proposes an evolution-based technique to assist developers with uncovering limitations in existing data when previously unseen environmental phenomena are introduced [126, 130]. This technique explores unique contexts for a given environmental condition, with an emphasis on diversity. Synthetic data generated by this technique may be used for two purposes: (1) to assess the robustness of a system to uncertain environmental factors and (2) to improve the system’s robustness. This technique is demonstrated to outperform random and greedy methods for multiple adverse environmental conditions applied to image-processing DNNs. The remainder of this chapter is organized as follows. Section 3.1 overviews the motivation and objectives of this chapter. Section 3.2 describes a methodology for assessing and improving the robustness of a DNN. Section 3.3 presents results from an empirical evaluation of DNNs trained with benchmark data. Section 3.4 reviews related work in automated testing and DNNs. Finally, Section 3.5 provides a concluding summary for this chapter. 3.1 Overview For cyber-physical LESs in real-world environments, unpredictable behavior can occur when design-time training conditions deviate from run-time conditions [44, 212]. When different forms of adversity are introduced into the environment, such as inclement weather or a malicious security attack, the outcome can result in costly damage to hardware, human injury, or even fatalities (e.g., recent accidents with autonomous vehicles [162, 163, 164]). This chapter focuses on systems that 23 can refine or optimize functional behavior based on information gathered through experience (e.g., obstacle detection for a navigation system) in contrast to the use of machine learning to manage self-adaptations [121]. Model-driven LESs use domain knowledge and a semantic model of the learned task [227]. In contrast, the behavior of a data-driven LES (e.g., a system comprising a DNN) is determined entirely from patterns inferred from training data without a well-defined semantic model [16]. Trustworthiness for these systems must be established at the design stage, but accounting for every possible example of a run-time condition is challenging. For example, consider an autonomous vehicle with a DNN that has been trained for object detection but has only been exposed to images taken in clear weather. Due to rainy conditions at run time, a situation may arise where the lens of the camera is partially obstructed by a falling raindrop on the camera lens. Examples illustrated in Figure 7.16 show how raindrops with changing size and position can adversely impact image recognition in potentially subtle and unexpected ways. Absent of any training or test data that include this particular raindrop condition, it is difficult for developers to assess the impact occluding raindrops will have on the object detector or autonomous vehicle. This chapter introduces an automated technique to explore the impact of such uncertain environmental conditions on an LES and to uncover specific contexts that produce a wide range of system behav- ior. Results from this approach can be leveraged to assess and improve a system’s robustness to the environment, where robustness refers to an LES’s ability to deliver consistent and acceptable behavior in the presence of a noisy environment [230]. Despite a widespread movement to incorporate deep learning [73] into software applica- tions [85], proving the correctness of software driven by deep learning remains challenging [81]. For the scope of this chapter, an LES is any system with one or more learning components that have been trained offline via supervised deep learning. Since deep learning methods allow a system to solve tasks by example without a full description of the problem space, it is attractive for use in problem domains that are dynamic, poorly defined, or otherwise too complex to solve by classic algorithms. However, introducing deep learning into safety-critical systems is troublesome for soft- ware assurance purposes [213]. Classical algorithms contain well-defined, logic-based rules that 24 can be verified by formal methods, and traditional software systems developed through program- ming languages have a broad collection of tools/techniques for testing and quality assurance. In contrast, deep learning systems such as DNNs are difficult to interpret and have been shown to have limited robustness when training examples inadequately cover the problem domain. Collecting “good” data to train and test DNNs is non-trivial [25, 86, 197] and expensive to produce, in terms of both time and human effort. Furthermore, manual data selection is susceptible to cognitive biases [27]. Computer-assisted techniques can alleviate these issues by augmenting data to expand coverage of the problem space. i. Correct ii. Correct iii. Incorrect i. Correct ii. Correct iii. Incorrect (a) Impact of raindrops on automobile detection (b) Impact of raindrops on deer detection Figure 3.1: Examples of how occluding raindrops with changing size and position can impact an image recognition DNN trained with 91% test accuracy for CIFAR-10 [119] images. Examples images are provided for an automobile (a) and a deer (b). Unaltered images (i.) are shown with resulting DNN classifications labeled on the bottom. Synthetic raindrops can be introduced with either no impact on classification (ii.) or negative impact (iii.). The impact any given raindrop will have on the resulting classification is not known a priori. This chapter describes an evolution-based technique to construct synthetic variants of existing datasets by introducing contexts of environmental conditions that are not present in default train- ing/validation data and therefore have an uncertain impact on the system. Furthermore, in order to assess and improve system robustness, this technique seeks out contexts that result in the most unique and extreme system behavior, to uncover sets of training/test examples that cover a wide range of system behavior. For example, an environmental condition may describe where and how a raindrop appears on an image from an autonomous vehicle’s onboard camera, and the proposed technique can generate multiple raindrop variations that impact the performance of the autonomous vehicle in mutually unique (i.e., diverse) ways. This technique implements an evolution-based 25 search method that simulates environmental conditions on sensor data via parameterized transfor- mation functions to find specific parameter values (i.e., contexts) that result in increasingly different behavior for the LES over each generation. Synthetic data is then produced by transforming existing sensor data to reflect the evolved contexts. This technique enables a developer to automatically assess a machine learning system under a wide range of contexts for an environmental condition not covered by default. Furthermore, it provides a means to train machine learning systems to be more robust to a wider range of otherwise unexposed environmental contexts. This chapter introduces Enki,1 a black box automation tool that supports the proposed tech- nique [126, 130]. Enki can be used to discover diverse operating contexts for any general SUT, based on the SUT’s observable behavior. This chapter addresses two key research questions. First, can we use Enki to identify gaps in training that cause an LES to perform poorly? Second, can Enki improve the robustness of an LES to the effects of an uncertain environmental condition? To answer these questions, this chapter focuses on DNNs designed for image recognition and object detection, key capabilities needed for several autonomous driving features (e.g., obstacle avoidance, lane management, and adaptive cruise control). Experiments have been conducted on DNNs under a variety of environmental conditions to demonstrate how Enki can be used to construct useful synthetic test and training data. Results from this study show that by using Enki with established benchmark datasets for image recognition (CIFAR-10 [119]) and object detection (KITTI [70]), relevant examples of environ- mental conditions that negatively impact the performance of DNNs (e.g., decreased lighting, haze, and the presence of rain) can be automatically identified. Furthermore, this chapter demonstrates that synthetic training data generated by Enki can be used to improve the robustness of these DNNs to the introduced environmental conditions. 1 Enki is an ancient Sumerian deity associated with knowledge, mischief, and creation [118]. 26 3.2 Assessing and Improving the Robustness of Learning Models This section provides an overview of how Enki can be used to assess and improve the robustness of a DNN when exposed to uncertain environmental conditions. Section 3.2.1 clarifies how an environmental condition is defined for this study, Section 3.2.2 describes the steps involved with assessing a DNN’s performance under a specific uncertain environmental condition, and Section 3.2.3 describes the steps involved with improving the robustness of a DNN. 3.2.1 Definition of an Environmental Condition This approach is intended for use when the given training data for a DNN contains no real-world examples of an environmental phenomenon of interest. It is assumed that the phenomenon can be simulated using examples of real-world data and a simulation environment. For illustrative purposes, this section considers a specific environmental condition where a raindrop may partially occlude the view of an autonomous system’s camera sensor. However, any environmental phenom- ena that the simulation environment is capable of replicating may be used in place of the raindrop condition. The raindrop occlusion condition is considered a “known unknown” for DNNs that have only been trained with clear-weather images. It is an environmental condition that has known appearance (i.e., its impact on sensor input can be described) but also has unknown effects on the DNN’s behavior. For this work, such environmental conditions can be described by a parameterized transformation function (𝑇), x′ = 𝑇 (g, x) (3.1) where g is a set of parameter values that describe a specific appearance of the condition (i.e., an operational context), x is an example of sensor data absent of the condition, and x′ is an example of how the sensor data would appear in the presence of the given context for the condition. For raindrop occlusion, parameters in g may describe a raindrop’s appearance (e.g., position, size, etc.), and each instance of g corresponds to an alternative context of a raindrop (e.g., a large raindrop in 27 the center of view versus a small one in the upper-right corner). Given a real-world camera image absent of any raindrop occlusion (x), transformation function 𝑇 is capable of generating an example of how the image would appear (x′) with a raindrop on the camera’s lens. In absence of real-world examples of raindrops occluding the camera’s view, synthetic exam- ples are generated via a ray tracing2 method. Modeling the visual appearance of rain to a high degree of detail has been studied extensively [67]. However, these models are often complex and computationally expensive when applied to datasets as large as those normally required for training or evaluating a DNN. Raindrops falling onto a camera lens can either settle as stationary pools of water or even produce streaks, depending on factors such as the angle of gravity, lens curvature, and wind. For simplicity, this chapter assumes raindrops are hemispherical and stationary on the lens (i.e., circular). These simulated raindrops can be described by the following properties: position, radius, and blur. For comparison, Figure 3.2 shows example images taken with a clean lens (Figure 3.2a), images of occluding real-world water droplets from the same view (Figure 3.2b), and synthetic droplets inserted into the scene (Figure 3.2c). Since the appearance of real-world raindrops include rays of light coming from sources outside of the camera’s view, these simulated raindrops may not completely replicate all details of the real-world phenomenon. For cases where more realistic raindrop phenomena are needed, an alternative method for raindrop simulation may be used; however, for this study, it is assumed that the “reality gap” [96] is sufficiently small to ignore. 3.2.2 Assessing System Behavior When Exposed to an Uncertain Condition To assess a DNN under a new, previously-unseen environmental condition, synthetic examples are generated using contexts (e.g., alternative raindrop variations) evolved by Enki to produce a wide range of behavior for the DNN. Synthetic test data is generated by transforming default test data with the contexts evolved by Enki. A data flow diagram (DFD) for this stepwise process is depicted in Figure 3.3, where circles represent computational steps, boxes represent external entities, arrows 2 Ray tracing is a method of image generation that simulates photorealistic effects by tracing the path rays of light project in a scene from a specific viewpoint (the image plane) back to the originating light sources [91]. 28 (a) No Droplets (b) Real Droplets (c) Simulated Droplets Figure 3.2: Examples of real-world and simulated water droplets. Images in (a) are not obscured by water droplets on the lens. Images in (b) have had real water droplets placed on the lens. Images in (c) show simulated droplets superimposed over the original images. Figure 3.3: DFD for using Enki to assess a DNN with an uncertain environmental condition. Circles represent computational steps, boxes represent external entities, parallel lines mark persistent data stores, and connecting arrows show the flow of data between steps. 29 represent data flow, and data stores are represented within parallel lines. Each step is described in turn as follows. Step 1. Enki Enki searches a space of possible contexts for the environmental condition of interest (e.g., every possible variation of a raindrop), assesses the behavior of the DNN under each selected context, and then archives the most diverse contexts based on the DNN’s observed behavior. The intent is to apply this technique for situations where the full impact of an environmental condition is not known for the DNN, and it is not possible to exhaustively assess every possible context. An evolution-based search procedure enables Enki to evaluate a population of multiple contexts for the DNN in parallel and guide the population towards the most diverse selection of contexts thus far observed. Enki requires a specification for the environmental condition and a specification for the system behavior to be observed. Step 1.1. Evolutionary Search Each individual operational context is defined by a set of independent parameter values (g in Equation (3.1)) that describe characteristics of the operational environment. The user must specify a name for each parameter and permissible values. Each individual context generated by Enki is a sample taken from these specified value ranges. Parameter values may be real, integer, or categorical. Sampled values are then encoded by Enki into a numeric vector form, designated as a genome, with normalized values ranging between 0 and 1, based on the user-specified value ranges. For the raindrop occlusion condition, contexts can be specified by 𝑥, 𝑦, 𝑟𝑎𝑑𝑖𝑢𝑠, and 𝑏𝑙𝑢𝑟 parameters, where a specific context corresponds to explicit values selected for these parameters. For example, the context {𝑥 = 0.5, 𝑦 = 0.5, 𝑟𝑎𝑑𝑖𝑢𝑠 = 0.1, 𝑏𝑙𝑢𝑟 = 0.02} would correspond to a raindrop in the center of an image, radius covering 10% of the image, and with a small amount of blur. Enki evaluates a DNN’s behavior when exposed to each generated context (see Step 2 for details about behavior metrics). With each generated context, a user-defined evaluation procedure 30 is executed to monitor the behavior of the DNN according to the given behavior metrics. The results are then encoded into a numeric vector form, designated as a phenome, with normalized values ranging between 0 and 1. Enki uses an EA (Algorithm 3.1), to explore the space of possible operational contexts and select those that lead to the most diversity. Diversity is determined by comparing the phenomes (i.e., system behavior) associated with each context within a population (see Step 1.3 for details about how diversity is computed). The initial population is created from random-generated contexts. With each generation of evolution, the population is said to become more diverse when the phenome of each context becomes less similar to its nearest neighbors. A very diverse population of contexts will each produce mutually unique system behavior (regardless of how similar the operating environment may appear between contexts). For each generation, the population is also compared to a separate archive that is maintained to track the least similar contexts discovered. Standard evolutionary operations are used to evolve the population’s genomes [52], such as selection, recombination, and mutation, over multiple generations to guide the population towards contexts that improve the phenome diversity of the archive. Round-robin tournament selection chooses which parent genomes is recombined, favoring genomes that result in more diverse phe- nomes (see Step 1.3). Random single-point crossover produces offspring genomes by merging the beginning of one parent’s genome with the end of the other’s at a random selected point within the genomes. Finally, creep mutation is implemented to slightly increase or decrease the value of each element in an offspring genome according to a mutation rate probability. The amount of change for each mutated element is randomly drawn from a uniform distribution bounded by a mutation shift variable. Step 1.2. Commit to Archive For each generation, the current population is evaluated and compared to an archived collection of contexts. This archive preserves the least similar contexts across generations (with respect to how the DNN’s behavior is affected), and each context in the current population is ranked according to 31 how it compares to those in the archive. Contexts that exceed a novelty threshold are added to the archive. When the archive is full, it is truncated by discarding the archived contexts that contribute the least to its overall diversity. Step 1.3. Rank Individuals Enki ranks each individual context according to how similar its phenome (i.e., system behavior) is to its nearest neighbors in the archive, favoring those that are less similar. Similarity is determined by the Euclidean distance between phenomes as follows,   novelty(p, P, 𝑘) = mean min-k ∥p − p𝑖 ∥ 2 ∀p𝑖 ∈ P : p𝑖 ≠ p, 𝑘 (3.2) where the novelty score is computed as the average distance between a phenome (p) with its 𝑘 nearest neighbors in the set of all archived phenomes (P). Contexts with phenomes that are more distant from the closest matching phenomes in the archive (i.e., less similar) are given priority. By focusing on phenome diversity over genome diversity, Enki is able to uncover and archive cases where even slight changes to the operational context produce significantly different system behavior (e.g., adversarial examples for DNNs). Figure 3.4 plots how novelty scores for archived contexts typically evolve over each generation. The blue curve shows the mean novelty score for all archived contexts. The green shaded area shows the range between the minimum and maximum archived novelty score. An increasing mean novelty score indicates that archived contexts have become more distant from each other, and narrowing range of novelty scores indicates contexts have become more uniformly spaced. Step 2. Evaluate Context To assess the effect a specific context has on the DNN, a subset of the original, unaltered training data is selected for use as a basis for evaluation. Using the transformation function (Equation (3.1)) for the environmental condition of interest, synthetic examples (x′) are generated by applying the parameter values associated with a given context (g) to each sampled training example (x). The behavior of the DNN is then observed when processing the entire set of synthetic examples. This 32 procedure is described by Algorithm 3.2. Three behavior metrics are considered for this study: the DNN’s classification F-scores, neuron activation coverage, and neuron activation pattern. Alternative metrics may also be considered for future investigation, such as the Kullback-Leibler divergence [203] or the Receiver Operating Characteristic (ROC) curve [203]. Classification accuracy is defined as the ratio of correct classifications to the total number of classifications made. However, when considered alone, accuracy can be a misleading metric, espe- cially when classification categories are not equally balanced. Therefore, F-scores are considered for each classification category, defined as the harmonic mean of the precision and recall for each respective category [192]. A DNN’s neuron activation coverage is computed by monitoring the cumulative output of each neuron in each activation layer of the DNN and then computing the ratio of activated neurons to the entire set of neurons. Thus, the neuron coverage indicates the fraction of the network exercised by the entire set of given synthetic images. To determine when a neuron is activated, an activation threshold must be defined, and any neuron output with an absolute value that exceeds the threshold is considered activated. DNNs for this study primarily use the rectified linear unit (ReLU) [157] activation function (commonly used for image processing DNNs [120]), and neurons are considered activated with any output value greater than zero. Neuron activation patterns are computed in a manner similar to the neuron coverage. A vector is constructed with elements having a one-to-one correspondence to each neuron in each activation layer of the DNN. Elements corresponding to activated neurons are assigned a value of 1, whereas all others are assigned a value of 0. These patterns are observed as a means for measuring which specific portions of the DNN are exercised by the given synthetic images, analogous to the execution paths for a traditional software program. Including the activation pattern as a metric for Enki to diversify will encourage each archived context to activate mutually unique sections of the DNN under test. Figure 3.5 illustrates two examples of possible activation patterns for the same simple neural network. Activated neurons (depicted as green) contribute to the final output, while non-activated neurons have zero contribution. 33 Enki Archive Novelty Scores vs. Generation Figure 3.4: Plot of novelty scores for Enki-generated raindrop contexts over 50 generations. The blue curve shows the mean novelty score of all contexts archived. The green-shaded area shows the bounds for the highest and lowest novelty score in the archive. (a) Example 1 (b) Example 2 Figure 3.5: Example activation patterns for the same DNN. Activated neurons (marked green), are assigned a value of 1, and inactive neurons are assigned a value of 0. Finally, all designated values are concatenated into a vector form. 34 Algorithm 3.1 Enki 1: function evolutionary-search(𝑛_𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠, 𝑒𝑣𝑎𝑙_ 𝑓 𝑢𝑛𝑐) 2: 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 ← ∅ 3: 𝑝𝑜 𝑝 ← random-population() ⊲ Initialize population with random genomes. 4: for 0 to 𝑛_𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 do ⊲ Evolve over given number of generations. 5: 𝑝𝑜 𝑝 ← selection( 𝑝𝑜 𝑝) ⊲ Select genomes via tournament selection. 6: 𝑝𝑜 𝑝 ← recombination( 𝑝𝑜 𝑝) ⊲ Recombine genomes via single-point crossover. 7: 𝑝𝑜 𝑝 ← mutation( 𝑝𝑜 𝑝) ⊲ Mutate genomes via creep mutation. 8: 𝑝𝑜 𝑝 ← evaluation( 𝑝𝑜 𝑝, 𝑒𝑣𝑎𝑙_ 𝑓 𝑢𝑛𝑐) ⊲ Evaluate phenomes with given procedure. 9: 𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝 ← commit-to-archive(𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝) 10: return 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 11: function commit-to-archive(𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝) 12: 𝑠𝑐𝑜𝑟𝑒𝑠 ← rank-individuals(𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝) ⊲ Compute novelty scores via Eq. (3.2). 13: 𝑝𝑜 𝑝 ← 𝑝𝑜 𝑝 : 𝑠𝑐𝑜𝑟𝑒𝑠 > 𝑛𝑜𝑣𝑒𝑙𝑡𝑦_𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ⊲ Filter out any below minimum threshold. 14: 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 ← truncate(𝑎𝑟𝑐ℎ𝑖𝑣𝑒 ∪ 𝑝𝑜 𝑝) ⊲ Combine and truncate to desired size. 15: return 𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝 Algorithm 3.2 Evaluate Context 1: function evaluate-context(𝑐𝑜𝑛𝑡𝑒𝑥𝑡) 2: 𝑡𝑟𝑎𝑖𝑛_𝑠𝑢𝑏𝑠𝑒𝑡 ← sample-training-data() ⊲ Sample unaltered training data. 3: 𝑠𝑦𝑛𝑡ℎ_𝑠𝑢𝑏𝑠𝑒𝑡 ← transform(𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑡𝑟𝑎𝑖𝑛_𝑠𝑢𝑏𝑠𝑒𝑡) ⊲ Transform w/ the given context. 4: 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟 ← evaluate-dnn-behavior(𝑠𝑦𝑛𝑡ℎ_𝑠𝑢𝑏𝑠𝑒𝑡) ⊲ Evaluate synthetic data. 5: return 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟 Since Enki’s evolved contexts are encouraged to exhibit mutually unique behavior in the DNN (i.e., phenome diversity), each context exercises different aspects of the DNN regardless of how similar operational characteristics may be between test cases (i.e., genome similarity). For example, Figure 3.6 illustrates both genotypic and phenotypic properties of each raindrop context evolved by Enki for the raindrop occlusion condition. Plots on top depicted archived contexts as circles with a position and radius in proportion to each occluding raindrop. On bottom, the DNN accuracy is shown for processing images under each context. Enki focuses on evolving an archive of raindrops that have a diverse impact on the DNN rather than having diverse appearances. After 30 generations, Enki found that large raindrops towards the center the of the image produced the most varied behavior for the DNN. Furthermore, since Enki’s objective is diversity (rather than optimizing test cases to favor any specific behavior metric), the resulting suite of test cases are not expected to be strictly failure cases. Contexts generated by Enki are expected to cover a wide range 35 for each behavior metric, and therefore, the resulting contexts enable an assessment of the DNN that ignores any bias towards a specific level of performance or any bias towards the likelihood of any specific context to occur. (a) Generation 1 (b) Generation 15 (c) Generation 30 Figure 3.6: Plots of Enki-archived raindrops, evolved over 30 generations. On top, raindrops are plotted as overlapping circles at the same relative image position and with the same relative raindrop radius. On bottom, accuracy is shown for when the DNN is exposed to each raindrop. Enki starts with a random assortment of raindrops and evolves raindrops that produce diverse effects on the DNN, rather than evolving raindrops with diverse appearances. After 30 generations, Enki found that larger raindrops towards the center created the widest distribution of DNN behavior. Step 3. Evaluate Robustness To evaluate the robustness of a DNN to a specific environmental condition, each Enki-generated context is applied to the default test data. A robustness measure (𝜓) is then computed for each test input with the following formula [230], 1 𝜓(𝑥) = (3.3) max 𝐷 𝐾 𝐿 (𝑃(𝑥), 𝑃(𝑥 + 𝛿)) 𝛿∈𝑠𝑒𝑡 where 𝐷 𝐾 𝐿 is the Kullback-Leibler divergence3 between the DNN’s predictions on default test data, 𝑃(𝑥), and its predictions on synthetically altered data, 𝑃(𝑥 + 𝛿). A DNN’s robustness for any given input (𝑥) is the inverse of the maximum 𝐷 𝐾 𝐿 found for the entire set of generated contexts. 3 Kullback-Leibler divergence is a measure of the relative entropy between two probability distributions [203]. 36 Thus, a more robust DNN implies that its output has been disturbed less when exposed to the noise introduced by the given contexts. To determine the robustness of a set of test inputs (𝑋 = [𝑥 0 ...𝑥 𝑛 ]), the mean robustness is taken over the whole set. The entire evaluation procedure is shown in Algorithm 3.3. Algorithm 3.3 Evaluate Robustness 1: function evaluate-dnn-robustness(𝑑𝑛𝑛, 𝑡𝑒𝑠𝑡_𝑑𝑎𝑡𝑎, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑠) 2: 𝑑𝑎𝑡𝑎𝑠𝑒𝑡_𝑟𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠 ← ∅ 3: for 𝑥 in 𝑡𝑒𝑠𝑡_𝑑𝑎𝑡𝑎 do ⊲ Iterate through each test example. 4: 𝑑_𝑚𝑎𝑥 ← 0 5: 𝑦 ← evaluate-dnn(𝑑𝑛𝑛, 𝑥) ⊲ Evaluate the unaltered example. 6: for 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 in 𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑠 do ⊲ Iterate through each given context. 7: 𝑥 ′ ← transform(𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑥) ⊲ Transform 𝑥 into synthetic example 𝑥 ′. 8: 𝑦′ ← evaluate(𝑑𝑛𝑛, 𝑥 ′) ⊲ Evaluate the synthetic example. 9: 𝑑 ← kl-divergence(𝑦, 𝑦 )′ ⊲ Compute divergence for synthetic example. 10: 𝑑_𝑚𝑎𝑥 ← max(𝑑, 𝑑_𝑚𝑎𝑥) ⊲ Track which context created the most divergence. 11: 𝑟𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠 ← 1/𝑑_𝑚𝑎𝑥 12: 𝑑𝑎𝑡𝑎𝑠𝑒𝑡_𝑟𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠 ← 𝑑𝑎𝑡𝑎𝑠𝑒𝑡_𝑟𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠 ∪ 𝑟𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠 13: return mean(𝑑𝑎𝑡𝑎𝑠𝑒𝑡_𝑟𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠) 3.2.3 Improving System Robustness for Exposure to an Uncertain Condition Aside from using the diverse contexts from Enki to generate test data for the DNN, additional steps can be taken to create a more robust DNN by introducing synthetic training examples, created in a similar manner. Data flow for these additional steps are depicted in Figure 3.7, and descriptions for each step follow. Step 4. Train Deep Neural Network With Synthetic Data A new DNN is trained by mixing synthetic training data with the default training dataset during the training phase (Algorithm 3.4). With each epoch, a fraction (𝜌) of the training images are chosen to be transformed to match a randomly selected context from Enki’s archive. After multiple training iterations (i.e., epochs), the DNN is exposed to each training image in its original form and also a mixture of different synthetic variants (based on Enki’s archived contexts). The new DNN is 37 structurally identical to the original DNN, with the only difference being that the weights chosen by this training procedure is optimized to better handle Enki’s archived contexts, as opposed to the weights chosen for the original DNN, where only the unaltered training data is considered (i.e., the targeted uncertain environmental conditions are not included). Figure 3.7: DFD for using Enki-generated contexts to retrain and evaluate a more robust DNN with synthetic data. Algorithm 3.4 Train DNN with Synthetic Data 1: function train-with-synth-data(𝑑𝑛𝑛, 𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑠, 𝑛_𝑒 𝑝𝑜𝑐ℎ𝑠) 2: 𝑑𝑛𝑛 ← init-weights(𝑑𝑛𝑛) ⊲ Initialize weight values for the given DNN. 3: for 0 to 𝑛_𝑒 𝑝𝑜𝑐ℎ𝑠 do 4: 𝑠𝑦𝑛𝑡ℎ_𝑑𝑎𝑡𝑎 ← mix-data(𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑠) ⊲ Mix synthetic & unaltered data. 5: 𝑑𝑛𝑛 ← fit-weights(𝑑𝑛𝑛, 𝑠𝑦𝑛𝑡ℎ_𝑑𝑎𝑡𝑎) ⊲ Fit weights to data (gradient descent). 6: return 𝑑𝑛𝑛 7: function mix-data(𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑠) 8: 𝑚𝑖𝑥𝑒𝑑_𝑑𝑎𝑡𝑎 ← ∅ 9: for 𝑥 in 𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎 do 10: if random() > 𝜌 then ⊲ Transform according to transformation rate 𝜌. 11: 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 ← select-random(𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑠) ⊲ Select random context. 12: 𝑥 ← transform(𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑥) ⊲ Transform data with selected context. 13: 𝑚𝑖𝑥𝑒𝑑_𝑑𝑎𝑡𝑎 ← 𝑚𝑖𝑥𝑒𝑑_𝑑𝑎𝑡𝑎 ∪ 𝑥 ⊲ Add current example to set of mixed data. 14: return 𝑚𝑖𝑥𝑒𝑑_𝑑𝑎𝑡𝑎 Additional factors may be considered when retraining DNNs with synthetic data in practice. To help reduce overfitting to the contexts archived by Enki, some fuzzing could be introduced when data is transformed (Algorithm 3.4, Line 20). For this study, fuzzing has not been introduced, 38 in order to fairly compare the quality of the exact contexts selected by Enki. Furthermore, this study assumes 𝜌 = 0.5 to give equal balance to unaltered and synthetic training examples. Alternative values of 𝜌 will bias the training to either emphasize or de-emphasize the presence of the environmental condition of interest. Finally, additional constraints should be considered when applying a specific context to a specific training input. For example, when applying raindrops to an image, an additional step should be taken to ensure that the photographed object of interest is not completely obscured by the introduced raindrop. In cases where it is completely obscured, the resulting synthetic image should not be included in the training dataset. Step 5. Evaluate Robustness Since the new DNN is structurally identical to the original DNN (i.e., trained only on default data), it can be evaluated with the same procedure as Step 3 (Algorithm 3.3), where synthetic test data is generated to reflect the contexts archived by Enki. By using the same evaluation procedure, the performance results of the new DNN can be directly compared to the original DNN. 3.3 Empirical Validation This section presents results from experiments conducted with Enki on DNNs trained with the CIFAR-10 and KITTI datasets. For these experiments, Enki has been configured with the settings in Table 3.1. For comparison, experiments have also been conducted using DeepTest [211] and a random (i.e., Monte Carlo) generation method. Results from these experiments answer the following research questions with accompanying null hypotheses (𝐻0 ) and alternative hypotheses (𝐻1 ): RQ1.) Can we use Enki to identify gaps in training that cause an LES to perform poorly? 𝐻0 : Testing with Enki-generated data does not significantly decrease DNN performance. 𝐻1 : Testing with Enki-generated data significantly decreases DNN performance. 39 RQ2.) Can we use Enki to improve the robustness of an LES to the effects of uncertain environmental conditions? 𝐻0 : Retraining with Enki-generated data does not significantly improve DNN robustness. 𝐻1 : Retraining with Enki-generated data significantly improves DNN robustness. Synthetic test datasets resulting from each method are labeled T with a corresponding subscript to indicate the source method for generation (e.g., Tdeep , Tenki , and Trand to represent DeepTest-generated, Enki-generated, and random-generated test data, respectively). Similarly, DNNs are labeled with a subscript to indicate a source for synthetic training data (e.g., DNNenki is trained with Enki-generated data). Default (i.e., unaltered) test datasets are referred to as Tbase , and DNNs trained only on unaltered training data are referred to as DNNbase . When referring to a specific application, a prefix will be attached (e.g., CIFAR10-DNNbase and CIFAR10-Tbase ). For brevity, the application prefix is omitted when the discussion applies to both applications. Table 3.1: Enki Configuration Settings Setting Value 𝑛𝑢𝑚_𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 50 generations 𝑎𝑟𝑐ℎ𝑖𝑣𝑒_𝑠𝑖𝑧𝑒 50 individuals 𝑝𝑜 𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛_𝑠𝑖𝑧𝑒 10 individuals 𝑡𝑜𝑢𝑟𝑛𝑎𝑚𝑒𝑛𝑡_𝑠𝑖𝑧𝑒 3 comparisons 𝑚𝑢𝑡𝑎𝑡𝑖𝑜𝑛_𝑟𝑎𝑡𝑒 14% 𝑚𝑢𝑡𝑎𝑡𝑖𝑜𝑛_𝑠ℎ𝑖 𝑓 𝑡 20% 𝑛𝑢𝑚_𝑛𝑒𝑎𝑟𝑒𝑠𝑡_𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 𝑠 3 individuals In addition to raindrop occlusion (see Section 3.2), additional environmental conditions have been included in these experiments, such as variable brightness and contrast. Transformation functions for these conditions have been implemented using standard image enhancement functions included in the Pillow [143] module for Python. For variable brightness, pixel values for each input image have been altered by a uniform 𝑏𝑟𝑖𝑔ℎ𝑡𝑛𝑒𝑠𝑠 factor, where 𝑏𝑟𝑖𝑔ℎ𝑡𝑛𝑒𝑠𝑠 = 0 results in a black image, and 𝑏𝑟𝑖𝑔ℎ𝑡𝑛𝑒𝑠𝑠 = 1 results in the original pixel intensity. For variable contrast, pixel values for each input image have been altered by a 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 factor, where 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 = 0 results in a gray 40 image, and 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 = 1 results in the original image contrast. Experiments have been conducted with these transformations implemented in isolation as well as in 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛 with the raindrop condition. Parameters with value ranges for all environmental conditions are provided in Table 3.2 when applied to both the CIFAR-10 and KITTI applications. Table 3.2: Environmental Parameter Ranges CIFAR-10 KITTI Parameter Value Range Value Range 𝑏𝑟𝑖𝑔ℎ𝑡𝑛𝑒𝑠𝑠 0 to 1 0 to 1 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 0 to 1 0 to 1 𝑟𝑎𝑖𝑛𝑑𝑟𝑜 𝑝_𝑥 0 to 31 pixels 0 to 1241 pixels 𝑟𝑎𝑖𝑛𝑑𝑟𝑜 𝑝_𝑦 0 to 31 pixels 0 to 374 pixels 𝑟𝑎𝑖𝑛𝑑𝑟𝑜 𝑝_𝑟𝑎𝑑𝑖𝑢𝑠 0 to 9 pixels 0 to 123 pixels 𝑟𝑎𝑖𝑛𝑑𝑟𝑜 𝑝_𝑏𝑙𝑢𝑟 1 to 2 pixels 9 to 18 pixels 3.3.1 Deep Neural Network Applications For validation, this chapter’s approach has been applied to two benchmark applications (described in Chapter 2). Both applications implement image-processing DNNs to learn from labeled training data and make predictions for new data. The first application implements a DNN for image recognition, where each image input is a photograph of a single object, and the task is to classify the type of object photographed. The second application implements a DNN for object detection, where each image input is a photograph of zero or more objects, and the task is to both locate and classify each object photographed. The focus of this study is how to assess and improve any pre-trained DNN to conditions not covered by existing training or validation data. A baseline CIFAR10-DNNbase has been trained on default training images to achieve a 91% accuracy on default test images. Similarly, a baseline KITTI-DNNbase has been trained on default training images to achieve a 88% recall on default test images. Any specific specializations in architecture or hyper-parameter selections that might incrementally improve performance of the described DNNs on default test data is considered 41 tangential to this study. The objective of this study is to assess and improve each DNN’s performance when new environmental phenomena are introduced into each respective dataset. 3.3.2 Evaluation of System Behavior Under Uncertain Environmental Conditions To assess the performance of CIFAR10-DNNbase and KITTI-DNNbase under each environ- mental condition, Enki creates an archive of diverse contexts for each respective condition. The Tenki synthetic test dataset was created by applying Enki-generated contexts to the default test data. Synthetic test datasets Trand and Tdeep were also created for comparison. Trand was created by applying random-generated contexts to the default test data, and Tdeep was created by using the DeepTest method [211]. 3.3.2.1 System Performance on Synthetic Test Data To address the first research question (RQ1), the quality of synthetic test data produced by Enki (Tenki ) is compared to alternative test datasets (Tbase , Tdeep , and Trand ) under each environmental condition introduced to both CIFAR-10 and KITTI datasets. The null hypothesis (𝐻0 ) is that there is no difference between the mean test performance of each DNNbase for Tenki when compared to Tdeep and when compared to Trand (i.e., 𝐻0 : 𝜇diff = 0). The alternative hypothesis (𝐻1 ) is that the mean test performance of each DNNbase will be less for Tenki when compared to Tdeep and Trand (i.e., 𝐻1 : 𝜇diff > 0). To validate the statistical significance of these results and test each hypothesis, paired samples 𝑡-tests [214] have been conducted to reject 𝐻0 and accept 𝐻1 when comparing Tenki versus Tdeep and Tenki versus Trand . Plots in Figure 3.8 and Figure 3.9 show the respective performance of CIFAR10-DNNbase and KITTI-DNNbase when evaluated with both unaltered and synthetic test data from each method. Results shown are the mean over multiple trials to account for variation (a 95% confidence interval is shown by whiskers to illustrate variation). Figure 3.8 shows the mean accuracy of CIFAR10-DNNbase for test datasets generated for each environmental condition in isolation (e.g., Brightness, Contrast, and Raindrop) as well as in combination. For all trials, CIFAR10-DNNbase 42 had the lowest mean accuracy for CIFAR10-Tenki . Similarly, Figure 3.9 shows the mean recall of KITTI-DNNbase for each test dataset and each environmental condition. KITTI-DNNbase had the lowest mean recall for KITTI-Tenki . Test Accuracy for CIFAR10-DNNbase Figure 3.8: Test accuracy of CIFAR10-DNNbase for each respective test dataset and each envi- ronmental condition. Displayed values are the mean over multiple trial runs, with 95% confidence intervals shown by whiskers. Test Recall for KITTI-DNNbase Figure 3.9: Test recall of KITTI-DNNbase for each respective test dataset and each environmental condition. Displayed values are the mean over multiple trial runs, with 95% confidence intervals shown by whiskers. Tables 3.3 and 3.4 provide 𝑡-test results and corresponding 𝑝-values for these experiments. In Table 3.3, the value 𝜇diff is the mean difference in CIFAR10-DNNbase accuracy for each pair of generated test datasets over multiple trials, and 𝜎diff is the corresponding standard deviation. In Table 3.4, 𝜇diff is the mean difference in KITTI-DNNbase recall, and 𝜎diff is the corresponding standard deviation. For both CIFAR-10 and KITTI experiments, there is strong evidence (𝑝 ≤ 0.01) to reject 𝐻0 and support 𝐻1 . Therefore, these results show that Enki was able to generate test examples that cause both DNN’s to perform more poorly than Tbase , Tdeep , and Trand , thus addressing RQ1. 43 Table 3.3: Paired Samples 𝑡-Test For Significance of CIFAR10-Tenki Accuracy Comparisons CIFAR10-Tenki vs. Tdeep CIFAR10-Tenki vs. Trand 𝜇diff 𝜎diff 𝑡 𝑝 𝜇diff 𝜎diff 𝑡 𝑝 Brightness 50.34 1.86 85.59 <0.01 52.79 0.62 269.11 <0.01 Contrast 44.58 1.38 97.07 <0.01 48.38 1.12 136.56 <0.01 Raindrop 16.88 2.71 18.71 <0.01 17.85 2.12 26.61 <0.01 Combination 37.59 4.07 29.19 <0.01 37.93 1.49 80.38 <0.01 Table 3.4: Paired Samples 𝑡-Test For Significance of KITTI-Tenki Recall Comparisons KITTI-Tenki vs. Tdeep KITTI-Tenki vs. Trand 𝜇diff 𝜎diff 𝑡 𝑝 𝜇diff 𝜎diff 𝑡 𝑝 Brightness 9.71 1.60 14.91 <0.01 6.33 1.25 12.40 <0.01 Contrast 8.09 2.66 7.46 <0.01 3.12 1.19 6.40 <0.01 Raindrop 7.37 1.20 13.78 <0.01 7.16 0.74 21.50 <0.01 Combination 0.99 0.45 4.40 0.01 4.89 0.81 12.12 <0.01 3.3.2.2 Analysis of Resulting System Behavior To illustrate the diversity of DNN behavior produced by Enki-generated contexts, both Fig- ures 3.10 and 3.11 compare random-generated and Enki-generated contexts from each trial run. Results of the neuron coverage and accuracy of CIFAR10-DNNbase for random-generated and Enki-generated contexts are shown in Figures 3.10a and 3.10b, respectively. Each point corre- sponds to an individual context. Bold (i.e., colored) points show contexts from a single trial run, while gray points show contexts from all trial runs. Regression lines are drawn as dashed lines, with a Pearson’s correlation coefficient4 (𝑟) computed for each set of results and displayed in the bottom corner of each plot. The overall mean accuracy is marked by a vertical dotted line for each plot, and for each environmental condition. Similarly, Figures 3.11a and 3.11b plot the relationship between neuron coverage and recall of KITTI-DNNbase for random-generated and Enki-generated contexts. Contexts generated by Enki (Figure 3.10b and Figure 3.11b) are more evenly distributed along both axes of each plot, such that each context activates a different per- 4 The Pearson’s correlation coefficient is a measure of the linear association of two variables, ranging between −1 and 1 [110]. 44 centage of neurons and provides a different level of difficulty for the DNN. In contrast, a majority of the random-generated contexts are focused on the upper-right corner of each plot, resulting in a higher percentage of neurons covered and fewer misclassifications. These results imply that a random selection of environmental properties is not as likely to produce challenging test examples for CIFAR10-DNNbase or KITTI-DNNbase , in comparison to test examples from Enki. CIFAR10-DNNbase Neuron Coverage vs. Accuracy for Generated Contexts (a) Random-Generated Contexts (b) Enki-Generated Contexts Figure 3.10: Random-generated (a) and Enki-generated (b) contexts are plotted to show the rela- tionship between the neuron coverage and accuracy of CIFAR10-DNNbase under each respective environmental condition. Bold (i.e., colored) points show contexts resulting from a single trial run. Dashed lines are regression lines with a Pearson’s correlation coefficient (𝑟). Dotted vertical lines mark the mean accuracy over all contexts. Compared to random-generation, Enki-generated con- texts are more evenly distributed and result in more misclassifications overall (i.e., lower accuracy). For the variable brightness and contrast conditions, there is a strong positive correlation (𝑟 > 0.5) between neuron coverage and accuracy, which implies that more challenging contexts are likely to activate fewer neurons in the DNN. An intuitive rationale for this result is that information is lost when decreasing brightness or contrast, and weaker input signals will result in fewer activated neurons. However, greedy techniques that select test examples by maximizing neuron coverage (e.g., DeepTest) assume adverse properties of the environment will correlate to an increased neuron coverage. Contexts created by Enki are more evenly distributed along the y-axis (i.e., 45 neuron coverage). Since no assumption is made about how neuron coverage relates to failing test examples, more challenging test examples can be found by Enki. For the raindrop occlusion condition, no significant variation in the DNN’s neuron coverage has been observed between contexts, most likely since raindrops only affect a small portion of the source image, and therefore, no strong correlation could be found between neuron coverage and accuracy. When combining all environmental conditions, there is a strong correlation between neuron coverage and accuracy. Finally, in comparison to random-generated contexts, Enki-generated contexts resulted in the overall least accuracy for CIFAR10-DNNbase and overall least recall for KITTI-DNNbase . KITTI-DNNbase Neuron Coverage vs. Recall for Generated Contexts (a) Random-Generated Contexts (b) Enki-Generated Contexts Figure 3.11: Random-generated (a) and Enki-generated (b) contexts are plotted to show the relationship between the neuron coverage and accuracy of KITTI-DNNbase under each respective environmental condition. Bold (i.e., colored) points show contexts resulting from a single trial run. Dashed lines are regression lines with a Pearson’s correlation coefficient (𝑟). Dotted vertical lines mark the mean recall over all contexts. Compared to random-generation, Enki-generated contexts are more evenly distributed and result in more misclassifications overall (i.e., lower recall). 3.3.3 Evaluation of System Robustness To address the second research question (RQ2), the CIFAR-10 and KITTI DNNs were retrained with a mixture of synthetic training data to improve each DNN’s performance in the presence 46 of the introduced environmental conditions (see Section 3.2.3). DNNenki was retrained with a mixture of unaltered training images and training images transformed with Enki-generated contexts. For further comparison of Enki-generation versus random-generation, experiments have been conducted to compare DNNenki to a DNN retrained similarly but with a mixture of random-generated contexts (DNNrand ). The hypothesis for RQ2 is that DNNenki will be more robust to each introduced environmental condition when compared to DNNbase . Therefore, the null hypothesis (𝐻0 ) is that there is no difference between the mean robustness of DNNenki when compared to DNNbase (i.e., 𝐻0 : 𝜇diff = 0). The alternative hypothesis (𝐻1 ) is that DNNenki will be more robust than DNNbase for each respective environmental condition (i.e., 𝐻1 : 𝜇diff > 0). To validate the statistical significance of these results and test each hypothesis, paired samples 𝑡-tests [214] have been conducted to reject 𝐻0 and support 𝐻1 when comparing Tenki versus Tbase . Plots in Figures 3.12 and 3.13 show the respective test performance of retrained DNNs (DNNenki and DNNrand ) when evaluated with both unaltered and synthetic test datasets. Figure 3.8 shows that both CIFAR10-DNNenki and CIFAR10-DNNrand are more accurate for the introduced synthetic data when compared to CIFAR10-DNNbase . Similarly, Figure 3.9 shows that both KITTI-DNNenki and KITTI-DNNrand have higher recall for the introduced synthetic data when compared to KITTI-DNNbase . When comparing the robustness of each DNN to the introduced environmental conditions, robustness is evaluated according to Equation (3.3). This form of robustness compares a DNN’s output when processing an unaltered image to its output when processing each synthetic variant. A higher robustness value indicates that the DNN’s output has been disturbed less by the associated environmental condition. Plots in Figures 3.14 and 3.15 show the robustness of each retrained DNN when exposed to synthetic test datasets for each respective environmental condition. Results are shown as the mean over multiple trials to account for variation (a 95% confidence interval is shown by whiskers to illustrate variation). The mean robustness across all test datasets is shown by the horizontal gray line in each plot (also displayed as 𝜇 in the top corner). Both 47 Test Accuracy for Synthetically Retrained CIFAR-10 DNNs (a) Test Accuracy for CIFAR10-DNNrand (b) Test Accuracy for CIFAR10-DNNenki Figure 3.12: Test performance of CIFAR10-DNNrand (a) and CIFAR10-DNNenki (b) for each respective test dataset and for each environmental condition. Displayed values are the mean accuracy over multiple trial runs, with 95% confidence intervals shown by whiskers. When compared to CIFAR10-DNNbase (Figure 3.8), synthetic test data accuracy has improved for both CIFAR10-DNNrand and CIFAR10-DNNenki . Test Recall for Synthetically Retrained KITTI DNNs (a) Test Recall for KITTI-DNNrand (b) Test Recall for KITTI-DNNenki Figure 3.13: Test performance of KITTI-DNNrand (a) and KITTI-DNNenki (b) for each respective test dataset and for each environmental condition. Displayed values are the mean recall over multiple trial runs, with 95% confidence intervals shown by whiskers. When compared to KITTI-DNNbase (Figure 3.9), synthetic test data recall has improved for both KITTI-DNNrand and KITTI-DNNenki . 48 CIFAR10-DNNbase (Figure 3.14) and KITTI-DNNbase (Figure 3.15) were observed to be the least robust to the introduced environmental conditions. Introducing random-generated synthetic training data improved DNN robustness, but on average, use of Enki-generated synthetic training data lead to a more robust DNN. Tables 3.5 and 3.6 provide 𝑡-test results and corresponding 𝑝-values for these experiments. The value 𝜇diff is the mean difference in robustness for each pair of DNNs, and 𝜎diff is the corresponding standard deviation. For both CIFAR-10 and KITTI experiments, when comparing DNNenki to DNNbase , there is evidence (𝑝 ≤ 0.05) to reject 𝐻0 and support 𝐻1 . These results show that retraining a DNN with Enki-generated synthetic training data was able to produce a more robust DNN compared to DNNbase , thus rejecting 𝐻0 and supporting 𝐻1 for RQ2. Furthermore, these results show that retraining with Enki-generated synthetic data is more likely to produce a more robust DNN in comparison than retraining with random-generated synthetic data. Table 3.5: Paired Samples 𝑡-Test Results For CIFAR10-DNNenki Robustness Comparisons CIFAR10-DNNenki vs. DNNbase CIFAR10-DNNenki vs. DNNrand 𝜇diff 𝜎diff 𝑡 𝑝 𝜇diff 𝜎diff 𝑡 𝑝 Brightness 4.70 1.38 8.98 <0.01 0.96 0.37 6.88 <0.01 Contrast 4.99 1.20 10.20 <0.01 1.68 0.12 35.46 <0.01 Raindrop 2.93 0.30 23.91 <0.01 0.95 0.55 4.24 <0.01 Combination 2.67 0.95 6.88 <0.01 0.45 0.28 3.54 0.01 Table 3.6: Paired Samples 𝑡-Test Results For KITTI-DNNenki Robustness Comparisons KITTI-DNNenki vs. DNNbase KITTI-DNNenki vs. DNNrand 𝜇diff 𝜎diff 𝑡 𝑝 𝜇diff 𝜎diff 𝑡 𝑝 Brightness 1.42 0.16 15.33 <0.01 0.72 0.24 5.18 0.02 Contrast 2.59 1.06 5.45 <0.01 0.65 0.12 7.62 0.04 Raindrop 1.19 0.24 7.02 0.05 2.38 0.33 10.18 0.03 Combination 5.76 0.10 82.38 <0.01 1.51 0.36 5.88 0.05 49 Robustness Results for CIFAR-10 DNNs Figure 3.14: Box plots of the robustness of CIFAR10-DNNs (rows) when evaluated against each test dataset for each environmental condition (columns). Robustness (y-axis) was measured over multiple trials for each test dataset (x-axis). Each box shows the interquartile range for the DNN’s measured robustness. Medians values are marked orange, and whiskers show the full range. For each plot, the mean (𝜇) robustness found across all test datasets for an environmental condition is also shown (top-left corner and gray line). On average, CIFAR10-DNNenki was found to be the most robust to each introduced environmental condition. Robustness Results for KITTI DNNs Figure 3.15: Box plots of the robustness of KITTI-DNNs (rows) when evaluated against each test dataset for each environmental condition (columns). Robustness (y-axis) was measured over multiple trials for each test dataset (x-axis). Each box shows the interquartile range for the DNN’s measured robustness. Medians values are marked orange, and whiskers show the full range. For each plot, the mean (𝜇) robustness found across all test datasets for an environmental condition is also shown (top-left corner and gray line). On average, KITTI-DNNenki was found to be the most robust to each introduced environmental condition. 50 3.3.4 Summary of Results Both research questions (RQ1 and RQ2) can be answered with consistent results from two different applications and their respective datasets (CIFAR-10 and KITTI). To answer RQ1, these results show that Enki generates test data that is more difficult for the evaluated DNNs, thus highlighting training gaps for the introduced environmental conditions. To answer RQ2, these results show that retraining with Enki-generated training data leads to a DNN that is more robust to the introduced environmental conditions. Results from these experiments show that Enki-generated contexts are more likely to correspond to more difficult test examples. When compared to random-generated and DeepTest test data, Enki-generated test data caused CIFAR10-DNNbase and KITTI-DNNbase to result in more misclassifications. Furthermore, an analysis of the contexts archived by Enki can assist a developer with identifying environmental properties that lead to DNN failure. As Figure 3.6 shows, Enki can automatically discover that large raindrops focused on the center of the camera image produced the most varied performance in DNNs. This information could potentially be used by developers or automated processes to pre-emptively identify run-time contexts that lead to DNN failure. Enki can also enable a developer to improve the robustness of an LES to the effects of envi- ronmental conditions that are absent from default training data. Results from these experiments show that retraining with synthetic training data from Enki-generated contexts is able to improve the overall accuracy of a DNN in the presence of a newly introduced environmental condition. Furthermore, these results show that DNNenki is significantly more robust than DNNbase and comparable or better than DNNrand when faced with each respective environmental condition. 3.3.5 Threats to Validity We consider two main threats to the validity of this study. First, we acknowledge the use of simulation and possible discrepancies with reality (i.e., the “reality gap”) [96]. Second, we acknowledge possible variation in results due to stochastic components involved with each method. Each point is addressed in turn. 51 This chapter’s approach depends on a simulation environment capable of emulating sources of uncertainty, and therefore, a reality gap may exist. The quality of synthetic data will need to be assessed to determine if it is sufficiently realistic before use in a safety-critical application. However, even if synthetic data is not used directly to test and retrain the learning component of an LES, the contexts selected by Enki can provide insight to help guide follow-on studies with real-world phenomena. For example, additional images can be taken with real water droplets placed with the same sizes and locations identified by Enki. All DNN performance results presented are subject to variation, since both the training of DNNs and each data generation method include stochastic functions. To account for variation in these results, multiple independent trials were conducted for statistical analysis (at least ten trials each). Results displayed in Figures 3.8, 3.9, 3.12, 3.13, 3.14, and 3.15 are the mean over multiple trials with confidence intervals displayed to show the degree of variation observed. Finally, paired samples 𝑡-tests have been conducted to validate the statistical significance of comparisons between generated test datasets and re-trained DNNs. 3.4 Related Work This section compares this chapter’s work to related work in the areas of automated testing with novelty search and automated testing for DNNs. 3.4.1 Automated Testing with Novelty Search This chapter’s approach is inspired by Loki [178], a novelty search method to generate environmental conditions that lead to diverse system behavior as specified by high-level KAOS goal models [43]. Additional research has been conducted on the topic of using novelty search for automated testing to generate test data for object-oriented systems [22, 23]. However, such approaches use novelty search to diversify test cases by their testing parameters rather than by the resulting system behavior. By using the system’s behavior as the basis for diversity, Enki is able to discover test cases that may have similar operating conditions but unexpectedly lead to very different system behavior. 52 3.4.2 Automated Testing for Deep Learning Alternative methods have been proposed to address the problem of automatic test generation for DNNs. Related techniques, such as DeepXplore [171], DeepTest [211], DeepGauge [144], DeepHunter [226], DeepRoad [235], and DeepConcolic [207], generate test inputs by maximizing structural coverage metrics. DeepXplore and DeepTest measure neuron coverage over the entire DNN for a given test input. However, preliminary results show that test inputs created to maximize neuron coverage are not able to account for all possible behaviors that a DNN may exhibit and that neuron coverage alone can be insufficient for finding corner cases [195]. Finer variants of neuron coverage have also been considered with DeepGauge, where test cases are generated to maximize coverage over subsections or specific activation patterns within a DNN at the layer-level. DeepConcolic supports neuron coverage but also modified condition/decision condition (MC/DC) variants [208]. Further studies have shown that even techniques that attempt to maximize these more fine-grained structural coverage metrics can be misleading, since selected subsets of data with significantly different error rates can produce similar amounts of coverage [137]. Thus, more recent techniques such as DeepGini [56] have shifted focus on prioritizing test cases based on neuron coverage to more statistical performance metrics. One key difference that differentiates this work from related work in testing DNNs is that this work encourages diversity in terms of how a DNN performs across all generated test cases. In contrast to the related work, Enki explicitly searches for operating contexts that lead to unique system responses. Instead of only discovering test cases that highlight weaknesses in a DNN, Enki can be used to discover a broad range of test cases that may uncover strengths, weaknesses, or otherwise unknown (latent) behavior. Furthermore, Enki’s emphasis on diversity identifies more diverse weaknesses, thus enabling better coverage of vulnerabilities. Another key difference between these related techniques and Enki is in the form of each technique’s output. The aforementioned techniques generate transformations that are linked to a specific source input, and therefore, no general trends can be inferred from the synthetic data produced. In contrast, Enki generates transformations based on DNN behavior over multiple inputs rather than one specific input, which 53 enables more generalized patterns to be inferred. By uncoupling problematic test cases from a single source input, Enki helps developers identify general limitations of a DNN (e.g., what sizes of raindrops are more likely to cause objects to be misidentified). Though there are different approaches in technique, all related methods for testing DNNs rely on a generative procedure to construct synthetic test data. Methods like DeepTest and DeepXplore depend on a manually crafted procedure to transform real data into a manipulated, synthetic form (e.g., an image processing procedure to add fog to an existing image). Using a rule-based algorithmic transformation procedure gives developers direct control over how synthetic data is generated, but it also relies on the assumption that a transformation procedure can be developed to accurately reproduce the desired phenomenon. Alternatively, methods like DeepRoad use Generative Adversarial Networks (GANs) [74] that perform Image-to-Image translation [236] to transform the appearance of any input data to match real-world examples of the desired phenomenon (e.g., transform a clear-weather image of a road into a snowy version). Image-to-Image translation does not require a formal understanding or model of the introduced environmental condition, but it does require an adequate collection of real-world examples (i.e., a sufficiently large collection of snowy images). Use of GANs in this manner gives developers less control over how the resulting synthetic data will appear, as it is entirely dependent on the example images given to the GAN and whatever features the GAN learns to focus on during its training phase. This chapter assumes a parameterized transformation procedure can be constructed to reproduce the environmental condition of interest (similar to DeepTest and DeepXplore). 3.5 Summary This chapter has described a technique to both assess and improve the robustness of an LES to previously unseen environmental conditions. When applied to image-processing DNNs for two different vision tasks (image recognition and object detection), Enki can introduce and discover unique contexts of an environmental condition that lead to more difficult test examples than random- generation or the DeepTest method. Likewise, this chapter has demonstrated how these contexts can 54 be used to augment training data to retrain a DNN to be more robust to the introduced environmental condition. Two key benefits of the diversity-driven approach described by this chapter are that (1) multiple behavior criteria may be considered and (2) generated training/test cases are encouraged to elicit unique effects on the LES. Existing techniques that generate test cases by optimizing specific behavior metrics, such as neuron coverage, assume that the given behavior metric has some meaningful contribution to the failure of a DNN. Results from this study have shown that for the raindrop occlusion condition, neuron coverage has no correlation to failing test cases, and therefore, other behavior metrics should be considered. Since Enki can observe multiple criteria for diversification, it can uncover failing cases even when it is unknown how any specific metric will correlate to the failing cases. Furthermore, by generating cases that result in diverse behavior, Enki can create a wider spectrum of failing cases, uncovering corresponding operating contexts that lead to each form of failure. Existing greedy optimization methods may be used in conjunction to generate additional test cases that target each specific case uncovered by Enki. Since developers are required to supply a simulation environment able to emulate the conditions of interest, Enki is not able to address conditions of uncertainty completely unknown to developers a priori. For example, Enki will not explore the effects of snowy weather without a simulation environment with a predefined function to emulate snow conditions. However, Enki can uncover unconsidered combinations of known conditions that result in a novel condition when combined. Additionally, as a tool for developers, the quality of data produced by this technique will only be as realistic as the simulation environment allows. In cases where synthetic data cannot be trusted for retraining purposes, contexts generated by Enki can still inform developers with which characteristics of the real-world phenomena should be targeted for follow-on study (e.g., raindrops of a certain size and in certain regions of the camera image). As described, this technique is intended for use during design time to enable developers to assess and train a more robust LES to uncertain operational conditions. Chapter 4 extends the scope of this work to applications beyond DNNs, such as using EAs to evolve more robust configuration 55 settings for onboard controllers. Chapter 5 considers additional run-time processes to determine if the underlying learning components are robust enough to handle the present conditions. Chapter 6 describes how a self-adaptive LES may mitigate anticipated undesirable behavior and switch to alternative strategies depending on run-time assessments [39, 121, 181]. 56 CHAPTER 4 ENABLING THE EVOLUTION OF ROBUST ROBOTIC SYSTEMS This chapter investigates the integration of evolution-based optimization and novelty search in order to improve the robustness of autonomous systems. This chapter introduces the Evo-Enki [128] framework, which comprises Enki and Evo-ROS. Simon et al. [199] developed Evo-ROS to integrate EC with physics-based simulations of autonomous systems controlled by ROS [176]. Enki uses novelty search to discover operational scenarios1 that lead to the most diverse behavior in the target system. Combining these techniques and tools into a single framework yields an automated approach to explore the operational landscape of the target system, identify regions of poor performance, and evolve system settings that better manage wide ranges of adversity. Ex- periments have been conducted with the throttle controller for EvoRally, a 1:5-scale autonomous vehicle for the study of autonomous driving systems. Preliminary results demonstrate the ability of the Evo-Enki framework to identify and characterize input speed signals that cause the existing controller to perform poorly. By identifying these problematic signals and autonomously switching among optimized controller modes, a control system can be developed to handle a wider range of conditions The remainder of this chapter is organized as follows. Section 4.1 overviews the motivation and objectives of this chapter. Section 4.3 provides background information on simulation for cyber-physical systems. Section 4.3 describes the Evo-Enki framework. Section 4.4 presents an empirical evaluation of Evo-Enki applied to the EvoRally platform. Section 4.5 overviews related work in the topics covered by this chapter. Finally, Section 4.6 provides a concluding summary for this chapter. 1A scenario is defined as a specific set of parameter values that describe the environment and internal conditions of a system. 57 4.1 Overview Autonomous cyber-physical systems are required for tasks where the burden of having a human operator is too high. With the absence of human supervision, effort must be made to mitigate sources of failure before the system is deployed or to include a capability for the system to adapt to unexpected conditions. This problem is particularly difficult due to the many sources of uncertainty in natural environments. Autonomous systems are described as being robust when they are capable of mitigating a wide range of adverse conditions. This chapter proposes an automated approach to assist software developers with making design decisions that result in more robust autonomous systems. Developing autonomous systems to meet software requirements at run time is challenging, because subsystems must interact with uncertain conditions in both internal mechanisms and the external environment. Techniques are needed to identify key scenarios that will have the most significant impact on system performance at design time to help developers form strategies that can mitigate potential sources of failure. Existing techniques either optimize the system to perform on a manual selection of scenarios, randomly generate scenarios, or use some heuristic-driven approach to create scenarios. Manual selection often requires expert knowledge for the problem domain and can be subject to confirmation bias [27]. Random generation may not be useful for discovering “corner-case” scenarios that cover small regions of the operational landscape [38, 78]. Finally, iterative heuristic-driven approaches rely on objectives that may be difficult to define and can often lead to sub-optimal solutions when the operational landscape is not amenable to hill-climbing or gradient search [134]. This chapter presents the Evo-Enki framework, a two-phase evolution-based approach to im- prove the robustness of an autonomous system. Leveraging prior research with Evo-ROS [199], the first phase optimizes system settings for a given set of scenarios. The second phase uses Enki (introduced in Chapter 3) to identify sets of scenarios that produce both extreme and diverse (i.e., mutually unique) system behavior. Using these two phases in tandem, the proposed Evo-Enki framework can systematically improve autonomous systems with less reliance on a priori expertise 58 of the problem domain and less subject to bias that might mask harmful corner cases. For an empirical evaluation, this chapter applies Evo-Enki to the EvoRally autonomous vehicle. Experiments have been conducted to improve the robustness of EvoRally’s throttle controller to a wide (i.e., diverse) range of operating scenarios. Preliminary results demonstrate that new throttle controller settings derived from the Evo-Enki framework exhibit less overall error in the presence of an increasingly varied set of scenarios. Thus, Evo-Enki is demonstrated to improve the robustness of EvoRally to its operating environment. 4.2 Digital Twin-Based Simulation of a Cyber-Physical System The term “digital twin” [76] has been adopted to refer to alternative, virtual versions of a physical platform under test. Virtual simulation can address many of the challenges associated with testing cyber-physical systems. Design changes on a physical platform or alterations to a physical test environment require a considerable amount of human effort and time. Furthermore, when testing faults occur, there is risk of damage to hardware and the external environment, which can be expensive to replace or repair. Digital twins enable engineers to determine modes of failure before a physical version of a system is ever produced, resulting in less expensive, more efficient, and less risky procedures for system evaluation. System exploration with digital twins enables developers to identify unforeseen, emergent be- havior during the design stage of a system [76]. Four categories of behavior have been identified within this context. Any desired, intentional behavior of the system is categorized as Predicted Desirable (PD). Problematic behaviors that are known a priori are categorized as Predicted Unde- sirable (PU). Unexpected but beneficial behaviors are categorized as Unpredicted Desirable (UD). Finally, any behavior that is unpredicted and leads to system failure is categorized as Unpredicted Undesirable (UU). The final category is of the most concern, because UU behaviors will have the most potential for catastrophic problems. The earlier UU behaviors are identified in a system’s design, the less expensive they will be to address. Through the use of digital twins, exploratory techniques can uncover such unexpected emergent behaviors before a physical platform is deployed. 59 Both partial knowledge of the physical world and the state explosion problem [40] are challenges that should be considered when using digital twins for evaluation. Simulations must systematically model the environment and all external interfaces for the cyber-physical system (e.g., sensors and actuators). However, partial knowledge of the physical world can lead to potential “reality gaps [96].” Since it is not feasible to model all physical laws of the universe into a simulation environment, simulation results will be limited by the specific subset of physical phenomena implemented. Any relevant real-world phenomena missing from the simulation could threaten the validity of results derived from the evaluation of a digital twin. Furthermore, as features are added to the system interface and additional phenomena are simulated, the state explosion problem can become an issue [76]. To uncover unexpected emergent behavior, it is possible that thousands of parameters must be considered, leading to an explosion in the number of possible states to be evaluated. Though each state could be evaluated in parallel, the computational cost can be expensive. To address the state explosion problem, heuristic search procedures could be considered to navigate the state search space more efficiently. Concerning the reality gap, the use of digital twins should be considered as one approach of a larger framework for system evaluation that also includes follow-on real-world tests for validation. 4.3 Exploring Environmental Diversity with Novelty Search The proposed Evo-Enki framework executes two steps to support the evolution of a more robust system configuration. Figure 4.1 illustrates the data flow between the two main computational steps for Evo-Enki, where data flow is indicated by labeled arrows, steps are depicted by circles, external entities are depicted by boxes, and data stores are depicted by double lines. Step 1 enables the evolution of a system configuration that works optimally across a range of given operational scenarios. Step 2 enables the evolution of a new set of operational scenarios that exercise the resulting configuration in diverse ways. The resulting diverse scenarios can then enable an even more robust system configuration to be evolved through future executions of Steps 1 and 2. Next, each step is described in turn. 60 Figure 4.1: High-level DFD of the Evo-Enki framework. Circles represent computational steps, boxes represent external entities, parallel lines mark persistent data stores, and connecting arrows show the flow of data between steps. 4.3.1 Fitness-Driven Optimization of System Settings The first step of the Evo-Enki framework (isolated in Figure 4.2) is responsible for evolving a system configuration that performs optimally over a given set of operating scenarios. A standard fitness-guided EA (Step 1.a) evolves a population of candidate configurations of the target platform (e.g., settings for a PID controller) towards an optimal configuration for a given set of scenarios. Candidate configurations are encoded into a vector form (i.e., a genome), and through standard operations of evolution, the genomes are recombined and mutated to uncover new candidates. Each candidate is evaluated by using Evo-ROS (Step 2.b) to spawn an instance of the simulation environment. Evo-ROS was developed by Simon et al. to enable an EA to use a ROS simulation environment when evaluating each candidate for evolution [199]. The simulation environment is initialized to run the given scenarios, and the simulated platform’s state is monitored to determine the fitness of the given system configuration. The final output is the candidate configuration with the highest fitness score after multiple generations of evolution. 61 Figure 4.2: Data flow between the steps responsible for optimizing a robust system configuration for a diverse set of operating scenarios. 4.3.2 Diversity-Driven Evolution of Simulation Settings The second step of the Evo-Enki framework (isolated in Figure 4.3) is responsible for evolving a new set of operating scenarios that exercise the given system configuration in mutually unique ways. Enki (Step 2.a) assesses the operation of a target software system to discover operating contexts that lead to unique, extreme, and possibly unexpected behavior. Enki uses an evolution- based novelty search algorithm to manage populations of individual scenarios. Each scenario is represented by a genome that encodes its operational characteristics and is associated with a phenome that encodes the target system’s behavior when exposed to the scenario. Enki evolves this population for multiple generations, using standard operations of evolution (i.e., recombination, mutation, and selection). However, unlike traditional evolutionary search methods, Enki does not guide the population towards a single fitness objective. Instead, a novelty archive is maintained across generations that ranks all individual scenarios in the current population with a novelty score and only archives the scenarios that exhibit the most diverse behavior from the target system. 62 The novelty score for an individual scenario is determined by comparing its phenome to its nearest neighbors in the archive and averaging the distance. Thus, through evolution, Enki guides the search for scenarios outward in the operational landscape in terms of the type of behavior they produce from the target system. A benefit of this approach versus traditional evolutionary search is that the output is a set of individual scenarios that produce unique results, which may be adverse or favorable for the target system across multiple metrics, instead of producing only scenarios that maximize a single objective metric. Figure 4.3: Data flow between the steps responsible for generating diverse operating scenarios for a given system configuration Figure 4.4 illustrates the diversifying effect of Enki’s novelty search process on the archived collection of scenarios. Each point corresponds to an individual scenario. Blue points correspond to scenarios currently in the archive, and gray points correspond to those that have been evaluated but not archived. All points are projected into two dimensions with distances scaled to match the relative distance between the scenarios’ phenomes. In early generations, where the scenarios are closer to a random selection, the archived scenarios produce very similar behavior from the target 63 system. As the search progresses, the archive “pushes” outward, demonstrating that each archived scenario is increasingly affecting the target system in different ways (i.e., becoming more diverse). Visualization of Phenotype Distances in Enki’s Archive (a) Generation 1 (b) Generation 5 (c) Generation 15 Figure 4.4: A visualization of phenotype distances between scenarios explored by Enki. Blue points show archived scenarios. Gray points show all other evaluated scenarios. Archived scenarios increase in diversity over generations. 4.4 Empirical Validation This section describes an experiment to evaluate the Evo-Enki framework on the EvoRally platform. The overall objective is to improve the ability of EvoRally’s throttle controller to handle a wider range of reference speed signals. EvoRally uses a PID throttle controller that is calibrated by adjusting four tuning constants, 𝐾 𝑃 , 𝐾 𝐼 , 𝐾 𝐷 and 𝐼𝑚𝑎𝑥 . This section designates default values for these tuning constants with the label 𝐶0 . For this experiment, Evo-Enki first evolved new values for the PID tuning constants (𝐶1 ) based on a single reference speed signal (e.g., Step 1 Figure 4.1). Evo-Enki then evolved a new collection of speed signals that result in a diverse range of behavior in the throttle controller (e.g., Step 2 in Figure 4.1). New PID constant values (𝐶2 ) were then evolved yet again but, in contrast to 𝐶1 , these were evolved to work optimally across all Enki-generated scenarios. The resulting values for 𝐶0 , 𝐶1 , and 𝐶2 are provided in Table 4.1. Through this experiment, we aim to answer the following research questions: RQ1.) Can Evo-Enki evolve tuning constants for the controller (𝐶1 ) that are more robust than the default constants (𝐶0 )? 64 RQ2.) Can Enki discover more challenging test speed signals to identify weaknesses in the controller, when compared to a random-generation technique? RQ3.) Can Evo-Enki leverage Enki-generated scenarios to evolve even more robust tuning con- stants for the controller (𝐶2 ), when compared to tuning constants evolved based on a single scenario (𝐶1 )? 4.4.1 The EvoRally Throttle Controller This section includes experiments conducted with a Proportional-Integral-Derivative (PID) throttle controller [13]. In general, controllers are responsible for monitoring and adjusting a system’s state to ensure the system behaves correctly. More specifically, controllers monitor process variables and take corrective actions on control variables to ensure the system’s output remains within expected limits. PID controllers are given a reference signal as the target value for a process variable, and an error signal is computed as the difference between the target value and the actual value from the process variable. In the case of a PID throttle controller, the process variable is the actual speed of the vehicle, the reference signal is the target speed, and the error is the difference between the target speed and actual speed. PID controllers attempt to minimize error by adjusting the control variable via three control terms (𝑃, 𝐼, and 𝐷). In order for the controller to respond correctly, certain tuning constants associated with each of these terms (𝐾 𝑃 , 𝐾 𝐼 , and 𝐾 𝐷 , respectively) must be tuned for the application. Improper values for these constants can lead to problems such as oscillation and overshoot. Table 4.1: PID Controller Settings. 𝐾𝑃 𝐾𝐼 𝐾𝐷 𝐼𝑚𝑎𝑥 𝐶0 (Default) 0.200 0.000 0.001 0.150 𝐶1 (Evolved) 0.203 0.045 0.092 0.511 𝐶2 (Evolved with Enki) 0.679 0.658 0.772 0.445 65 4.4.2 Evolving a New Controller For this experiment, we used Step 1 of Evo-Enki to evolve a controller configuration (𝐶1 ) that improves upon the default configuration (𝐶0 ). Improvement is measured by the reduction in controller error when exposing EvoRally to a single reference signal (see Figures 4.5a and 4.5b). The configuration settings for the fitness-driven EA (Step 1.a) are provided in Table 4.2. Upon completion, the fitness-driven EA reduced the mean-squared-error (MSE) between the actual speed of the vehicle and reference speed for the controller from 0.528 for 𝐶0 to 0.018 for 𝐶1 . However, further assessment is required to show that 𝐶1 also exhibits improvement over a wider range of reference signals. Figure 4.5 compares the performance of 𝐶0 and 𝐶1 on three other reference signals. Notably, 𝐶1 was not optimized for these signals. In Figures 4.5c through 4.5f, 𝐶1 tracks acceleration for the reference signal better than 𝐶0 , but deceleration remains a challenge without the use of brakes. When braking is allowed, 𝐶1 is able to track both the acceleration and the deceleration better than 𝐶0 (Figures 4.5g and 4.5h, respectively). 4.4.3 Assessing the Controllers To further assess the effectiveness of 𝐶0 , test reference signals were generated via Step 2 of Evo-Enki. Enki generated 1, 250 test reference signals (𝑆enki ), and for comparison, an additional 1, 250 signals were created via a random-generation method (𝑆rand ). All signals were generated to cover a period of 60 seconds with speeds capped at a maximum of 10 meters-per-second (m/s). The MSE was computed for the performance of 𝐶0 on each generated reference signal. A probability distribution of MSE for signals generated from each method is displayed in Figure 4.6. Regions are shaded by quartile ranges, with green being the bottom quartile, blue being the middle two quartiles, and red being the top quartile. Signals from 𝑆rand showed an average MSE of 1.047 (Figure 4.6b) compared to an average MSE of 2.430 from signals in 𝑆enki (Figure 4.6a). Compared to random-generation, signals from Enki are more likely to result in a lower MSE, and therefore, Enki is more likely to find reference signals that cause the system to perform poorly. The random-generation method creates reference signals by uniformly selecting values for each signal. 66 (a) (b) (c) (d) (e) (f) (g) (h) Figure 4.5: Comparison of the default 𝐶0 PID controller (left) settings with the evolved 𝐶1 PID controller (right) settings over a variety of different speed signals. Subplots (a) and (b) show the speed signal against which the 𝐶1 was evolved. Subplots (c), (d), (e), and (f) show performance of 𝐶0 and 𝐶1 against other random test signals, and subplots (g) and (h) show performance of 𝐶0 and 𝐶1 on a test signal while braking is allowed. 67 When error-inducing reference signals occupy a small region of the domain of all possible reference signals, a uniform selection method is not likely to uncover such challenging signals. Instead, a random-generation method will more likely result in reference signals that exhibit similar degrees of error. In contrast, Enki seeks out reference signals that produce unique differences in error, and therefore, Enki can produce reference signals with a more uniform distribution of MSE, which may then lead to sets of more diverse and challenging reference signals. Because the high-level objective is to evolve a more robust controller, the purpose of Enki is to discover less common and more adverse reference signals. Table 4.2: Fitness EA Configuration Table 4.3: Enki EA Configuration Parameter Setting Parameter Setting Generations 25 Generations 50 Population Size 25 Population Size 50 Tournament Selection Tournament Selection - Size: 2 - Size: 3 Two-Point Crossover Single-Point Crossover - Rate: 50% - Rate: 100% Gaussian Mutation Creep Mutation - Sigma: 1.0 - Range: ±20% - Rate: 20% - Rate: 25% Objective Minimize MSE Objective Diversify Error Run Time ∼ 12 hours Run Time ∼ 5 hours To compare the effectiveness of 𝐶1 to 𝐶0 on a wider range of possible reference signals, we used Enki to generate reference signals based on the performance of 𝐶0 (i.e., the error produced by using 𝐶0 for each generated reference signal). Again, all reference signals were generated for 60 seconds and capped at 10 m/s. The error was defined as the absolute difference between the actual speed of the vehicle and the reference signal. After running Enki with the settings provided in Table 4.3, we generated 2, 489 signals (𝑆0 ), and Figure 4.7 shows a comparison of the results from each controller. We found that the average MSE for 𝐶0 was 2.672 (Figure 4.7a). When exposing 𝐶1 to the same test reference signals, we found that the average MSE for 𝐶1 was 2.250 (Figure 4.7b). Each subplot shows a distribution of MSE observed by Enki for each controller. Since the whole distribution can be seen to skew right for 𝐶1 when compared to 𝐶0 , 𝐶1 has been 68 observed to produce less error in general for the given reference signals. Therefore, assessment with Enki further supports the claim that Evo-Enki did successfully evolve more robust controller settings for 𝐶1 than 𝐶0 . MSE Observed from Enki and Random Signals on 𝐶0 (a) Distribution of MSE for 𝐶0 for signals in 𝑆enki (b) Distribution of MSE for 𝐶0 for signals in 𝑆rand Figure 4.6: MSE observed from 𝐶0 for signals from 𝑆enki and 𝑆rand . Regions are shaded by quartile ranges, with green being the bottom quartile, blue being the middle two quartiles, and red being the top quartile. 4.4.4 Further Enhancing the Controller To explore whether use of Enki-generated scenarios as the basis of evolution can further improve the robustness of the controller, a second set of controller settings (𝐶2 ) was evolved. 𝐶2 was created in a process similar to 𝐶1 (see Section 4.4.2). However, for 𝐶2 , instead of only evolving the controller against the single reference signal shown in Figure 4.5a and Figure 4.5b, the controller was also evolved against the top 5 most unique reference signals generated by Enki. After deriving the values for 𝐶2 , we evaluated the new settings against the signals in 𝑆0 , and the results are displayed in Figure 4.7c. We found that 𝐶2 further reduced the average MSE to 2.148, with the overall MSE distribution skewed slightly more right. The reduced error exhibited by 𝐶2 shows that by assessing the controller with Enki, we can find scenarios to further harden the controller to a wider range of reference signals. 69 MSE Observed from Enki Signals on 𝐶0 , 𝐶1 , and 𝐶2 (a) Distribution of MSE for 𝐶0 for signals in 𝑆0 (b) Distribution of MSE for 𝐶1 for signals in 𝑆0 (c) Distribution of MSE for 𝐶2 for signals in 𝑆0 Figure 4.7: Error observed on controller settings 𝐶0 , 𝐶1 , and 𝐶2 when tested against signals from 𝑆0 . Regions are shaded by quartile ranges, with green being the bottom quartile, blue being the middle two quartiles, and red being the top quartile. 4.4.5 Threats to Validity Since this chapter’s framework relies on simulation, it is assumed that the simulator is capable of accurately matching reality. Any deviation in the simulator from the physical vehicle and its environment will impact the effectiveness of this approach on a real platform. Future work includes validating the results with the physical EvoRally platform. Additionally, this experiment has only considered reference signals with a maximum speed of 10 m/s, and therefore, the comparative performance of the controllers may not be valid when the speed is allowed to exceed 10 m/s. Finally, the intent of this chapter has been to assess the potential value of the Evo-Enki framework as a proof-of-concept, but since evolution-based techniques contain stochastic elements, multiple trials will be required to determine the statistical relevance of these results. 70 4.5 Related Work The work described in this chapter extends research conducted by Clark et al. [39] on discovering execution mode boundaries for adaptive controllers on robotic fish. Their work describes a mode discovery algorithm that evolves the controller over a fixed set of scenarios and then incrementally adds new scenarios to the set for each round of evolution. There work considers two different adaptive random techniques for scenario generation, where scenarios are randomly generated from a base scenario and selected based on specific criteria. In contrast to their work, the Evo-Enki framework uses novelty search to generate new scenarios without any need for a pre-defined “base” scenario. 4.6 Summary This chapter has demonstrated that evolution-based techniques can enable the discovery of more robust configurations for an autonomous system with limited input by the user. As demonstrated with EvoRally’s throttle controller, Evo-Enki can evolve a more robust system configuration by first evolving a diverse set of operating scenarios and then optimizing a new system configuration to minimize error across all diverse scenarios. The ultimate goal is to apply this work to a physical system that can independently and effectively detect changing adverse environmental conditions, such as sharp inclines or slippery terrain, and safely transition into a better suited system configuration or mode to navigate efficiently through the adverse environment. Additional investigations address these issues. Chapter 5 introduces machine learning techniques that additionally learn to infer execution mode boundaries for a given system configuration. Chapter 6 proposes how to construct an adaptive system that can switch between predetermined system configurations when assurance cases are no longer satisfied at run time. Chapter 7 proposes how to adapt systems to maintain system requirements at run time. Finally, Chapter 8 proposes how to construct a service-oriented framework for managing the trustworthiness of LECs in an autonomous system at run time, when faced with environmental uncertainty. 71 CHAPTER 5 UNCOVERING INADEQUATE LEARNING MODELS AT RUN TIME Since deep learning systems do not generalize well when training data is incomplete and missing coverage of corner cases, it is difficult to ensure the robustness of safety-critical self-adaptive systems with deep learning components. Nonetheless, stakeholders require a reasonable level of confidence that a safety-critical system will behave as expected in all contexts. However, uncertainty in the behavior of safety-critical LESs arises when run-time contexts deviate from training and validation data. To this end, this chapter proposes an approach to develop a more robust safety- critical LES by predicting its learned behavior when exposed to uncertainty and thereby enabling mitigating countermeasures for predicted failures [127]. By combining evolutionary computation with machine learning, an automated method is introduced to assess and predict the behavior of an LES when faced with previously unseen environmental conditions. By experimenting with DNNs under a variety of adverse environmental changes, the proposed method is compared to a Monte Carlo (i.e., random sampling) method. Results indicate that when Monte Carlo sampling fails to capture uncommon system behavior, the proposed method is better at training behavior models with fewer training examples required. The remainder of this chapter is organized as follows. Section 5.1 overviews the motivation and objectives of this chapter. Section 5.2 describes the proposed method for building a more resilient LES. Section 5.3 presents results from an empirical evaluation. Section 5.4 reviews related work. Finally, Section 5.5 provides a concluding summary for this chapter. 5.1 Overview As artificial intelligent software systems migrate from the purview of research and development into commercial applications with widespread use, trust in the ability for such systems to operate as intended and safely has become paramount. When deployed in the real world, autonomous systems will likely encounter adverse conditions not considered during their design, such as zero- 72 day security attacks or system degradation from inclement weather. This problem is exacerbated when the system being tested is also self-adaptive or relies on a learning component. As system behavior adapts, new test cases may need to be considered and existing test cases may become obsolete [62, 63]. Furthermore, when a run-time context diverges from training experience for a LES, it can be difficult to establish confidence in the system’s behavior. This chapter proposes an automated approach to synergistically combine evolutionary computation with machine learning to model and predict system behavior for such uncertain contexts (i.e., known unknowns). This approach can be used by autonomous systems to predict and mitigate failure of a learning component at run time by deploying fail-safes for situations in which the component has not been adequately trained. By leveraging such context-aware self-assessment, use of the proposed techniques can help instill confidence that an autonomous system will recognize and use learning components in contexts for which it can handle. It is difficult to determine the limitations of safety-critical systems that rely on DNNs [73], especially for situations that deviate from training experience [20]. In real-world environments, adverse corner cases (e.g., the effect of a scratched camera lens or a lens flare caused by the setting sun) are frequently missing when establishing training and validation examples for an LES. Even for known adverse environmental factors included in data, confirmation and selection bias [27] can negatively influence learned behavior. GANs [74] have been used to synthetically augment datasets to fill in gaps (e.g., image-to-image translation [236]), however the quality of such GANs is highly contingent on existing examples of the adverse phenomena. When examples of the adverse phenomena are lacking, other solution strategies are necessary to increase training/validation coverage for safety-critical LESs. This chapter introduces the Enlil1 framework to produce a behavior oracle, which is used to predict how an LES will respond to operating conditions not covered by existing datasets [127]. Enlil combines evolutionary computation with deep learning to support the automatic creation of a wide range of test cases for an LES, to assess how robust an LES is to specific forms of 1 Enlil is an ancient Sumerian deity that possesses the “Tablet of Destinies [118].” 73 sensory noise, and to model system behavior under uncertainty. Enlil synthesizes examples of new operating contexts with newly-introduced sensory phenomena by transforming existing data. Enlil then generates diverse synthetic examples that correspond to predefined behavior categories, which are then used to train a predictive behavior model via machine learning [9]. The resulting behavior model can then be used to predict how the target LES will respond to sensory data with the newly-introduced phenomena. Machine learning plays two roles in this work. First, the LES uses machine learning to accomplish its system objectives (e.g., detect obstacles). Second, Enlil uses machine learning to create a behavior oracle for predicting the behavior of an LES when facing unexpected conditions. Ultimately, this work aims to enable the construction of a more robust safety-critical LES through automated methods that assess and predict how the LES will behave under previously untested conditions. Enlil’s behavior oracles can enable a safety-critical system to preemptively determine when learning components are inadequate for a given run-time context, thereby prompting transitions/adaptations to mitigate faults (i.e., fail-safes). Experiments have been conducted to evaluate use of the Enlil framework on an object detector for an autonomous vehicle under inclement environmental conditions (e.g., poor lighting, rain, and fog). Specifically, Enlil is used to predict the object detector’s performance in the presence of the newly-introduced environmental conditions. Figure 5.1 provides an illustrative use case. Figure 5.1b shows how a trained object detector can correctly identify all vehicles in a clear image (outlined in boxes). Figure 5.1c shows that when a synthetic raindrop is introduced onto the camera lens, the object detector fails to detect the occluded vehicles. Finally, Figure 5.1d shows how a behavior model trained by Enlil can perceive the raindrop and predict that it will disrupt object detection (marked with red shading). This chapter assumes that these environmental conditions have not been sufficiently covered by default training/test data, and therefore, it is unknown how they will impact the object detector’s performance a priori. For validation, example training cases created by Enlil have been compared to those generated by a baseline Monte Carlo random sampling method, and the accuracy of Enlil’s predictive models have been evaluated against models trained only on Monte Carlo training cases. 74 Demonstration of Enlil Prediction (a) Unaltered View of Street (b) Automobiles Detected (c) View Obscured by Raindrop (d) Detection of Raindrop and Prediction of Failure Figure 5.1: Example of an Enlil behavior model use-case. An object detector is given a clear image (a) and identifies all automobiles (b). A raindrop is introduced into the scene, which causes the system to fail to detect occluded automobiles (c). The Enlil model is able to correctly detect the interfering raindrop (d) and predicts that the system will fail. 75 Results show that a significant percentage of system failures can be predicted preemptively with the Enlil framework, which can enable developers to better implement mitigation strategies. Furthermore, results demonstrate that Enlil generates balanced sets of training cases with respect to the types of system behavior induced. As such, Enlil can train comparably accurate or better predictive models with fewer training examples for the considered object detector. 5.2 Predictive Behavior Modeling for System Resilience This section describes the Enlil framework, an automated method to create a more robust LES through predictive behavior modeling. First, Enlil automatically assesses LES behavior under synthetic environmental phenomena to generate a collection of operating contexts that lead to specific behavior categories (i.e., types of resulting behavior). Next, Enlil uses the generated contexts to train a predictive behavior model of the LES (i.e., a behavior oracle [9]) that can be used either internally or by an external system (e.g., an autonomic manager [105]) to determine how the LES will behave when faced with any given context of the environmental phenomena. Figure 5.2 provides a DFD for the Enlil framework, where circles represent computational steps, boxes represent external entities, arrows represent data flow, and data stores are represented within parallel lines. Enlil has two primary objectives: to generate a diverse variety of operating contexts for the LES (Step 1 and Step 2) and to infer a behavior model of the LES based on the generated contexts (Step 3). For example, Figure 5.3 demonstrates how rainfall can impact an object detector for an au- tonomous vehicle to the point of failure. The object detector can successfully identify an auto- mobile and a cyclist in the absence of rainfall (Figure 5.3a). Light synthetic rainfall is introduced with no impact on detection (Figure 5.3b). However, more intense rainfall can result in a failure to detect the cyclist (Figure 5.3c). When default training data does not include any such examples of rainfall, the Enlil framework generates synthetic examples and trains a behavior model to predict the object detector’s performance under rainfall (i.e., if it is likely to miss an object). 76 Figure 5.2: High-level DFD of the Enlil framework. Computational steps are shown as circles. Boxes represent external entities. Directed lines indicate data flow. Persistent data stores are marked within parallel lines. Step 1. Context Generation. To assess the behavior of an LES under a previously unseen environmental condition, data must be collected to determine how the introduced condition impacts the behavior of the LES. To accomplish this task, the Enlil framework requires an operating specification, behavior specification, behavior categories, and evaluation procedure from the user. Operating specification The operating specification defines each configurable parameter of the LES’s operating environment (e.g., rainfall angle, intensity, etc.). A label and permissible value ranges must be specified for each parameter, where values can be numeric or categorical. Each candidate operating context generated by Enlil is sampled from the values defined in this specifi- cation. For the example rainfall phenomena (Figure 5.3), an operating specification (OP) could be defined as follows: ∈ [ 41 𝜋, 34 𝜋]     𝑟𝑎𝑖𝑛 𝑓 𝑎𝑙𝑙_𝑎𝑛𝑔𝑙𝑒      𝑟𝑎𝑖𝑛 𝑓 𝑎𝑙𝑙_𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ∈ [0, 1]  OP =     𝑟𝑎𝑖𝑛 𝑓 𝑎𝑙𝑙_𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 ∈ [0, 1]        𝑟𝑎𝑖𝑛 𝑓 𝑎𝑙𝑙_𝑠𝑒𝑒𝑑 ∈ [0, 100]  77 The Effect of Rainfall on an Object Detection DNN (a) Unaltered Image from KITTI dataset [70] (b) Light Rainfall (Simulated) (c) Medium Rainfall (Simulated) (d) Heavy Rainfall (Simulated) Figure 5.3: Demonstrative examples for how rainfall can affect an image-based DNN for object detection. Detected objects have been marked with overlaying bounding boxes, as shown in (a). As “rainfall” increases in intensity, as seen in (b) and (c), the effectiveness of the DNN diminishes to a point where it fails to detect the cyclist, as seen in (d). 78 Behavior specification The behavior specification defines a set of observable behavior metrics for the LES. For an object detector, this may include properties such as precision and recall for a set of validation images, specified as follows.    𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∈ [0, 1]  BV =     𝑟𝑒𝑐𝑎𝑙𝑙 ∈ [0, 1]    In addition to these component-level properties, system-level properties may also be considered when utility functions [36] exist to evaluate functional objectives at run time. Behavior categories The Enlil framework assumes that a set of behavior categories has been established by the user, which are defined in relation to the behavior metrics in the behavior specification. For example, behavior categories could be specified as “expected object misdetection” (Cat 1), “expected pedestrian misdetection” (Cat 2), and “expected successful detection” (Cat 3), where each category is based on thresholds (𝜃) for the recall (𝑟) according to the following user- specified mapping.   Cat 1, if 𝑟 𝑡𝑜𝑡𝑎𝑙 < 𝜃 𝑡𝑜𝑡𝑎𝑙         BV-CAT (𝑟) = Cat 2, if 𝑟 𝑡𝑜𝑡𝑎𝑙 ≥ 𝜃 𝑡𝑜𝑡𝑎𝑙 ∧ 𝑟 𝑝𝑒𝑑 < 𝜃 𝑝𝑒𝑑       Cat 3,   if 𝑟 𝑡𝑜𝑡𝑎𝑙 ≥ 𝜃 𝑡𝑜𝑡𝑎𝑙 ∧ 𝑟 𝑝𝑒𝑑 ≥ 𝜃 𝑝𝑒𝑑  Evaluation procedure The evaluation procedure specifies how Enlil must instantiate the simu- lation environment and monitor the LES’s behavior for each given operating context (see Step 2). To efficiently search the space of all possible operating contexts and uncover a wide range of examples for each specified behavior category, the Enlil framework uses an EA search procedure. The EA generates candidate operating contexts, assesses the behavior of an LES under each candidate context, and archives the most diverse contexts found to match the given behavior category. The result is an archive of operating contexts with widely varying appearance but similar resulting LES behavior to match each respective behavior category. The intent is to apply this technique for situations where the full impact of an operating condition is not known for the LES, 79 and it is not possible to exhaustively assess every explicit context. When considering how rainfall might impact a camera-based object detector for an autonomous vehicle, it is not practical to evaluate every potential appearance of rainfall. Given a parameterized simulation procedure to emulate rainfall, Enlil evolves a population of multiple rainfall contexts in parallel and encourages each generation to exhibit increasingly different appearances of rainfall (e.g., angle and intensity) that result in the same behavior category for the LES (e.g., “expected pedestrian misdetection”). Enlil’s EA for context generation is described in Algorithm 5.1. Each candidate context is represented by a genome and phenome. Genomes encode the operating characteristics of each context (e.g., rainfall angle, intensity, etc.), where the number of specified operating parameters will determine the length of each genome. Phenomes encode both the operating characteristics and resulting system behavior for each context (e.g., angle and intensity values for rainfall and resulting object detector accuracy). An initial population of operating contexts is generated with random operating characteristics. Operating characteristics for each context in the population are manipulated through the recombination and mutation of selected genomes. After modifying genomes, a population’s phenomes are determined by evaluating the behavior of the LES for each individual context. To encourage diversity, the population is compared to an archive that tracks phenomes that are mutually unique and fit the given behavior category. Enlil ranks each context in the current population based on how its phenomes compare to the archived contexts. Context diversity is determined by the Euclidean distance between an individual’s phenome and the phenomes of its nearest neighbors as follows,   nov(p, P, 𝑘) = mean min-k ∥p − p𝑖 ∥ 2 ∀p𝑖 ∈ P : p𝑖 ≠ p where the novelty score (𝑛𝑜𝑣) is computed as the average distance between a phenome (p) with its 𝑘 nearest neighbors in the set of all archived phenomes (P). Contexts with phenomes that are more distant from the closest matching phenomes in the archive (i.e., less similar) are given priority. Since Enlil aims to generate contexts with diverse appearance but similar resulting system behavior, an additional step is included to filter out any candidate contexts from the population that do not match the specified behavior category (Line 9 in Algorithm 5.1). After ranking each context in a 80 population, any context that results in behavior other than the target behavior category is excluded from the population, such that it will not contribute to future generations. Thus, as Enlil evolves the population over consecutive generations, individual contexts within the population will be favored for survival and reproduction when they exhibit operating characteristics that are different from those found in the archive yet produce similar behavior in the LES (e.g., “expected pedestrian misdetection”). In essence, Enlil attempts to generate the broadest set of example contexts that yield the same resulting system behavior for a given behavior category. Algorithm 5.1 Enlil: Context Generation (Step 1) 1: function evolutionary-search(𝑛_𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠, 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟_𝑝𝑎𝑡𝑡𝑒𝑟𝑛, 𝑒𝑣𝑎𝑙_ 𝑓 𝑢𝑛𝑐) 2: 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 ← ∅ 3: 𝑝𝑜 𝑝 ← random-population() 4: for 0 to 𝑛_𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 do 5: 𝑝𝑜 𝑝 ← selection( 𝑝𝑜 𝑝) 6: 𝑝𝑜 𝑝 ← recombination( 𝑝𝑜 𝑝) 7: 𝑝𝑜 𝑝 ← mutation( 𝑝𝑜 𝑝) 8: 𝑝𝑜 𝑝 ← evaluation( 𝑝𝑜 𝑝, 𝑒𝑣𝑎𝑙_ 𝑓 𝑢𝑛𝑐) 9: 𝑝𝑜 𝑝 ← filter( 𝑝𝑜 𝑝, 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟_𝑝𝑎𝑡𝑡𝑒𝑟𝑛) 10: 𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝 ← commit-to-archive(𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝) 11: return 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 12: function commit-to-archive(𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝) 13: 𝑠𝑐𝑜𝑟𝑒𝑠 ← rank-individuals(𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝) 14: 𝑝𝑜 𝑝 ← 𝑝𝑜 𝑝 : 𝑠𝑐𝑜𝑟𝑒𝑠 > 𝑛𝑜𝑣𝑒𝑙𝑡𝑦_𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 15: 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 ← truncate(𝑎𝑟𝑐ℎ𝑖𝑣𝑒 ∪ 𝑝𝑜 𝑝) 16: return 𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑝𝑜 𝑝 Step 2. Context Evaluation. For each given context, a subset of the default training data is transformed to simulate the environ- ment described by the context. Then the transformed data is processed by the DNN to determine the resulting confusion matrix,2 which can be used to derive the DNN’s precision and recall behavior metrics. The scatter plot in Figure 5.4 shows how Enlil’s archived contexts compare to Monte Carlo (i.e., random-generated) contexts. For this example, Enlil introduced a variety of raindrops onto 2A confusion matrix tallies the number of correct and incorrect classifications for each type of object [202]. 81 the lens of a camera for an LES. Operating characteristics were defined by parameters such as the raindrop’s image position and radius. Behavior categories were defined to include a category for raindrops that have no significant impact on the LES’s ability to detect objects (green) and a category for raindrops that cause the LES to consistently fail to detect objects (red). Enlil was able to generate an equal number of contexts for each behavior category, where Monte Carlo generation produced a severely imbalanced set of contexts, favoring raindrops that had no significant impact on the LES. Since Enlil uses these generated contexts as training cases for predictive behavior models, a balanced distribution is preferred to reduce bias towards a specific behavior category. To reach a comparable number of failures cases, the Monte Carlo method would need to generate significantly more contexts (e.g., over 10X, for the examples shown in Figure 5.4). This form of between-class imbalance is a known problem for learning algorithms [86]. Enlil addresses this problem using novelty search and parallel evolution to produce archives of equal sizes for each behavior category. Comparison of Enlil vs. Monte Carlo Generated Contexts (a) Enlil Contexts (b) Monte Carlo Contexts Figure 5.4: Scatter plot comparison of Enlil-generated (a) versus Monte Carlo-generated (b) raindrop contexts (where a raindrop occludes the view). Points correspond to raindrops with varying size/position. Colors show whether a raindrop resulted in failure for the LES (red) or had no significant impact (green). Enlil contexts show distinct clusters of equal quantity. Monte Carlo contexts contain few problem-causing raindrops. 82 Step 3. Behavior Modeling. To predict an LES’s behavior under a condition of uncertainty (e.g., the presence of rainfall), the Enlil framework constructs and trains a separate meta-level DNN from synthetically-generated data to act as a behavior oracle for the LES. Data for training the behavior oracle is derived from a combination of unaltered training data for the LES and the contexts archived by Enlil in Step 1. Each entry in the Enlil’s training dataset consists of the following values: synthetic LES sensor data (𝑥), the context parameter values (𝑡 0 ) used to generate the synthetic data, and a label (𝑡1 ) for the observed behavior category for the LES when exposed to the synthetic data. Enlil trains the behavior oracle DNN (Enlil-Oracle) by learning to model the relationship between 𝑥 and the target variables 𝑡0 and 𝑡1 . Once trained, the resulting Enlil-Oracle DNN may be used as a function to preemptively determine how the LES will react to new examples of sensor data and thus enable adaptation of the LES at run time to mitigate failure. A high-level topology for the Enlil-Oracle DNN is illustrated in Figure 5.5. Enlil-Oracle takes sensor input data for the LES (𝑥) and outputs an estimation of the corresponding operating context parameters (𝑦 0 ) and a prediction for the LES’s behavior category (𝑦 1 ). Enlil-Oracle con- tains four sub-networks: an interference isolation sub-network, a feature extraction sub-network, an operating context regression sub-network, and a behavior classification sub-network. Figure 5.5: High-level illustration of the Enlil-Oracle DNN architecture. The architecture comprises four sub-networks. First, the suspected adverse noise is separated from the sensor input by the interference isolation sub-network. Second, a set of “latent” features are distilled from the isolated noise via a feature extraction sub-network. Next, a context regression sub-network converts these latent features into a “perceived” context for the current environment. Finally, a behavior classification sub-network predicts a behavior category for the LES. 83 The interference isolation and feature extraction sub-networks are responsible for reducing sensor input (𝑥) into a set of learned features to enable behavior classification. The interference isolation sub-network learns to isolate any adverse noise introduced into the sensor input. The feature extraction sub-network learns to reduce the isolated noise into a set of latent features. The specific architectures for the interference isolation and feature extraction sub-networks are dependent on the type of sensor data used by the LES. For this study, sensor data is limited to static images, and the feature extraction sub-network has been implemented with a ResNet [87] architecture (a common method for image processing tasks). The extracted features are provided as input to the context regression sub-network, which is responsible for estimating an operating context (𝑦 0 ) to describe the given sensor input (i.e., a per- ceived context). For this study, the context regression sub-network comprises a fully-connected [73] layer with ReLU [157] activation and a second fully-connected layer of 𝑛 units with linear activation (where 𝑛 is the number of context parameters). Output from the context regression sub-network is relayed to the behavior classification sub- network to classify an expected behavior category (𝑦 1 ) for the perceived context (i.e., an inferred behavior). The behavior classification sub-network for this study comprises an fully-connected layer with ReLU activation and a second fully-connected layer of 𝑚 units with softmax activation (where 𝑚 is the number of behavior categories). Thus, Enlil-Oracle receives sensor data, perceives an operating context for the given data, and infers a behavior category for the LES in response to the context. A gradient descent training method (Algorithm 5.2) minimizes the error of the Enlil-Oracle DNN by adjusting its weight values via back propagation [73]. The training procedure is im- plemented in two stages. First, the behavior classification sub-network is trained in isolation to minimize the categorical cross entropy between 𝑡1 and 𝑦 1 for synthetic examples of each of the archived contexts. Second, the behavior classification sub-network is frozen, and the whole net- work is trained to minimize the mean-squared-error of 𝑡0 and 𝑦 0 for synthetic examples of each of the archived contexts. The two-stage approach prevents the output of the behavior classifica- 84 tion sub-network from directly influencing the types of features extracted by the feature extraction sub-network. Instead, features are extracted to have a greater relevance to the network’s perceived context. Algorithm 5.2 Enlil: Behavior Modeling (Step 3) 1: function train-behavior-model(𝑎𝑟𝑐ℎ𝑖𝑣𝑒, 𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎, 𝑛_𝑒 𝑝𝑜𝑐ℎ𝑠) 2: 𝑑𝑛𝑛 ← initialize-dnn() 3: 𝑠𝑢𝑏𝑛𝑒𝑡 ← get-behavior-subnet(𝑑𝑛𝑛) 4: for 0 to 𝑛_𝑒 𝑝𝑜𝑐ℎ𝑠 do 5: for (𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟) in 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 do 6: 𝑥 ← select-random(𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎) 7: 𝑥 ← transform(𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑥) 8: 𝑡0 , 𝑡1 ← 𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟 9: 𝑠𝑢𝑏𝑛𝑒𝑡 ← fit-weights(𝑠𝑢𝑏𝑛𝑒𝑡, 𝑥, (𝑡0 , 𝑡1 )) 10: 𝑑𝑛𝑛 ← freeze-behavior-subnet(𝑑𝑛𝑛, 𝑠𝑢𝑏𝑛𝑒𝑡) 11: for 0 to 𝑛_𝑒 𝑝𝑜𝑐ℎ𝑠 do 12: for (𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟) in 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 do 13: 𝑥 ← select-random(𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎) 14: 𝑥 ← transform(𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑥) 15: 𝑡0 , 𝑡1 ← 𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟 16: 𝑑𝑛𝑛 ← fit-weights(𝑑𝑛𝑛, 𝑥, (𝑡0 , 𝑡1 )) 17: return 𝑑𝑛𝑛 5.3 Empirical Validation This section presents results from an empirical evaluation of the Enlil framework. Experiments have been performed on an LES that uses an image-processing DNN for object detection. Multiple trials have been conducted to evaluate the impact of specific weather conditions on the LES. Synthetic examples of each weather condition are generated using both Enlil and a Monte Carlo method. Next, predictive behavior oracles are trained using examples generated by each method for comparison. Results from these experiments answer the following research questions: RQ1.) Can Enlil generate a balanced distribution of operating contexts with respect to system behavior more efficiently than Monte Carlo sampling? 85 RQ2.) Can operating contexts from Enlil train a more accurate behavior oracle when compared to an oracle trained by Monte Carlo contexts? 5.3.1 Datasets Training and validation images have been acquired from a forward-facing camera mounted on an au- tomobile, sourced from the KITTI Vision Benchmark Suite [70] and the Waymo Open Dataset [205] for autonomous driving. These datasets have been chosen, because they are publicly available, well- documented, and contain high-quality images taken from real-world driving scenarios. A majority of images from each dataset depict clear-weather scenes, and therefore, it is unknown how a DNN trained only by default data will react under introduced adverse weather conditions. Validating our approach on two independently collected datasets helps to illustrate that our techniques are not dataset dependent. 5.3.2 Experimental Setup Object detection DNNs for this study have been implemented using Tensorflow [1] machine learning libraries and a RetinaNet [65, 139] architecture. When trained and evaluated with unaltered KITTI images, the object detector achieved a mean recall of 88% overall (94% for vehicles and 70% for pedestrians/cyclists). For Waymo images, the object detector achieved a mean recall of 95% overall (95% for vehicles and 91% for pedestrians/cyclists). This study considers an assortment of weather conditions. Specific conditions include fog conditions, lens flare on the camera lens, raindrops placed on the camera lens, and rainfall occluding portions of the image. These weather conditions are simulated by performing parameterized image processing techniques on real-world images taken from the original KITTI and Waymo datasets. Examples of each weather condition can be seen in Figure 5.6. An unaltered image from the KITTI dataset is shown in Figure 5.6a, followed by example contexts of fog, lens flares, raindrops, and rainfall in Figures 5.6b, 5.6c, 5.6d, and 5.6e, respectively. Table 5.1 lists parameters for 86 each weather condition with respective value ranges. Each set of sample values drawn for these parameters is considered a unique operating context for the corresponding weather condition. The behavior categories referenced in these experiments are the same as those specified in Section 5.2. Thresholds have been set such that Cat 1 (“expected object misdetection”) is assigned when an introduced phenomenon causes recall to decrease more than 5% over a set of validation images, Cat 2 (“expected pedestrian misdetection”) is assigned when there is greater than a 1% decrease in recall for pedestrians over a set of validation images, and Cat 3 (“expected successful detection”) is assigned for all other cases. Categories are assigned to a given operating context after executing the evaluation procedure in Step 2 (Section 5.2) to compare the recall for a set of validation images with and without the newly-introduced phenomenon. Table 5.1: Parameters for Environmental Effects Parameter Permissible Values 𝑓 𝑜𝑔_𝑑𝑒𝑛𝑠𝑖𝑡𝑦 0 to 100% 𝑓 𝑜𝑔_𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 0 to 100% 𝑙𝑒𝑛𝑠 𝑓 𝑙𝑎𝑟𝑒_𝑏𝑙𝑢𝑟_𝑟𝑎𝑑𝑖𝑢𝑠 3 to 20% image width 𝑙𝑒𝑛𝑠 𝑓 𝑙𝑎𝑟𝑒_𝑐ℎ𝑟𝑜𝑚𝑎𝑡𝑖𝑐_𝑑𝑖𝑠𝑡𝑜𝑟𝑡𝑖𝑜𝑛 0 to 100% 𝑙𝑒𝑛𝑠 𝑓 𝑙𝑎𝑟𝑒_𝑔ℎ𝑜𝑠𝑡_𝑠𝑝𝑎𝑐𝑖𝑛𝑔 0 to 100% image radius 𝑙𝑒𝑛𝑠 𝑓 𝑙𝑎𝑟𝑒_ℎ𝑎𝑙𝑜_𝑤𝑖𝑑𝑡ℎ 0 to 100% image radius 𝑙𝑒𝑛𝑠 𝑓 𝑙𝑎𝑟𝑒_𝑙𝑖𝑔ℎ𝑡_𝑥 0 to 100% image width 𝑙𝑒𝑛𝑠 𝑓 𝑙𝑎𝑟𝑒_𝑙𝑖𝑔ℎ𝑡_𝑦 0 to 50% image height 𝑙𝑒𝑛𝑠 𝑓 𝑙𝑎𝑟𝑒_𝑙𝑖𝑔ℎ𝑡_𝑟𝑎𝑑𝑖𝑢𝑠 0 to 10% image width 𝑟𝑎𝑖𝑛𝑑𝑟𝑜 𝑝_𝑥 0 to 100% image width 𝑟𝑎𝑖𝑛𝑑𝑟𝑜 𝑝_𝑦 0 to 100% image height 𝑟𝑎𝑖𝑛𝑑𝑟𝑜 𝑝_𝑟𝑎𝑑𝑖𝑢𝑠 0 to 33% image width 𝑟𝑎𝑖𝑛𝑑𝑟𝑜 𝑝_𝑏𝑙𝑢𝑟 3 to 6% image width 𝑟𝑎𝑖𝑛 𝑓 𝑎𝑙𝑙_𝑎𝑛𝑔𝑙𝑒 𝜋/4 to 3𝜋/4 radians 𝑟𝑎𝑖𝑛 𝑓 𝑎𝑙𝑙_𝑑𝑒𝑛𝑠𝑖𝑡𝑦 0 to 100% 𝑟𝑎𝑖𝑛 𝑓 𝑎𝑙𝑙_𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 0 to 100% 𝑟𝑎𝑖𝑛 𝑓 𝑎𝑙𝑙_𝑠𝑒𝑒𝑑 0 to 100 87 Examples Contexts of Each Environmental Effect (a) Unaltered Image (b) Fog Examples (c) Lens Flare Examples (d) Raindrop Occlusion Examples (e) Rainfall Examples Figure 5.6: Examples of contexts generated for each environmental effect: (a) shows the original, unaltered image, (b) shows examples of fog applied to the scene, (c) shows examples of lens flare applied to the camera, (d) shows examples of raindrops on the camera lens, and (e) shows examples of rainfall applied to the scene. 88 5.3.3 Evaluation of Context Generation To evaluate the distribution of operating contexts generated by Enlil, multiple archives have been created for each environmental condition, with contexts divided into the specified behavior categories (Cat 1, Cat 2, and Cat 3). Separate trials were executed with alternative random seeds and varying archive sizes, ranging from 300 to 1,200 contexts. The distribution of contexts in each archive are compared to a Monte Carlo selection of contexts (i.e., parameters in Table 5.1 have been randomly sampled). Tables 5.2 and 5.3 show the distribution of contexts generated by Enlil and Monte Carlo sampling when applied to the KITTI and Waymo datasets, respectively. All archives produced by Enlil were found to contain an even distribution of contexts belonging to each behavior category. In contrast, a Monte Carlo sampling method was found to consistently produce an uneven distribution of contexts with respect to each behavior category. Table 5.2 and 5.3 also show the number of samples each method would need to generate in order to achieve an even distribution at each respective archive size. For Enlil, this is the number of candidate contexts produced by its EA. For the Monte Carlo method, this number is a projection based on the observed distribution. A key benefit for using Enlil over a Monte Carlo method is its ability to efficiently uncover an equal proportion of contexts in cases where one behavior category is significantly less likely to occur (i.e., corner cases). To train a model that can effectively classify contexts into each behavior category, it is necessary to have relatively equal representation for each data category. 5.3.4 Evaluation of Predictive Behavior Models To compare the usefulness of Enlil-generated operating contexts to Monte Carlo context, separate behavior oracles have been trained with each archive generated by each method using the procedure described in Step 3 (Section 5.2). The accuracy of each behavior oracle has been measured for an independent set of test images with random environmental contexts applied, with mean results shown in Tables 5.4 and 5.5 for the KITTI and Waymo datasets, respectively. For each archive generated, the accuracy of behavior oracles trained with Enlil-generated con- 89 Table 5.2: Context Generation Results for KITTI Data Enlil Monte Carlo Archive Ratio Samples to Ratio Samples to Effect Size Cat 1:2:3 Balance Cat 1:2:3 Balance Fog 300 1:1:1 574 12:1:4 1,250 600 1:1:1 910 25:1:7 4,500 1,200 1:1:1 1,691 19:1:6 15,285 Lens 300 1:1:1 500 1:1:15 1,810 Flare 600 1:1:1 817 1:1:15 3,384 1,200 1:1:1 1,431 1:1:13 6,286 Rain 300 1:1:1 438 1:1:4 703 drop 600 1:1:1 810 1:1:4 1,312 1,200 1:1:1 1,428 1:1:4 2,712 Rain 300 1:1:1 610 22:1:1 2,457 fall 600 1:1:1 883 19:1:1 4,279 1,200 1:1:1 1,581 19:1:1 8,581 Table 5.3: Context Generation Results for Waymo Data Enlil Monte Carlo Archive Ratio Samples to Ratio Samples to Effect Size Cat 1:2:3 Balance Cat 1:2:3 Balance Fog 300 1:1:1 528 4:2:1 711 600 1:1:1 789 5:2:1 1,670 1,200 1:1:1 1,437 6:2:1 3,650 Lens 300 1:1:1 482 1:7:55 6,667 Flare 600 1:1:1 797 1:7:51 11,966 1,200 1:1:1 1,409 1:8:50 24,233 Rain 300 1:1:1 481 1:4:17 2,566 drop 600 1:1:1 826 1:4:18 4,690 1,200 1:1:1 1,421 1:3:16 8,103 Rain 300 1:1:1 533 8:1:1 1,087 fall 600 1:1:1 826 8:1:1 2,094 1,200 1:1:1 1,473 9:1:1 4,362 90 Table 5.4: Accuracy of KITTI Behavior Oracles Enlil Monte Carlo Archive Accuracy (%) Accuracy (%) Effect Size Cat 1 / 2 / 3 Mean Cat 1 / 2 / 3 Mean Fog 300 97.9 / 34.2 / 87.3 73.1 95.8 / 0.0 / 93.1 63.0 600 98.7 / 89.0 / 73.5 87.1 96.6 / 0.0 / 93.1 63.2 1,200 98.5 / 90.1 / 76.4 88.3 96.9 / 0.0 / 93.2 63.4 Lens 300 84.6 / 53.9 / 82.3 73.6 32.4 / 0.0 / 95.6 42.7 Flare 600 85.4 / 67.7 / 83.0 78.7 50.5 / 0.0 / 92.6 47.7 1,200 85.0 / 75.3 / 84.5 81.6 75.0 / 40.2 / 94.8 70.0 Rain 300 87.9 / 67.9 / 92.0 82.6 84.3 / 26.9 / 96.5 69.2 drop 600 91.8 / 76.1 / 85.4 84.4 83.0 / 57.1 / 94.5 78.2 1,200 89.3 / 79.2 / 88.3 85.6 93.2 / 60.1 / 79.7 77.5 Rain 300 87.9 / 75.1 / 26.5 63.1 99.6 / 1.6 / 41.6 47.6 fall 600 88.2 / 74.5 / 26.6 63.1 99.3 / 3.3 / 45.7 49.4 1,200 89.7 / 73.3 / 27.3 63.4 96.6 / 4.7 / 71.5 57.6 Table 5.5: Accuracy of Waymo Behavior Oracles Enlil Monte Carlo Archive Accuracy (%) Accuracy (%) Effect Size Cat 1 / 2 / 3 Mean Cat 1 / 2 / 3 Mean Fog 300 95.5 / 78.3 / 72.4 82.1 98.7 / 75.5 / 72.2 82.1 600 97.2 / 83.9 / 69.0 83.4 98.7 / 84.9 / 62.0 81.9 1,200 97.8 / 81.1 / 76.4 85.1 98.0 / 88.2 / 53.3 79.8 Lens 300 64.8 / 21.1 / 89.7 58.5 10.6 / 0.9 / 99.9 37.1 Flare 600 80.3 / 38.4 / 75.8 64.8 21.8 / 2.5 / 99.4 41.2 1,200 80.4 / 40.3 / 80.3 67.0 51.2 / 3.8 / 99.2 51.4 Rain 300 69.8 / 44.5 / 65.7 60.0 0.0 / 0.0 / 99.9 33.3 drop 600 75.3 / 53.6 / 86.3 71.7 13.8 / 6.3 / 98.3 39.5 1,200 77.0 / 52.8 / 88.4 72.8 56.1 / 37.1 / 96.7 63.3 Rain 300 82.5 / 58.2 / 24.3 55.0 97.9 / 12.7 / 24.4 45.0 fall 600 81.5 / 69.6 / 25.1 58.8 97.9 / 10.3 / 32.6 46.9 1,200 82.1 / 60.7 / 36.0 59.6 97.7 / 12.7 / 37.9 49.4 91 texts were either comparable to or better than those trained with Monte Carlo contexts. For smaller archive sizes, Enlil behavior oracles were significantly more accurate. Since the distribution of Monte Carlo contexts are heavily biased, more examples are required to adequately represent all behavior categories, making behavior classification difficult for small sample sizes. These results show that Enlil is more likely to produce an accurate behavior oracle when compared to an oracle trained only with randomly-sampled contexts, independent of archive size. 5.3.5 Threats to Validity Since these experiments use synthetically generated data, results may not directly apply to the real world. Use of Enlil is recommended when the “reality gap” [96] is insignificant. The simulated phenomena have been implemented to visually match real-world analogues and may help guide follow-on studies under real weather conditions to alleviate concerns about the reality gap. Figure 5.7 has been included to demonstrate an Enlil-Oracle applied to images of real raindrops. In these examples, Enlil-Oracle was able to correctly perceive the presence of a raindrop and predict that it could possibly interfere with the performance of the object detector. The use of stochastic functions may influence the results of this study. Since the DNNs in this study have been trained by a stochastic gradient descent method and since Monte Carlo sampling is also stochastic, some variability in the results presented may be expected. For variation, all results reported have been averaged over multiple runs. 5.4 Related Work This section overviews work related to this chapter, including model testing and predictive behavior modeling. 5.4.1 Model Testing Automated techniques are essential when evaluating the numerous possible interactions between a cyber-physical system and its physical environment. Model testing has been advocated to shift 92 testing away from the physical implementation of a cyber-physical system to a more abstract rep- resentation [24]. The removal of hardware constraints facilitates a more efficient search of the system’s interactions within a simulated environment. Related tools such as Veritas [63] and Pro- teus [62] have been introduced to perform adaptive run-time testing of a system by monitoring utility functions [36, 215] that measure a degree of system requirements satisfaction [45]. Enlil’s approach can be considered a form of statistical model checking [132], where multiple runs of a system are evaluated under varying operating conditions and statistical evidence is gathered to sup- port claims about the system’s behavior. For example, when evaluating a DNN, statistical behavior metrics may include accuracy or a confusion matrix based on validation data. By model testing with a widely diverse set of contexts, Enlil can efficiently gather unique counterexamples [33] that can infer a more generalized description of how the system interacts with its environment. Enlil-Oracle Demonstration on Real Raindrops (a) Real-world Example (b) Real-world Example (c) “Expected misdetection” (d) “Expected misdetection” Figure 5.7: Proof-of-concept validation of Enlil-Oracle on real-world examples. (a) and (b) show images with real raindrops from the Waymo dataset [205]. (c) and (d) show that Enlil-Oracle was able to perceive the possibly disruptive raindrop in each image (highlighted in red). 93 5.4.2 Predictive Behavior Modeling This chapter extends prior research in the use of machine learning to infer behavior models for software testing. Walkinshaw et al. [61, 170] described a framework that uses model inference to emulate software components. In their work, the inferred behavior model is trained on functional inputs and outputs of a tested software component. This chapter describes a comparable conceptual framework but differs in a few key aspects. First, Enlil interacts with the external interface of a cyber-physical LES, rather than modeling the direct input and output of a single unit software function or component. Enlil trains a model that emulates a cyber-physical system’s sensor data and its expected behavior patterns. Second, based on the hypothesis that a diverse training set will enable accurate behavior models to be trained more efficiently, Enlil uses a diversity-driven method to generate training examples for its behavior models, rather than a greedy optimization method. Finally, a major motivation of this chapter is to assess the robustness of DNNs and model the impact of uncertain conditions on DNNs, which has not been addressed in the work of Walkinshaw et al. 5.5 Summary It is non-trivial but necessary to determine the behavior of safety-critical LESs under all operating contexts. This chapter has shown how the Enlil framework enables the creation of a behavior oracle (e.g., Enlil-Oracle) that can predict how an LES will behave under conditions of uncertainty. Enlil has been demonstrated to uncover a wide variety of training contexts that result in behaviors that would be underrepresented by Monte Carlo sampling. Furthermore, Enlil can train behavior models more efficiently (i.e., requiring fewer training examples) to achieve equivalent or greater accuracy than Monte Carlo methods for introducing adversity. By using novelty search to achieve a more diverse sampling, Enlil requires additional overhead in comparison to Monte Carlo sampling. The number of parameters in the operating specification, the archive size, and the complexity of the given evaluation procedure are major factors to consider when using Enlil. For this study, each trial took approximately one additional hour for context generation with Enlil versus Monte Carlo sampling on a single workstation. Since these procedures are intended to be executed at design 94 time, the additional overhead is considered tolerable. Behavior oracles can potentially inform developers about training gaps and system limitations that require additional development to enable proactive mitigation strategies. Self-adaptive systems can use behavior oracles to determine how the managed system will react to conditions for which it has not explicitly been trained. For example, an autonomic manager can harness an Enlil-Oracle to anticipate failure of a learning component and trigger preventative actions at run time. Chapter 7 explores in detail a self-adaptive framework that harnesses behavior oracles to prevent unsafe operations of an autonomous vehicle at run time. Chapter 8 further explores how behavior oracles can be used in a service-oriented architecture for the purpose of establishing more trustworthy AI in the presence of environmental uncertainty. 95 CHAPTER 6 ASSURANCE CASES FOR SELF-ADAPTIVE SYSTEMS For self-adaptive cyber-physical systems, such as autonomous robots, steps need to be taken to ensure requirements remain satisfied across run-time adaptations. ROS is widely used in both research and industrial applications as a middleware infrastructure for robotic systems [176]. How- ever, ROS by default does not provide any software assurance for systems that utilize it. This chapter introduces AC-ROS (developed in collaboration with Cheng et al. [34]), a framework to manage run-time adaptations for ROS systems with respect to established assurance cases. Assur- ance cases provide structured arguments to certify that a system satisfies requirements and can be modeled graphically with GSN. AC-ROS implements a ROS-based autonomic manager, informed by GSN models, to assure system behavior adheres to requirements across run-time adaptations. For this study, AC-ROS is implemented and tested on EvoRally, a 1:5-scale autonomous vehicle (described in Chapter 2). The remainder of this chapter is organized as follows. Section 6.1 overviews and motivates this study. Section 6.2 describes how functional assurance cases can be modeled for run-time assessment with AC-ROS. Section 6.3 describes related work. Finally, Section 6.4 provides a concluding summary for this chapter. 6.1 Overview When autonomous cyber-physical systems are introduced into safety-critical domains, requirements violations can lead to property damage, physical injuries, and even loss of life (e.g., automobiles, aircraft, medical devices, power grids, etc.). Assurance certification is one approach to ensure such systems operate correctly, where assurance cases [158] are used to specify relationships between development artifacts (e.g., models and code) and the required evidence (e.g., test cases) needed to verify the system operates as intended. However, many cyber-physical applications rely on frameworks that do not directly address software assurance (e.g., ROS-based systems). Therefore, 96 a challenge is how to integrate assurance cases into such applications to certify they operate and adapt as expected. This chapter demonstrates how assurance case driven self-adaptation can be realized in ROS-based systems. Over the past decade, research efforts have addressed assurance for self-adaptive systems [30, 218, 219], including the development of formal analysis techniques [28, 57, 234], formal model- ing [93], and proactively mitigating uncertainty [8, 152, 222]. For assurance certification purposes, the GSN has been developed to specify assurance cases graphically [2]. A GSN model defines specific claims for how requirements can be satisfied and what evidence is required to support those claims. Traditionally, the certification process has largely been done manually [72, 104, 125, 186]. While work has been conducted to automate analysis of GSN models [46, 138], including security and safety properties analysis for self-adaptive systems [29, 97], GSNs have not been used to address assurance for self-adaptive ROS-based systems. This chapter introduces AC-ROS, a MAPE-K framework that uses GSN assurance case models to manage run-time adaptations for ROS-based systems. A key insight is that the modular ROS publish-subscribe infrastructure is ideal for (1) implementing the MAPE-K control loop and event- driven adaptation and (2) supporting run-time monitoring of the environment and onboard system state. AC-ROS realizes the MAPE-K infrastructure in ROS, using GSN models at run time to guide adaptations that satisfy assurance requirements. Through automatic run-time analysis of GSN models, AC-ROS facilitates the development of assured, adaptive, and autonomous robotic systems. This chapter presents the main components of AC-ROS and describes a “proof-of-concept” implementation for EvoRally, a 1:5-scale off-road autonomous vehicle. The sensing, control, and actuation software of EvoRally is entirely ROS-based. In addition, a simulation of EvoRally provides a means to validate AC-ROS prior to deployment. Once validated, the ROS code used for simulation can be directly uploaded to EvoRally, thereby providing an assurance-driven digital twinning capability [115] for self-adaptive ROS-based systems. 97 6.2 Functional Assurance for Robotic Self-Adaptive Systems This section describes the core components and processes of AC-ROS, a framework for assuring the functional behavior of self-adaptive ROS-based systems. AC-ROS systematically integrates assurance information from GSN models with platform-specific information to guide run-time adaptations for a self-adaptive ROS-based autonomous platform. As such, AC-ROS enables ROS- based platforms to conform to assurance cases at run time, while adapting as necessary. The remainder of this section describes GSN modeling in more detail, how GSN models are created and analyzed at design time, and how they are used at run time to manage the adaptations. 6.2.1 Goal Structuring Notation for Functional Assurance Assurance cases provide a means to certify that software operates as intended [72]. With assurance arguments, claims are made about how functional and non-functional requirements are met, and each claim requires supporting evidence for validation. One way to document an assurance case argument is through GSN modeling [2]. GSN allows an argument to be defined as a graphical model, with each claim represented as a goal and each piece of supporting evidence (e.g., test cases, documentation, etc.) represented as a solution. Additional elements, such as assumptions, justifications, contexts, and strategies, are provided within the notation to further expound upon an assurance argument. A simple GSN example is shown in Figure 6.1 along with an abbreviated legend of selected elements and relationships defined by the GSN standard [2]. The figure depicts a single module (M0) of an assurance argument that claims a rover can successfully patrol its environment (M0-G1). Three sub-goals are included to support the top-level claim (M0-G1.1, M0-G1.2, and M0-G1.3, respectively). M0-G1.1 claims that each waypoint is visited within an allotted timeframe, M0-G1.2 claims that the rover avoids collisions, and M0-G1.3 claims that the rover maintains a safe power level. M0-G1.1 has an assumption, M0-A1.1.1, indicating that all the waypoints are reachable. Both M0-G1.1 and M0-G1.2 reference a separate module (M1), where additional supporting assurance 98 arguments are defined. Strategies M0-S1.1.1 and M0-S1.3.1 provide scope and clarification on how supported claims should be resolved. Solutions M0-S1.1.1.1.1 and M0-S1.3.1.1.1 define supporting evidence. Figure 6.1: Example module of a GSN model. This module depicts an assurance argument for the claim that an autonomous rover can successfully patrol its environment. As a graphical notation, GSN enables stakeholders with varying degrees of expertise to trace and audit the logic of an assurance case. In large-scale systems, however, the volume of documentation can become cumbersome and difficult to navigate manually [125], motivating recent work to explore automated GSN creation and refinement [46, 47], and automated GSN analysis [29, 97, 138]. Models have also been used at run time to manage self-adaptation [15, 18, 153], including assurance- focused approaches [35]. AC-ROS takes inspiration from these works to support a GSN-model driven approach to address assurance of a cyber-physical ROS-based system at run time. 99 6.2.2 Constructing and Evaluating an Assurance Case Model Since AC-ROS requires assurance cases in the form of GSN models, a graphical tool is needed for developers to construct GSN models and export them into a format that can be parsed for run-time evaluation. This chapter introduces GSN-Draw for this purpose. We developed GSN-Draw to be used as a graphical user interface to enable developers to construct an assurance case with standard GSN (Version 2) conventions [2]. GSN-Draw has been implemented as a web application to enable the construction of a GSN model. GSN elements can be selected and instantiated into a “drawing” pane. Instantiated elements can then be linked together by any supported relationship type to form a connected graph. Constructed GSN models can be exported into a parsable XML1 format. To demonstrate, Figure 6.2 shows a GSN model being constructed with GSN-Draw. The GSN standard allows for a single GSN model to be composed of multiple interconnected modules. These modules are developed individually within a given project using the GSN-Draw tool. Figure 6.3 depicts a set of GSN modules that describe an assurance case for the EvoRally platform. For example, Figure 6.3a shows the top-level module (M0) of a GSN model created for an autonomous rover, whose mission is to visit a set of waypoints. This module makes the most general claim (M0-G1) about a rover’s ability to complete its mission. A supporting argument is expressed by all elements connecting to M0-G1. Additional sub-modules, M1 through M5 (depicted in Figure 6.3b through Figure 6.3f, respectively) are developed in a similar manner but constructed separately to improve readability and enable the reuse of common sub-arguments. (Pink-tabbed boxes refer to sub-modules for a given project.) The assurance argument laid out by a GSN model must be supported by evidence. Solution elements (such as M0-S1.1.1.1.1 in Figure 6.3a) state what specific evidence is required to satisfy parent claims. For this study, solution elements are expressed by utility functions [37, 106, 152, 215]. Utility functions are derived from system and/or environmental properties that can be monitored at run time to determine the satisfaction of system requirements [177]. The utility function for 1 The Extensible Markup Language (XML) is used to encode information in a format that is both human-readable and machine-readable [225]. 100 M0-S1.1.1.1.1 checks the last time each waypoint was visited by the rover, and it is satisfied when each waypoint has been visited within the maximum allotted time. In a similar manner, the entire GSN model can be assessed by evaluating all utility functions against the current state of the rover. Once all utility functions have been evaluated, graph tracing can determine if the root-level claim is satisfied. GSN-Interpret, described by Cheng et al. [34] implements the functionality to evaluate GSN models from GSN-Draw for the AC-ROS framework. Figure 6.2: Screenshot of the GSN-Draw user interface. GSN-Draw enables the creation of standard GSN models that can be imported into AC-ROS. Other assurance case editing tools provide varying degrees of support for the GSN stan- dard [145]. A primary motivation for developing GSN-Draw is to facilitate the use of utility functions as evidence for GSN solution elements, which is not supported by existing tools. Ad- ditionally, for adaptive systems, it is expected that assurance cases will include optional branches and alternative strategies for assuring different system adaptations. GSN-Draw enables a GSN model to be created fully compliant with the GSN (Version 2) standard, including the extensions to support argument patterns (i.e., optionality and modules). 101 (a) Top-level Module M0 (b) Sub-Module M1 (c) Sub-Module M2 (d) Sub-Module M3 (e) Sub-Module M4 (f) Sub-Module M5 Figure 6.3: A standard GSN model of an assurance case for a patrolling rover. (a) Module M0 argues that a rover can successfully patrol its environment. (b) M1 argues that a rover can navigate its environment safely. (c) M2 argues that a rover can safely navigate with full sensor capabilities. (d) M3 argues that a rover can safely navigate with only limited sensor capabilities. (e) M4 argues that a rover can localize its position. (f) M5 argues that a rover can determine its speed. 102 6.2.3 A Shared Knowledge Base for the Managing Adaptations AC-ROS extends the generic MAPE-K control loop design, which shares a knowledge base across each iteration of the control loop. For AC-ROS, the knowledge base is an aggregate collection of static and dynamic data stores relevant to ROS-based platform and the GSN models for its assurance. For brevity, this section uses the prefix “KB” to refer to individual data stores within the aggregate knowledge base for AC-ROS. Descriptions for each data store follow. KB1. GSN Models GSN models for all relevant assurance cases are stored in KB1 in an XML format. KB2. Mapping to ROS Topics In order to determine how utility function parameters relate to platform-specific ROS topics, KB2 contains a mapping for each utility parameter referenced in KB1 to a ROS topic for the managed platform. For example, the utility function for M0-S1.1.1.1.1 in Figure 6.3a references a rover.power parameter, which is linked to the charge_percent ROS topic via this mapping. This mapping is not generated automatically, since it is platform-specific and requires an understanding of the available ROS topics. KB3. ROS Launch Files Information needed to configure and initialize the run-time monitoring processes for AC-ROS are stored in KB3. This information is stored as ROS launch files, which are automatically generated to provide the launched ROS nodes with details about which ROS topics to monitor and their corresponding utility parameters (based on the data in KB2). KB4. Adaptation Tactics Adaptation tactics [37] for the ROS-based platform are stored in KB4. Each tactic comprises a context (i.e., under which conditions a given adaptation is applicable), sequence of actions, and consequences of adaptation (i.e., effect(s) of adaptation). Each adaptation tactic is a high-level 103 sequence of commands to enable changes in behavior “modes” (e.g., a mode change could switch sensors from cameras to lidar for obstacle detection). These tactics are described by configuration files that are loaded at the beginning of run time. KB5. System State Platform-specific information gathered by AC-ROS at run time is stored in KB5. This data store may include raw data (e.g., wheel speeds), meta information (e.g., frequency and distributions of data stream samples), and various configuration parameters (e.g., settings for obstacle detection mode). 6.2.4 Design-time Activities The DFD in Figure 6.4 illustrates the design-time steps for modeling an assurance case and using it to configure the run-time ROS environment. Descriptions for each step follow. Figure 6.4: Data flow for the process AC-ROS takes to prepare GSN models and ROS configuration at design time. Circles, boxes, arrows, and parallel lines respectively represent computational steps, external entities, data flow, and data stores. Step D1. Construct GSN Model GSN-Draw enables a requirements engineer to interactively develop a graphical model for an assurance case in the form of one or more GSN modules. 104 Step D2. Export GSN Model GSN-Draw exports XML files to KB1, with each corresponding to a respective GSN module. Exported files preserve the structure and contents of the GSN model, including element properties, relationship dependencies, and associated utility functions. Step D3. Parse Utility Parameters XML files exported from GSN-Draw are parsed to determine the utility functions associated with each solution of the GSN model and then extracts the referenced functional parameters. For example, this step would parse the utility parameter rover.power from solution M0-S.1.3.1.1.1 in Figure 6.3a. Step D4. Create ROS Launch Files For run-time monitoring, AC-ROS references the utility parameters specified in the GSN model (e.g., rover.power). However, to collect relevant data from the managed ROS-based platform for each utility parameter, AC-ROS requires a mapping to the relevant ROS topics. Step D4 creates ROS launch files that define how its run-time monitoring ROS nodes should be instantiated (i.e., to which topics each should subscribe and the corresponding utility function in the GSN model). The resulting launch files are stored in KB3 for later use by AC-ROS at run time. The mapping between utility parameters and ROS topics are provided by KB2. This use of indirection and abstraction enables the requirements engineer to design a GSN model without needing to know implementation details of as specific ROS-based platform, thereby facilitating reuse of the GSN model for alternative platforms. 6.2.5 Assuring System Adaptations at Run time The DFD in Figure 6.5 illustrates the run-time steps for AC-ROS to manage system adaptations such that they satisfy an assurance case for the managed ROS-based autonomous platform. These steps follow the general design of a MAPE-K control loop, and descriptions for each step follow. 105 Figure 6.5: Data flow for run-time processes involved with managing adaptations of a ROS-based platform. Step R1. Monitor Step R1 comprises one or more monitor processes (i.e., ROS nodes) that observe the managed autonomous platform and update the knowledge base (KB5) accordingly. Launch files (KB3) for each monitor process determine which subsystem of the managed platform to observe (e.g., camera, lidar, etc.), the relevant ROS topics, and the associated utility parameters. Monitor processes are also capable of performing preprocessing to further refine measurements from raw ROS data (e.g., measure frequency, infer speed from GPS measurements, etc.). As each monitor process observes data from the managed platform and updates the knowledge base (Step R1.a), it also checks for any violations to its associated utility function (Step R1.b). Monitor processes are instantiated to account for every corresponding utility function parameter referenced by the GSN model, including all graph branches in the case of optionality. During run time, it is assumed all optional branches of the GSN model graph are available to be satisfied and are thus monitored. However, if needed, AC-ROS supports adaptive monitoring by allowing nodes to subscribe to, and unsubscribe from, the relevant ROS topics during run time. 106 Step R2. Analyze Step R2 (i.e., the analyze process) determines whether the managed platform must adapt to ensure that it delivers acceptable behavior with respect to the assigned assurance case. When Step R2 is initiated, GSN models are loaded from KB1 and stored in an internal graph representation (Step R2.a). The source XML data specifies a root goal element for each GSN module and parent-child relationships for each GSN element. Subgraphs are constructed for each module to preserve the specified relationships from root to leaf nodes. Links are then established to the module subgraphs in order to produce an internal graph of the entire GSN model. Step R2 is activated periodically, as well as when a monitor process from Step R1 reports a utility function violation. AC-ROS retrieves current state information from KB5 and references the given GSN models to determine if the managed platform complies with the modeled assurance case (Step R2.b). AC-ROS evaluates the GSN graph via a depth-first traversal to determine the satisfaction of leaf-level solution elements (by evaluating their associated utility functions). Example utility functions can be found in the leaf nodes of GSN modules in Figure 6.3. Based on this evaluation of the GSN model, AC-ROS determines whether any adaptations are necessary When the GSN model is not satisfied, the existing adaptation tactics are reviewed based on the descriptions of their contexts and consequences of adaptation impact. For this work, it is assumed that corresponding adaptation tactics have been predefined to cover each combination of utility function violations. Ongoing research includes the use of reinforcement learning to determine which adaptation tactics are best suited for the given context. Once an appropriate adaptation tactic has been identified (Step R2.c), a reference to the selected adaptation is sent to the plan process. Step R3. Plan When Step R3 (i.e., the plan process) receives a reference to a selected adaptation tactic, the appropriate sequence of actions to implement the tactic (i.e., the tactical procedure) are fetched from KB4. The tactical procedure ensures that the managed system is in a quiescent state before the system adapts [117]. For example, a tactic for changing obstacle detection modes might require 107 that the robot slow down first to reduce the possibility of collision. Step R4. Execute Step R4 (i.e., the execute process) receives an adaptation procedure and translates it into platform- specific commands. These commands are published to ROS topics to be carried out by the managed platform. 6.3 Related Work This section overviews related work in the area of assurance-driven cyber-physical systems, self- adaptive cyber-physical systems, and the evaluation of assurance cases. While other work has explored model engineering with ROS [7, 19, 88] and self-adaptation with ROS [4, 32], this chapter specifically brings together self-adaptation with MAPE-K feedback to support assurance through GSN assurance cases for ROS-based systems. Calinescu et al. introduced ENTRUST [29], a method for assurance-driven adaptation. Like AC-ROS, ENTRUST implements a MAPE-K control loop for self-adaptive systems, guided by GSN models. However, during the analysis step of the MAPE-K loop, ENTRUST uses a probabilistic verification engine to assess stochastic finite state transition models (i.e., Markov chains) of the controlled system. In contrast, AC-ROS has been designed to take advantage of ROS, and instead of relying on a model of the autonomous platform for assurance assessments, AC-ROS directly monitors real system data via ROS topics and evaluates assurance arguments based on utility functions. The RoCS framework [179], created by Ramos et al., implements a MAPE-K control loop to manage adaptations for robotic systems, with potential for integration into ROS. However, a key difference between RoCS and AC-ROS is that AC-ROS uses GSN models to guide and assure adaptations at run time. While RoCS and AC-ROS both realize the MAPE-K control loop, AC-ROS ensures that safety requirements are upheld with each executed system adaptation. Alternative methods to automated evaluation of assurance cases have been explored. Lin et al. [138] introduced a method that uses Dempster-Shafer theory to calculate confidence for an assurance case. AC-ROS acquires evidence directly via the evaluation of utility functions, 108 and thus confidence is not explicitly considered when evaluating evidence for an assurance case. Traditionally, assurance cases have focused on certifying functional requirements for the purpose of safety, although recent work has addressed assurance for security requirements [97]. Thus far, AC-ROS has been implemented to manage only functional assurance cases. 6.4 Summary This chapter introduced the AC-ROS framework to support an assurance case driven approach to develop and manage ROS-based adaptive systems. AC-ROS leverages the ROS publish-subscribe communication paradigm to integrate an assurance case (modeled by GSN) into a MAPE-K control loop to enable assured run-time adaptation. Chapter 7 further explores how system requirements models can drive MAPE-K adaptations to maintain safe operation of an LES. Finally, Chapter 8 combines these technologies into a single cohesive service-oriented framework to develop more trustworthy LESs. 109 CHAPTER 7 GOAL MODELS FOR LEARNING-ENABLED SYSTEMS Increasingly, safety-critical systems include AI and machine learning components (i.e., LECs). However, when behavior is learned in a training environment that fails to fully capture real-world phenomena, the response of an LEC to untrained phenomena is uncertain, and therefore cannot be assured as safe. Automated methods are needed for self-assessment and adaptation to decide when learned behavior can be trusted. This work introduces MoDALAS (developed in collaboration with Chan et al. [129]), a model-driven framework to manage self-adaptation of an LES to account for run-time contexts for which the learned behavior of LECs cannot be trusted. The MoDALAS framework enables an LES to monitor and evaluate goal models at run time to determine whether or not LECs can be expected to meet functional objectives. Using this framework enables stakeholders to have more confidence that LECs are used only in contexts comparable to those validated at design time. The remainder of this chapter is organized as follows. Section 7.1 overviews and motivates this chapter. Section 7.2 describes how goal models are processed by MoDALAS. Section 7.3 describes how MoDALAS manages LESs at run time to mitigate the impact of uncertain conditions. Section 7.4 presents an implementation of MoDALAS for the EvoRally autonomous platform. Section 7.5 reviews related work. Finally, Section 7.6 provides a concluding summary for this chapter. 7.1 Overview The integration of machine learning into autonomous systems is potentially problematic for high- assurance [30], safety-critical [125] applications, particularly when training coverage is limited and fails to fully represent run-time environments. In addition to meeting functional requirements, safety-critical LESs must account for all possible operating scenarios and guarantee that all sys- tem responses are safe [212]. However, machine learning components, such as DNNs [73], are 110 associated with uncertainties concerning generalizability [103], robustness [230], and interpretabil- ity [112]. A rigorous software assurance [186] process is needed to account for these issues of uncertainty. This chapter proposes a goal-oriented modeling approach to address the assurance of LECs and manage the run-time adaptation of a cyber-physical LES. Although verification of an LEC can include steps to validate learning algorithms offline, additional online steps are needed to provide confidence that an LES will perform reliably and safely at run time [84, 193]. At design time, mathematical proofs can show that convergence criteria of a learning algorithm are satisfied, and empirical testing through cross validation can help estimate the generalizability of a trained LEC. However, when all conceivable situations cannot be included in training/validation data, methods are needed to dynamically monitor and assess the trustworthiness of LECs to determine whether assurance evidence collected at design time remains valid for previously unseen run-time conditions. More importantly, an LES must be able to determine when results from an LEC can, or cannot, be trusted to correctly respond to current conditions. This chapter presents a framework for Model-Driven Assurance for Learning-enabled Au- tonomous Systems (MoDALAS) through the use of goal-oriented models at run time [129]. With run-time self-assessment of its LECs, a goal model-driven LES can adapt to satisfy system require- ments when exposed to environmental uncertainty. MoDALAS supports the run-time monitoring of an LES with respect to functional goal models, assesses the trustworthiness of LECs, and adapts the LES to mitigate the use of LECs in untrusted contexts. Specifically, predictions of LEC behavior under various operating conditions are directly referenced by goal models that include potential conflicts and resolutions relating to different LEC behaviors (including untrusted contexts). More- over, MoDALAS enables seamless monitoring and management of both LECs and “traditional” system components (both hardware and software) that do not involve machine learning. MoDALAS establishes online system verification by the run-time monitoring of KAOS goal models [124] for functional requirements [106]. Controlled by a self-adaptive feedback loop [105], MoDALAS includes behavior oracles (introduced in Chapter 5) for each LEC. Analogous to a test 111 oracle [9], a behavior oracle predicts how an LEC will behave in response to given inputs. With MoDALAS, behavior oracles are used to assess the capability of LECs under varying run-time conditions. The resulting self-adaptive LES can then detect when its LECs are operating outside of performance boundaries and adapt accordingly, including possible transitions to fail-safe modes in extreme circumstances. By combining goal models with behavior oracles for an LES, developers can specify requirements concerning the confidence in an LEC and implement alternative strategies to ensure assurance claims are supported. A proof-of-concept prototype of MoDALAS is described for an autonomous rover LES equipped with a camera-based object detector LEC [100]. DNNs play two roles in this work: 1) a DNN provides object detection capabilities for an autonomous rover and 2) a separate DNN acts as a behavior oracle within MoDALAS to assess the object detector’s performance at run time. The object detector has been trained offline by a supervised training dataset, which includes mostly clear-weather examples. However, the autonomous rover must be assured to also function as expected in adverse weather. Without MoDALAS, the object detector would be used regardless of how closely run-time contexts match its trained experience, which could be detrimental for adverse environmental conditions (e.g., haze from a dust plume at a construction site). In contrast, MoDALAS determines when the rover’s object detector is operating outside of training coverage. Furthermore, MoDALAS enables the rover to adapt accordingly by entering a more cautious operating mode or, under extreme conditions, a fail-safe mode. 7.2 Constructing Goal Models for Autonomous Systems This section describes how modeling and specification technologies are applied at design time in MoDALAS to model high-level system goals for an LES. 7.2.1 Developing Assurance Cases MoDALAS assumes that assurance cases and goal models have been constructed and validated at design time through methods such as model checking [41]. In this chapter, assurance cases 112 are modeled using GSN, though alternative methods may also be used to describe an assurance case [186]. A simple example GSN model is shown in Figure 7.1, which claims a rover will navigate its environment safely (claim C1). Strategies are implemented to support claim C1 through offline validation (strategy S1) and run-time analysis (strategy S2). At design time, software testing, model checking, and formal analysis are conducted offline to support assurance claim C2, with results provided as evidence in solutions Sn1-Sn3. At run time, evaluation of a KAOS goal model for the rover provides evidence (solution Sn4) to demonstrate system requirements remain satisfied under changing run-time conditions (claim C3). As such, a GSN model provides context for our work, where evidence generated for assurance solution Sn4 is provided by the evaluation of KAOS goal models at run time. Figure 7.1: Example GSN assurance case for design-time and run-time validation of a rover. At design time, validation is supported by formal proofs, test results, and simulation (highlighted in green). At run time, verification is supported by the evaluation of a KAOS goal model (highlighted in blue). 7.2.2 Developing Goal Models This section describes how KAOS [124] goal models are used to hierarchically decompose high- level system goals into leaf-level system requirements in MoDALAS. Whereas the focus of GSN 113 is on software certification, KAOS goal modeling supports a hierarchical decomposition of high- level functional and performance objectives into leaf-level system requirements (i.e., goal-oriented requirements engineering [131]). KAOS goal models enable a formal goal-oriented analysis of how system requirements are interrelated as well as threats to requirements satisfaction. Goals represent atomic objectives of a system at varying levels of abstraction, with sub-goals refining and clarifying higher-level goals. Any event threatening the satisfaction of a specific goal is represented as an obstacle. Resolutions for obstacles can be specified by attaching additional sub-goals with alternative system requirements to the corresponding obstacle. Finally, agents (i.e., system components) are assigned responsibility for each system requirement. KAOS goal models enable developers to decompose the expected behavior of a software system, including information about threats to specific system requirements and how system requirements relate to each system component. Figure 7.2 shows a legend for reading KAOS goal models, while Figure 7.3 shows an example KAOS goal model for a rover that must navigate its environment. In this example, a rover is expected to sense objects in its environment and plan its trajectory around objects according to object type (e.g., when pedestrians are present (G10) or not (G9)). The rover implements a DNN- based object detector that can locate zero or more objects within a camera image and classify the type of each object [100]. The trustworthiness of the object detector depends on how similar the run-time environment is to its training experience. The rover also ensures there is sufficient braking power (G20) using sensor values from the tire pressure monitoring system and friction sensor. In Figure 7.3, utility functions (shown in yellow) are attached to the bottom of each goal. Utility functions enable the KAOS goal model to be evaluated at run time to determine goal satisfaction. Figure 7.2: Legend key for interpreting the KAOS goal model notation. 114 Figure 7.3: Example KAOS goal model. Goals (blue parallelograms) represent system objectives. The top-level goal (G1) is refined by sub-goals down to leaf-level system requirements. Agents (white hexagons) represent entities responsible for accomplishing requirements. Obstacles (red parallelograms) represent threats to the satisfaction of a goal (e.g., O1 and O2). In KAOS notation, any event that can threaten the satisfaction of a goal is represented as a KAOS obstacle. In Figure 7.3, obstacles O1 and O2 represent events in which the object detector is operating in a state not explored during design time. Obstacle O1 represents events where the object detector’s performance is degraded (i.e., statistical performance is less than ideal), and obstacle O2 represents events where the object detector is compromised (i.e., statistical performance is unacceptable). When the object detector is compromised, goal G16 is given as an obstacle resolution, where the rover is expected to perform a fail-safe procedure (e.g., halt movement). When the object detector is only degraded, goal G17 is given as a resolution, where the rover is expected to slow down and increase its minimum buffer distance from objects encountered in the environment. Additional KAOS obstacles and resolutions can be included, depending on the LEC, targeted behavior categories, and LES requirements. Similar to the GSN-Draw tool introduced in Chapter 6, a graphical tool is needed for developers 115 to construct KAOS models in a format that can be parsed for run-time evaluation. This chapter introduces KAOS-Draw for this purpose. KAOS-Draw has been implemented as a web application to enable the construction of a KAOS model. KAOS elements can be selected and instantiated into a “drawing” pane. Instantiated elements can then be linked together by any supported relationship type to form a connected graph. Constructed KAOS models can be exported into a parsable XML format. To demonstrate, Figure 7.4 shows a KAOS model being constructed with KAOS-Draw. Figure 7.4: Screenshot of the KAOS-Draw user interface. KAOS-Draw enables the creation of standard KAOS models that can be imported into MoDALAS. 7.2.3 Relaxing Goal Models In this chapter, the RELAX language [222] is used to explicitly specify uncertainties affecting an LES. RELAX is a requirements specification language that enables developers to identify, evaluate, and “relax” brittle requirements to address and mitigate uncertainty factors during run time. During requirements engineering, developers may describe system behaviors with strict and highly constrained properties. However, due to the numerous sources of uncertainty potentially impacting an LEC, it may not always be possible to strictly satisfy all requirements. RELAX 116 allows for non-invariant requirements be temporarily unsatisfied due to uncertain environmental and onboard conditions. RELAX operators add flexibility to the conditions for which a given requirement is considered satisfied, thereby adding the notion of degrees of satisfaction (i.e., “satisficement” [124, 135]) in a goal model. For example, RELAX operators such as AS CLOSE AS POSSIBLE can be used to reduce the brittleness of a given goal to three RELAX elements (i.e., ENV, MON, and REL) [222]. Table 7.1 provides examples of RELAX operators, with the names of the operators provided in the first column and corresponding descriptions provided in the second. RELAX also includes traditional temporal logic operators, but they have been mapped to fuzzy logic operators instead [222]. Table 7.1: Example of RELAX operators and uncertainty factors [63, 222] Operator Description Modal operators SHALL Requirement must hold. MAY...OR Requirement specifies one or more alternatives. Temporal operators EVENTUALLY Requirement that must hold eventually. UNTIL Requirement must hold until a future position. BEFORE/AFTER Requirement must hold before or after a particular event. AS EARLY AS POSSIBLE Requirement should hold as soon as possible. AS LATE AS POSSIBLE Requirement should be delayed as long as possible. AS CLOSE AS POSSIBLE TO Requirement happens repeatedly, though frequency may be relaxed. [frequency t] Ordinal operators AS FEW/MANY AS POSSIBLE Requirement specifies a countable quantity, though the exact count may be relaxed. AS CLOSE AS POSSIBLE TO Requirement specifies a countable quantity, though the exact count [quantity q] may be relaxed. Uncertainty factors Description ENV Defines a set of properties that define the system’s environment. MON Defines the set of properties that can be monitored by the system. REL Defines the relationship between ENV and MON properties. DEP Defines dependencies between (relaxed and invariant) requirements. 117 RELAX enables developers to create more flexible requirements to ensure robustness against uncertainties. However, modifications to textual requirements require run-time evaluation to verify a threshold of satisfaction. During run time, LESs monitor system values and use utility functions to assess whether system performance and/or configuration satisfy the current goal model. Tradi- tionally, utility functions return a Boolean value (i.e., 0 or 1) based on goal satisfaction. To address run-time uncertainty, RELAX operators are mapped to fuzzy logic semantics [79, 231]. Fuzzy logic enables developers to specify a partially satisfied goal, where a utility function can return real values ranging from 0 (i.e., not satisfied) to 1 (i.e., satisfied). A goal that returns a partially satisfied utility function is known as satisficed [124], permitting system requirements to be verified as satisfied with a degree of fuzziness. Since fuzzy logic allows utility functions to return real numbers, goal refinement (i.e., AND and OR goal decompositions) for parent goal evaluation must be redefined. In the parent goal of RELAX-ed goals, the goal’s utility function is evaluated by applying mathematical operations (e.g., min and max) on sub-goals’ satisficement values. While several popular approaches to fuzzy logic evaluation exist, this work uses Zadeh fuzzy operators [111], a common convention for resolving fuzzy logic (e.g., conjunctions, disjunctions, and implications). Using Zadeh fuzzy operators, when the sub-goals are related by an OR relationship, the maximum value of all sub- goals’ utility functions determine the evaluation of the parent goal. If the sub-goals are part of an AND relationship, then the minimum values determine the parent goal instead. A parent goal may be converted to Boolean satisfaction if the evaluation of the value of the RELAX-ed sub-goals exceeds a specified threshold (e.g., 0.5). To illustrate an example of a RELAX specification, consider a component of an autonomous vehicle that detects obstacles. A traditional requirement may be as follows: S1: The system SHALL detect obstacles within 10 meters. This requirement represents an ideal situation. However, instead of a rigid requirement, a developer may wish to relax the requirement to account for uncertainty factors (e.g., speed variance of two vehicles, the sensitivity of the sensors, etc.). For example, S1 may be modified to the 118 expression S1′ if the vehicle is traveling below 10 meters per second, since there is more time for the system to react to detected obstacles. S1′: The system SHALL detect obstacles AS CLOSE AS POSSIBLE to 10 meters. ENV: location of obstacle MON: obstacle detection system REL: system detects obstacle Using S1′, the system can still handle the requirement of “detect obstacle within 10 meters”, and also support a more flexible requirement should the system detect an obstacle within 8 meters. Specifically, developers can trade-off rigid Boolean utility functions (i.e., the system detected obstacle within 10 meters or not) for fuzzy logic utility functions (e.g., degree of satisficement from 0 to 1) under RELAX. As such, the system can adapt and temporarily trade-off non-critical requirements to maintain the satisfaction of more critical requirements. To address environmental uncertainties, developers may use RELAX to temporarily allow spe- cific requirements to be relaxed within acceptable ranges. During the design step, the developers identify non-invariant requirements that may be relaxed. Next, developers specify specific require- ments in terms of goal model, including various RELAX operators for goals that face uncertainty factors. During this step, developers must also define utility value thresholds for goals that convert fuzzy logic utility function values to Boolean utility function values. Figure 7.5 shows a modified version of the KAOS goal model from Figure 7.3 where RELAX operators are used in goals G20, G21, and G22. In the new goal model, G20 has been modified with the RELAX language to allow partial goal satisfaction. The fuzzy logic utility functions of the RELAX-ed goal models are modified to return a real number ranging from 0 to 1 to represent the degree of satisficement for the specific goal. Specifically, G20 is considered satisfied if the threshold value of both the tire pressure sensor monitor and the friction monitor are satisfied to the degree of 0.5. To demonstrate the use of RELAX with an autonomous rover, consider goal G20 in Figure 7.5, 119 where the rover ensures that there is sufficient braking power for the rover to stop should it detect a potential collision. Prior to being RELAX-ed, G20 returns a Boolean value to parent goals indicating whether there is enough braking power or not. G21 and G22 return Boolean values depending on whether or not the specific sensor values are satisfied. In order to add flexibility and account for environmental uncertainties, G21 and G22 are RELAX-ed to account for uncertainty (e.g., if the visibility is poor and the rover is traveling under 10 m/s). To check if G20 is satisfied, the system ensures that i) the friction sensor detects a friction rate of 2 Newtons, with an acceptable range of −0.5 Newtons and ii) the tire pressure is within 221 kPa (kilopascals), with an acceptable range of ±14 kPa. Fuzzy logic is used to express the degree of satisficement in the RELAX utility functions. For example, if the system detects that the value of a RELAX-ed goal is satisfied (e.g., tire pressure is at 221 kPa), then the corresponding utility function returns 1. The value returned reduces to 0 as the value reaches the lower-bound of the acceptable range. Figure 7.5: A RELAX-ed version of the KAOS goal model shown in Figure 7.3. Goals G20, G21, and G22 (shown as green) have been RELAX-ed and use fuzzy logic to express a degree of goal satisficement. 120 Figure 7.6 shows an example of the range of values returned for G21. The utility function used to evaluate G21 is shown in Figure 7.7. G21 returns 1 if the value of the friction sensor reads 2 Newtons. The returned value linearly decreases as the sensor value reduces to the lower-bound of the acceptable range. In contrast, Figure 7.8 shows an example of the range of values returned in G22. Unlike G21 where the goal has an acceptable range below a set value, G22 allows for satisficement in both directions (i.e., a triangular fuzzy logic function is used). The utility function used to derive the return value for G22 is shown in Figure 7.9. It returns 1 when the tire pressure monitor reads 221 kPa. Since the acceptable range of the utility function is defined as 221 ± 14, the value returned by the utility function decreases linearly to 0 as it approaches 207 kPa or 235 kPa, forming a triangular relationship. To obtain the return value of G20’s utility function, the system evaluates the values of G21 and G22’s refinement and compares it to a threshold defined by the domain expert. Figure 7.5 shows the relationship between G20, G21, and G22, where the sub-goals form an AND relationship. The utility function used to evaluate G20’s value is defined as:    1.0 if min(𝑢𝑡𝑖𝑙𝑖𝑡𝑦(𝐺21), 𝑢𝑡𝑖𝑙𝑖𝑡𝑦(𝐺22)) > threshold    𝑢𝑡𝑖𝑙𝑖𝑡𝑦(𝐺20) =   0.0 otherwise    While this chapter demonstrates the use of RELAX to explicitly specify uncertainty, MoDALAS can also accommodate other types of requirement specification languages and corresponding utility functions used to address uncertainty, such as FLAGS [8], probabilistic utility functions, etc. 7.3 Goal Model-Driven Self-Adaptation of Learning-Enabled Systems This section describes the proposed MoDALAS framework for goal-based self-adaptation of an LES. Figure 7.10 shows a DFD of the framework. Circles represent computational steps, boxes represent external entities, directed arrows show the flow of data, and persistent data stores are shown within parallel lines. Design-time steps (green) include the construction of an assurance case, a goal-oriented requirements model of the LES, and a behavior oracle for each LEC. Run-time steps (blue) implement a MAPE-K feedback loop driven by the models constructed at design time. 121 Figure 7.6: Range of values returned by the utility function for evaluating the friction sensor (G21 in Figure 7.5) using the RELAX language with fuzzy logic. function friction-sensor-utility(𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑, 𝑑𝑒𝑠𝑖𝑟𝑒𝑑=2, 𝑏𝑜𝑢𝑛𝑑𝑠=0.5) if (𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 ≥ 𝑏𝑜𝑢𝑛𝑑𝑠) then return 1.0 else if (𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 ≤ 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 − 𝑏𝑜𝑢𝑛𝑑𝑠) then return 0.0 else return (𝑑𝑒𝑠𝑖𝑟𝑒𝑑 − 𝑏𝑜𝑢𝑛𝑑𝑠 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑)/𝑏𝑜𝑢𝑛𝑑𝑠 return 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 Figure 7.7: Utility function for evaluating friction sensor satisficement (G21 in Figure 7.5). Figure 7.8: Range of values returned by the utility function for evaluating the tire pressure monitor system (G22 in Figure 7.5) using the RELAX language with fuzzy logic. function tire-pressure-monitor-utility(𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑, 𝑑𝑒𝑠𝑖𝑟𝑒𝑑=221, 𝑏𝑜𝑢𝑛𝑑𝑠=14) if ((𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 ≤ 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 − 𝑏𝑜𝑢𝑛𝑑𝑠) ∨ (𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 ≥ 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 + 𝑏𝑜𝑢𝑛𝑑𝑠)) then return 0.0 else if (𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 ≤ 𝑑𝑒𝑠𝑖𝑟𝑒𝑑) then return (𝑑𝑒𝑠𝑖𝑟𝑒𝑑 − 𝑏𝑜𝑢𝑛𝑑𝑠 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑)/𝑏𝑜𝑢𝑛𝑑𝑠 else return (𝑑𝑒𝑠𝑖𝑟𝑒𝑑 + 𝑏𝑜𝑢𝑛𝑑𝑠 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑)/𝑏𝑜𝑢𝑛𝑑𝑠 return 𝑎𝑟𝑐ℎ𝑖𝑣𝑒 Figure 7.9: Utility function for evaluating the tire pressure monitor satisficement (G22 in Figure 7.5). 122 Although MoDALAS is platform-independent, to aid the reader, the following descriptions include an example of an autonomous rover with a learning-enabled object detector. Specific implementation details on how MoDALAS is applied to the autonomous rover are provided in Section 7.4. Each step in Figure 7.10 is described in detail as follows. Figure 7.10: High-level data flow diagram of MoDALAS. Processes are shown as circles, external entities are shown as boxes, and persistent data stores are shown within parallel lines. Directed lines between entities show the flow of data. 7.3.1 Preparation at Design Time To prepare the MoDALAS framework for execution, steps need to be taken to properly initialize the framework. Configuration files must be prepared to instantiate the autonomic manager (Step D1), and behavior oracles must be constructed prior to run time execution (Step D2). Step D1. MAPE Instantiation An autonomic manager (implemented as a MAPE-K loop) is instantiated to manage adaptations of the LES. To determine the system state and evaluate KAOS goal models at run time, the 123 autonomic manager must be configured to monitor the same system attributes referenced by KAOS goal models. KAOS goal models are parsed, and utility functions are extracted from each KAOS element. MoDALAS requires that KAOS goal models have been converted into a machine parsable file format (e.g., XML) that includes attributes for each KAOS element and its associated utility function. A set of utility parameters is then compiled by identifying each unique variable referenced by a utility function. Since utility parameters may refer to abstract concepts, a manual mapping must be made by the user to link each utility parameter to a platform-specific property of the LES. For example, for the utility function object_dist >= 0.8 in Figure 7.3, goal G14, the utility parameter object_dist refers to the buffer distance between the rover and any object in the environment. It is the responsibility of the autonomic manager to link this abstract parameter to a real, platform- specific property of the rover. Configuration files are generated by Step D1 to initialize the monitor processes of the MAPE-K loop with references to the platform-specific properties to observe. Step D2. Constructing Behavior Oracles To monitor and assess the trustworthiness of LECs at run time, MoDALAS leverages behavior oracles generated by Enlil [127] for each individual LEC. Behavior oracles are implemented as DNNs to infer behavior of each LEC when exposed to new forms of environmental uncertainty under simulation. For example, when a rover implements a learning-enabled object detector that has been trained only in clear weather, Enlil can be used to simulate adverse weather conditions and model the capability of the object detector under a variety of adverse conditions. The resulting behavior oracle can then predict different behavior categories for the object detector when presented with sensor data under various weather conditions. These categories are application-specific and must be defined according to the user for the given task and LECs involved. In essence, behavior oracles serve as a utility function for assessing the type of LEC behavior to expect under varying run-time conditions. The KAOS goal model in Figure 7.3 reflects that three different behavior categories have been specified for the behavior oracle of a rover’s object detector. Two of these (behavior_cat == 1 124 and behavior_cat == 2) are attached to obstacles O1 and O2, respectively. The third is the default and not explicitly shown in Figure 7.3 (behavior_cat == 0) . (The number of behavior categories depends on the granularity and spectrum of available behaviors and also the number of alternative resolutions required to satisfy system requirements.) Categories are determined by assessing the object detector’s performance under a variety of adverse environmental contexts in simulation. In this example, the object detector’s recall is measured when a newly-introduced adverse condition is present versus when it is not. The change in recall (𝛿𝑟𝑒𝑐𝑎𝑙𝑙 ) is then used to measure the effect on object detector’s performance. The value of 𝛿𝑟𝑒𝑐𝑎𝑙𝑙 is computed statistically by measuring the object detector’s recall for a set of validation images with and without exposure to the given environmental phenomena. Table 7.2 provides a description of each behavior category reflected in Figure 7.3. When 𝛿𝑟𝑒𝑐𝑎𝑙𝑙 < 5%, the given context is considered to have “little impact” on object detection (Category 0). When 5% <= 𝛿𝑟𝑒𝑐𝑎𝑙𝑙 < 10%, the object detector is considered “degraded” (Category 1). Finally, when 𝛿𝑟𝑒𝑐𝑎𝑙𝑙 > 10%, the object detector is considered “compromised” (Category 2). Table 7.2: Behavior categories for an object detector. Category Classification Definition 0 “little impact” 0% ≤ 𝛿𝑟𝑒𝑐𝑎𝑙𝑙 < 5% 1 “degraded” 5% ≤ 𝛿𝑟𝑒𝑐𝑎𝑙𝑙 < 10% 2 “compromised” 10% ≤ 𝛿𝑟𝑒𝑐𝑎𝑙𝑙 Enlil automatically assesses an LEC by generating unique contexts of simulated environmental phenomena (via evolutionary computation [52]) to uncover examples that lead to each targeted behavior category. Figure 7.11 shows a scatter plot generated by Enlil when simulating dust clouds. Each point represents a different dust cloud context with the resulting recall for the object detector LEC. Colors correspond to the observed behavior category for each respective point (i.e., categories 0, 1, and 2 are green, yellow, and red, respectively). Data collected during this assessment phase is used by Enlil to train a behavior oracle that can map LEC inputs to expected behavior categories (i.e., model inference). 125 Behavior oracles created in Step D2 are used at run time to predict the resulting behavior category of an LEC for any given run-time context. For the example object detector, inputs to the behavior oracle are the same camera inputs given to the object detector. Output from the behavior oracle includes a description of the perceived context of the environmental condition and the inferred behavior category for the object detector. Figure 7.12 shows three behavior categories to represent different degrees of impact dust clouds can have on an object detector. Effectively, this information is used to assess the trustworthiness of the object detector. 7.3.2 Self-Adaptation at Run Time A MAPE-K loop autonomic manager is executed at run time to monitor and reconfigure the managed LES with respect to the models constructed at design time. Responsibilities include assessing the current state of the LES, predicting the capability of LECs via behavior oracles, determining when system requirements are not satisfied by referencing KAOS goal models, and planning adaptations to ensure mitigating actions are taken to maintain requirements satisfaction. Step R1. Monitor In order to inform self-adaptations, monitor processes observe and record relevant attributes of the managed LES, which includes executing behavior oracles from Step D2. In KAOS notation, agents indicate which system components are responsible for each system requirement (e.g., A1-A4 in Figure 7.3). Specific attributes of a system component are monitored when referenced by utility functions in the models constructed at design time (Step D1). Monitor processes are responsible for observing functional system components (e.g., controllers, mechanical parts, physical sensors, etc.) as well as behavior oracles for LECs. For example, when using a behavior oracle for a camera- based object detector, the behavior oracle is executed for each new camera input to predict the impact of run-time conditions on object detector performance. Through the use of utility functions, MoDALAS enables LECs to be monitored in a similar manner to traditional system components, using behavior oracles to determine whether or not LECs can be trusted in their run-time state. 126 Figure 7.11: Scatter plot of object detector recall when exposed to simulated dust clouds from Enlil. Each point represents a different dust cloud context with the corresponding density and intensity. Green, yellow, and red points correspond to behavior categories 0, 1, and 2, respectively. (a) “light” dust cloud resulting in a category 0 classification (b) “medium” dust cloud resulting in a category 1 classification (c) “heavy” dust cloud resulting in a category 2 classification Figure 7.12: Example behavior oracle input/output for an image-based object detector LEC. Input is identical to the input given to the LEC. Output is a “perceived context” to describe the environmental condition and a “behavior category” to describe the expected LEC behavior. Examples of behavior categories 0, 1, and 2 are shown in (a), (b), and (c), respectively. 127 Step R2. Analyze KAOS goal models of the LES are evaluated in Step R2.a to determine if adaptation is needed to resolve violated system requirements. Utility functions from the KAOS goal model are extracted, and a logical expression is formed via top-down graph traversal of the KAOS goal model. For example, Figure 7.13 shows the logical expression parsed from the KAOS goal model in Figure 7.3. Variables in this expression are substituted with corresponding values recorded by Step R1, and the entire expression is evaluated to determine satisfaction of the KAOS goal model. If the logical expression is satisfied, then no adaptation is needed. However, in the event that the resulting evaluation is unsatisfied, then the type of adaptation is determined based on the set of violated utility functions. planner_status == “enabled” AND gps_status == “enabled” AND has point_cloud AND lidar_status == “enabled” OR camera_status == “enabled” AND has object_types AND AND “pedestrian” ∈ object_types AND object_dist ≥ 0.5 AND vel_x ≤ 0.4 OR “pedestrian” ∉ object_types AND object_dist ≥ 0.8 AND vel_x ≤ 0.3 AND IF behavior_cat == 1 THEN object_dist ≥ 1.0 AND vel_x ≤ 0.2 AND IF behavior_cat == 2 THEN vel_x == 0 Figure 7.13: Logical expression parsed from KAOS model in Figure 7.3. Methods for adaptation are implemented as adaptation tactics [37], which are stored in a repository accessible by the MAPE-K loop (example tactic in Figure 7.14). Each tactic comprises a precondition, post-condition, and set of actions to perform on the managed LES. Preconditions and postconditions for tactics reference the satisfaction of KAOS goals/obstacles, where preconditions are defined by the utility functions for KAOS obstacles and post-conditions are defined by the 128 utility functions associated with the resolution goals for KAOS obstacles. Step R2.b retrieves a tactic from the repository with preconditions that most closely match (e.g., via logical implication) the current evaluation of the KAOS goal model. For example, in the event that O1 is satisfied and goal G17 is not satisfied, the tactic in Figure 7.14 with a precondition matching the utility function for O1 is selected. The post-condition in Figure 7.14 includes the utility functions for G17 and its sub-goals (G18 and G19). The actions associated with the tactic are platform-independent operations required to satisfy the post-condition. When multiple tactics fit the given preconditions, the tactic with higher priority is chosen, where priorities can be manually assigned and/or adjusted based on the success of subsequent goal-model evaluations in future iterations of the MAPE-K loop. tactic “Reconfigure to Cautious Mode” precondition: (KAOS O1) behavior_cat == 1 actions: set object_dist ← 1.0 set vel_x ← 0.2 post-condition: (KAOS G17 ∧ G18 ∧ G19) object_dist ≥ 1.0 AND vel_x ≤ 0.2 Figure 7.14: Example tactic for reconfiguring a rover to a “cautious mode”. Preconditions and post-conditions refer to KAOS elements and utility functions (see Figure 7.3). Actions are abstract operations to achieve the post-condition. Step R3. Plan After a platform-independent adaptation tactic has been selected in Step R2, a platform-specific procedure is generated for implementing the actions associated with the selected tactic. For example, a platform-independent action to turn the autonomous platform 15◦ will be translated into the corresponding operations for a wheeled rover versus a legged-robot. Additionally, actions may be modified to consider the dynamic state of the system during execution of a tactic (e.g., actions may change or be preempted to account for emergent mechanical issues in a rover) [169]. 129 Step R4. Execute After an adaptation procedure has been planned, Step R4 is responsible for interfacing with and reconfiguring the managed LES. Depending on the nature of the adaptation and the current system state, different methods of adaptation may be considered to ensure the managed LES functions correctly while safely transitioning into its new configuration (e.g., one-point, guided, or overlap adaptations) [233]. Since adaptations may not be safe to perform in all states of the LES, Step R4 is responsible for determining quiescent states where the LES can be safely reconfigured (e.g., prevent halting a rover during a high-speed turn) [117]. 7.4 Demonstration With Autonomous Rover To illustrate the operation of MoDALAS, we consider a scenario where an EvoRally autonomous rover (see Section 2.5) is used within a construction site. Compared to autonomous automobiles op- erated on public roads, autonomous construction vehicles operate within relatively tight behavioral constraints and physical areas, leading to rapid growth in this market segment [149]. In addition to large earth-moving vehicles, smaller rovers are used to carry tools and materials for construction workers, periodically record the progress of construction, and provide surveillance of the site past operation hours [146]. For such rovers, detecting and avoiding objects in the environment, includ- ing pedestrians and other vehicles, are safety-critical requirements [217]. Increasingly, machine learning techniques have been used to provide object detection in such applications [49]. However, ensuring requirements satisfaction of learning-enabled autonomous rovers is a challenging task, as transient environmental conditions (i.e., rainfall or dust clouds) can impede object detection and potentially lead to serious accidents and even fatalities. To demonstrate the operation of MoDALAS in the construction site application domain, we have implemented a prototype and integrated it into the software for an autonomous robot in our laboratory. The EvoRally rover’s software infrastructure is based on ROS [176], a popular middleware platform for robotics. A ROS implementation comprises a set of processes, called ROS nodes, that communicate with other ROS processes using a publish-subscribe mechanism called ROS topics. 130 Multiple ROS nodes can publish messages on a ROS topic, and multiple ROS nodes can subscribe to the same ROS topic. Commonly, and in our case, ROS is implemented atop the Linux operating system with ROS nodes realized as Linux processes. For a non-trivial robot such as our rover, this design produces an intricate software infrastructure that can be visualized with a ROS graph. The full ROS graph for our rover software comprises over 30 nodes that implement tasks such as processing of sensor data, localization, path planning, and generating the corresponding commands to control the vehicle. Over 200 ROS topics are used to convey raw and preprocessed sensor data, exchange of information among controller nodes, and delivers commands to actuators for throttle control, steering and braking. Figure 7.15 shows an (elided) ROS graph of the MAPE-K loop implemented for the rover. The \knowledge ROS node is a process that manages access to the data stores depicted in Figure 7.10. Data stores for goal models and adaptation tactics are populated at startup time and remain static during operation. However, the managed system state data store is highly dynamic, comprising sensor readings and other state information that are updated continually. The MAPE-K monitor step (Step R1 in Figure 7.10) is implemented as a collection of ROS nodes (e.g., \monitor_lidar, \monitor_wheels, \monitor_camera) that receive raw sensor data collected by hardware-specific ROS nodes. These nodes preprocess data streams and publish results to the \update_state topic in order to modify the managed system state. Examples include direct measurements (e.g., wheel speed), derivative measurements (e.g., rate of battery drain), and operational status of hardware components (e.g., delays in GPS localization reporting). The remaining three MAPE-K steps (Steps R2-R4) are implemented as singleton ROS nodes, respectively, \analyze, \plan, and \execute. KAOS goal model evaluation by the \analyze node is triggered by state changes published on the \state_change ROS topic. If the KAOS goal model is not satisfied and an adaptation is necessary, then the \analyze node determines the type of adaptation needed and relays the adaptation type to the \plan node via the \plan_action topic. The \plan node retrieves actions for the corresponding tactic from the knowledge base and forwards an adaptation procedure to the 131 \execute node. The \execute node directly interfaces with and reconfigures components of the target platform. Figure 7.15: Elided ROS graph for MAPE-K loop in rover software. ROS nodes shown as green ellipses and ROS topics as yellow boxes. Arrows indicate data flow. 7.4.1 Assessing Visual Inputs With Behavior Oracles In our proof-of-concept demonstration, we use images from the mounted cameras atop the rover for object detection [100, 141] and triangulation from stereo vision [83]. A three-dimensional point cloud [188] is generated by fusing stereo camera triangulations and lidar sensor readings. As shown in Figure 7.15, both the \monitor_camera and \behavior_oracle nodes receive raw camera data published from onboard cameras. The \monitor_camera node processes camera data and delivers relevant information (e.g., frame rate, etc.) to the \knowledge node. For example, lack of input or a slow frame rate might indicate a problem with one or both cameras, thus necessitating an adaptation. The \behavior_oracle node processes camera images online with the behavior oracle DNN that was trained offline by Enlil for model inference. (The DNN is initialized at startup time from a configuration file.) Specifically, the \behavior_oracle node infers the behavior of the onboard object detector LEC by evaluating input images given to the object detector. The behavior 132 category produced by the \behavior_oracle node is published on the \category ROS topic, which is monitored by the \monitor_oracle node. At run time, if the \monitor_oracle node reports any change in the behavior category, then the \analyze node will execute to address the situation, as follows. 7.4.2 Run-Time Adaptation We consider a scenario in which the behavior oracle triggers run-time adaptations to the rover. At design time, the behavior oracle was created to account for three types of adverse environmental conditions that can impact object detection: rainfall, dust clouds, and lens flares (where a bright light source obscures part of the image). Additional environmental phenomena can be included by introducing them into the simulation environment used by Enlil. Figure 7.16 shows examples of each simulated phenomenon, with different levels of intensity, and the resulting object detector behavior category inferred by the behavior oracle. Referencing the behavior categories in Table 7.2, examples in columns (i), (ii), and (iii) are expected result in Categories 0, 1, and 2, respectively (i.e., “little impact,” ”degraded,” and “compromised”). Scenario 1. Dust Clouds To demonstrate MoDALAS in practice, we explore a scenario where the autonomous rover navigates a construction site to periodically record progress on the project at different locations. The rover begins with behavior_category = 0. As the rover approaches a construction worker, a dust cloud is produced by a dump truck leaving the construction area. When the \behavior_oracle node receives images from the rover’s cameras, the dust-obscured images are evaluated by the behavior oracle DNN, which infers that the object detector will be degraded by the current environment. Thus, the \behavior_oracle node publishes behavior_category = 1 on the \category topic. The \monitor_oracle node forwards this change to the \knowledge node. The state change triggers execution of the \analyze node to evaluate the logical expression (Figure 7.13) of the KAOS goal model depicted in Figure 7.3. Upon evaluation, the \analyze node determines that adaptation is 133 (a) Unaltered Image of Construction Site i. “little impact” ii. “degraded” iii. “compromised” (b) Simulated Rainfall i. “little impact” ii. “degraded” iii. “compromised” (c) Simulated Dust Cloud (Haze) i. “little impact” ii. “degraded” iii. “compromised” (d) Simulated Lens Flare Figure 7.16: Example of object detection at a construction site. A pedestrian is detected by an image-based object detector (a). New environmental phenomena are introduced in simulation, such as (b) rainfall, (c) dust clouds, and (d) lens flares. Enlil explores different contexts to find examples that have (i) little impact, (ii) degrade, or (iii) compromise the object detector’s ability to achieve validated design-time performance. 134 necessary, since the pre-condition associated with KAOS obstacle O1 applies but the resolving goal G17 is not satisfied. The tactic in Figure 7.14 is selected and forwarded to \plan, which finds that the tactic’s actions involve reducing the maximum velocity of the rover and increasing the buffer distance between the rover and objects in the environment. The \plan node then maps abstract tactic actions to a platform-specific procedure. Our rover uses a Timed Elastic Band (TEB) [183] planner provided with ROS to compute trajectories around objects in the environment. The abstract actions in Figure 7.14 can be accomplished by setting the \min_obstacle_dist and \max_vel_x parameters of the TEB planner. Finally, the platform-specific procedure is forwarded to the \execute node, which is responsible for executing the reconfiguration of the rover. As a result, the rover moves slower and takes a wider berth around objects in the environment while the dust cloud is present. Eventually, as the dust settles, the behavior oracle determines that the new environmental condition is expected to have little impact on the object detector (i.e., Category 0). Through the same sequence of steps described above, the \analyze node is triggered to execute by the state change. The \analyze node then determines that KAOS obstacle O1 no longer applies and the KAOS goal model is satisfied. The \analyze node publishes a message to notify the \plan node that the selected tactic is no longer applicable. The \plan node then triggers the \execute node so that the previous operating parameters are restored (i.e., reset the minimum object distance and maximum rover velocity to their prior values). Scenario 2. Lens Flare In a second scenario, the rover is navigating around a parked vehicle. Suppose the reflection of the sun on the windshield of the vehicle causes a momentary lens flare that blinds the cameras. The behavior oracle processes the camera image and determines that the impacted images will compromise the ability of the rover’s object detector to perform as validated at design time (i.e., Category 2). The \monitor_oracle node publishes behavior_category = 2 to \knowledge, triggering the \analyze node similar to the dust cloud scenario. The KAOS goal model is evaluated, 135 but this time obstacle O2 applies and its resolving goal G16 is not satisfied. A tactic with a pre- condition matching O2 and post-condition matching G16 is selected and forwarded to the \plan node. The actions associated with the selected tactic are to halt the rover, and the \plan node generates a procedure to set the rover’s maximum velocity to zero. Finally, the adaptation procedure is forwarded to the \execute node to update the rover accordingly, thereby transitioning it to a fail-safe state. When the lens flare eventually disappears (e.g., due to changing angle of the sun or cloud movements), the \monitor_oracle node publishes behavior_category = 0. The change in behavior category triggers the \analyze node to re-evaluate the KAOS goal model. The \analyze node then determines that O2 no longer applies, subsequently triggering the \plan and \execute nodes to reset the selected tactic and restore the rover to its original configuration. Scenario 3. Relaxation of Goals In a third scenario, we explore how RELAX may be used to explicitly deal with uncertainty on the rover. Suppose the rover uses the KAOS goal model in Figure 7.3, where the KAOS goal model is initially not RELAX-ed to address run-time uncertainties. Figure 7.17 shows example utility values published by the rover during operation. Table 7.3 shows the resulting goal model evaluation to be unsatisfied since the original KAOS goal model expects a friction rate of 2 and a tire pressure of 221 kPa (individual goal results of G21 (0.0) and G22 (0.0)). In this instance, an unsatisfactory evaluation of the goal model would trigger MoDALAS to execute an adaptation tactic to mitigate brake failure (e.g., notifying a human supervisor for intervention). However, the rover may be able to operate under the given values as the deviation is insignificant (e.g., inaccurate readings due to sensor noise). Thus, if we use the RELAX-ed KAOS goal model in Figure 7.5, the system uncertainties may be tolerated and avoid the need for an immediate mitigation strategy that could negatively impact performance. Table 7.4 shows the resulting evaluation of the same rover configuration from Figure 7.17, but instead using the RELAX-ed goal model from Figure 7.5. Using fuzzy logic, the new model is tolerant to the sensor values with an accepted deviation range (individual goal results of G21 (0.799) and G22 (0.95)). 136 utility_params = { ‘behavior_cat’: 0, ‘camera_status’: ‘enabled’, ‘friction_val’: 1.9, ‘gps_status’: ‘enabled’, ‘lidar_status’: ‘enabled’, ‘object_dist’: 3.0, ‘object_types’: [‘pedestrian’, ‘car’, ‘car’], ‘planner’: ‘enabled’, ‘point_cloud’: True, ‘tire_pressure’: 31.9, ‘vel_x’: 0.1, } Figure 7.17: Example set of utility parameter values for Scenario 3 Table 7.3: Example evaluation of non-RELAX- Table 7.4: Example evaluation of RELAX-ed ed KAOS goal model (Figure 7.3) KAOS goal model (Figure 7.5) Goal # Evaluation Result Goal # Evaluation Result ‘G1’ 1.0 ‘G1’ 1.0 ‘G2’ 1.0 ‘G2’ 1.0 ‘G3’ 1.0 ‘G3’ 1.0 ‘G4’ 1.0 ‘G4’ 1.0 ‘G5’ 1.0 ‘G5’ 1.0 ‘G6’ 1.0 ‘G6’ 1.0 ‘G9’ 0.0 ‘G9’ 0.0 ‘G10’ 1.0 ‘G10’ 1.0 ‘G12’ 1.0 ‘G12’ 1.0 ‘G13’ 1.0 ‘G13’ 1.0 ‘G14’ 1.0 ‘G14’ 1.0 ‘G16’ 0.0 ‘G16’ 0.0 ‘G18’ 1.0 ‘G18’ 1.0 ‘G19’ 1.0 ‘G19’ 1.0 ‘G21’ 0.0 ‘G21’ 0.799 ‘G22’ 0.0 ‘G22’ 0.95 ‘O1’ 0.0 ‘O1’ 0.0 ‘O2’ 0.0 ‘O2’ 0.0 Total Evaluation 0.0 Total Evaluation 0.799 Total Satisficement 0.0 Total Satisficement 1.0 137 7.5 Related Work This chapter explores methods for the assurance of cyber-physical LESs via models at run time [43]. Related studies have investigated the verification of robotic systems [90], including construction site applications [77]. Those works apply formal methods for verification but do not explicitly address LECs faced with uncertain conditions. RoCS [179] has been introduced as a self-adaptive framework for robotic systems, but in contrast to MoDALAS, it is not model-driven nor focused on software assurance. To the best of our knowledge, MoDALAS is the first to include run-time assessment of LECs with respect to goal-oriented (i.e., KAOS) models. Self-adaptive frameworks have used different approaches to address assurance. Zhang and Cheng [233] developed a state-based modeling approach for model checking assurance properties of SASs. Weyns and Iftikhar [216] proposed the use of model-based simulation to evaluate system requirements and determine adaptation procedures. ENTRUST [29] supports the development of an SAS driven by GSN assurance cases and verified by probabilistic models at run time. Similarly, AC-ROS (described in Chapter 6) is a GSN model-driven self-adaptive framework for ROS-based applications, which includes self-assessment through utility functions as assurance evidence. In contrast, MoDALAS uses KAOS goal models to assess the satisfaction of system requirements of an LES at run time. Furthermore, these other approaches do not address uncertainty for LECs. MoDALAS enables an LES to self-adapt to mitigate failure from the use of LECs in untrusted contexts. A number of design-time approaches have addressed how LECs handle uncertainty [175, 182]. Smith et al. [201] have also explored the construction of assurance cases at design time to cate- gorize LEC behavior with respect to hazardous behaviors. However, these methods do not enable self-assessments at run time and have limited applications for handling uncertain environmental conditions. MoDALAS differs from these works by using model inference (via behavior oracles) to assess LEC behavior at run time for known unknown environmental conditions. Another direction in which research has addressed environmental uncertainties for LECs is during requirements modeling and specification. Whittle et al. [221, 222] proposed RELAX as 138 a requirements specification language that allows for the relaxation of requirements to adapt to environmental uncertainties. Fredericks et al. [63] and Ramirez et al. [177] proposed automation of relaxation of goal models and derivation of utility functions using RELAX. Letier et al. [135], Ramirez et al. [177], and Bencomo et al. [14] proposed various utility functions (e.g., probabilistic) used to evaluate and quantify partial satisfaction of a goal. Letier et al. [136] also proposed using Monte Carlo simulation to calculate the consequences of certain uncertainty factors. The MoDALAS framework has been designed to accommodate different requirements specification languages and the corresponding analysis techniques to address uncertainty. Recently, other researchers have explored system assurance for LESs and LECs. Asaadi et al. [5] proposed a probabilistic quantification of LES system confidence based on functional capabilities and dependability attributes. Boursinos et al. [21] proposed a conformal prediction framework, leveraging previous normal values to check for abnormalities. Weyns et al. [220] proposed com- bining MAPE, control theory, and machine learning for better adaptive systems. In contrast, MoDALAS focuses on self-adaptation to mitigate LEC failure through the use of behavior oracles and the run-time evaluation of KAOS goal models with optional RELAX-ed goals. 7.6 Summary This chapter introduced the MoDALAS framework for using requirements models at run time to automatically address the assurance of safety-critical systems with machine learning components. Due to uncertainties about the ability for LECs to generalize to complex environments, methods are needed to assess the capability of LECs at run time and adapt LESs to mitigate the use of LECs in unsafe run-time conditions. MoDALAS assesses the trustworthiness of LECs with behavior oracles and reconfigures an LES to maintain satisfaction of system requirements at run time. MoDALAS addresses uncertainties about the assurance of an autonomous LES when facing uncertain run-time conditions (e.g., known unknown phenomena). This chapter described a proof- of-concept prototype of MoDALAS for an autonomous rover LES with an object detector LEC. MoDALAS adapts the rover to maintain safety requirements under run-time conditions where the 139 object detector is deemed unreliable. This chapter has also demonstrated how MoDALAS can leverage the RELAX language and fuzzy logic run-time evaluation to manage uncertainties in requirements. Chapter 8 explores in detail how to apply model-driven self-adaptation into a single service-oriented framework to develop more trustworthy LESs, validated on an autonomous rover with vision-based deep learning components. 140 CHAPTER 8 ADDRESSING ROBUSTNESS AND RESILIENCY AT RUN TIME Trustworthy artificial intelligence (Trusted AI) is of utmost importance when LECs are used in autonomous, safety-critical systems. When reliant on deep learning, these systems need to address the reliability, robustness, and interpretability of learning models. In addition to developing specific strategies to address each of these concerns, appropriate software architectures are needed to coordinate LECs and ensure they deliver acceptable behavior under uncertain conditions. This work proposes a model-driven framework of loosely-coupled modular services designed to monitor and control LECs with respect to Trusted AI assurance concerns. The proposed framework is composable, deploying independent services to improve the resilience and robustness of AI systems. The overarching objective of this framework is to support software engineering principles such as modularity, composability, and reusability in order to facilitate development and maintenance tasks, while also increasing stakeholder confidence in Trusted AI systems. To demonstrate this framework, it has been implemented to manage the operation of an autonomous rover’s vision-based LEC under uncertain environmental conditions. The remainder of this chapter is organized as follows. Section 8.1 overviews and motivates this study. Section 8.2 describes the aggregate elements of a proposed service-oriented architecture for Trusted AI. Section 8.3 demonstrates a use case of the proposed framework on an autonomous rover platform. Finally, Section 8.5 provides a concluding summary for this chapter. 8.1 Overview A critical factor for the use of AI in safety-critical tasks is the need for stakeholders to trust AI systems to perform as intended, despite many uncertainties unique to LECs [224]. Data- driven LECs, such as DNNs [73], are often black box and more complex than traditional software components, and their use can require a “leap of faith” from stakeholders [112, 228]. However, the risk of using inadequate AI in safety-critical applications is severe, possibly leading to human 141 injury and casualties (e.g., autonomous driving accidents [164, 165]). Various high-level “Trusted AI” guidelines have been proposed to systematically address AI assurance concerns [92, 94, 151, 161, 190, 200]. As “best practice” guidelines, these frameworks increase awareness of safety issues unique to AI systems by decomposing assurance topics into categories such as reliability, fairness, robustness, interpretability, and uncertainty quantification. However, because the specific techniques used to address each assurance category are highly application-dependent, it can be challenging to generalize solutions for multiple applications. As machine learning software matures from a largely academic, research-focused domain to mainstream software, we need to apply well- established software engineering practices for Trusted AI [31, 150]. This chapter proposes a modular and composable approach to develop Trusted AI in support of the different dimensions of uncertainty recognized by existing guidelines [92, 94, 151, 161, 190, 200]. Current solutions to Trusted AI concerns, such as adversarial robustness and adversarial detection [142], are tightly-coupled to specific problem domains, leading to monolithic applications that are difficult to scale, reuse, and maintain [11, 64]. When “robustifying” DNNs, techniques have been proposed to augment training data, training procedures, or network topologies, with updates interwoven into a single, monolithic learning model [198]. When proposed solutions are tightly-coupled to a base learning model, it is challenging to repurpose them for alternative learning models. Furthermore, when addressing uncertainty for Trusted AI systems, many context- dependent solutions are needed to mitigate the various forms of uncertainty (e.g., robustness to malicious weather effects versus cybersecurity concerns). When solutions are monolithic, any change with respect to a single form of uncertainty can require the entire learning model to be retrained and validated. As new forms of adversarial conditions are uncovered, monolithic solutions require extensive updates to the entire learning model, rather than isolating changes to decomposable functional units. For systems that address multi-dimensional problems, such as Trusted AI, alternative software development and assessment practices should be considered, as current monolithic approaches are difficult to scale and maintain [154]. This chapter proposes a modular, composable approach to address multiple dimensions of 142 uncertainty in Trusted AI. Rather than using monolithic solutions to address all issues of uncer- tainty for Trusted AI, this chapter proposes the use of a framework comprising loosely-coupled services each of which are responsible for individual assurance concerns. In contrast to mono- lithic architectures, microservice 1 architectures realize software as a collection of independently deployable services that interact using a common Application Programming Interface (API) and communication protocol (e.g., TCP/IP) [6, 55]. Microservice architectures are ideal for systems that are both goal-oriented and focused on replaceability [155]. Microservice architectures facili- tate a wide range of reusable code, because each microservice can be implemented with different technologies and programming languages. Furthermore, individual microservices can be executed on separate hardware platforms, enabling more scalable deployments. When realizing Trusted AI systems as microservice architectures, separate reusable services can be deployed to manage the reliability, robustness, and interpretability of an underlying LEC, rather than incorporating the same functionality into a single component. This chapter introduces the Anunnaki2 framework as a collection of microservices to facilitate the development of Trusted AI. While other works have considered AI as a microservice, this chapter is the first to explore Trusted AI as an aggregation of microservices that addresses multiple dimensions of uncertainty [154, 159, 172, 189]. The Anunnaki framework leverages existing techniques, such as Enlil [127] to automatically assess the resiliency of LECs under a variety of adverse phenomena (e.g., rain, fog, etc.) and Enki [129] to automatically generate more robust learning model alternatives, trained with synthetically augmented data. Furthermore, this chapter introduces Utu3 as a collection of model-driven services within the Anunnaki framework to monitor and control LECs at run time, with respect to existing assurance and requirements artifacts (e.g., GSN [3] assurance cases and KAOS [124] requirement models). In contrast to previous work on AC-ROS [34] and MoDALAS [130], Utu is not dependent on the internal functionality of the managed AI platform, which promotes reusability, portability, and extensibility for more flexible 1 Microservices are small, autonomous services within a loosely-coupled service-oriented architecture [155]. 2 The Anunnaki are a group of ancient Sumerian deities [118]. 3 Utu is an ancient Sumerian deity responsible for enforcing divine justice [118]. 143 run-time monitoring. In execution, Anunnaki microservices run in parallel and independent of managed LECs. As such, the Anunnaki framework enables developers to reuse common services to generate robust alternatives to LECs, detect when LECs have entered untrusted states, and mitigate the use of LECs in untrusted states. To demonstrate the Anunnaki framework, it has been applied to an autonomous rover with a vision-based obstacle detector LEC. In its default form, the obstacle detector exhibits a reasonable degree of accuracy on known validation data. However, uncertainties arise when new forms of adverse phenomena are considered (e.g., changes in lighting, occluded visibility, etc.). This chapter demonstrates how aggregate services within the Anunnaki framework can be leveraged to recon- figure the rover and mitigate the use of its object detector under conditions deemed untrustworthy. Through the use of independent microservices to assess trustworthiness and enact change in system behavior, the Anunnaki framework is a non-monolithic, scalable, and maintainable approach to addressing uncertainty in Trusted AI systems. 8.2 A Service-Oriented Framework for Trusted AI This section provides a high-level overview of the Anunnaki framework. The aggregate collection of microservices provided by the Anunnaki framework collectively serve to manage the operation of LECs in the presence of uncertain conditions and to mitigate faults resulting from their use in untrusted conditions. Figure 8.1 illustrates the major processes within the Anunnaki framework as a DFD, where processes are depicted as interconnected circles. Rectangles depict systems external to the Anunnaki framework. Labeled arrows show data flow between processes, and persistent data stores are shown within parallel lines. Each process shown in Figure 8.1 is a separate microservice executed in parallel and independent of the managed AI system. The remainder of this section describes each of these processes. 144 Figure 8.1: A high-level DFD of the Anunnaki framework. Anunnaki processes are shown as circles, interacting with external systems, such as the managed AI system and a simulator, shown as rectangles. Labeled arrows show data communicated between processes, with persistent data stores shown within parallel lines. 8.2.1 Resiliency Through Predictive Behavior The Anunnaki framework leverages model inference and behavior models of an LEC for the purpose of adversarial detection. Adverse interference can include any malicious noise or environ- mental phenomena that result in undesirable behavior from an LEC. Behavior models are used to predict the impact of adverse conditions absent from existing training/validation data, thus enabling the Anunnaki framework to prevent the use of LECs under conditions they would normally per- form unreliably (e.g., poor lighting conditions). Enlil constructs behavior models of an LEC by assessing the impact of various environmental phenomena within an external simulator (Figure 8.1, Step 1) [127]. Next, Enlil generates a behavior model that can be executed, independent of the LEC as a behavior oracle (Figure 8.1, Step 2). One or more behavior oracles run in parallel to the managed AI system and subscribe to the same sensor data received by managed LECs. As sensor data is received, behavior oracles output behavior assessments, which include both a perceived con- 145 text for any apparent adversarial noise and an inferred behavior category to summarize the impact of the adversarial noise. As microservices, behavior oracles publish behavior assessments to any other subscribing microservice, thus enabling the Anunnaki framework to assess the reliability of a managed AI system under a variety of known unknown environmental conditions and react to adverse run-time phenomena. 8.2.2 Robustifying Learning Models To address the adversarial robustness of LECs, the Anunnaki framework uses alternate learning models trained using data synthetically augmented to include adverse phenomena (e.g., rain, fog, etc.). Using Enki (Figure 8.1, Step 3), robust learning models are generated by running a simulator to uncover examples of adverse phenomena that lead to a diverse array of behavior patterns for the given LEC [130]. The selected adversarial examples are then used to retrain the default learning model. Once generated, alternative learning models can be swapped in place of the default learning model by a model manager microservice that determines which specific learning model is active at run time (Figure 8.1, Step 4). This microservice approach enables separate learning models to be robustified with respect to specific forms of adverse interference and swapped in place of each other based on run-time contexts. When no adverse interference is detected, the default learning model can be activated. By decoupling the problem of robustification from a single learning model to separate, independent learning models, the Anunnaki framework enables more flexibility to the developers on what forms of adverse phenomena are addressed by any given implementation of the managed AI system. Furthermore, this approach enables developers to maintain and augment specific context-dependent models without needing to retrain and validate the base learning model. For example, a robust learning model generated for rainy environments can be updated without needing to retrain/validate the default learning model. Furthermore, additional robustified models can be created for alternative phenomena (e.g., foggy weather, poor lighting, etc.) that are also independent from each other and the default learning model. Thus, the Anunnaki framework provides a more modular and composable solution to robustifying LECs. 146 8.2.3 Constructing Goal Models The Anunnaki framework leverages goal models to capture the high-level objectives of the man- aged AI system. KAOS goal modeling [124] supports a hierarchical decomposition of high-level functional and performance objectives into leaf-level system requirements (i.e., goal-oriented re- quirements engineering [131]). KAOS goal models enable a formal goal-oriented analysis of how system requirements are interrelated as well as threats to requirement satisfaction. Goals represent atomic objectives of a system at varying levels of abstraction, with sub-goals refining and clarifying higher-level goals. Any event threatening the satisfaction of a specific goal is represented as an obstacle. Resolutions for obstacles can be specified by attaching additional sub-goals with alterna- tive system requirements to the corresponding obstacle. Finally, agents (i.e., system components) are assigned responsibility for each system requirement. KAOS goal models enable developers to decompose the expected behavior of a software system, including information about threats to specific system requirements and how system requirements relate to each system component. An example KAOS goal model is shown in Figure 8.2, comprising system objectives for a rover with a vision-based object detector. Blue parallelograms represent system goals (e.g., G12: “Rover warns nearby pedestrians.”) that can be decomposed into sub-goals with AND/OR refinements (shown as connecting lines with overlaying circles). Any potential hazards or obstacles that could prevent the satisfaction of a goal are shown as red parallelograms (e.g., O1: “Object detector is degraded/compromised.”). At the leaf-level, agents are shown as white hexagons to indicate which system components are responsible for achieving associated goals. The Anunnaki framework extends goal models by allowing utility functions to be associated with each goal and obstacle [36, 215]. Utility functions map attributes of the managed system to quantifiable metrics that establish a degree of goal satisfaction (i.e., satisficement) [45, 124]. For example, the utility function “A1.buzzer == true” is attached to goal G14. Thus, when the “buzzer” attribute of agent A1 is set to true, the goal G14 is evaluated as satisfied. Through the use of utility functions, the Anunnaki framework can interpret a KAOS goal model as a logic tree of run-time system checks to determine the satisfaction of high-level system 147 Figure 8.2: An example KAOS goal model to graphically depict system requirements of a robot rover as a hierarchy of logically interconnected goals. Blue parallelograms represent system goals and red parallelograms represent potential obstacles to the satisfaction of goals. White hexagons represent system components responsible for achieving leaf-level goals. The Anunnaki framework extends goal models by attaching utility functions to goals/obstacles (shown in yellow ellipses). Agents can be associated with specific message topics (also shown in yellow ellipses) to inform a MAPE-K controller. [G1]: OR [G2]: equal(A1.mode, "manual") [G3]: AND [G3]: equal(A1.mode, "auto") [G3]: AND [G12]: AND [G12]: OR [G13]: AND [G14]: equal(A1.buzzer, True) [G15]: equal(A1.lights, "#ffff00") [G16]: equal(A1.detected_human, True) [G17]: AND [G18]: equal(A1.lights, "#00ff00") [G19]: equal(A1.detected_human, False) [O1]: CONDITIONAL [O1]: IF [O1]: greater(A4.category, 0.0) [O1]: THEN [G24]: equal(A1.mode, "manual") [G4]: AND [G5]: greater_equal(rate(A2), 15) [G6]: greater_equal(rate(A3), 5) [G7]: fuzzy_right(A1.n_collisions, 0, 3) Figure 8.3: A logic tree representation of the KAOS goal model in Figure 8.2. The Anunnaki framework automatically parses and interprets goal models as logic trees of utility functions for run-time evaluation of goal satisfaction. 148 objectives. For example, Figure 8.3 shows a logic tree interpretation of the KAOS goal model in Figure 8.2. The Anunnaki framework also extends goal models by allowing message channels to be associated with each agent to specify which channels each respective agent publishes state data. For example, the message channel “/utu/oracle/output” is attached to agent A4, indicating that attributes for the behavior oracle can be monitored by observing the corresponding message channel. These extensions enable developers to map the same goal model to different platforms by simply redefining the associated message channels and system attributes. 8.2.4 Model-Driven Monitor and Control To monitor and control the managed AI system, the Anunnaki framework implements Utu as a goal model-driven MAPE-K controller to monitor, analyze, and reconfigure the use of LECs in response to uncertain environmental conditions. In order to mitigate faults from the use of LECs in untrusted conditions, Utu assesses the run-time state of the managed AI system and issues reconfiguration requests in response to the run-time environment. Utu requires goal models described in either a GSN [3] or KAOS [124] format to specify the expected system requirements. Utu also requires a predefined set of tactics to determine what actions should be taken to mitigate faults resulting from violated goal models [37]. Because goal models and tactics are not hard-coded into Utu and are instead model-driven, the Anunnaki framework can be deployed with alternative goals and adaptation tactics by simply re-instantiating Utu microservices with new goal models and tactics. 8.2.5 MAPE-K Microservices To manage adaptation of the managed AI system, Utu comprises five separate microservices to instantiate the MAPE-K loop, each of which are described as follows. Step 5.a. Utu Knowledge Manager. A Knowledge Manager microservice manages pre-defined goal models and adaptation tactics de- veloped at design time. To enable run-time evaluation of the managed AI system’s goal satisfaction, 149 system variables are parsed from the utility functions associated with the given goal models. The Knowledge Manager acts as a database of agents referenced by given goal models and the run-time values for each respective utility variable. Step 5.b. Utu Monitor. A Monitor microservice dynamically subscribes to outgoing message traffic of the managed AI system in order to track the system’s run-time utility. Subscribed message channels are based on the agents referenced by the active goal model managed by Knowledge Manager. Once subscribed to a message channel, incoming utility data is forwarded back to the Knowledge Manager at a frequency set by the user upon instantiation. Step 5.c. Utu Analyze. An Analyze microservice determines if any reconfiguration of the managed AI system is required. When any change in a monitored utility variable is detected by the Knowledge Manager, the active goal model and current utility values are forwarded to the Analyze microservice for evaluation. By treating the goal model as a logic tree of utility functions, the Analyze step can substitute run-time utility values into each respective utility function variable to determine goal satisficement. When the entire goal model is found to be unsatisfied, adaptation tactics with preconditions that matches the run-time goal model evaluation are selected and forwarded for planning. An example adaptation tactic is shown in Figure 8.4, where a tactic is defined in an XML format. Tactics are defined with a set of preconditions, actions, and postconditions. In the given example, a “fail-safe” tactic is defined with a precondition to trigger when G3 in Figure 8.2 is found to be unsatisfied. The tactic defines a set of actions to be executed once triggered. For the example fail-safe tactic, the actions are to (1) request a mode-change to “manual” mode for the rover and (2) email a notification to the user. Finally, a postcondition is given in the example to state that goal G3 is expected to be satisfied upon execution of the given actions. Thus, adaptation tactics define specific actions to perform within the Anunnaki framework and the managed system, triggered by specific evaluations of the provided goal model. 150