RUNTIME VERIFICATION OF PARTIALLY SYNCHRONOUS DISTRIBUTED CYBER-PHYSICAL SYSTEMS By Anik Momtaz A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2023 ABSTRACT This dissertation addresses the problem of runtime verification of distributed cyber-physical systems (CPS) with respect to a given formal specification. Cyber-physical systems are com- puter systems with integrated software and physical (hardware) components that, in an ideal environment, seamlessly interact with the real world, as well as each other. Since exhaus- tively validating correctness of a distributed CPS is usually infeasible (if not impossible), many modern validation methods involve runtime verification of distributed CPS based on safety properties. Our work focuses on developing time efficient and resource efficient ver- ification techniques that can run in parallel with the execution of these systems to ensure reliability. In this dissertation, we propose different methodologies to reason about the correctness of distributed CPS in real-time, depending on the system settings and architecture. We also provide case studies relevant to each approach in order to demonstrate real-world applica- tions. In all our proposed techniques, we assume a partially synchronous setting, where a clock synchronization algorithm guarantees a bound on clock drifts among all signals. To this end, we first introduce two monitoring methods for distributed systems with discrete events, where the specification in the linear temporal logic (LTL) [12] is evaluated on a system using (1) a deterministic finite automaton-based technique, and (2) a progression- based formula rewriting technique. We then extend this work to detecting violations of predicates over distributed continuous- time and continuous-valued signals in CPS. We introduce a novel retiming technique that allows reasoning about the correctness of predicates among continuous-time signals that do not share a global view of time. In addition, we show that leveraging simple knowledge of physical dynamics allows for further reduction in run time. Leveraging the previous two methods, we then introduce a monitoring technique for solving the problem of runtime verification for distributed CPS using the signal temporal logic (STL) [36]. We employ a formula progression technique utilizing a signal retiming method, that enables reasoning about the correctness of formulas among continuous-time and continuous-valued signals in CPS, even when only a partial signal is available. We also extend our previous work on detecting violations of predicates over distributed signals in CPS from a centralized monitoring setting to a decentralized monitoring setting. We employ a technique that allows us to indentify all possible violations, not just one. Which in turn allows for identification and elimination of bugs from distributed systems regardless of the actual clock drift. Finally, we introduce the notion of monitoring reliability on a network of monitors in decentralized monitoring setting. To this end, we present a generalized model of a class of CPS, where each monitor is represented by an Internet of Things (IoT) device (or node) in a layered network of producers and consumers. Our model monitors the events in nodes where resource usage occurs, and captures the tradeoffs between the reliability of the system and resource usage. We present an efficient algorithm to determine the optimal selection of processing quality for each node in this producer-consumer network, such that target system reliability is achieved while respecting the given resource bounds, and resource usage is minimized. In addition, we present a lightweight machine learning based solution to improve our model in terms of run time. To you, Nuban Mama. I will forever endeavor to illuminate the void created by your absence with the light you have inspired. iv ACKNOWLEDGEMENTS First and foremost, I would like to express my sincere and heartfelt gratitude to my PhD advisor, Dr. Borzoo Bonakdarpour, for his unwavering guidance and invaluable mentorship throughout the journey to complete my degree. His deep expertise, relentless dedication, and insightful feedback have been instrumental in shaping the quality and depth of my research. It is impossible for me to overstate the pivotal role his support has played in my academic growth. I am truly fortunate to have had the privilege of working under his tutelage. I would also like to express my gratitude to the remaining members of my PhD guidance committee, Dr. Betty Cheng, Dr. Bahare Kiumarsi, and Dr. Sandeep Kulkarni, for their continuous support and indispensable feedback on my research. Throughout my academic endeavors, I have had the pleasure of closely working with Dr. Houssam Abbas, from Oregon State University. His immeasurable contributions to my work in runtime verification of signals have made it possible to produce multiple high-quality papers, including one on predicate detection for signals that received the Best Paper Award at The 21st International Conference on Runtime Verification. I have also co-authored several papers with my exceptional colleague and dear friend, Ritam Ganguly. In every paper we co-authored, his contributions were instrumental, and second to none. My sincere gratitude is also extended to the rest of my brilliant colleagues, Oyendrila Dobe, Tzu-Han Hsu, and Eshita Zaman, who generously devoted many hours from their busy schedules to review and provide valuable suggestions on numerous aspects of my research. Words cannot fully express the depth of my gratitude to my family for giving me un- conditional love, and continuous encouragement from halfway across the globe. I want to specifically extend my heartfelt thanks to my parents, Asfia Sabina (Ammu) and Motazid Momtaz (Abba), my little sister, Monisha Momtaz (Monomono), my aunt, Simin Seury (Khamma), and my grandmother, Banu Tarafdar (Didda). Additionally, I am immensely grateful to my wonderful wife, Tiana Momtaz, for her love, support, and boundless sacrifices throughout this journey, and for always believing in me when, at times, even I could not. v I am also thankful to Sadika Amreen, Reazul Hoque, Balabhadra Khatiwada, Meena Khatiwada, and Emily Mui, for being extraordinary friends, and possessing the remarkable ability to make good times better, and not-so-good times bearable. Last, but most certainly not least, I must express my gratitude to Michigan State Uni- versity, specifically the Department of Computer Science and Engineering, for affording me the opportunity to pursue my dream of obtaining a PhD in a field I am passionate about. I am truly and unequivocally proud to call myself a Spartan. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF ABBREVIATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii CHAPTER 1 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 CHAPTER 2 PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Linear Temporal Logics (LTL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Distributed Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Hybrid Logical Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Signal Temporal Logic (STL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Producer-Consumer Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 CHAPTER 3 RUNTIME VERIFICATION OF PARTIALLY SYNCHRONOUS DISTRIBUTED DISCRETE-EVENT SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Formula Progression for LTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 SMT-based Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 CHAPTER 4 PREDICATE MONITORING IN DISTRIBUTED CYBER-PHYSICAL SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Signal Transmission to the Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 SMT-based Monitoring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Exploiting the Knowledge of System Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 CHAPTER 5 MONITORING SIGNAL TEMPORAL LOGIC IN DISTRIBUTED CYBER-PHYSICAL SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . 86 5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Monitoring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3 SMT-based Monitoring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 vii 5.4 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 CHAPTER 6 DECENTRALIZED PREDICATE DETECTION OVER PARTIALLY SYNCHRONOUS CONTINUOUS-TIME SIGNALS . . . . . 109 6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2 The Structure of Satisfying Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3 The Abstractor Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.4 The Slicer Process for Detecting Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.5 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 CHAPTER 7 RESOURCE OPTIMIZATION OF STREAM PROCESSING IN LAYERED SENSOR NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.1 Producer-Consumer Network with Resource Constraints . . . . . . . . . . . . . . . . . . . . . 127 7.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.3 SMT-based Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.4 Machine Learning-based Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.5 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 CHAPTER 8 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.1 Lattice-based Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.2 Runtime Monitoring in CPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.3 Asynchronous Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.4 Synchronous Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.5 Partially Synchronous Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.6 Decentralized Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.7 Monitoring Reliability in CPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 CHAPTER 9 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.2 Ongoing Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 9.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 viii LIST OF TABLES Table 4.1 Impact of clock skew in network of cars on verdicts using varying ε. . . . . . . . . . 80 Table 4.2 Impact of clock skew in network of UAVs on verdicts using varying ε. . . . . . . . 80 Table 4.3 Impact of clock skew in water tanks on verdicts using varying ε. . . . . . . . . . . . . . 85 Table 5.1 Impact of ε. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Table 7.1 Nodes v[1,5] resource usage.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Table 7.2 Nodes v[6,10] resource usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Table 7.3 Nodes v[1,9] power consumption (in watts) for different quality levels. . . . . . . . . 146 Table 7.4 Quality level tables for different nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 ix LIST OF FIGURES Figure 1.1 Hybrid dynamic cooling system with water tanks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 1.2 A distributed CPS composed of autonomous aerial vehicles with drifting clocks. The violation property to be monitored is, for any two aerial vehicles the distance along x axis is within 1 and the distance along y axis is within 1.7. Asynchronous signals produced by the vehicles must be monitored for predicate violations, while leveraging some knowledge of system dynamics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.3 Monitoring automaton for formula φ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.4 A distributed computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.5 Progression and segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7 7 8 Figure 2.1 LTL3 monitor for φ = a U b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 2.2 HLC example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Figure 2.3 Two partially synchronous continuous concurrent timelines with ε = 0.5, and corresponding signals x and y. (Solid dot indicates signal value at discontinuity). C is a consistent cut but C ′ is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 2.4 A trace σ generated by a system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 2.5 A producer-consumer network of 10 nodes.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 3.1 Progression example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Figure 3.2 Removing non-loop cycles in an LTL3 Monitor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure 3.3 Reachability Matrix for a Figure 3.4 Reachability Tree for a U b.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 U b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 3.5 Synthetic experiments - impact of different parameters. . . . . . . . . . . . . . . . . . . . . . . 54 Figure 3.6 Impact of parallelization on different data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 3.7 Cassandra experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Figure 4.1 Predicate violation between two signals x and y measured using partially synchronized clocks t and s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 4.2 Piece-wise interpolations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 x Figure 4.3 Piece-wise linear signals vs. piece-wise quadratic signals. . . . . . . . . . . . . . . . . . . . . 72 Figure 4.4 Leveraging dynamics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 4.5 Impact of signal segmentation on run time with varying signal duration (S.D.) and fixed ε = 0.001s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figure 4.6 Best run time (network of cars) for different signal duration. . . . . . . . . . . . . . . . . 77 Figure 4.7 Impact of clock skew on run time. Signal duration = 2s. . . . . . . . . . . . . . . . . . . . . 79 Figure 4.8 Impact of agents on run time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 4.9 Impact of communication (between two agents) on run time.. . . . . . . . . . . . . . . . 82 Figure 4.10 Run time (network of cars) vs. segment count. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 4.11 Impact of Algorithm 4.1 on monitoring run time. ε = 0.001s. . . . . . . . . . . . . . . . 83 Figure 4.12 Effect of segment duration and the number of water tanks on runtime when ε = 0.05s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 5.1 A valid ccf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Figure 5.2 Conversion of STL syntax trees to their corresponding SMT syntax tree. . . 93 Figure 5.3 SMT syntax tree of STL formulas φ1 and ¬ φ2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 ¬ Figure 5.4 Examples of partitioned SMT syntax tree of STL formulas at t = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 φ1 and ¬ φ2 ¬ Figure 5.5 Effect of number of segments and agents on run time for different flight properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Figure 5.6 Effect of segment duration and the number of water tanks on runtime for φP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 6.1 An example of a continuous-time distributed signal with 3 agents. Three timelines are shown, one per agent. The signals xn are also shown, and the local time intervals over which they are non-negative are solid black. The skew ϵ is 1. The Happened-before relation is illustrated with solid . Some satisfying cuts for , and e4 arrows, e.g. between e1 ⇝ e2 3 1 2 the predicate ϕ = (x1 0) are shown as dashed 0) (x2 0) ≥ ∧ arcs, and the extremal cuts as solid arcs. All extremal cuts contain root events, and leftmost cut A also contains non-root events. . . . . . . . . . . . . . . . . . . . . 109 ⇝ e5 2 (x3 ≥ ≥ ∧ xi Figure 6.2 Two satcuts for a pair of agents A1 and A2, shown by the crossed solid lines (s, t′) and (s′, t). Their intersection is (s, t), shown by a dashed arc, and their union is (s′, t′), shown by a dotted arc. For a conjunctive predicate φ, the intersection and union are also satcuts, forming a lattice of satcuts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Figure 6.3 A distributed signal of two agents (top) and the output of the abstractor (bottom). The abstractor marks zero-crossings as discrete root events and creates new events (dark circles) to maintain consistency. . . . . . . . . . . . . . 117 Figure 6.4 Example of subsection 6.4.1. Bold intervals are where the local signals are non-negative. The happened-before relation is illustrated with solid arrows. The predicate is ϕ = (x1 0). Solid circles represent 0) discrete events returned by the abstractor; hollow circles are those cre- ated by the slicers. The leftmost satcut of this example is [3.5 ϵ, 3.5] and the rightmost is [6, 5.8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 (x2 ≥ ≥ − ∧ Figure 6.5 Runtime vs root rate and N on synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Figure 6.6 Runtime vs number of agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Figure 7.1 Synthetic experiment results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Figure 7.2 A producer-consumer network of 8 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Figure 7.3 A Multi-Layer Network of Raspberry Pi Devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Figure 7.4 Case study results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 xii LIST OF ABBREVIATIONS AATC Automated Air Traffic Control ANN Artificial Neural Network AP Atomic Proposition CLA Cold Leg Accumulator CPS Cyber-Physical Systems CPU Central Processing Unit ECCS Emergency Core Cooling System FAA Federal Aviation Administration HLC Hybrid Logical Clock IoT Internet of Things LTL Linear Temporal Logic MPC Multi-Party Computation MTL Metric Temporal Logic NTP Network Time Protocol PTP Precision Time Protocol PVC Physical Vector Clock RAM Random Access Memory RWST Refueling Water Storage Tank SMT Satisfiability Modulo Theory STL Signal Temporal Logic UAV Unmanned Aerial Vehicle xiii CHAPTER 1 INTRODUCTION Distributed monitoring is the process of analyzing the execution of distributed systems with a centralized or decentralized monitor in relation to a given formal specification. While attempting to complete a collaborative job, distributed systems often consist of numerous systems that do not share a global clock and memory. In a distributed database, for example, data is kept in several physical locations, usually spread across a network of interlinked computers. A monitor may want to guarantee that queries to the distributed database fulfill some form of consistency requirements. The class of systems containing both software and physical (hardware) components that interact with the actual world as well as each other is a prominent class of distributed systems. These systems are referred to as cyber-physical systems (CPS) [122]. Our reliance on CPS has grown rapidly over the past decade, as these systems are more and more frequently deployed over networks of agents due to the emergence of the Internet of Things (IoT) and edge applications [27]. Therefore, validating the accuracy of these sys- tems, especially for the class of CPS that is safety-critical, is now of paramount importance. Software applications deployed among networked nodes, referred to as agents, are with a critical class of CPS. Examples include autonomous car fleets, sensor networks in infras- tructure, health-monitoring wearables, and medical device networks. Because CPS are often safety-sensitive, obtaining assurance regarding their accuracy is vital. CPS are distinguished by three defining characteristics: • First, because the signals are analog, they include an infinite number of events, ren- dering traditional reasoning approaches designed for discrete systems ineffective, if not inapplicable in most circumstances. The applications we target, such as those men- tioned above, require continuous-time behavior. It is not enough, for example, to assert that a voltage does not spike at sample times. As a result, increasing the signal sample rate does nothing to alleviate the necessity for analog signal reasoning. 1 • Second, each agent in these CPS has a local clock that drifts from the clocks of other agents. Hence, the concept of time, which is taken for granted in centralized sys- tems, must be changed, as it is unclear whether events are consecutive and concurrent. Furthermore, it is unclear how continuous events in various processes respect the hap- pened before relation [73], and how one may reason about the sequence of occurrence of continuous events. • Third, CPS signals obey physical laws and dynamics. An understanding of these dynamics may be used to reason about distributed signals and predict their behavior, as well as improve efficiency of reasoning. The characteristics listed above define the concept of distributed signals, and reasoning about them necessitates the establishment of some notion of ordering. Building such ordering for an infinite number of events from different signals while clock drifts occur at runtime is a difficult undertaking. 1.1 Motivating Examples We demonstrate the crucial need of monitoring distributed CPS through a critical ap- plication in automated air traffic control (AATC). The market for unmanned aerial vehicles (UAVs) is expanding rapidly [61]. In the United States, the Federal Aviation Administra- tion (FAA) envisions a federated framework in which UAVs that contribute in monitoring global air safety parameters are rewarded with faster free-flight pathways to their destina- tions [39, 43]. To support this federated structure, AATC tower software must monitor analog inputs such as UAV location and velocity to determine if they violate global instantaneous safety characteristics, also known as predicates. These predicates are Boolean expressions defined over the concurrent states of the several CPS agents, such as mutual separation, conditional speed limitations, and minimal energy storage. These predicates must be evaluated on the global state, which is the combined state of all UAVs at the same time. However, in the absence of a perfect shared clock across all UAVs, UAV1’s clock may report t = 5 and UAV2’s 2 Figure 1.1 Hybrid dynamic cooling system with water tanks. clock may report t = 5.2 at the same physical ‘real’ time. Equivalently, the same value on two clocks may represent distinct physical moments. If the central AATC monitor relies on these two states to determine if the predicate has been violated, then it may result in false negatives (i.e., missing violations) or false positives (i.e., declaring a violation when none exists). The UAV example has two characteristics that are shared by many different distributed CPS: First, while perfect continuous-time synchrony is often impossible to achieve, clock synchronization algorithms such as Network Time Protocol (NTP) [88] ensure that drift among local clocks remains within some bounds. Second, the central monitor frequently recognizes certain restrictions on the UAV dynamics, such as velocity limits. In this case, the AATC tower would be aware of the UAVs’ speed limitations. In developing our solution, we make use of these two characteristics. As another example, consider the water distribution system shown in Figure 1.1, where several tanks deliver water to an offsite location via a common pipe. Water tank outflow rate and pressure are monitored locally using drifting local clocks. If the compounded pressure 3 Cold LegAccumulatorsRefueling Water StorageTank 1ValveHigh Pressure Injection SystemsRefueling Water StorageTank 2ValveCoreReactor Coolant Pump or flow rate on the pipe is a concern and has to be monitored, correctly measuring these values becomes difficult since the continuous signals indicating the pressure and/or flow rate of the tanks are not synchronized. If the flow rate and pressure must always remain below a given threshold, clock drift among the local clocks may cause values for which the threshold is breached to be missed. 1.2 Challenges While there are approaches for monitoring temporal logic for distributed discrete-event systems (e.g., [49, 58, 96, 99]), we still lack a good understanding of distributed CPS. Al- though the literature on distributed computing is decades old, and many important problems have been solved in the context of discrete-event systems, the main challenge with distributed monitoring is that it is not always possible for the monitor to establish the right order of occurrence of events across different agents in the absence of a global clock. Given the non-deterministic nature of distributed programs, it is expected of a runtime monitor to provide multiple results for the same distributed computation. This leads to a combinato- rial explosion of possibilities that the monitor must examine at runtime, making the task computationally costly. Monitoring and detecting violations of formal specifications is a common and effective technique to reasoning about the health of CPS. Broadly speaking, the state of the art in runtime monitoring focuses on either (1) centralized monitoring for stand-alone applications or multi-agent systems that share a global clock while being blind to system dynamics [1, 5, 34, 35, 33, 83], or (2) decentralized monitoring in pure discrete-time for ordering discrete events [10, 16, 26, 28, 31, 46, 47, 49, 55, 64, 96, 99, 58], which is appropriate for pure software, but not CPS. As a result, solutions for monitoring CPS where analog signals are created by distributed agents that do not share a global clock are currently lacking (see the related work in Chapter 8). Lack of synchronization, in particular, poses substantial issues since the monitor must reason about signal levels at distinct agents’ local times, which may result in conflicting monitoring verdicts. This problem is exacerbated by the fact that agents often 4 Figure 1.2 A distributed CPS composed of autonomous aerial vehicles with drifting clocks. The violation property to be monitored is, for any two aerial vehicles the distance along x axis is within 1 and the distance along y axis is within 1.7. Asynchronous signals produced by the vehicles must be monitored for predicate violations, while leveraging some knowledge of system dynamics. communicate with one another, imposing extra limits on event ordering. Furthermore, in a distributed system, a central monitor that receives all signals is subject to a single point of failure. That is, if the monitor fails, predicate detection fails altogether. In decentralized monitoring, the concept of reliability of a network of monitors adds another layer to the list of challenges. To handle trade-offs, most systems use manual con- trols. Some network applications, for example, enable administrators to alter the quality of malicious activity detection systems based on predicted traffic [92, 134]. This strategy frequently focuses on a subset of resources and lacks the flexibility required by huge dynamic systems. Another strategy is to aggressively over-provision the processing infrastructure in terms of machine capabilities (e.g., CPU, memory, etc.), network bandwidth, and assigned power budget to ensure that no limitations are reached [15]. This is an extremely expensive approach that is frequently not feasible and is not future-proof. The major problem in resource management and optimization is that monitors in a net- work often receive data, process it, and then transmit it to succeeding monitors. This results 5 ||˙x|1,|˙y|1.7 in a quality vs. cost trade-off across distinct monitors, where resources are determined not simply by pairs of consecutively interacting monitors, but by the interaction of all monitors in the network. In other words, lowering the processing quality of a monitors might have an impact on subsequent monitors in the network that receive lower quality data. This means that quality versus resource utilization must be optimized across the entire network, not simply for pairs of monitors communicating with each other. On top of that, it is easy to see that quality and resource utilization are frequently at odds; that is, greater quality and dependability need higher resource usage, making optimization more challenging. 1.3 Thesis Statement Now that we have provided challenges and motivation for this dissertation, in this section, we define the statement of thesis as follows: Thesis Statement It is possible to develop trustworthy verification methodologies under both centralized and decentralized monitoring settings in order to reason about the correctness of safety- critical partially synchronous distributed cyber-physical systems in real-time. 1.4 Contributions In this dissertation, we take steps toward rigorous, automated reasoning about distributed CPS, the accuracy and integrity of which is critical to ensuring the safety of the environment in which they function. Based on the proposed verification approaches, our contributions are grouped into five primary segments. These techniques differ in terms of (1) system archi- tecture (i.e., discrete events vs. continuous time), (2) monitor architecture (i.e., centralized vs. decentralized), and (3) specification language (i.e., LTL vs. STL). 1.4.1 Monitoring Discrete-Event Systems using LTL First, we present two sound and complete solutions to the problem of distributed run- time verification (RV) with regard to LTL formulas. Both approaches employ a fault-proof 6 q0 , ∅ , p, r { } p { , } r { } q1 , ∅ p { } q2 ∅ , p, r { } r { } , p, r { } r { } true q⊤ true q⊥ P1 P2 Figure 1.3 Monitoring automaton for formula φ. ∅ e1 0 ∅ e1 1 ∅ e1 2 ∅ e1 3 r e1 4 ∅ e1 5 p e2 0 seg1 ∅ e2 1 ∅ e2 2 p e2 4 ∅ e2 3 seg2 Figure 1.4 A distributed computation. central monitor, and to address the explosion of various interleavings, we propose a practical assumption, namely, a bounded skew ε between local clocks of each pair of processes, which is guaranteed by a fault-proof clock synchronization mechanism (e.g., NTP [88]). This im- plies that time instants from multiple local clocks within ε are deemed concurrent, i.e., their order of occurrence cannot be determined. This is a partial synchrony setting that does not presume a global clock but restricts the impact of asynchrony within clock drifts. Our first approach is based on constructing the LTL3 [12] monitor automaton of an LTL formula and constructing multiple Satisfiability Modulo Theory (SMT)[6] queries to determine which states of the monitor automaton are reachable for a given distributed com- putation. For example, Figure 1.3 shows the monitor automaton for formula φ mentioned earlier and one has to construct 4 different SMT queries to determine the set of all possible reachable states at the end of the computation in Figure 1.4. We transform our monitoring decision problem into an SMT solving problem. The SMT instance includes constraints that encode (1) our monitoring algorithm based on the 3-valued semantics of LTL, (2) behavior of 7 P1 P2 ∅ e1 0 p e2 0 ∅ e1 1 ∅ e1 2 ∅ e2 1 seg1 ∅ e2 2 P1 P2 ∅ e1 3 ∅ e2 2 ∅ e2 3 seg2 φ = ( r ( p ¬ U → r)) Figure 1.5 Progression and segmentation. r e1 4 ∅ e1 5 p e2 4 communicating processes and their local state changes in terms of a distributed computation, and (3) the happened-before relation subject to the ϵ clock skew assumption. Afterwards, it attempts to concretize an uninterpreted function whose evaluation provides the possible verdicts of the monitor with respect to the given computation. We divide a computation into multiple segments to make the verification problem tractable, significantly reducing the search space of each SMT query. Thus, the result of monitoring each segment (the possi- ble LTL3 states) should be carried to the next segment. Furthermore, because distributed applications are now operated on large cloud services, we extend our method to a parallel monitoring algorithm to take use of the available computational resources and gain greater scalability. The intuition behind our second monitoring technique is that since (in the first approach) running SMT queries to test whether each state of the LTL3 monitor automaton is reachable is excessive, it should be sufficient to test whether temporal sub-formulas of an LTL formula hold in a distributed computation. Similar to the first approach, we utilize segmentation to break down the problem size. In the second approach, to carry the result of monitoring from one segment to the next, we also develop a formula progression technique. Specifically, given a finite trace α, and an LTL formula φ, we define a function Pr, such that Pr(α, φ) characterizes the progression of φ and α. We create a formula progression approach to convey the results of monitoring from one segment to the next. Specifically, given a finite trace α, and an LTL formula φ, we define 8 a function Pr, such that Pr(α, φ) characterizes the progression of φ and α. Progression is defined as the rewritten formula for future extensions of α that yields true, false, or an LTL formula based on what has been seen thus far. We emphasize fundamental distinction between our approach and the standard rewriting technique [59] is that, function Praccepts a finite trace as input, whereas the algorithm in [59] rewrites the input LTL formula in a state-by-state manner. This suggests that rewriting based on the fixed point representation of temporal operators is not possible in our context. Our motivation stems from the fact that when a given distributed computation is divided into a number of segments, a state-by- state rewriting approach will generate too many SMT queries, rendering it unscalable. For example, in Figure 1.5 (which is the computation in Figure 1.4 chopped to two segments), our progression-based approach needs the same 4 SMT queries for seg1 sub-formulas p)) as compared to [49]. The evaluation yields formulas r and ( (2 for each of the r) ( ¬ ¬ and ( r p ¬ queries in seg2 → r) as the possible formulas and as a result we only need to build 4 SMT U compared to 5 for the automata-based approach in [49]. We make a detailed comparison between the proposed approaches through not only a set of vigorous synthetic experiments, but also monitoring the same set of consistency conditions in Cassandra. We also put our approach to test using a real-time airspace monitoring dataset (RACE) from NASA [85]. Our experiments show that the progression-based approach has 35% reduced overhead as compared to the automata-based approach. 1.4.2 Monitoring Predicates on CPS We provide a sound and complete solution to the problem of predicate monitoring for distributed systems when extended to CPS. Our system, which employs a central monitor to receive distributed signals, may be characterized as follows: We assume a clock synchroniza- tion mechanism guarantees limited skew ε between all local clocks. That is, time instants from separate clocks within ε are regarded concurrent, i.e., their sequence of occurrence cannot be determined below an ε of resolution. The limited skew assumption is used to sup- plement the classic happened-before relation [73]. We introduce a retiming technique that 9 leverages the concept of retiming functions from stochastic processes to make the monitor align the locally timed agent signals. A retiming function aligns the supports of two sig- nals while taking into account the order, ε-skew, and arbitrary message exchanges between agents. Our monitoring decision problem is transformed into a Satisfiability Modulo Theory (SMT) problem that seeks a retiming function that observes a predicate violation. We show how to simplify the general SMT problem of searching for arbitrary retiming functions to the considerably simpler problem of looking for piece-wise linear retimings. Furthermore, knowledge about agent dynamics constraints may be used to decrease monitoring overhead. The following are our contributions: 1. An SMT-based algorithm for centralized monitoring of distributed analog signals for predicate violations, supplemented with a clock synchronization algorithm that ensures finite skew between all local clocks, employing the classic happened-before relation. [73]; 2. A signal retiming approach based on the concept of retiming functions as used in stochastic processes to address the challenges presented by time asynchrony; 3. A lightweight approach for adding system dynamics constraints in order to decrease monitoring overhead; 4. An analysis of the relationship between monitoring overhead’s sensitivity to the skew bound and the quantity of communication between agents, and 5. A method for parallelizing the monitoring algorithm in order to improve scalability. We have fully implemented our methodologies and provide the results of experiments on monitoring a network of autonomous ground vehicles (in the real world), aerial vehicles (in simulation), and a water distribution system (in simulation). It should be noted that systems with a central monitor are inherently vulnerable to a single point of failure. Our work is concerned with establishing the suggested theory and does not take into consideration fault tolerance. The following are our observations. First, while our solution is based on SMT 10 solving, it may be used for online monitoring if the monitor is run at an acceptable fre- quency (i.e., the monitoring overhead does not exceed the system’s regular operating time). Second, adding knowledge of system dynamics is hugely beneficial in decreasing monitoring overhead. In some cases, the speedup (as compared to when the information is not used) can be an order of magnitude. Third, when practical clock synchronization protocols (e.g., NTP and PTP) are used, monitoring overhead is independent of clock skews. Finally, we notice that communication between agents does not always reduce monitoring overhead in the continuous-time context; this contradicts popular perception in the discrete-time situa- tion, where communication event orderings are thought to make automated reasoning more efficient. 1.4.3 Monitoring CPS using STL We expand our approach from monitoring just Boolean predicates across distributed signals to whole signal temporal logic (STL) [36]. To this end, we start with a partially syn- chronous scenario, in which a clock synchronization mechanism ensures a maximum bound ε on clock drifts across all signals. This can be ensured by off-the-shelf algorithms such as NTP [88]. We use the signal retiming approach presented in [95] to align continuous-time signals that do not share a global sense of time. Assuming the bound ε, the decision problem is to find a retiming function that violates an input STL formula. If no such function exists, then it indicates that the distributed signals have not yet broken the formula (it may or may not in the future). To minimize the size of a distributed signal to more manageable smaller problems, we break the original signal into smaller signals known as segments. The problem here is that the outcome of monitoring one segment should be carried over to the next. For example, consider STL formula φ = [0,5] p (which means proposition p should hold at all times in time interval [0, 5]) and the current segment of signals that end at time 3. This means if p holds in the interval [0, 3], then the formula has to be rewritten to φ′ = [0,2] p for the second segment. Of course, such rewriting can become challenging when the formulas have multiple 11 nested temporal operators with relative time intervals. To this end, we propose a formula progression technique that takes as inputs an STL formula and a finite-time distributed signal σ and returns an STL formula φ′ such that for any extension σ′, we have σσ′ = φ | if and only if σ′ = φ′. We encode the resulting problem as a (SMT) problem that searches | for a retiming function given the constraints of the current segment and STL formula. We provide approaches for solving the SMT encoding efficiently. We should highlight that we are not concerned in this dissertation with problems such as monitoring fault-tolerance (i.e., we assume a flawless centralized monitor with no noise or communication failures). We have fully implemented our approach on two distributed CPS applications: moni- toring of a (1) network of aerial vehicles for a set of properties such as mutual separation and formation, and (2) a water distribution system for the property in which the outflow pressure exceeds the threshold pressure. The results indicate that in some circumstances, a distributed CPS can be monitored fast enough for online deployment. 1.4.4 Decentralized Monitoring Predicates on CPS In order to address the issue of single point of failure in a distributed system, we also expand our approach of centralized predicate detection for distributed CPS with drifting clocks under partial synchrony to a decentralized monitoring approach. To this end, our contributions are as follows: 1. A fully decentralized monitoring approach, where each agent only has access to its own signal, and exchanges a limited amount of information with other agents; 2. A detection technique that identifies all violating predicates, not just one; 3. An online algorithm applying a class of global properties that are conjunctions of local propositions, that can be executed in parallel to tasks carried out by agents; 4. A novel physical vector clock that orders continuous-time events in a distributed com- putation without a shared clock, and 12 5. A method to deploy our algorithm to existing infrastructure. Specifically, our algo- rithm includes a modified version of the classical detector described in [26] that can be deployed on top of existing infrastructure. Our methodologies are fully implemented, and we provide the results of experiments on two synthetically generated signal datasets. 1.4.5 Monitoring Reliability in a Multi-Layered CPS Finally, we introduce the notion of monitoring reliability on a network of monitors in decentralized monitoring setting. To this end, we present a generalized model of a class of CPS, where each monitor is represented by an (IoT) device or a node in a layered network of producers and consumers. Assuming a layered producer-consumer network with stream processing, each node in the network faces a trade-off between processing quality and re- source utilization. An abstract model of stream processing applications is presented. The processing nodes, in particular, are modeled as a network of producers-consumers, which is a directed acyclic graph in which a node can be a producer, a consumer, or both based on its incoming/outgoing edges. Each node in the network consumes data that flows through its incoming edges and produces data that flows through its outgoing edges. The processing of data consumed/produced by a node can be done at various quality levels. The quan- tity of resources utilized by the node is determined by the processing quality level. Power, energy, RAM, disk, or network bandwidth are all examples of resources. In addition to these resources, we represent reliability as a nonrenewable resource that flows across the network and is partially depleted based on the quality levels of the nodes through which it flows. Individual and collective resource limits and bounds apply to nodes. Lower quality leads to more error, which propagates across the network and has the potential to affect the quality of subsequent nodes as well as overall reliability. Our goal is to provide an efficient framework for modeling a system in such a way that resource bounds are respected and a designer-specified goal is optimized. This goal is supplemented with optimization objectives such as optimizing reliability and minimizing energy or other resource usage in the system. 13 To answer the above-mentioned multi-objective optimization problem, we reduce it to the satisfiability problem for the satisfiability modulo theory (SMT). SMT-solving technology has advanced dramatically over the last two decades [77], and we use its improvements to solve our problem. To that end, we represent (1) the elements of the producers-consumers graph, as well as the concepts of data rates, quality, reliability, and resource consumption, as SMT entities (e.g., variables, functions, constants, and so on), (2) the resource constraints and bounds as a set of SMT constraints, (3) the pillars of our original optimization problem as additional SMT constraints that will be checked and searched using a binary search algorithm to find the optimal solution, and (4) a machine learning based model that aims to even further optimize the problem in terms execution time at the cost of minimal loss in accuracy. The SMT aspects of our technique is implemented using the SMT-solver Z3 [32] and the machine learning aspects of our technique is implemented using the machine learning toolkit Scikit-learn [101] and Keras [65] artificial neural network interface. Our model aims to optimize reliability and resource consumption trade-offs. We explore these trade-offs through detailed synthetic experiments. We also apply our techniques on a real-world case study, where we optimize a network of embedded streaming devices, so that the network (1) delivers the best possible performance using the available resources, or it (2) uses the minimal amount of a certain resource while meeting a given performance goal. 1.5 Organization This chapter (Chapter 1) provided an overview of the motivation, challenges and con- tributions of this dissertation. The remainder is organized as follows. Chapter 2 discusses the background for our work. Chapter 3 provides details on our runtime verification of distributed systems using automata-based and progression-based techniques. Chapter 4 ex- tends this work to CPS and Boolean predicate detection. Chapter 5 further extends this work from Boolean predicates to STL, whereas, Chapter 6 extends this from a centralized monitoring setting to a decentralized monitoring setting. Chapter 7 introduces the notion of 14 reliability and provides a resource optimization technique. Chapter 8 elaborates on related work, and finally Chapter 9 summarizes the findings, discusses ongoing work and suggests avenues for further research. 15 CHAPTER 2 PRELIMINARIES In this Chapter, we present the background concepts of our work. We start with the formal specification languages we use in our approaches, and then introduce other crucial back- ground components of our work. 2.1 Linear Temporal Logics (LTL) Let AP be a set of atomic propositions and Σ = 2AP be the set of all possible states. A trace is a sequence s0s1 . . ., where si Σ for every i ∈ ≥ 0. We denote by Σ∗ (resp., Σω) the set of all finite (resp., infinite) traces. For a finite trace α = s0s1 . . . sk, denotes its length, α | | k + 1. Also, for α = s0s1 . . . sk, by αi, we mean trace sisi+1 . . . sk of α. 2.1.1 Infinite-trace Semantics of LTL The syntax and semantics of the linear temporal logic (LTL) [104] are defined for infinite traces. The syntax is defined by the following grammar: φ ::= p φ | ¬ φ | ∨ φ | φ φ | U φ where p ∈ AP, and where and U are the ‘next’ and ‘until’ temporal operators respectively. Other propositional and temporal operators are considered as abbreviations, that is, true = p p, false = ∨ ¬ φ), and φ = ¬ true, φ ψ = ( ¬ ¬ φ (always φ). We denote the set of all LTL formulas by ΦLTL. φ = true ψ = ∨ ¬ → ¬ ∨ ∧ φ φ U ψ, φ ψ), φ (eventually ¬ ¬ The infinite-trace semantics of LTL is defined as follows. Let σ = s0s1s2 Σω, i 0, ≥ · · · ∈ and let = denote the satisfaction relation: | (σ, i) (σ, i) (σ, i) (σ, i) (σ, i) φ = p | = | ¬ = φ | ∨ = φ | = φ | U iff iff iff iff iff ψ ψ si p ∈ (σ, i) (σ, i) = φ ̸| = φ or (σ, i) | = ψ | (σ, i + 1) = φ | i : (σ, k) k ∃ ≥ = ψ and | j ∀ ∈ [i, k) : (σ, j) = φ | 16 a } { q0 {} q⊥ a, b } { , { b } q⊤ true true Figure 2.1 LTL3 monitor for φ = a b. U 2.1.2 Finite-trace Semantics of LTL In the context of RV, the 3-valued LTL (LTL3 for short) [12] evaluates LTL formulas for finite traces, but with an eye on possible future extensions, whereas the finite LTL, or FLTL [80] solely considers the present trace with no regard for the future. In LTL3, the set of truth values is B3 = , {⊤ , where , ? } ⊥ ⊤ (resp., ⊥ ) denotes that the formula is permanently satisfied (resp., violated), regardless of how far the current finite trace extends, and ‘ ?’ denotes an unknown verdict, i.e., there exists an extension that can violate the formula, and another extension that can satisfy the formula. Let α Σ∗ be a non-empty finite trace. ∈ The truth value of an LTL3 formula φ with respect to α, denoted by [α =3 φ], is defined as | follows: [α =3 φ] = |    ⊤ ⊥ ? if Σω : ασ σ ∀ ∈ if σ Σω : ασ ∀ ∈ otherwise. = φ | = φ ̸| Definition 1. The LTL3 monitor for a formula φ is the unique deterministic finite state machine M φ = (Σ, Q, q0, δ, λ), where Q is the set of states, q0 is the initial state, δ : Q B3 is a function such that λ(cid:0)δ(q0, α)(cid:1) = [α Q is the transition function, and λ : Q → Σ × → =3 φ], | for every finite trace α Σ∗. ■ ∈ As an example, Figure 2.1, shows the monitor automaton for formula φ = a U has the same syntax as LTL, and its semantics is based on the truth values B2 = b. FLTL , {⊤ , ⊥} 17 where ⊤ (resp., ) denotes that the formula is satisfied (resp., violated) given the current ⊥ finite trace. For atomic propositions and Boolean operators, the semantics of FLTL is iden- tical to those of LTL. Let φ, φ1, and φ2 be LTL formulas, α = s0s1 . . . sn be a non-empty finite trace, and =F denote the satisfaction relation in FLTL. The semantics of FLTL for | the temporal operators are as follows: [α =F | φ] = [α =F φ1 | U φ2] =    ⊤ ⊥    if [α1 =F φ] | if α1 = ε ⊥ otherwise. [0, n] : ([αk k ∃ ∈ =F φ2] = | ) ∧ ⊤ [0, k) : ([αl ∈ l ∀ otherwise. =F φ1] = | ) ⊤ Consider the formula φ = p, and a finite trace α = s0s1 sn to further illustrate the difference between LTL and FLTL and LTL3. If p ̸∈ that is, the formula is permanently violated and so is the case in FLTL where, [α · · · si for some i Now, consider formula φ = p. If p si for all i ̸∈ ∈ [0, n], then [α there exist infinite extensions to α that can satisfy or violate φ in the infinite semantics of LTL. But, this is not the case in FLTL where [α =F φ] = | ⊥ as it did not observe any p in the observed finite trace. 2.2 Distributed Computation We assume a loosely coupled asynchronous message passing system, consisting of n re- liable processes (that do not fail), denoted by A = A1, A2, . . . , An { } , without any shared memory or global clock. Channels are assumed to be First In, First Out (FIFO), and loss- less. In our model, each local state change is considered an event, and every message activity (send or receive) is also represented by a new event. Message transmission does not change the local state of processes and the content of a message is immaterial to our purposes. We will need to refer to some global clock that acts as a ‘real’ timekeeper. It is to be understood, 18 ∈ [0, n], then [α , =3 φ] = ⊥ | . =F φ] = ⊥ | =3 φ] =?. This is because | ̸ however, that this global clock is a theoretical object used in definitions, and is not available to the processes. We make a practical assumption, known as partial synchrony [40]. The local clock (or time) of a process Ai, where i ∈ [1, n], can be represented as an increasing function ci : R≥0 → R≥0, where ci(χ) is the value of the local clock at global time χ. Therefore, for any two processes Ai and Aj, we have: R≥0. χ ∀ ∈ ci(χ) | − cj(χ) | < ε with ε > 0 being the maximum clock skew. The value ε is assumed to be fixed and known by the monitor in the rest of this dissertation. In the sequel, we make it explicit when we refer to ‘local’ or ‘global’ time. This assumption is met by using a clock synchronization algorithm, like NTP [88], to ensure bounded clock skew among all processes. An event in process Ai is of the form ei τ,σ , where σ is logical time (i.e., a natural number) and τ is the local time at global time χ, that is, τ = ci(χ). We assume that for every two events ei τ,σ and ei τ ′,σ′, we have (τ < τ ′) (σ < σ′). ⇔ Definition 2. A distributed computation on N processes is a tuple ( E , ⇝), where E is a set of events partially ordered by Lamport’s happened-before (⇝) relation [73], subject to the partial synchrony assumption: • In every process Ai, 1 i ≤ ≤ N , all events are totally ordered, that is, τ, τ ′ ∀ R+. σ, σ′ ∀ ∈ ∈ Z≥0.(σ < σ′) → (ei τ,σ ⇝ ei τ ′,σ′). • If e is a message send event in a process, and f is the corresponding receive event by another process, then we have e ⇝ f . • For any two processes Ai and Aj, and any two events ei τ,σ, ej τ ′,σ′ , if τ + ε < τ ′, then ∈ E ei τ,σ ⇝ ej τ ′,σ′, where ε is the maximum clock skew. • If e ⇝ f and f ⇝ g, then e ⇝ g. ■ 19 Definition 3. Given a distributed computation ( , ⇝), a subset of events C E is said ⊆ E to form a consistent cut iff when C contains an event e, then it contains all events that happened-before e. Formally, .(e C) (f ⇝ e) f C. ■ ∧ The frontier of a consistent cut C, denoted front(C) is the set of events that happen last ∈ E → ∈ ∈ e ∀ in the cut. front(C) is a set of ei last for each i [1, N ] and ei C. We denote ei last as the last event in Pi such that ei τ,σ ∈ E ∀ .(ei τ,σ ̸ 2.3 Hybrid Logical Clocks ∈ last) = ei (ei τ,σ → last ∈ last). ⇝ ei A hybrid logical clock (HLC) [71] is a tuple (τ, σ, ω) for detecting one-way causality, where τ is the local time, σ ensures the order of send and receive events between two processes, and ω indicates causality between events. Thus, in the sequel, we denote an event by ei τ,σ,ω . More specifically, for a set of events: E • τ is the local clock value of events, where for any process Ai and two events ei τ,σ,ω, ei τ ′,σ′,ω′ , we have τ < τ ′ iff ei τ,σ,ω ⇝ ei τ ′,σ′,ω′. ∈ E • σ stipulates the logical time, where: – For any process Ai and any event ei τ,σ,ω ∈ E , τ never exceeds σ, and their difference is bounded by ε (i.e, σ τ ε). ≤ – For any two processes Ai and Aj, and any two events ei − τ,σ,ω, ej τ ′,σ′,ω′ event ei τ,σ,ω receiving a message sent by event ej } The maximum of the three values are chosen to ensure that σ remains updated τ ′,σ′,ω′, σ is updated to max { , where ∈ E σ, σ′, τ . with the largest τ observed so far. Observe that σ has similar behavior as τ , except the communication between processes has no impact on the value of τ for an event. • ω : Z≥0 is a function that maps each event in E → – For any process Ai and a send or local event ei E to the causality updates, where: , if τ < σ, then ω is τ,σ,ω ∈ E incremented. Otherwise, ω is reset to 0. 20 P1 P2 P3 (τ , σ, ω) 10 10 0 ✗ ✓ ✗ 20 20 0 21 21 0 31 31 0 0 10 1 1 10 2 2 10 5 20 20 0 0 0 0 1 10 3 2 10 4 C0 Figure 2.2 HLC example. 20 20 0 C1 C2 – For any two processes Ai and Aj and any two events ei event ei τ,σ,ω receiving a message sent by event ej τ ′,σ′,ω′, ω(ei τ ′,σ′,ω′ , where τ,σ,ω, ej τ,σ,ω) is updated based ∈ E on max σ, σ′, τ { } . – For any two processes Ai and Aj, and any two events ei τ,σ,ω, ej τ ′,σ′,ω′ , (τ = ∈ E (ω < ω′) τ ′) ∧ → ei τ,σ,ω ⇝ ej τ ′,σ′,ω′. We presume that HLC is fault-proof in our implementation. Figure 2.2 depicts an HLC with partially synchronous concurrent timelines of three processes with ε = 10. Note that the local times of all events in front(C1) are bounded by ε. As a result, C1 is a consistent cut, but C0 and C2 are not. 2.3.1 Physical Vector Clocks We first define Physical Vector Clocks (PVCs), which generalize vector clocks [81] from countable to uncountable sets of events. They are used by the abstractor process (next section) to track the happened-before relation. A PVC captures one agent’s knowledge, at appropriate local times, of events at other agents. Definition 4. Given a distributed signal (E, ⇝) on N agents, a Physical Vector Clock, or PVC, is a set of N -dimensional timestamp vectors vt n ∈ RN + , where vector vt n is defined by the following: 1. Initialization: v0 n[i] = 0, i ∀ 1, . . . , N ∈ { } 21 2. Timestamps store the local time of their agent: vt n[n] = t for all t > 0. 3. Timestamps keep a consistent view of time: Let V t n be the set of all timestamps vs m s.t. es m happened-before et n in E. Then: vt n[i] = max m∈V t n vs (vs m[i]), i ∀ ∈ [N ] n , t > 0 } \ { PVCs are partially ordered: vt n < vt′ m iff vt n ̸ = vt′ m and vt n[i] vt′ m[i] [N ]. ■ ≤ . The detection algorithm can now know the happened-before ∈ i ∀ We say vt n is assigned to et n relation by comparing PVCs. Lemma 1. Let n = m and t, t′ = 0. Then (et n ⇝ et′ m) iff (vt′ m[n] t). ≥ Proof. We split the bidirectional implication into its two directions: 1. (et n ⇝ et′ Since vt (vt′ m[n] m) = ≥ n[n] = t by Definition 4 2 and et ⇒ t) n ⇝ et′ m , then by Definition 4 3, vt′ m[n] t. ≥ 2. (et n ⇝ et′ m) = (vt′ m[n] ⇐ t) ≥ a) Case (vt′ m[n] = t) = ⇒ (et n ⇝ et′ m): Besides initialization, the only case in Definition 4 where a value is assigned which did not come from another timestamp is Definition 4 2. Consider an event et n . The timestamp of this event at index n is t, by Definition 4 2. At the point in time when this event is created (local time t on agent An), no other timestamp has the value t at index n. All other vt′ m which have the value t at index n must be assigned by Definition 4 3. This means that they have the relation et n ⇝ et′ m , due to the transitive property of the happened-before relation. b) Case (vt′ m[n] > t) = ⇒ Consider a t′′ where vt′ (et n ⇝ et′ m): m[n] = t′′ and t′′ > t. Then by the previous case, et′′ n ⇝ et′ m . Since by the happened-before relation all events on an agent are totally ordered (Definition 7 2), et n ⇝ et′′ n . By the transitive property of the happened-before relation (Definition 7 2), et n ⇝ et′ m . 22 ̸ ̸ ■ Theorem 1. Given a distributed signal (E, ⇝), let V be the corresponding set of PVC timestamps. Then (V, <) and (E, ⇝) are order isomorphic, i.e., there is a bijective mapping between V and E s.t. et n ⇝ et′ m iff vt n < vt′ m . Proof. Since each PVC timestamp corresponds to exactly one event and all events have a timestamp, there is clearly a bijective mapping. To show it preserves order, we need to confirm that (et n ⇝ et′ m) ⇐⇒ (vt n < vt′ m). 1. et n m = ⇝ et′ n < vt′ vt By Definition 4 3, each element of vt n ⇒ m must be less than or equal to the corresponding element of vt′ m . So then we need to show that vt n ̸ n[m] = t′ then et′ m vt′ m[m] = t′. By Theorem 1 if vt the happened-before order relation, then vt = vt′ m . Definition 4 2 indicates that ⇝ et n ; but there cannot be cycles in n[m] < t′. This implies that vt n < vt′ m . 2. (et n ⇝ et′ m) = (vt n < vt′ m) ⇐ m means that vt n < vt′ vt By Definition 4 2, vt vt′ m[i], ≤ n[n] = t, so vt′ n[i] i ∀ m[n] [N ]. Consider index n, where vt vt′ m[n]. t. Then Theorem 1 states that this implies n[n] ≤ ∈ ≥ et n ⇝ et′ m . ■ Definition 4 is not quite a constructive definition. We need a way to actually compute PVCs. This is enabled by the next theorem. Theorem 2. The assignment vt n =   [0, . . . , 0, t, 0, . . . , 0], t < ϵ  [t − ϵ, . . . , t ϵ, t, t − − ϵ, . . . , t ϵ], t ϵ ≥ − where the t is in the nth position in both cases, satisfies the conditions of PVC in Definition 4. 23 Proof. Consider Definition 7 2. This indicates that all events et−ϵ i happened-before et n , n . Therefore, if these events directly happened-before et n (there is no et′ m i ∀ ∈ where [N ] et−ϵ i \ { } ⇝ et′ m and et′ m ⇝ et n ), then this vector is a correct assignment. By looking at each point in Definition 7, we can see that the only case where one event happened-before another on a different process is when there is at least ϵ difference, Def- inition 2. While an event may have happened-before et n by indirectly following Defini- tion 2 by way of 2 and 2, we do not need to consider this event because there is not a direct happened-before relation with et n (no event in between). Therefore, the assignment ϵ, . . . , t [t − ϵ, t, t − − ϵ, . . . , t − ϵ] is suitable for timestamp vt n . ■ 2.4 Signal Model In this section, we introduce our signal model, i.e., our model of the output signal of an agent. To this end, first, we set some notations. The set of reals is R, the set of non-negative reals is R+, and the set of positive reals is R∗ + . The set of integers 1, . . . , N { } is abbreviated as [N ]. Global time values, kept track of by a hypothetical global clock are denoted by χ, χ′, etc., while the letters t, t′, t1, t2, s, s′, s1, s2, etc. denote corresponding local clock values particular to individual signals/agents, which are always clear from the context. Definition 5. An output signal (of some agent A) is a function x : [a, b] Rd, which is → right-continuous, left-limited, and is not Zeno. Here, [a, b] is an interval in R+, and will be referred to as the timeline of the signal. ■ Definition 6. A root is an event et n where xn(t) = 0 or a discontinuity at which the signal changes sign: sgn(xn(t)) = sgn(lims→t− xn(s)). A left root et n is a root preceded by negative values: there exists a positive real δ s.t. xn(t − α) < 0 for all 0 < α δ. A right root et n is a root followed by negative values: xn(t + α) < 0 for all 0 < α ≤ ≤ δ. ■ We assume that x is one-dimensional, i.e., d = 1. Therefore, Right-continuity implies that for each t in its support, lims→t+ x(s) = x(t). The function is Left-limitedness if it has . Not being Zeno means that x a finite left-limit at every t in its support: lims→t− x(s) < ∞ 24 ̸ has a finite number of discontinuities in any bounded interval in its support. This prevents the signal from jumping indefinitely many times in a finite length of time. A discontinuity ) can be caused by a discrete event within agent A (such as a variable updated in a signal x( · by software), or to a message transmitted to or received from another agent A′. We assume a loosely linked system with N reliable agents that never fail, denoted by A1, . . . , AN , without any shared memory or global clock. The output signal of agent An { } is denoted by xn, for 1 N . We refer to some global clock which acts as a ‘real’ time- n ≤ ≤ keeper. However, this global clock is a hypothetical object used in definitions and theorems, and is not available to the agents. We make two assumptions: • (A1) Partial synchrony. The local clock (or time) of an agent An can be represented as an increasing function cn : R+ → R+, where cn(χ) is the value of the local clock at global time χ. Then, for any two agents An and Am, where m, n [N ], we have: ∈ χ ∀ ∈ R+. cn(χ) | − cm(χ) < ε | where the maximum clock skew presumed fixed and known by the monitor is ε > 0. When we refer to ‘local’ or ‘global’ time in the sequel, we make it clear. • (A2) Deadlock-freedom. The agents being analyzed do not enter a deadlock state. Assumption (A1) is met by using a clock synchronization algorithm, like NTP [88], to ensure bounded clock skew across all agents. An event in the discrete-time setting is a change in value of an agent’s variables. We now update this definition for the continuous-time setting of this work. Specifically, in an agent An, an event is either a (i) a pair (t, xn(t)), where t is the local time (i.e., returned by function cn); (ii) a message transmission, or (iii) a message reception. The communications that the agents transmit to each other are free of assumptions. Messages that are sent to the monitor are timestamped by their respective local clocks. Since the agents evolve in continuous time and their output signals are defined for all local times t, a message transmission or reception always coincides with a signal value; i.e., if An receives a message at local time t, its signal 25 has value xn(t) at that time. Thus, without loss of generality, every event will be represented as a (local time, value) pair (t, xn(t)), often abbreviated as en t (n and t will be omitted when irrelevant). A distributed signal is modeled as a set of signals, where events in each signal are partially ordered by a variation of the happened-before (⇝) relation [73], extended by our assumption (A1) on bounded clock skew among all agents. The following defines a continuous-time/value distributed signal under partial synchrony. Definition 7. A distributed signal on N agents is a pair (E, ⇝), where E = (x1, . . . , xN ) is a vector of signals, the set In is a bounded nonempty interval, and the relation ⇝ is a relation between events in signals such that: 1. In every signal xn, all events are totally ordered, that is, for all n [N ], for any ∈ t, t′ ∈ In, if t < t′, then (t, xn(t)) ⇝ (t′, xn(t′)). That is, n ∀ ∈ [N ]. t, t′ ∀ ∈ (cid:16) t < t′(cid:17) In. ⇒ (cid:16) (cid:17) (t, xn(t)) ⇝ (t′, xn(t′)) , where the set In is a bounded nonempty interval. 2. If the time between any two events is more than the maximum clock skew ε, then the events are totally ordered, that is, for all m, n [N ], for any t, t′ In, if t + ε < t′, ∈ ∈ then (t, xn(t)) ⇝ (t′, xn(t′)). That is, m, n ∀ ∈ [N ]. t, t′ ∀ ∈ (cid:16) In. t + ε < t′(cid:17) ⇒ (cid:16) (cid:17) (t, xm(t)) ⇝ (t′, xn(t′)) . 3. If e is a message send event in an agent and f is the corresponding receive event by another agent, then we have e ⇝ f . 4. For any three events e, f , and g, if e ⇝ f and f ⇝ g, then e ⇝ g. ■ Setting ε = ∞ yields the classic instance of total asynchrony. The constraints on In (bounded and non-empty) are required in the continuous-time context and will be discussed more in the next section. Because the agents are synchronized within ε, it is not possible to 26 Figure 2.3 Two partially synchronous continuous concurrent timelines with ε = 0.5, and corresponding signals x and y. (Solid dot indicates signal value at discontinuity). C is a consistent cut but C ′ is not. analyze all signals in global time simultaneously. The following definition of consistent cut captures plausible global states, that is, states that might be legitimate global states. Fig- ure 2.3 shows two partially synchronous concurrent timelines generated by two agents. Every moment in each timeline corresponds to an event (t, xn(t)), n [2]. Thus, the following hold: ∈ (1, x1(1)) ⇝ (2.3, x1(2.3)), (2.3, x1(2.3)) ⇝ (2.94, x2(2.94)), (1, x2(1)) ⇝ (2.94, x2(2.94)), and (2.94, x2(2.94)) ⇝ (3, x1(3)). Definition 8. Let (E, ⇝) be a distributed signal over N agents and S be the set of all events defined as follows: S = (cid:110) (t, xn(t)) xn E t ∧ ∈ ∈ In ∧ In ⊆ | (cid:111) . R+ A consistent cut C is a subset of S if and only if when C contains an event e, then it contains all events that happened before e. Formally, e, f ∀ ∈ S . (e C) ∈ ∧ (f ⇝ e) (f ∈ ⇒ C). ■ From this definition and Definition 7 it follows that if (t′, xn(t′)) is in C, then C also contains every event (t, xm(t)) s.t. t + ε < t′. Note that due to time asynchrony, there exists an infinite number of consistent cuts represented by (χ) at any global time χ R+. This ∈ C 27 1.512.332.943.1C’messageCA1A2ytsx̸ is due to the fact that there are an infinite number of time instances between any two local time instances t1 and t2 on some signal x. As a result, an infinite number of consistent cuts can be created. A consistent cut C can be represented by its frontier front(C) = (t1, x1(t1)), . . . , (tN , xn(tN )) (cid:110) (cid:111) , in which each (tn, xn(tn)), where 1 n ≤ ≤ Formally: N , is the last event of agent An appearing in C. n ∀ ∈ [N ] . (tn, xn(tn)) ∈ C and tn = max (cid:110) t In ∈ (t, xn(t)) | ∃ (cid:111) . C ∈ Example Assuming ε = 0.1 in Figure 2.3, it comes that all events below (thus, before) the solid arc form a consistent cut C with frontier front(C) = . (3, x1(3)), (2.94, x2(2.94)) } On the other hand, all events below the dashed arc do not form a consistent cut since { (2.3, x1(2.3)) ⇝ (3.1, x2(3.1)) and (3.1, x2(3.1)) is in the set C ′, but (2.3, x1(2.3)) is not in C ′. 2.5 Signal Temporal Logic (STL) Let AP be a set of atomic propositions. The syntax for signal temporal logic (STL) [79] is defined for infinite traces using the following grammar: φ := p φ | ¬ φ φ φ | ∧ | U [a,b] φ where p AP and is the ‘until’ temporal operator. We view other propositional and U temporal operators as abbreviations, that is, ∈ = p ⊤ ∨ ¬ p (true), = ⊥ ¬⊤ (false), [a,b] φ = [a,b]φ (eventually or F), ⊤ U formulas by ΦSTL. [a,b] φ = φ (always or G). We denote the set of all STL ¬ [a,b] ¬ Let a trace σ = (x1, . . . , xN ) be a vector of N continuous-time and continuous-valued sig- nals. In the context of STL, we express p as f (x1[t], . . . , xn[t]) > 0, where (x1[t], . . . , xn[t]) ∈ R is a function that evaluates a vector Rn is a vector of signal values at time t, and f : Rn → of signal values. 28 q p ⊤ ⊥ ⊤ ⊥ t 0 4.5 Figure 2.4 A trace σ generated by a system. 6 The infinite-trace semantics of STL is defined as follows. Let = be the satisfaction | relation, and the satisfaction of formula φ by a trace σ at time t be: (σ, t) (σ, t) (σ, t) (σ, t) = p | = φ | = | ¬ = φ | ψ iff iff iff ∧ φ [a,b]ψ iff U f (x1[t], . . . , xn[t]) > 0 (σ, t) = φ and (σ, t) | ((σ, t) = φ) | = ψ | ¬ t′ ∃ [t + a, t + b] : (σ, t′) ∈ = ψ and | t′′ ∀ ∈ [t, t′] : (σ, t) = φ | For the sake of simplification, from this point and onward, we write σ = φ if and only | if (σ, 0) = φ holds. As an example of STL, given the trace σ shown in Figure 2.4, the STL | formula φ = p [4,6.5]q holds at time 0, that is, σ = φ. However, φ does not hold after time | U 2, as in that case, q must hold after time 2 + 4 and before 2 + 6.5, which does not happen. The STL semantics are over infinite signals, however a distributed signal E is defined to have a fixed duration (In is bounded), which is suited for online monitoring, but the STL semantics are over infinite signals. Given a (completely synchronous) finite duration signal x, we say it satisfies/violates φ iff every extension (x.y), where y is an infinite signal, satisfies/violates φ. Otherwise, Unknown is returned by the monitor. The dot ‘.’ here represents time concatenation. 2.6 Producer-Consumer Network A producer-consumer network is a directed acyclic graph (DAG) G = (V, E), in which each vertex v ∈ V is a node), that may be either a producer, a consumer, or both, based on its incoming/outgoing edges. A producer node only has outgoing edges, a consumer node 29 vs v1 v2 v3 v4 v5 v6 v7 v8 v9 Figure 2.5 A producer-consumer network of 10 nodes. only has incoming edges, and a producer/consumer node has both incoming and outgoing edges. Let Pred(v) denote the finite set of predecessor nodes from which v receives data, and Succ(v) denote the finite set of successor nodes which receive data from v. The set E of edges represented as ordered pairs of vertices such that: E = (cid:110) (u, v) Succ(u) (cid:111) . v | ∈ An edge from u to v represents a stream of items flowing from u to v, in which case u is a producer (potentially also a consumer) and v is a consumer. A node v V, where ∈ Pred(v) = ∅ is called a source and a node u V, where Succ(u) = is called a sink. ∅ ∈ Figure 2.5 depicts a producer-consumer network. The network represents a hierarchical monitoring system, in which v[1,4] are producers of events that are consumed and manipulated by nodes v[5,8]. Nodes v[5,8] then transmit the manipulated events into v9. A producer (respectively, consumer) node v ∈ V may receive (respectively, emit) data at a set of possible input rates denoted by possible output rates IRate(v) (respectively, ORatev). Let Out(u, v) denote the outgoing data rate from node u into node v. For example, in Figure 2.5, the incoming data for v1 is received from vs, and the outgoing data is sent to v5 30 and v6. For every node v ∈ V, we define In(v) such that, In(v) = (cid:88) Out(u, v). u∈Pred(v) 31 CHAPTER 3 RUNTIME VERIFICATION OF PARTIALLY SYNCHRONOUS DISTRIBUTED DISCRETE-EVENT SYSTEMS In this chapter, we present two sound and complete solutions to the distributed runtime verification (RV) problem in relation to LTL formulas. In order to address the explosion of different interleaving, we adopt a practical assumption, namely, a finite skew between local clocks of each pair of processes, which is ensured by a fault-proof clock synchronization system, such as NTP [88]. Both approaches utlize a fault-proof central monitor. To this end, we consider discrete-event systems [20], where the discrete states in the said systems are transitioned via events. The events can be message send events, message receive events or local processing events. As stated in Chapter 1, the agents in these systems do not share a global clock and memory, while attempting to perform a joint task. However, a clock synchronization algorithm (see Subsection 2.3) guarantees a maximum clock skew among the agents; thus, allowing partial synchrony. In other words, we make the following assumptions: • The systems under observation are discrete-event systems. That is, for every agent, within any time period, there is a finite number of event executions. These events could be internal to agents (e.g. variable updates), a message send event, or a message receive event. • A bounded skew ε between local clocks of every pair of processes, guaranteed by a fault- proof clock synchronization algorithm (e.g., NTP). This means time instants from different local clocks within ε are considered concurrent, i.e., it is not possible to determine their order of occurrence. This setting constitutes partial synchrony, which does not assume a global clock but limits the impact of asynchrony within clock drifts. In the following sections, we elaborate on our runtime verification approach for partially synchronous distributed systems using an automata-based technique and a progression-based formula rewriting technique. 32 3.1 Problem Statement Given a distributed computation ( E , ⇝), as defined in Definition 2, and an LTL formula φ, we say ( E , ⇝) satisfies φ iff there exists a trace, α, defined by a sequence of frontiers , ⇝), that satisfies φ. Formally, the evaluation of the LTL formula φ with respect to in ( E , ⇝) in the finite semantics is the following: ( E Problem Statement Monitoring of Distributed Systems. Given a distributed computation ( E , ⇝), a valid sequence of consistent cuts is of the form C0C1C2 , where for all i 0, we have ≥ · · · denote the set of all valid sequences of (1) Ci Ci+1, and (2) ⊂ | consistent cuts. We define the set of all traces of ( C + 1 = Ci | Ci+1 | . Let | , ⇝) as follows: E (cid:8)front(C0)front(C1) C0C1C2 · · · | · · · ∈ C (cid:9). The evaluation of the LTL formula φ with respect to ( E , ⇝) in the finite semantics is the following: , ⇝) [( E =3 φ] = | (cid:110) α =3 φ | | α ∈ (cid:8)front(C0)front(C1) C0C1C2 · · · | · · · ∈ C (cid:9)(cid:111) and, , ⇝) [( E =F φ] = | (cid:110) α =F φ | | α ∈ (cid:8)front(C0)front(C1) C0C1C2 · · · | · · · ∈ C (cid:9)(cid:111) This means that evaluating a distributed computation against a formula yields a set of verdicts, because a computation may contain multiple traces. It should be noted that throughout this chapter, ( E , ⇝) is used to denote distributed computation. 3.2 Formula Progression for LTL Because of the existence of a total ordering of events in a synchronous system, verification on a computation may be accomplished in a state by state method [10]. However, in a 33 partially synchronous system, such event ordering is not possible. A distributed computation , ⇝) may have different event orderings governed by different event interleavings. As a ( E result, multiple verdicts might be obtained from the same distributed computation ( E , ⇝). To explore these verdicts, we present a formula progression-based monitoring approach that, if possible, partially evaluates a formula on the current computation and, depending on the verdict, provides a rewritten formula to be evaluated on the extensions of the computation. As an example, let us consider the formula to be monitored as, φ = (a b). Now, if in → some trace in a computation, the monitor observes a, then for the extensions of computations, it is enough to monitor the rewritten formula, φ′ = b, as the final verdict is no longer dependent on the occurrence of a. We call this method of rewriting formula progression. Definition 9. A progression function Pr : Σ∗ ΦLTL ΦLTL is one that for all finite traces × Σω, and formulas φ → ΦLTL, we have: ασ ∈ = φ iff and only if | α σ Σ∗, infinite traces σ ∈ = Pr(α, φ). ■ | ∈ Our method and the traditional rewriting method [59] vary primarily in that our function Pr accepts finite traces as input, whereas the algorithm in [59] rewrites the input LTL formula in a state-by-state manner. As a result, it is not feasible to rewrite using the fixed point representation of temporal operators. The fact that a given distributed computation is divided into a number of segments so an SMT query is used to verify each segment serves as the motivation for our method. A state-by-state approach would generate excessive amounts of SMT queries, rendering the approach inefficient and unscalable. Remark 1. It is straightforward to see that for any α Σ∗ and φ ∈ ∈ Φ, if a progression function returns a non-trivial formula, which we denote by Pr(α, φ) = φ′ for some φ′ ΦLTL, ∈ then the verdict of monitoring is unknown. Atomic propositions. Let φ = p for some p AP. The verdict is provided depending ∈ upon whether or not p ∈ α(0). This is the only case where the output of Pr cannot be a rewritten formula; the possible verdicts are either true or false: 34   true if α(0) p ∈ Pr(α, φ) =  false if p α(0) ̸∈ Pr(α, ϕ). ¬ Let φ = ϕ. We have Pr(α, φ) = Negation. Disjunction. ¬ Let φ = φ1 φ2. If either sub-formula φ1 or φ2 is evaluated to false, then ∨ the progression of φ becomes the other sub-formula φ2 or φ1 respectively, since that will be the only responsible sub-formula for the verdict of all future computations: Pr(α, φ) = true false    φ′ 2 φ′ 1 φ′ 1 ∨ φ′ 2 if if if if if Pr(α, φ1) = true ∨ Pr(α, φ2) = true Pr(α, φ1) = false Pr(α, φ1) = false ∧ ∧ Pr(α, φ2) = false Pr(α, φ2) = φ′ 2 Pr(α, φ2) = false Pr(α, φ1) = φ′ 1 ∧ Pr(α, φ1) = φ′ 1 ∧ Pr(α, φ2) = φ′ 2 Next operator. Let φ = ϕ. The verdicts true, false and ϕ′ can only be reached if α1 = ε. Otherwise, or if we are at the last event in the trace, then the progression of φ becomes ϕ; implying ϕ must hold at the beginning of the future extension: Pr(α, φ) =    true if false if ϕ′ ϕ if if Pr(α1, ϕ) = true ∧ Pr(α1, ϕ) = false α1 = ε α1 = ε ∧ Pr(α1, ϕ) = ϕ′ α1 = ε ∧ α1 | = ε | Always and eventually operators. Progression in the temporal operator ‘always’, (resp. ‘eventually’, ) may yield false (resp. true) or remain unchanged: Pr(α, φ) = false if [α =F φ] = | ⊥    ϕ if otherwise 35 ̸ ̸ ̸ ̸ Pr(α, φ) =   true if  ϕ if [α =F φ] = | ⊤ otherwise Note that the semantics of FLTL is not frequently used, due to LTL3 being generally more expressive, as shown in [11]. However, LTL3 cannot be used to construct the progression rules. To be more precise, the ‘ ?’ (unknown) verdict in LTL3 semantics would raise additional and unnecessary complications in the progression rules, as this verdict does not provide any additional information as far as our progression-based approach is concerned. In fact, if progression results in a formula, it represents the ‘ ?’ verdict in LTL3. Therefore, we use FLTL for specifying the progression rules without any loss of generality as shown later in the proof of Lemma 2. Until operator. Let φ = φ1 φ2. Recall that φ1 φ2 = φ2 (φ1 ∨ ∧ (φ1 U U φ2)). We U divide the U formula into two parts, one with globally ( φ1) and the other eventuality ( φ2). These sub-formulas are evaluated independently, and the verdicts of each are used to establish the progression for the U operator. However, for the case when both φ1 and φ2 occur in the same computation, we cannot reach a verdict without taking the order of occurrence of these sub-formulas into account. That is, on a given finite trace α, if φ2 holds in α(i) (denoted iφ2) and φ1 holds throughout in all states from α(0) to α(i i−1φ1), then the progression of φ becomes true. If this is not the case, and φ1 does not 1) (denoted − hold in α, the progression of φ becomes false, since this signifies a break from the streak of φ1 required for φ to hold. The progression of φ remains unchanged if φ1 holds throughout α, but φ2 does not hold anywhere: 36 α α′ α′′ ∅ ∅ ∅ ∅ r ∅ ∅ q p Figure 3.1 Progression example. Pr(α, φ) =    true false Pr(α, φ1) Pr(α, φ1) U Pr(α, φ2) if i ∃ ∧ if [α ∧ if [α ∧ if [α ∈ [α [0, α | | − 1].[α =F | iPr(α, φ2)] = ⊤ =F | i−1Pr(α, φ1)] = ⊤ Pr(α, φ1)] = =F | not the first case ⊥ Pr(α, φ2)] = =F | not the second case ⊤ =F | Pr(α, φ1)] = ⊤ [α =F | ∧ Pr(α, φ2)] = ⊥ Example. Consider the formula φ = r p ( ¬ U → q), which can be broken into sub- formulas φs = { r, q, q, p } , according to our progression rules. Consider the trace in Figure 3.1 divided into three segments. In the first segment α, neither p, q nor r are present, and as far as the laws of the progression function defined above, φ remains unchanged for the next segment; i.e., Pr(α, φ) = φ. In the second segment α′, proposition r is observed, this satisfies sub-formula r the progressed formula becomes p ¬ U q; i.e., Pr(α′, φ) = q. In p ¬ U the next segment α′′, proposition q occurs before p. This falls under the first case of the until progression operator. Since q happens after a streak of p, we arrive at the verdict true; ¬ i.e., Pr(α′′, p ¬ U q) = true. Put it another way, Pr(αα′α′′, φ) = true. Lemma 2. Given an LTL formula φ, and a finite and infinite trace α tively, trace ασ satisfies φ if and only if σ satisfies Pr(α, φ). Formally, Σ∗, σ ∈ ∈ Σω respec- Proof. We distinguish the following cases: [ασ =F φ] | ⇐⇒ [σ =F Pr(α, φ)] | 37 Case 1: First, we consider the base case of this proof, where the formula is an atomic proposition, that is, φ = p. ) Let us first consider that p is observed on the first state of ασ. This implies, [ασ ( ⇒ =F φ] yields true, and Pr(α, φ) yields | true. . Therefore, [σ ⊤ =F Pr(α, φ)] must also yield | Now, let us consider that p is not observed on the first state of ασ. This implies, [ασ φ] yields false, and Pr(α, φ) yields . Therefore, [σ ) Let us first consider that [σ =F | =F Pr(α, φ)] must also yield false. | ⊥ =F Pr(α, φ)] yields true. This implies, Pr(α, φ) yields | =F φ] yields true. Therefore, p must have been observed on the first state of | ( ⇐ , and [ασ ⊤ ασ. Now, let us consider that [σ , ⊥ =F φ] yields false. Therefore, p must not have been observed on the first state of | =F Pr(α, φ)] yields false. This implies, Pr(α, φ) yields | and [ασ ασ. Case 2: Assume that the proof has been established for the case when the formula is φ = ϕ. Now, we consider the case where the formula is φ = ϕ. ¬ We can say [ασ =F | ¬ ϕ] is equivalent to [ασ =F ϕ] according to the finite-trace se- | mantics of LTL. We can also say [σ =F | Pr(α, ϕ) is defined as a progression rule. Furthermore, [σ ϕ)] is equivalent to [σ ¬ ¬ =F Pr(α, | Pr(α, ϕ)] since Pr(α, ϕ)] is ¬ ¬ =F | Pr(α, ϕ) = ¬ equivalent to ¬ [σ ¬ =F Pr(α, ϕ)] according to the finite-trace semantics of LTL. | Based on our assumption, the proof has already been established for [ασ ⇐⇒ =F Pr(α, ϕ)], and by extension, | =F ϕ] | [σ =F Pr(α, ϕ)]. Therefore, | [ασ ϕ] [σ =F | ¬ ⇐⇒ =F Pr(α, | =F ϕ] | [σ ⇐⇒ ¬ [ασ ¬ ϕ)] ¬ Case 3: Assume that the proof has been established for the case when the formula is φ = ϕ. Now, we consider the case where the formula is φ = ϕ. Let us first consider the case where the length of the trace α is 1, that is, α1 = 0. In this particular case, [ασ | | Pr(α, ϕ) = ϕ; which implies, [σ =F | ϕ] is equivalent to [σ | =F ϕ]. Furthermore, | =F Pr(α, ϕ)] is equivalent to [σ | =F ϕ]. Therefore, | α = 1 and | 38 [ασ ϕ] =F | Now, let us consider the case where the length of the trace α is longer than 1, that =F Pr(α, ϕ)]. | ⇐⇒ [σ is, [σ α 1 and α1 1. In this case, [ασ |≥ | | =F Pr(α, ϕ)] is equivalent to [σ | |≥ =F Pr(α1, ϕ)]. | =F | ϕ] is equivalent to [α1σ =F ϕ], and | Based on our assumption, the proof has already been established for [α1σ =F ϕ] | ⇐⇒ [σ =F Pr(α1, ϕ)]. Therefore, [ασ | =F | ϕ] ⇐⇒ Case 4: Assume that the proof has been established for the cases when the formulas are [σ =F Pr(α, ϕ)]. | φ = φ1 and φ = φ2. Now, we consider the case where the formula is φ = φ1 ∨ Based on our assumption, the proof has already been established for [ασ φ2. ⇐⇒ =F Pr(α, ϕ2)]. Therefore, we can derive the | =F ϕ1] | [σ =F Pr(α, ϕ1)] and [ασ | =F ϕ2] | ⇐⇒ [σ following: [ασ =F (φ1 | ∨ φ2)] ⇐⇒ ⇐⇒ ⇐⇒ [ασ [σ [σ [ασ ∨ =F φ1] | =F Pr(α, φ1)] | =F Pr(φ1 | ∨ ∨ φ2)]. =F φ2] | [σ =F Pr(α, φ2)] | Case 5: Now, we consider the case where the formula is φ = φ1 φ2. We prove this by U induction: Base Case: α = 0. | | [ασ =F φ] | ⇐⇒ ⇐⇒ [σ [σ =F Pr(α, φ)] | =F φ] | 39 Hypothesis Step: α = k. | | [ασ =F φ1 | [ασ ⇐⇒ φ2] (cid:16) U =F | (cid:0)φ1 φ2 ∨ ∧ (φ1 (cid:16) φ1 U ∧ φ2)(cid:1)(cid:17) ] (φ1 U (cid:17) φ2) ] ⇐⇒ [ασ =F φ2] | ∨ ⇐⇒ [ασ =F φ2] | ∨ [ασ =F | [ασ =F φ1] | ∧ [α1σ =F φ1 | (cid:17) ] φ2 U ⇐⇒ [ασ =F φ2] | ∨ [ασ =F φ1] | ∧ [α1σ (cid:16) φ2 =F | =F φ2](cid:1) | ∨ ∨ (cid:0)φ1 . . . ∨ (cid:17) ∧ (cid:16) φ2)(cid:1)(cid:17) (cid:19) ] (φ1 U [ασ =F φ1] | ∧ ⇐⇒ [α1σ (cid:16) [ασ [ασ =F φ2] | ∨ =F φ1] | ∧ . . . ∧ (cid:0)[ασ =F φ1] | [αk−2σ [α1σ ∧ =F φ1] | [(αk−1σ ∧ =F φ2] | =F φ1] | ∧ . . . [αk−1σ =F φ1] | [αkσ =F φ1 | U φ2] ∨ (cid:17) [ασ =F φ2] | [αk−1σ ⇐⇒ . . . ∧ =F φ1] | [σ =F φ1 | U ∧ φ2] ∧ [α1σ (cid:17) [ασ =F φ1] | ∧ ∨ (cid:17) =F φ2] | . . . ∨ ∨ (cid:16) [ασ =F φ1] | ∧ (cid:16) (cid:18) ∧ (cid:16) Inductive Step: α = k + 1 Trivially expanded from the above expansion. | | [ασ =F φ1 | U φ2] ⇐⇒ (cid:16) [ασ [ασ =F φ2] | ∨ =F φ1] | ∧ . . . ∧ [ασ =F φ1] | [αk−1σ ∧ =F φ1] | ∧ [α1σ =F φ2] | [αkσ =F φ1] | [σ =F φ1 | U ∧ (cid:17) φ2] (cid:17) . . . ∨ ∨ (cid:16) Now, in order for [ασ φ2] to yield true, there must be a k . . . φ1 ∧ ∧ αk−1σ =F φ1 | =F φ1 | ∧ U =F φ2], that is, αkσ | 1 such that [ασ =F | ≥ [ασ =F φ1 | U φ2] ⇐⇒ [ k ∃ αkσ =F φ1 | ∧ . . . ∧ αk−1σ =F φ1 | ∧ 1 . α0σ ≥ =F φ2] | ⇐⇒ [ k ∃ ≥ 1 . ασ =F | kφ2 ασ =F | ∧ k−1φ1] Note that the above recursive definition of Until allows us to evaluate any until for- mula, and by extension, any always ( φ = φ ) and eventually ( φ = U ⊥ φ) formula. ⊤ U Therefore, we can evaluate any sub-formula using this fixed point representation of until. ■ 40 a1 q2 a3 q3 q0 q1 a2 a4 a5 q4 qr (a) a1a2a3 a3 q3 a3a1a2 a1 q2 a2a3a1 q0 q1 a2 a4 a5 q4 qr (b) a1a2a3 q3 a3a1a2 a1 q2 a2a3a1 q0 q1 a2 a4 a5 q4 qr (c) Figure 3.2 Removing non-loop cycles in an LTL3 Monitor. 3.3 SMT-based Solution In this section, we go into further detail about our approach to distributed monitoring utilizing the two previously discussed monitoring techniques: (1) automata-based approach, and (2) progression-based approach. 3.3.1 Overall Idea Automata-based approach. Recall from Figure 1.5 that monitoring a distributed com- putation may result in multiple verdicts depending upon different ordering of events. In other words, given a distributed computation ( E , ⇝) and an LTL formula φ, different ordering of events may reach different states in the monitor automaton φ = (Σ, Q, q0, δ, λ) (as defined M in Definition 1). In order to ensure that all possible verdicts are explored, we generate an SMT instance for (1) the distributed computation ( E , ⇝), and (2) each possible path in the LTL3 monitor. Thus, the corresponding decision problem is the following: given ( E , ⇝) and a monitor path q0q1 qm in an LTL3 monitor, can ( E · · · , ⇝) reach qm? If the SMT instance is satisfiable, then λ(qm) is a possible verdict. For example, for the monitor in Figure 2.1, we consider two paths q∗ 0q⊥ and q∗ 0q⊤ (and, hence, two SMT instances). Thus, if both instances turn out to be unsatisfiable, then the resulting monitor state is q0, where λ(q0) =?. We note that LTL3 monitors may contain non-self-loop cycles. In order to simplify the SMT instance creation process (for each possible path in the LTL3 monitor), we collapse each 41 M φ = (Σ, Q, q0, δ, λ) φ = (Σ, Q, q0, δ′, λ) ′ Data: Result: Let CP be the set of all possible paths containing cycles δ′ δ foreach q M ← Q do ∈ foreach q sm δ′(q, sm end CP do sn q −→ q ← ∈ −→ · · · sn) · · · qi sk −→ qj | q sm −→ · · · qi sk −→ qj · · · sn −→ q CP } ∈ do ∈ { end foreach qm s qn −→ if m > n then δ′(qm, s) ← ∅ end end return φ M Algorithm 3.1 Non-Self Loop Cycle Removal Algorithm non-self-loop cycle into one state with a self-loop labeled by the sequence of events in the cycle using Algorithm 3.1. As an example, in Figure 3.2, Algorithm 3.1 first takes an LTL3 monitor (Figure 3.2a) and adds the necessary self-loops (Figure 3.2b). Then it eliminates all non-self-loop cycles by removing transitions from states with higher identifiers to states with lower identifiers in cycles (Figure 3.2c). The non-deterministic nature of the final automata ensure that all the transitions and the accepting language of the automata are preserved. Lemma 3. Let M φ = (Σ, Q, q0, δ, λ) be the monitor automaton for LTL formula, φ, and ′ φ = (Σ, Q, q0, δ′, λ) be the monitor automaton with no non-self loop cycles, obtained from an and a initial state, φ. Given a finite trace, α = a1a2 M applying Algorithm 3.1 on M · · · Q, we prove that λ(δ(q, α)) = λ(δ′(q, α)). q ∈ Proof. We distinguish the following cases: Case 1: First we show, λ(δ(q, α)) λ(δ′(q, α)) Let α = a1a2 an, where · · · → i ∀ λ(δ′(q, α)), that is, Q . λ(δ(q, α)) = ⇒ Σ. Algorithm 3.1 removes non-self loop ∈ ∀ α, q ∀ [1, n].ai ∈ ∈ cycles by removing a transition such that the corresponding transition of δ(q, ai), δ′(q, ai), where i ∈ [1, m] does not exist. This is such that k ∃ ∈ [1, i] . q′ ai−k −−→ · · · q ai −→ q′. This 42 transition is same as δ′(q′, ai−k · · · ai) = q′ which was one of the added self-loops. The rest of the transitions are maintained such that δ(q, ai) = δ′(q, ai), where q Q and i [1, m]. ∈ ∈ Case 2: Now, we show, λ(δ′(q, α)) λ(δ(q, α)) Let α = a1a1 · · · by i ∃ q ai −→ [1, n], k [1, n ∈ q′ ai+1 −−→ · · · ∃ ∈ ai+k −−→ − q in → i ∀ an, where ∈ i] . δ′(q, aiai+1 λ(δ(q, α)), that is, α, ∀ q ∀ ∈ [1, n].ai ∈ Σ. A self-loop in ′ φ M Q . λ(δ′(q, α)) = ⇒ can be represented ai+k) = q. In another words, there exists a path · · · φ. The rest of the non-self loop transitions are the same, such M that δ′(q, ai) = δ(q, ai), where q Q and i [1, m]. ∈ ∈ ■ Progression-based approach. Due to the existence of a total ordering of events in a synchronous system, verification on a computation may be carried out using a state-by-state methodology [10]. A partially synchronous system, however, makes such an ordering of events impossible. Varying interleavings of events can lead to different orderings of events in a distributed computation ( E , ⇝). Therefore, it is possible to obtain multiple verdicts on the same distributed computation ( E , ⇝). To explore these verdicts, we provide a formula progression monitoring approach that, if feasible, partially evaluates a formula on the current computation and, in response to the verdict, offers a rewritten formula that is to be evaluated on the extensions of the computation. As an example, let us consider the formula to be monitored as, φ = (a → b). Now, if in some trace in a computation, the monitor observes a, then for the extensions of computations, it is enough to monitor the rewritten formula, φ′ = b, as the final verdict is no longer dependent on the occurrence of a. We call this method of rewriting formula Progression, which we discuss in length later on. In the next two subsections, we present the SMT entities and constraints with respect to one monitor path and a distributed computation. 3.3.2 SMT Entities SMT entities represent the sub-formulas of an LTL formula and a distributed computa- tion. After the verdicts from all the sub-formulas are generated, we construct our rewritten 43 formula by attaching the said verdicts to their corresponding parent formulas in the parse tree and then performing an in-order traversal starting from the root of the parse tree. At the end of the traversal, the resulting formula is, in fact, the progression for the next computation. We now introduce the entities that represent a path in an LTL3 monitor φ = (Σ, Q, q0, δ, λ) for LTL formula φ and distributed computation ( E M noted that the SMT entities in this subsection are used in both the automata-based and the , ⇝). It should be progression-based approaches. Monitor automaton. Let q0 s0 −→ q1 s1 −→ · · · (qj sj −→ qj)∗ sm−1 −−−→ · · · qm be a path of monitor φ, which may or may not include a self-loop. We include a non-negative integer variable M ki for each transition qi sj −→ self-loop qj si −→ qi+1, where i [0, m − ∈ 1] and si ∈ Σ. This is also true for the qj, for which we include a non-negative interger kj. Distributed computation. In our SMT encoding, the set of events, are represented by a E bit vector, where each bit corresponds to an individual event in the distributed computation, ( E an , ⇝). We conduct a pre-processing of the distributed computation, during which we create matrix, hbSet to incorporate the additional happen-before relations obtained by E × E the clock-synchronization algorithm. Afterwards, we populate the hbSet with 0’s and 1’s, such that hbSet[i][j] = 1 if [i] ⇝ [j], and hbSet[i][j] = 0 otherwise. We introduce a function µ : AP E × → { E E true, false } in order to establish a relation between each event and the atomic propositions in it. In the event that other variables or constants are used in defining the predicates (e.g. x1 + x2 an uninterpreted function ρ : Z≥0 ≥ → 2), µ is constructed accordingly. Finally, we introduce 2E that identifies a sequence of consistent cuts from to for reaching a verdict, while satisfying a number of given constraints explained in {} {E} Subsection 3.3.3. 3.3.3 SMT Constraints We next go on to the SMT constraints after defining the requisite SMT entities. The SMT constraints for consistent cuts that are enforced on both the automata-based and the 44 progression-based approaches are first defined. Afterwards we define the SMT constraints that are more dependant on the methodology. Consistent cut constraints over ρ. In order to ensure that the uninterpreted function ρ identifies a sequence of consistent cuts, we enforce certain consistent cut constraints. The first constraint enforces that each element in the range of ρ is in fact a consistent cut: i ∀ ∈ [0, m]. e, e′ ∀ ∈ E (cid:16) . (e′ ⇝ e) (e ∧ ∈ (cid:17) ρ(i)) (cid:16) e′ ∈ → (cid:17) ρ(i) Next, we enforce that the sequence of consistent cuts identified by ρ start from an empty set of events, and each successor cut of the sequence contains one more new event than its predecessor. | Finally, we ensure that each successive consistent cut is immediately reachable in ( ∈ | i ∀ [0, m]. ρ(i + 1) = ρ(i) | | + 1 , ⇝) by E enforcing a subset relation: i ∀ ∈ [0, m]. ρ(i) ρ(i + 1) ⊆ We determine if a series of consistent cuts conforms to the specification after it has been created. This is done using (1) progression-based approach, where the LTL formula is rep- resented by a SMT constrain and (2) LTL3 automata-based approach, where a path on the automata is represented as an SMT constraint. This is repeated for all sub-formulas of the original LTL formula and all paths in the LTL3 automata respectively as discussed below. Let C represent for the conjunction of the aforementioned constraints. Recall that there is only one valid path that is relevant to this conjunction C. Since there may be multiple paths in the monitor, we replicate the above constraints for each such path. Suppose there are n such paths and let C1, C2, . . . , Cn be the corresponding SMT constraints for these n paths. We include the following constraint: This means that if the SMT instance above satisfiable, then a valid path exists. C1 C2 ∨ ∨ C3 ∨ · · · ∨ Cn 45 Constraints for LTL progression over ρ. Given a distributed computation ( , ⇝ E ), the aforementioned constraints may provide a valid series of consistent cuts that may result in multiple verdicts depending on how the concurrent events are ordered. Therefore, while evaluating an LTL formula on ( E , ⇝), all potential outcomes are investigated in order to prevent false positives. To achieve this, we examine the sequence of consistent cuts C0C1C2 · · · Cm interpreted by the uninterpreted function ρ(m), looking for both satisfaction and violation. Note that applying our progression rules to monitor any LTL formula will cause it to eventually monitor sub-formulas that only include atomic propositions, globally, and eventually temporal operators: φ = p φ = ϕ φ = ϕ front(ρi) AP (satisfaction, i.e., = p, for p | [0, m]. front(ρi) ∈ ∈ [0, m]. front(ρi) i ∃ i ∃ ∈ = ϕ (violation, i.e., ̸| = ϕ (satisfaction, i.e., | ) ⊤ ) ⊥ ) ⊤ Situations to the contrary will lead to a rewritten formula that will go on to the following segment. In general, the verdict for any LTL formula will be derived using our progression rules in Section 3.2. 3.4 Optimization We employ several optimization techniques in our implementation to speed up and im- prove the monitoring process. In this section, we discuss two crucial optimization techniques, as well as their impact on run time. 3.4.1 Segmentation of Distributed Computation RV is known to be an NP-complete problem in the number of processes in a distributed setting [53]. The complexity exhibits even more exponential blowup during verifying for- mulas with nested temporal operators. In order to cope with this complexity, we divide our computation into smaller segments, (seg1, ⇝)(seg2, ⇝) albeit more SMT problems. Given a distributed computation ( (segl/g, ⇝) to create smaller, , ⇝) of length l, we divide · · · E it into l g smaller segments length g. The set of events in segment j, where j [1, l g ], is the ∈ 46 following: segj = (cid:110) en τ,σ,ω | σ 0, (j [max { ∈ 1) g , j ε } − × − × g] ∧ n ∈ (cid:111) [1, N ] Note that each segment (barring seg0 the previous segments ending point. This creates an overlap of ε time units between each ) has to be constructed starting at ε time units before pair of adjacent segments. Doing so ensures that no pair of possible concurrent become non-concurrent due to the splits caused by segmentation. Therefore, dividing the actual computation into segments does not have any effect on the final verdict of the said computa- tion. We also use parallelization to make our algorithm perform faster, while utilizing most of the computation power modern processors are capable of handling. Lemma 4. A distributed computation, ( , ⇝), of length l satisfies an LTL formula, φ, if and only if the distributed computation, ( E E , ⇝), is divided into l g segments of length g satisfies φ using the automata-based approach. That is, Given a distributed computation ( , ⇝) of E length l divided into l g segments of length g, the evaluation of the LTL formula φ on, by the automata-based approach is equal, i.e., [(seg1.seg2. .seg l g , ⇝) · · · ⇐⇒ = [(seg1.seg2. .seg l g , ⇝) =3 φ] | =3 φ], that is, | α { =3 φ | | [( , ⇝) =3 φ] | , ⇝) =3 φ] | E Proof. Let us assume [( E =3 φ | α = , ⇝) Tr( E ∈ ) Let Ck be a consistent cut such that Ck is in Tr( E Tr(seg1.seg2. α { · · · } ̸ α | , ⇝) } ∈ ( ⇒ ) for some k · · · .seg l g [0, ]. This implies that the frontier of Ck, front(Ck) seg1 and front(Ck) ̸⊆ . However, this is not possible, as according to the seg- ̸⊆ , ⇝), but not in Tr(seg1.seg2. .seg l g , ⇝ · · · ∈ |E| and front(Ck) and · · · seg2 mentation construction, there must be a segj Therefore, such Ck cannot exist, and seg l g ̸⊆ α =3 φ | . By extension, [( E } , ⇝) { α ∈ | =3 φ] | ⇒ Tr(seg1.seg2. .seg l g , ⇝) · · · ( ⇐ , ⇝) for some k ) Let Ck be a consistent cut such that Ck is in Tr(seg1.seg2. [0, ]. This implies, front(Ck) E [1, l ⊆ g ]. However, this is not possible due to the fact that |E| ∈ Tr( j ∈ where 1 j ≤ such that front(Ck) segj . ⊆ l g ≤ Tr( E , ⇝) } ⊆ { [(seg1.seg2. · · · .seg l g · · · α =3 φ | .seg l g | , ⇝) α ∈ =3 φ] | , ⇝), but not in segj and front(Ck) for some ̸⊆ E j ∀ ∈ [1, l g ] . segj ⊆ E . Therefore, 47 ̸ such Ck cannot exist, and | . By extension, [(seg1.seg2. { α =3 φ | α Tr(seg1.seg2. , ⇝) ∈ .seg l g · · · =3 φ] | .seg l g , ⇝) , ⇝) [( E ⇒ α α =3 φ | } ⊆ { ∈ =3 φ]. Therefore, | | · · · Tr( [( E E , ⇝) , ⇝) } =3 φ] | [(seg1.seg2. .seg l g , ⇝) =3 φ]. ■ | · · · ⇐⇒ Lemma 5. A distributed computation ( E , ⇝) of length l satisfies an LTL formula φ if and only if the distributed computation, ( E , ⇝), is divided into l g segments of length g satisfies φ using the progression-based approach. That is, , ⇝) [( E =F φ] | ⇐⇒ [(seg1.seg2. .seg l g , ⇝) =F φ] | · · · 3.4.2 Parallelized Monitoring Clusters of computers with several processing cores and processors are used by many cloud services. They can now create high-performance parallel/distributed applications and handle huge data rates as a result. Utilizing the extensive infrastructure should also be possible for monitoring such applications. In light of this, we will now talk about parallelizing our SMT-based monitoring technique. Let G be a sequence of g segments G = seg1seg2 · · · segg . For each computer core that is available, a task queue will be established.The segments will then be distributed evenly among all of the queues so that each core may independently monitor its queue. However, merely dividing up all the segments across cores will not guarantee a reliable outcome. For example, consider formula φ = a U b and two segments, seg1 and seg2 across two cores, Cr1 and Cr2, respectively. The monitor operating on Cr2 must be aware of the outcome of the monitor operating on Cr1 in order to render the proper verdict. In a scenario, where Cr1 observes one or more a in seg1 ¬ , a violation must be reported even if Cr2 does not observe b and no a. Generally speaking, the temporal order of events makes independent evaluation ¬ of segments impossible for LTL formulas. Of course, some formulas such as safety (e.g., p) and co-safety (e.g., q) properties are exceptions. For our automata-based approach, we address this problem in two steps. Let φ = M (Σ, Q, q0, δ, λ) be an LTL3 monitor. Our first step is to create a 3-dimensional reachability 48 matrix RM by solving the following SMT decision problem: given a current monitor state qj ∈ and j, k Q and segment segi , can this segment reach monitor state qk Q, for all i [1, g], ∈ ∈ [0, Q | | − ∈ 1]. If the answer to the problem is affirmative, then we mark RM [i][j][k] with true, otherwise with false. This is illustrated in Figure 3.3 for the monitor shown in Figure 2.1, where the grey cells are filled arbitrarily with the answer to the SMT prob- lem. This step can be made embarrassingly parallel, where each element of RM can be computed independently by a different computing core. One can optimize the construc- tion of RM by omitting redundant SMT executions. For example, if RM [i][j][ ⊤ ] = true, then RM [i′][ ][ ] = true for all i′ ⊤ ⊤ ] = true for all i′ RM [i′][ ][ ⊥ ⊥ ∈ [i, Q | 1]. [i, Q | | − ∈ 1]. Likewise, if RM [i][j][ ⊥ ] = true, then | − The second step is to generate a verdict reachability tree from RM . The goal of the tree is to check if a monitor state qm ∈ Q can be reached from the initial monitor state q0. This is achieved by setting q0 as the root and generating all possible paths from q0 using RM . That is, if RM [i][k][j] = true, then we create a tree node with label qj and add it as a child of the node with the label qk. Once the tree is generated, if qm is one of the leaves, only then we can say qm is reachable from q0. In general, all leaves of the tree are possible monitoring verdicts. Note that creation of the tree is achieved using a sequential algorithm. For example, Figure3.4 shows the verdict reachability tree generated from the matrix in Figure 3.3. For our progression-based approach, we adhere to a similar technique for parallelized monitoring as our automata-based approach. The key difference being, in the progression- based approach subformulas are used, whereas in the automata-based approach different states are used. As an example, the previous formula φ = a b will be broken into two subformulas φ1 = a and φ2 = U b, before creating the reachibility matrix, and then generating the verdict for both these subformulas. Lemma 6. A distributed computation ( E , ⇝) of length l satisfies an LTL formula φ if and 49 seg2 q⊤ q⊥ q0 seg3 q⊤ q⊥ q0 seg1 q0 q⊤ q⊥ q0 q0 T F q0 q⊤ F q0 q⊥ F q⊤ q⊥ q0 F F q⊤ q⊥ q0 F F F seg4 q⊤ q⊥ F T T F T T T T T T q⊤ q⊥ F T F q⊤ q⊥ F T q⊤ q⊥ q0 F T F q⊤ q⊥ q0 F T F q⊤ q⊥ q0 q⊤ q⊥ q0 F T F F T F Figure 3.3 Reachability Matrix for a b. U q0 q0 q q ⊥ ⊥ q q q ⊤ ⊤ ⊤ q0 q0 q q ⊤ ⊤ q0 q q ⊤ Figure 3.4 Reachability Tree for a ⊥ b. U only if the parallelized monitoring technique satisfies φ. That is, , ⇝) [( E =3 φ] | ⊤ ∈ ⇐⇒ λ(q) = ⊤ and, Where q ∈ ⇐⇒ Q is some leaf node in the verdict reachability tree generated from RM during ⊥ ∈ ⊥ E [( , ⇝) =3 φ] | λ(q) = the parallelized monitoring process and λ is the labelling function in φ. M Base Case: Let us first consider the case where there is only one segment. That is, l = g. ) If ( ⇒ , ⇝) [( E =3 φ] (resp., | ⊤ ∈ , ⇝) [( E =3 φ]), then according to the construction | ⊥ ∈ of the corresponding verdict reachability tree made from the RM , the root node q0 must have a child q⊤ (resp., q⊥), such that, λ(q⊤) = (resp., λ(q⊥) = ⊥ ⊤ ). This child is also a leaf node, as the height of a verdict reachability tree is 2 when there is only one segment. 50 ) We can trivially show that if λ(q⊤) = ( ⇐ (resp., λ(q⊥) = ⊤ ), that is, if q⊤ (resp., q⊥) is reachable from q0, then E Hypothesis: Let us assume the proof as been established for l = g ⊤ ∈ ⊥ ∈ E [( , ⇝) =3 φ] (resp., | [( , ⇝) ⊥ =3 φ]). | k. Now we consider × l = q × (k + 1) as the segment length. ) If ( ⇒ , ⇝) [( E =3 φ] (resp., | ⊤ ∈ , ⇝) [( E =3 φ]), then according to our assumption, | ⊥ ∈ there must be at least one node at height k + 1 (height of the leaf nodes where there are k segments), such that λ(q⊤) = (resp., λ(q⊥) = ⊤ ). Now for k + 1 number of segments, ⊥ according to the construction of the corresponding verdict reachability tree made from the RM , the node q⊤ (resp., q⊥) can only have the child q⊤ (resp., q⊥). Therefore, there must be at least one node at height k + 2 (height of the leaf nodes when there are k + 1 segments), such that λ(q⊤) = (resp., λ(q⊥) = ). ⊥ ⊤ ) We can trivially show that if λ(q⊤) = ( ⇐ (resp., λ(q⊥) = ⊤ is reachable from q0, then , ⇝) [( E =3 φ] (resp., | [( E ⊥ ∈ ⊤ ∈ 3.5 Case Studies and Evaluation , ⇝) ⊥ =3 φ]). | ), that is, if q⊤ (resp., q⊥) In this section, we focus on our SMT-based solution without digressing into other aspects like instrumentation, data gathering, data transfer, monitoring, etc., as given the distributed setting, runtime will be the dominant factor over any other kind of overhead. We evaluate our proposed technique using synthetic experiments, Cassandra (a distributed database) [19, 72], and the RACE dataset from NASA [85]. 3.5.1 Implementation and Experimental Setup Three steps may be identified for each experiment: (1) data generation, (2) data collec- tion, and (3) data verification. For the purpose of generating data, we create a synthetic program that at random generates a distributed computation (i.e., the behaviors of a set of programs in terms of their inter-process communication and local calculations). Generating synthetic experimental data offer benefits that enable us to draw comparison between differ- ent parameters and their effect on the approach. For example, generating data for different values of ε is beneficial to study its effect on the runtime and the number of false warning 51 verdicts of our approach. When developing the synthetic distributed system as part of our experiment, we ensure a partially-synchronous setting by including an HLC implementation. We use a uniform distribution (0, 2) to define the type of event (local computation, send and receive message) and a flip-coin distribution for computing the atomic propositions that are true at each local computation event. Although the events in our synthetic experiments in Section 3.5.2 are uniformly distributed over the length of the trace, the event distribution as part of the Cassandra experiments in Section 3.5.3 are affected by the network latency and other external factors. In addition, we assume that that there is an external data collection program which keeps track of the data/states of the system under verification. It generates the trace logs which is used by the monitoring program to verify against the given LTL specifications mentioned in Figure 3.5b. For data verification, we consider the following parameters: (1) number of processes (N ), (2) computation duration (l secs), (3) segment length (g), (4) event rate (r events/pro- cess/sec), (5) maximum clock skew (ϵ), and (6) number of nested temporal operators ( ) ϕ | | for the LTL formula under monitoring. The primary metric is to calculate the SMT solving runtime for each parameter configuration. In all of the charts shown in this section, the time axis is displayed in log scale. By keeping the values of all the other parameters at sensible fixed values, we can study the impact of changing one parameter. In all the graphs, we com- pare the runtime of our automata-based approach against the progression based approach. We use a MacBook Pro with Intel i7-7567U(3.5Ghz) processor, 16GB RAM, 512 SSD and g++ Apple clang version 12.0.5 (clang-1205.0.22.9) interface to the Z3 SMT-solver [97] to generate the traces. To evaluate our parallel algorithm, we use a server with 2x Intel Xeon Platinum 8180 (2.5GHz) processor, 768GB RAM, 112 vcores and g++(GCC) 9.3.1 interface to the Z3 SMT-solver [97]. 52 3.5.2 Analysis of Results – Synthetic Experiments In this series of experiments, we examine every parameter that is available and record how it impacts SMT solution. To investigate how each parameter affects runtime, we test each one separately. Since the created synthetic data is independent of any outside influences, we include a delay to both reduce the amount of events occurring at each time unit and to ensure that events are distributed equally across the execution of each process. We assign a value to each local computation event in each process using a uniform distribution (0, ). Σ | | The findings of the following experiments only make use of one CPU core. Overall, we notice an improvement of around 35% when the progression based technique is compared to the other automata based approach. This improvement in performance owes to two main reasons: (1) compared to the automata-based approach, the LTL constrains in our progression-based approach is less demanding in terms of computational complexity. Each sub-formula consists of mostly one atomic proposition as opposed to multiple atomic propositions in each path of the automaton, which in turn speeds up the overall verification process, and (2) the total number of SMT-instances needed is fewer due to the less number of sub-formulas compared to automaton paths given the same specification. We now analyze the results in detail. Impact of predicate structure. In this experiment (Figure 3.5a), we consider different predicate distribution over AP for the formula, φ1, i.e., how many processes are involved with a particular predicate. We consider different predicate structures: O(1), O(n), O(n2) and O(n3) which signifies the order of the number of SMT-encodings that need to be generated for the given distribution of predicates. As can be seen, the progression based technique outperforms the automata-based technique overall by 35% on average. Having said that, during our experiments when comparing the runtime of our moni- toring approach for increasing number of sub-formulas, we observe a slight decrease in the overall efficiency in runtime when using the progression-based approach compared to the automata-based approach. Since the progression-based approach is based on evaluating each 53 ) s ( e m i t n u R 500 100 50 10 5 10 5 ) s ( e m i t n u R 1 500 100 50 10 5 1 ) s ( e m i t n u R O(1) Automata O(1) Progression O(n) Automata O(n) Progression O(n2) Automata O(n2) Progression O(n3) Automata O(n3) Progression φ1 Automata φ1 Progression φ2 Automata φ2 Progression φ3 Automata φ3 Progression φ4 Automata φ4 Progression φ5 Automata φ5 Progression φ6 Automata φ6 Progression 500 100 50 10 5 1 ) s ( e m i t n u R 2 3 5 4 Number of processes 7 6 |P| (a) Predicate Structure. 8 9 10 1 2 Number of processes 3 4 |P| (b) LTL Formula. 5 g = 0.5 Automata g = 0.4 Automata g = 0.3 Automata g = 0.2 Automata g = 0.5 Progression g = 0.4 Progression g = 0.3 Progression g = 0.2 Progression = 1 Automata = 2 Automata = 3 Automata = 4 Automata = 5 Automata = 1 Progression = 2 Progression = 3 Progression = 4 Progression = 5 Progression |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| 500 100 50 10 5 1 ) s ( e m i t n u R 50 100 150 200 250 300 350 400 450 500 Clock skew ϵ (ms) 5 6 8 7 Event rate (/process/sec) 10 11 12 13 14 15 9 (c) Epsilon. (d) Event Rate. = 1 Automata = 2 Automata = 3 Automata = 4 Automata = 5 Automata = 1 Progression = 2 Progression = 3 Progression = 4 Progression = 5 Progression |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| = 1 Automata = 2 Automata = 3 Automata = 4 Automata = 5 Automata = 1 Progression = 2 Progression = 3 Progression = 4 Progression = 5 Progression |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| 100 50 10 5 1 ) s ( e m i t n u R 50 100 150 200 250 300 350 400 450 500 Segment length g(ms) 0.5 1 2 1.5 Computation duration l sec 2.5 3.5 3 4 4.5 5 (e) Segment Length. (f) Computation Duration. Figure 3.5 Synthetic experiments - impact of different parameters. 54 sub-formula, there exists an LTL formula where the number of sub-formulas is more than the number of paths in the corresponding automata, and thus, the the progression-based approach might not be as efficient as the automata-based approach in such a scenario. For example, consider a formula, φ = a b ∨ ∨ c, where the automata has two states, which makes the number of paths to be 2. However, the progression involves 3 sub-formulas, which makes the progression based approach less efficient than its automata counterpart. We would like to point out that the formula can be rewritten as (a b ∨ ∨ c), which makes both the approaches yield similar results. Thus we hypothesize that for all LTL formulas, the progression-based approach will be more (if not equally) efficient to that of the automata-based approach. Impact of LTL formula. Given an LTL formula, the depth of nested temporal operators plays an important role as suggested by Figure 3.5b. We experiment with the following LTL formula and the progression based technique achieved an average improvement of 32.8% compared to the automta-based one. φ1 = p φ2 = (q → φ3 = ((q φ4 = ((q ∧ ∧ p) r) r) φ5 = r (s r)) U (s (r ∨ U t) p p → ( ¬ ( → ¬ r ( ¬ (s ∧ ¬ r ( ¬ t) p ∧ (t p ( ¬ U p))) d = 2 d = 3 d = 4 d = 5 t))))) d = 6 = 1 = 2 = 3 = 8 = 8 ϕ | | ϕ | | ϕ | | ϕ | | ϕ | | ϕ | | U → φ6 = ((q ∧ r ( ¬ Impact of partial synchrony. Figure 3.5c depicts the anticipated outcome, wherein an → r ( ¬ d = 7 ∧ r) p))) = 9 → → r) (t ∧ ∧ ∧ U U U U exponential rise in the number of concurrent events across processes leads to longer runtime as clock skew ϵ grows. When comparing with the automata-based approach , the progression- based technique yields us an improvement of 33.36%. Impact of event rate. Figure. 3.5d shows that our approach breaks even with the com- putation duration for N = 3 for an event rate of 5events/process/sec. However, increasing 55 the event rate increases the search space for the SMT solver. Overall we improve by 34.4% by using the progression-based technique compared to the automata-based technique. Impact of segment count. The number of events to be handled grows as segment length rises, exponentially lengthening the time our method takes to operate. Since there are not enough occurrences to have an effect, N = 1, 2 doesn’t show significant improvement in Figure 3.5e. For a greater number of operations, we see improved performance with shorter segments. Due to the time required to construct a greater number of SMT encodings outweighing the performance benefit from smaller segments, it should be noted that the runtime rises for extremely short segment lengths. Here too, we notice an improvement of 32.6% for the progression-based technique over the automata-based technique. Impact of computation duration. In Figure 3.5f, we lengthen computation and monitor the impact on runtime. The number of segments required to verify the lengthier computation grows as the duration of the computation rises, leading to a linear increase in runtime. The progression-based approach improves the runtime by 33.1% when compared to the automata- based approach. Impact of parallelization. The technique performs significantly better when the veri- fication is distributed over many cores. Figure 3.6a illustrates the dramatic improvement in performance that occurs when the number of cores is increased from 1 to 10. However, raising it further makes little progress since the time required to generate the SMT encodings begins to take precedence over the time required to solve it. An improvement of 33.8% is achieved for progression-based approach when compared to automata-based approach. Impact of ϵ on false warnings. As discussed in Section 2.3, the monitor does not have access to the global clock, it can report events as concurrent, when in reality, one happened before the other in the system under observation. However, during this experiment, we keep track of the global clock values separately, which gives us full knowledge over the total ordering of all events. Thus, allowing us to study and report the real verdicts alongside the 56 1,000 500 100 50 10 5 1 ) s ( e m i t n u R ) s ( e m i t n u R 100 50 20 10 5 2 1 = 1 Automata = 2 Automata = 3 Automata = 4 Automata = 5 Automata = 1 Progression = 2 Progression = 3 Progression = 4 Progression = 5 Progression |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| 50,000 ) s ( e m i t n u R 10,000 5,000 sbs - 1 sbs - 2 sbs - 3 0 10 20 30 40 50 60 70 80 90 100 Number of cores (a) Synthetic Data. 1 2 3 Number of cores 4 5 (b) SBS Data. O(n) Conjunction Satisfaction O(n) Conjunction Violation O(n) Disjunction Satisfaction O(n) Disjunction Violation 1 event/sec/process 2 event/sec/process 3 event/sec/process 100 80 60 40 ) s ( e m i t n u R 11 12 13 14 10−2 0.1 5 · 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Clock skew ϵ (ms) 1 2 3 4 7 6 5 10 Number of cores 8 9 (c) Google Data. (d) False Warnings. Figure 3.6 Impact of parallelization on different data. reported verdicts. We observe that the monitor sometimes report false warnings, that is, it reports both verdicts (satisfaction and violation), when in reality, only one has occurred. Note that the monitor never fails to report real verdicts. However, it may report false warnings alongside real verdicts on some occasions. Although this does not change the correctness of the approach, it may still include false warnings as part of the set of evaluated results. In Figure 3.6d, we observe that with the increase of the maximum clock skew ϵ, the number of false warnings increases. The increase in false warnings is attributed to the fact that as the value of ϵ increases, so does the number of events considered as concurrent by 57 the monitor. Additionally, we observe that the number of false warning is greatly influenced by the predicate structure of the LTL formula, as evident from Figure 3.6d. For O(n) conjunctive satisfaction formula monitoring and O(n) disjunctive violation formula monitoring, false warnings might occur if any one of the n sub-formulas are violated or satisfied, respectively. Therefore, we see a higher number of false warnings. Similarly, for O(n) disjunctive satisfac- tion formula monitoring and O(n) conjunctive violation formula monitoring, false warnings might occur if all of the n sub-formulas are violated or satisfied, respectively. Therefore, we see a lower number of false warnings. 3.5.3 Case Study 1: Cassandra In this case study, we observe read/write irregularities of a No-SQL distributed database management system called Cassandra [19, 72]. One node from each cluster serves as the seed node in our simulation of a distributed database with two data centers: one cluster with four nodes and the other with three. Each node in both clusters replicates all of the data. Each node runs on Red Hat OpenStack Platform using 4 VCPUs, 4GB RAM, Ubuntu 18.04, Cassandra 3.11.6, and Java 1.8.0_252. Additionally, we have simulated a system with numerous processes, each of which is in charge of the fundamental database operations (read, write and update). These processes are also capable of inter-process communication, which enables them to alert other processes in the event that they create a new database record. We compared our system’s latency against that of Google Cloud, Microsoft Azure, and Amazon Web Services in order to make our simulated database as realistic as possible. The quickest response was timed at 41ms compared to our system’s 100ms. The sluggish band- width and different infrastructure are to blame for the significant latency when compared to the industry norm. In all of our experiments, we consider a delay of 100ms into account. Each of the processes is capable of reading, writing, or updating the database entries given the way the processes are designed. We choose the kind of operation that will be carried out by the process using a (0, 2) uniform distribution. The other processes are informed 58 ) s ( e m i t n u R 100 50 10 5 4 6 φrw, φrw, φwrc, φwrc, φdrc, φdrc, φrw, φrw, φwrc, φwrc, φdrc, φdrc, |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| = 2 Automata = 3 Automata = 2 Automata = 3 Automata = 2 Automata = 3 Automata = 2 Progression = 3 Progression = 2 Progression = 3 Progression = 2 Progression = 3 Progression φrw, φrw, φwrc, φwrc, φdrc, φdrc, φrw, φrw, φwrc, φwrc, φdrc, φdrc, |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| |P| = 2 Automata = 3 Automata = 2 Automata = 3 Automata = 2 Automata = 3 Automata = 2 Progression = 3 Progression = 2 Progression = 3 Progression = 2 Progression = 3 Progression 100 50 10 5 ) s ( e m i t n u R 8 10 12 Segmentation frequency (Hz) 16 14 18 20 1 2 4 8 3 Computation duration (s) 6 5 7 9 10 (a) Segment Length. (b) Computation Duration. Figure 3.7 Cassandra experiments. of any additions made by the write operation using the inter-process communications. No messages are believed to have been lost during transmission, and as soon as they are received, the receiving process reads each message. Consistency level helps a database maintain the bare minimum number of replications required for an activity to be deemed successfully completed. In order to eliminate any potential of a read or write anomaly in the database, Cassandra recommends that the total of read and write consistency should be greater than the replication factor. Using runtime monitoring approaches, we want to keep an eye on and detect read/write irregularities in the database. The corresponding LTL specification becomes: φrw = n (cid:94) (cid:16) i=0 write(i) → (cid:17) read(i) where n is the number of read/write operations. One of the drawbacks of utilizing a distributed database like Cassandra is the absence of database normalization features. As a result, we intend to monitor both write and delete reference checks. We present two tables: Student(id, name) Enrollment(id, course) On these tables, we enforce the write and remove reference check. A write in the Enroll- 59 ment table must always be followed by a write in the Student table with the same id. Similarly, deleting from the Student table should always be followed by deleting from the Enrollment table with the same id. This ensures no insertion and deletion anomalies, resulting in the following LTL specification: φwrc = write(Student.id) ¬ U write(Enrollment.id) (cid:17) φdrc = delete(Enrollment.id) delete(Student.id) (cid:17) U (cid:16) ¬ (cid:16) ¬ ¬ Extreme load scenario. Fig 3.7b and Fig 3.7a depict runtime versus computation dura- tion and runtime vs segmentation frequency under our network’s maximum read/write load. These results are slightly noisier when compared to the results of the synthetic experiments. This is because the events in the synthetic experiments were uniformly distributed across the whole computation length, but they are not uniform here. Database operations requiring net- work communications (read, write, and update) require an average of 100ms, whereas sending and receiving messages involve inter-process communication and take roughly 10ms-15ms, resulting in a non-uniform event distribution. When comparing with the automata-based approach, we do not see much improvement when monitoring φwrc or φdrc using progression based approach. However, when monitoring φrw, we observe an average improvement of 55.53%. Moderate load scenario. In Fig 3.7b, we were able to break even with as little as 2 processes. To find a real-world example with modest database activities, consider the Google Sheets API, which enables a maximum of 500 requests per 100s per project and a maximum of 100 requests per second per user, i.e., on average 5 events/sec per project and 1 event/sec per user. To see how our technique operates in such a scenario, we increase the number of processes and cores available to monitor such a system in order to investigate the time required to verify the trace created by such a system. In Figure 3.6c, we see that we break even at an event rate of 3 events/sec/user when using the progression-based strategy. Our algorithm operates effectively when the number of processes is 7, 8, or 9, which is far higher 60 than Google allows. This allows us to be certain that our technique can be implemented online in real-world scenarios. 3.5.4 Case Study 2: RACE In this case study, we monitor a mutual separation property between multiple aircrafts. The dataset1 for this case study was generated using the Runtime for Airspace Concept Evaluation (RACE) [85] framework developed by NASA. RACE is a framework for creating an event-based, reactive airspace simulation. This dataset consists of three data sets obtained on three distinct days. Each pair was captured at around 37◦ N Latitude and 121◦ W Longitude. The dataset contains all eight types of messages sent by the SBS unit when a Telnet application is used to listen to port 30003, but we only use the messages with ID M SG − 3 which is the Airborne Position Message and includes a flight’s latitude, longitude, and altitude and is used to verify the mutual separation of all pairs of aircraft. We found that the time gap between the time the message was created and the time it was recorded was generally less than a second, thus we regarded an ϵ = 1s over the time the message was generated. Furthermore, calculating the distance between two locations is computationally intensive since we must account for characteristics such as earth curvature. We consider a constant latitude of 111.2km and longitude of 87.62km to speed up distance computations at the expense of a minuscule error margin. We use these as constants and multiply them by the difference in latitude and longitude, and factor in the altitude to get the distance between two aircrafts. We verify mutual separation by assuming a minimum separation of 500m between each pair of aircrafts. According to the dataset, each aircraft generates a message at least once per 1 second. There are three distinct datasets: sbs-1 has 293 aircrafts, 168, 283 messages spread over 3 hours, 28 minutes, and 58 seconds; sbs-2 has 110 aircrafts, 64, 218 messages spread over 1 hour, 1 minute, and 46 seconds; and sbs-3 has 97 aircraft, 64, 162 messages spread over 49 minutes and 42 seconds. In Figure 3.6b, we compare our obtained runtime to the three RACE datasets (labelled 1https://github.com/NASARace/race-data 61 sbs-1, sbs-2 and sbs-3). We monitor the data in real time, with segments of 10s and ϵ of 1s. We put our approach to the test by increasing the number of cores on the CPU and utilizing all available cores, as described in 3.4.2 by using more number of cores on the processor and utilize all available cores. Our results break even for 4 cores. This makes our approach desirable for aircraft monitoring and similar systems such as IoT. 3.6 Conclusion We elected to start our work with discrete-event systems (as opposed to continuous-time systems) due to the fact that monitoring discrete-event systems are intuitively less expensive in terms of runtime and computational complexity, compared to similar continuous-time sys- tems. Both of our proposed techniques take an LTL formula and a distributed computation as input and, assuming a bounded clock skew among all processes, chops the computation into multiple segments before applying either the automata-based monitoring algorithm, or the progression-based monitoring algorithm implemented as an SMT decision problem to ver- ify the formula’s correctness. In Section 3.5, we carried out extensive simulated experiments, as well as case studies on monitoring consistency conditions in Cassandra and a NASA air traffic control dataset. Our experiments demonstrate up to 35% improvement in performance in our progression-based algorithm over our automata-based algorithm. Furthermore, based on these experiments, we summarize that online monitoring is indeed possible with our tech- niques when distributed computations are properly segmented and parallelized. A natural course of action now would be to carry and apply the relevant aspects of this approach into monitoring continuous-valued systems; in other words, distributed CPS. We take the first steps into monitoring distributed CPS in the next chapter. 62 CHAPTER 4 PREDICATE MONITORING IN DISTRIBUTED CYBER-PHYSICAL SYSTEMS In this chapter, we take first steps towards rigorously monitoring distributed CPS. To this end, we propose a monitoring technique to detect Boolean predicates over the analog (i.e. continuous-time and continuous-valued) signals generated by the agents in a distributed CPS. Similar to our approach described in Chapter 3, a clock synchronization algorithm guarantees a maximum clock skew across all signals generated by the agents. In the following sections, we first define the analog signal transmission sampling method based on our signal model defined in Chapter 2. We then elaborate on our predicate detection approach for partially synchronous distributed CPS using a signal retiming technique. 4.1 Signal Transmission to the Monitor Communication between nodes requires sampling the analog signal, sending the samples, and reconstructing the signal at the receiving node. Our goal is to monitor the reconstructed analog signals. This is not the same as monitoring a discrete-time signal composed of sam- ples; the applications we are addressing are concerned with the value of the signal between samples and the possible violations revealed by it. Signal transmission methods, such as sampling and reconstruction, are common in communication theory. Errors caused by sam- pling and reconstruction (for example, owing to bandwidth constraints) can be addressed for by tightening the STL formula using the methods of [45]. The reconstruction algorithm is chosen based on the application and domain knowledge. For the sake of simplicity and generality, we assume that every output signal xn is rebuilt as piece-wise linear between the samples, except in one experiment where we reconstruct a signal as both piece-wise linear and piece-wise quadratic to study the trade-offs. Other signal constructions, such as cubic splines, can also be employed with easy modifications to our algorithms at the cost of increased run time, provided that the signal structure chosen is orthogonal to our methodologies and the aims of this work. Since we assume the agents do not deadlock, this transmission happens in 63 segments of length T : at the kth transmission, agent An transmits xn tion of its output signal to the interval [(k − remainder of the work refers solely to the signal fragments received by the monitor during a [(k−1)T,kT ], the restric- | 1)T, kT ] as measured by its local clock. The specific transmission. We now return to the constraint imposed on In in Definition 7, namely that it is a non-empty bounded interval. Non-emptiness models the absence of deadlock in comput- ing. That is an interval In expresses that no events are missed, or equivalently, that signal reconstruction is perfect at the monitor. The restriction that it be bounded models the above monitoring setup: the monitor is only ever dealing with bounded signal fragments xn [(k−1)T,kT ], therefore, | − for every agent at the kth transmission, measured in local time. By the bounded skew In = [(k 1)T, kT ], (4.1) assumption, we have: Lemma 7. For any two agents An, Am, min In | − min Im | ≤ ε and max In | − max Im ε. | ≤ 4.2 Problem Statement Predicates are frequently used to encapsulate several system requirements (e.g., invari- ants). A predicate φ is a global Boolean-valued function over the signal values of agents. For instance, φ(x1, x2) = (x1 > 0) ∧ (ln(x2) < 3) is a predicate on two signals that evaluates to true when x1 > 0 and ln(x2) < 3, otherwise false. Because the agents are partially synchronized to within an ε, it is not possible to actually evaluate all signals at the same moment in global time. However, as noted above, the frontier of a consistent cut gives us a possible global state. Therefore, the monitoring problem can be worded as follows. Given a distributed signal (E, ⇝) over N agents, as defined in Definition 7, and a Boolean predicate φ, (E, ⇝) satisfies φ iff there exists a frontier of a consistent cut in (E, ⇝), where φ is satisfied. It should be noted that throughout this chapter, (E, ⇝) is used to denote distributed signals. We now define distributed satisfaction below. 64 Definition 10. [Distributed satisfaction] Given a distributed signal (E, ⇝) over N agents, and a predicate φ over the N agents, we say that (E, ⇝) satisfies φ iff for all consistent cuts E with C ⊆ front(C) = (cid:16) (t1, x1(t1)), . . . , (tN , xN (tN )) (cid:17) we have φ(cid:0)x1(t1), x2(t2), . . . , xN (tN )(cid:1) = true. We write this as (E, ⇝) = φ. ■ | Thus, we formally define the problem as follows. Problem Statement Continuous-Time Monitoring of Distributed CPS. Given a distributed signal (E, ⇝) and a predicate φ over N agents, determine whether (E, ⇝) = φ. | When a distributed signal (E, ⇝) does not satisfy a predicate φ, we say that (E, ⇝) violates φ and write (E, ⇝) = φ. ̸| In this dissertation, we want to detect whether there exists a consistent cut C ⊆ E, such that (E, ⇝) = φ. ̸| The main challenge in monitoring distributed signals is that the monitor has to reason about signals that are subject to time asynchrony. For instance, consider two signals x1 and x2 and the case where x1(2) = 5, x2(3) = 1, φ(x1, x2) = (x1 > 4) (x2 < 0), and ε = 2 so ∧ that time points 2 and 3 form a consistent cut. In this case, since the above signal values are at local times within the possible clock skew, one has to (conservatively) consider that the predicate is violated. In the next section, we present our solution to the problem. 4.3 SMT-based Monitoring Algorithm In a nutshell, our solution has the following features: • Central monitor. We assume that there is a central monitor that solves, at regular intervals, the monitoring problem described in Section 4.2. • Signal retiming. As signals are measured using their local clocks, the monitor should somehow align them to detect possible violations of the predicate. To this end, we propose a retiming technique that establishes the happened-before relation in 65 the continuous-time setting, and stretches or compacts signals to align them with each other within the ε clock skew bound. • SMT encoding. We transform the monitoring decision problem into an SMT-solving problem, whose components (like input signals and the happened-before relation) are modeled as SMT entities and constraints. 4.3.1 Retiming Functions Our signal model is continuous-time, that is, the signals are maps from R+ to R+. There- fore, to model the approximate re-synchronizing action of the monitor, we use a retiming function. Definition 11. [Retiming functions] A retiming function, or simply retiming, is an increasing function ρ : R+ → R+. An ε-retiming is a retiming such that: R+ : t ∀ ∈ t | ρ(t) | − < ε. Given a distributed signal (E, ⇝) over N agents and any two distinct agents Ai, Aj, where i, j [N ], ∈ (t < ρ(t′)) for a retiming ρ from Aj to Ai respects ⇝ if we have ((t, xi(t)) ⇝ (t′, xj(t′))) ⇒ any two events (t, xi(t)), (t′, xj(t′)) ∈ E. An ε-retiming that respects ⇝ is a valid retiming. ■ Figure 4.1 shows examples of retimings and how they relate to predicate monitoring. To detect predicate violation, we must first retime y to the t axis via a retiming map ρ. (c) shows three different retimings, including the identity. (d)-(e) show the retimed y. For the predicate x > y, (e)-(f) show no violations, but (d) does. The conservative monitoring answer is that the predicate is violated. An ε-retiming ρ maps R+ to itself, but it is easy to see that the restriction of ρ to a bounded interval I is an increasing function from I to ρ(I) that respects the constraint t | − attention to the action of ε-retimings on bounded intervals. < ε for all t ρ(t) ∈ | I. Thus, in what follows we restrict our We now state and prove the main technical result of this chapter, which relates the existence of consistent cuts in distributed signals to the existence of retimings between the agents’ local clocks. Theorem 3. Given a predicate φ and distributed signals (E, ⇝) over N agents, there exists a consistent cut C ⊆ E that violates φ if and only if there exists a finite A1-local clock value 66 Figure 4.1 Predicate violation between two signals x and y measured using partially syn- chronized clocks t and s. t and N − 1 ε-retimings ρn : In → I1 that respect ⇝, 2 ≤ (cid:16) φ x1(t), x2 ρ−1 2 (t), . . . , xN ρ−1 N (t) ◦ ◦ n (cid:17) N , such that: ≤ = false (4.2) and such that ρ−1 m ◦ ρn : In → composition operator. In is an ε-retiming for all n = m. Here, ‘ ◦ ’ denotes the function Proof. We distinguish the following cases: Case 1: Suppose that such retimings exist. Define the local time values t1 := t, tn = ρ−1 2 n N , and the set C = ≤ ≤ ≤ retimings respect ⇝, it holds that if e en t t { n (t), . By the construction of C and the fact that the } C and f ⇝ e then f C. For every n, m = m, 2, n tn ∈ ≥ ∈ it holds that tm = ρ−1 m (ρn(tn)) so tn | − tm | ≤ ε. Thus C is a consistent cut with frontier (en tn)N n=1 that witnesses the violation of φ. 67 ytsxyidxtxty⇢1xty⇢2(a)(b)(d)(e)(f)(c)ssstttid⇢1⇢2̸ ̸ Case 2: Suppose now that there exists a consistent cut C with frontier: front(C) = (cid:16) (t1, x1(t1)), . . . , (tN , xN (tN )) (cid:17) that witnesses violation of φ. We need the following facts. and em Fact 1. For every two events en tm tn front(C), we have em ε. Indeed, since en tm tn ∈ C for all s s.t. s + ε s ∈ in the frontier of a consistent cut, we have s for all such s and so tm tn − ≥ ε. By symmetry of the argument, tn ≥ ≤ tm tn. Thus, ε holds − − | ≤ tn | tm ≥ as well. Fact 2. Given intervals [a, b] and [c, d] s.t. c a | − | ≤ ε and d b | − | ≤ ε, the map L : [a, b] [c, d] → defined by L(t) = c + d−c b−a(t − a) is a linear ε-retiming. This is immediate. Suppose first that there are no message exchanges. For 2 n ≤ ≤ N , we define the retiming ρn : In → I1 in two pieces. First, set ρn(tn) = t1. By preceding lemma, tn | − t1 | ≤ ε. Write I1 = [a, b] and In = [c, d] for notational simplicity in this proof. Call a pair of intervals that satisfies the hypothesis of Fact 2 an admissible pair. Then, the following pairs are clearly admissible by Lemma 7: [a, t1] and [c, tn], and [t1, b] and [tn, d]. Thus, there exist two linear retimings Ln : [a, t1] ρn(t) = Ln(t) on c is an ε-retiming. ≤ → t ≤ [c, tn] and L′ n : [t1, b] [tn, d], and we can define a piece-wise ρn: tn and ρn(t) = L′ → n(t) on tn t ≤ ≤ d. It is easy to establish that ρn It remains to show that ρ−1 n ◦ ρm : Im = [f, g] → [c, d] is also an ε-retiming. This too can be established in parts, first over [f, tm] then over [tm, g], using the same arguments as above and exploiting the linearity of these retimings. For instance, if we write αn for the slope of Ln, then over [f, tm] n (ρm(s)) = L−1 ρ−1 n (Lm(s)) = L−1 1 αn [a + αm(s − c)] + f − = n (a + αm(s c)) − a/αn = f + g d f c − − (s c) − which is a linear ε-retiming by Fact 2. ■ 68 If there are message exchanges, the above argument still applies but over a more fine- grained division of the timelines In obtained by partitioning each timeline at message trans- mission times. Proof. For the admissible pair I1 = [a, b] and In = [c, d], suppose the first message is sent from An to A1 at local time s < tn and is received at local time r < t1. Define t(s) := min(s + ε, r). Then the pair [a, t(s)], [c, s] is admissible. Upon repeating this process for all messages, a collection of admissible pairs is obtained that can be retimed to each other, as above, without violating the ⇝ relation. These are concatenated to yield the desired retiming ρn. ■ Thus, finding a consistent cut that violates the predicate can be achieved by finding such retimings. The proof of Prop. 3 further shows that the retimings can always be chosen as piece-wise linear (rather than any increasing function), which yields significant runtime savings in the SMT encoding in the next section. Remark 2. An interesting consequence of Fact 2 in the proof is that it is enough to use piece-wise linear retimings. This results in the following concrete problem. Concrete Problem Statement Given ε > 0, a distributed signal (E, ⇝) over N agents, and a predicate φ over the N agents, find N − 1 piece-wise linear ε-retiming functions ρ2, . . . , ρN that satisfy the hypotheses of Theorem 3 and s.t. (cid:16) φ x1(t1), x2(ρ−1 2 (t1)), . . . , xN (ρ−1 (cid:17) N (t1)) = false (4.3) 4.3.2 SMT Formulation We solve the monitoring problem by transforming it into an instance of satisfiability modulo theory (SMT) [6]. Specifically, we ask whether there exists N 1 retimings, such − that (4.3) holds; equivalently, whether there exists a consistent cut that witnesses satisfaction of φ. ¬ 69 Without loss of generality, we start with our encoding of two agents, A1 and A2 (shown in Figure 2.3). A1 outputs signal x supported over the bounded timeline I1, which is discretized to D1 ⊂ I1 and sent to the monitor. Similarly, A2 outputs signal y supported over the bounded timeline I2, which is discretized to D2 I2 and sent to the monitor. D1 and D2 ⊂ are finite. Let δk > 0 be the sampling period of agent Ak, so two consecutive elements of Dk differ by δk, k . 1, 2 } ∈ { Consider further that A2 transmits a message at local time t1 and it is received by A1 at local time t2, and that A1 sends a message at local time t3 which is received by A2 at local time t4. The distributed signal violates the predicate iff the following SMT problem returns SAT. SMT entities. In our encoding, the entities are the retimings ρn included as uninterpreted functions (the solver will interpret), signals x and y, intervals I1 and I2, real numbers t, s, s′, t1, t2, t3, and t4. All these entities have been defined in the previous sections. The following quantities are all constants in the encoding, since they are known to the monitor: the sampling time sets Dk and sampling periods δk, the sampled values x(ti) { ti | ∈ D1 } and si D2 , and the message transmission and reception local times. y(si) { SMT constraints. The encoding is a conjunction of the following constraints: ∈ } | • (Predicate violation) The first constraint ‘finds’ local times t and s at which predicate φ is violated (upto ε-synchrony): (cid:17) t− + δ1 s− + δ2 I2. s ∃ ∈ D1. t− t ≤ s ≤ ≤ ≤ D2 . s− (cid:17) I1. ∈ ∈ t− ∃ s− ∃ ρ(s) = t ∈ ∃ t (cid:16) (cid:16) (cid:16) (cid:16) ∧ φ(x(t−), y(s−)) (cid:17) ¬ ∧ (cid:17) ∧ (4.4a) (4.4b) (4.4c) (4.4d) (4.4e) Eq. (4.4b) finds the time sample t− such that x(t) = x(t−): this is the result of our assumption that signals are piece-wise constant. Eq.(4.4c) does the same for y. 70 Eq. (4.4d) specifies that s is retimed to t: this is what guarantees that (x(t), y(s)) is a possible global state as per Theorem 3. Eq. (4.4e) checks violation of the predicate at (x(t), y(s)) = (x(t−), y(s−)). • (Valid retiming) Eq. (4.5) ensures that ρ is a valid ε-retiming from I2 to I1: s ∀ ∈ I2. t ∃ ∈ I1. (ρ(s) = t) t ( | ∧ s | − < ε) and Eq. (4.6) ensures that the retiming function is increasing: s ∀ ∈ I2. s′ ∀ ∈ I2. (cid:16) s < s′ ⇒ ρ(s) < ρ(s′) (cid:17) (4.5) (4.6) • (Happened-before) Eq. (4.7) enforces the happened-before relation for message trans- missions: (cid:16) ρ(t1) < t2 (cid:17) (cid:16) ∧ (cid:17) t3 < ρ(t4) (4.7) • (Inverse retiming) When there are more than 2 agents, we must also encode the con- straint that for all n = m, ρ−1 m ◦ ρn is an ε-retiming. Thus, for all n = m, letting fm be the uninterpreted function that represents the inverse of the uninterpreted ρm, we add t ∀ ∈ In · fm(ρn(t)) = t (4.8) in addition to the analogs of Eqs. (4.6) and (4.5) for fm ρn. ◦ Other signal models. If output signals were piece-wise linear, say, Eq. (4.4e) would be modified accordingly: (cid:18) x(t−) + φ x(t− + δ1) δ1 − x(t−) t−), (t − y(s−) + y(s− + δ2) δ2 − y(s−) (s − (4.9) = false (cid:19) s−) Similarly, if output signals were piece-wise quadratic, Eq. (4.4e) would be modified as follows: φ (x(t), y(s)) = false (4.10) 71 ̸ ̸ x(t−) = 1 x x(t− + δ1) = 5 3 y y(s−) = 2 y(s− + δ2) = 4 3 x(t− + δ1) y(s− + δ1) x(t) y(s) x(t−) x y y(s−) (a) Piece-wise linear signals. (b) Piece-wise quadratic signals. Figure 4.2 Piece-wise interpolations. y x Figure 4.3 Piece-wise linear signals vs. piece-wise quadratic signals. In some systems, piece-wise quadratic signals may be used to represent signals more accurately. For example, Figure 4.3 shows two piece-wise quadratic construction having the same value at some point in time, whereas their piece-wise linear counterpart signals do not. Our choice of signal models is limited by the SMT solver: it must be able to handle the corresponding interpolation equations, like the piece-wise linear interpolation in Eq. (4.9). As an example, in Figure 4.2a, let x and y be two signals, where the violating predicate ϕ to be monitored is x(t) = y(s). Let ρ be a retiming of y on x, such that ρ(s−) = t− and ρ(s− + δ2) = t− + δ1. It can be observed that although the discretized signal samples do not violate ϕ, due to the signals being piece-wise linear, it is easy to identify a violation at time t and s on signals x and y respectively, where x(t) = 3, y(s) = 3 and ρ(s) = t. Another example is demonstrated in Figure 4.2b, where x and y are two signals expressed by their corresponding quadratic formulas. The violating predicate ϕ to be monitored is d(x(t), y(s)) ≤ 2, where d is a function that yields the distance between any two points. Let ρ be a retiming of y on x, such that ρ(s−) = t− and ρ(s− + δ2) = t− + δ1. Furthermore, let 72 Figure 4.4 Leveraging dynamics. evaluation of d(x(t−), y(s−)) be 3 and evaluation of d(x(t− + δ1), y(s− + δ2)) be 3. It can be observed that although the discretized signal samples do not violate ϕ, due to the signals being piece-wise quadratic, it is easy to identify a violation at time t and s on signals x and y respectively, where d(x(t), y(s)) 2 and ρ(s) = t. ≤ It is worth mentioning that restricting the SMT search to piece-wise linear retimings results in a significant decrease in run time, compared to the approach where the SMT is tasked with determining an interpolation. For example, for two UAVs with ε = 1ms over 5s-long signals, at segment count 5, the search for a general retiming requires 3.42s, whereas searching for a piece-wise linear retiming requires only 1.01s. Since, by Remark 2, there is no loss of generality in this restriction, from this point, all the reported experiments are obtained using the piece-wise linear retiming approach. Remark 3. (i) ρ−1 m ◦ ρn respects ⇝ automatically so it is not necessary to encode that explic- itly. (ii) Because we can restrict the SMT search to piece-wise linear retimings (see remark following proof of Theorem 3), constraint (4.8) can be simplified, namely, the expression for the inverse can be hard-coded. We do not show this to maintain clarity of exposition. 73 ytsxrate boundy0.5x3rate bound⌧1⌧2 Data: Distributed signal (E, ⇝), ε, predicate φ, bounds Result: (E, ⇝) = φ | Set tn = min In, n [N ] while not done do ˙xn | | ≤ bn, n ∈ [N ] ∈ Get next violating assignment σ to the atoms of φ if there are no more violating assign- ments then done else for every atom a in φ do if σ(a) = true then τ τn = min { else x(tn + τ ) , n va } ∈ ≥ [N ] | x(tn + τ ) < va τ τn = min { end Set τ = maxn τn and m = argmax of the restrictions xn | ∈ [N ] , n } nτn SMT-monitor the distributed signal Eσ made [tn+τ −ε,max In], n | = m and xm [tm+τ,max Im] If SAT, done. | end end Algorithm 4.1 Dynamics-aware monitoring. 4.4 Exploiting the Knowledge of System Dynamics Physical processes in a CPS follow the laws of physics. A runtime monitor can leverage this knowledge of the CPS dynamics to make monitoring more efficient. We explain our idea by the following example (see Figure 4.4). From knowing the rate bound 1 (shown by ˙x | | ≤ a dashed line), the monitor concludes that the earliest x can satisfy the atom x 3 is τ1. ≤ Similarly for y. Given that τ1 > τ2, the monitor discards, roughly speaking, the fragment [0, τ2] from each signal and monitors the remaining pieces. Note that x(0) = 1 and y(0) = 2. Consider the predicate: φ = (a ¬ ∨ b), where a := x 3 and b := y ≥ ≤ 0.5. Let a and b be atoms of predicate φ. There are 3 Boolean assignments to atoms a and b that falsify the predicate. Let us fix one such assignment, a = b = true. If the monitor knows a uniform bound on the rate of change ˙x of x, say ˙x(t) t. | ∀ | ≤ 1, then it can infer that a = true cannot hold before τ1 = 2 (local time). Similarly, if the monitor knows that 3, then b = true ˙y | | ≤ cannot hold before τ2 = 0.5 (local time). Taking into account the ε-synchrony, the monitor can limit itself to monitoring x [2,T ] (the restriction of x to [2, T ]) and y | [2−ε,T +ε]. | Now, if this yields UNSAT in the SMT instance, we select the next Boolean assignment (in terms of atoms a and b) that falsifies predicate φ (e.g., a = false and b = true), derive 74 ̸ the useful portion of signals x and y, and repeat the same procedure until the answer to the SMT instance is affirmative or all falsifying Boolean assignments are exhausted. Of course, this requires exploring all such assignments to atoms of the predicate, but since we expect the number of atoms in realistic predicates to be relatively small, the exhaustive nature of falsifying Boolean assignments will not be a bottleneck. We generalize this idea to N agents and arbitrary predicates in Algorithm 4.1. We assume without loss of generality that every atom a that appears in φ is of the form xn va for some n and va R. A Boolean assignment is a map σ from atoms to { ≥ , and a violating assignment is one that makes the false, true } ∈ predicate false. Thus, given a violating assignment σ, for every atom a, a = σ(a) iff xn va ≥ (if σ(a) = true) or xn < va (if σ(a) = false). Obvious modification to Algorithm 4.1 allows the monitor to take advantage of knowing different rate bounds at different points along the signals. 4.5 Case Studies and Evaluation In this section, we evaluate our technique using two case studies on networks of au- tonomous ground and aerial vehicles. 4.5.1 Case Study 1: Network of Ground Autonomous Vehicles We collected data from two 1/10th-scale autonomous cars competing in a race around a closed track. Each car carries a LiDAR for perceiving the world, and uses Wi-Fi antennas to communicate with the central monitor. Each car is running a model predictive controller to track its racing line and RRT to adjust its path. The trajectory data is sampled at 25Hz. In this application, the useful signal length to monitor is 1 2s, as this is the control horizon − (i.e., the controller repeatedly plans the next 1 2s). Thus, in Eq. (4.1), T = 1 2s. A − − reasonable range for ε is interval [1, 5]ms, guaranteed by ROS clock synchronization based on NTP. Unless otherwise indicated, we monitor the predicate d(x1, x2) > δ d(x1, x2) ∧ ∆. ≤ 75 4.5.2 Case Study 2: Network of UAVs We use Fly-by-Logic [100], a path planner software for UAVs, to simulate the operation of two UAVs performing various reach-avoid missions. In a reach-avoid mission, each UAV must reach a goal within a deadline, and must avoid static obstacles as well as other UAVs. The path planner uses a temporal logic robustness optimizer to find the most robust trajectory. The trajectories are sampled at 20Hz. In this application, the useful signal length to monitor is around 2s, as this is the UAV’s ‘reaction time’ (depending on current speed). Thus, in Eq. (4.1), T ≊ 2s. A reasonable range for ε is again 1 otherwise indicated, we monitor the predicate d(x1, x2) − ≥ 4.5.3 Case Study 3: Water Distribution System 5ms, guaranteed by ROS. Unless δ. We use a model of a hybrid dynamic high pressure water distribution system consisting of two water tanks. Each water tank has an inlet pipe connected to some external water source, and an outlet pipe with a valve that can be used to regulate high pressure water outflow from each tank. A controller on each water tank operates its valve, and samples the outflow pressure at 20Hz using its local clock. We model such a system in Simulink, which is a simplified emulation of the Refueling Water Storage Tanks (RWST) module of an Emergency Core Cooling System (ECCS) of a Pressurized Water Reactor Plant [118] as shown in Figure 1.1. ECCS is tasked with providing core cooling to minimize fuel damage following a ‘loss of coolant’ accident by administering high pressure water injection from RWST. The water tanks, and by extension their controllers, operate even when the supply of power is lost to the plant. As a failsafe, ECCS also incorporates Cold Leg Accumulators that do not require power to operate. These tanks contain large amounts of borated water with a pressurized nitrogen gas bubble in the top. If the pressure of the outflow pressure drops below a certain threshold, the nitrogen will force the borated water out of the tank and into the reactor coolant system. A reasonable range for ε is 5ms 500ms [13] depending − on how often the local clocks of the water tanks are synced with global time. In this case study, we monitor the property that the cumulative pressure of the RWSTs always remains 76 e l a c s 2 g o l n i ) s ( e m i t n u R 2 0 2 − 4 − S.D. = 0.5s S.D. = 0.6s S.D. = 0.7s S.D. = 0.8s S.D. = 0.9s S.D. = 1.0s S.D. = 1.5s S.D. = 2.0s 0 5 15 10 Number of segments 20 e l a c s 2 g o l n i ) s ( e m i t n u R 2 0 2 − 4 − S.D. = 0.5s S.D. = 0.6s S.D. = 0.7s S.D. = 0.8s S.D. = 0.9s S.D. = 1.0s S.D. = 1.5s S.D. = 2.0s 0 5 15 10 Number of segments 20 (a) Network of cars. (b) Network of UAVs. Figure 4.5 Impact of signal segmentation on run time with varying signal duration (S.D.) and fixed ε = 0.001s. ε = 0.001s 0.6 0.4 0.2 ) s ( e m i t n u R 0 1 2 3 4 5 6 Signal duration (s) Figure 4.6 Best run time (network of cars) for different signal duration. above a certain threshold. Note that the SMT solver’s effort is mostly spent on finding retiming, instead of predicate complexity. Thus, we pick simpler predicates for our experiments. 4.5.4 Experimental Setup In our experiments, we choose the following parameters: (1) signal duration, (2) maxi- mum clock skew ε, and (3) distribution of communication among agents. We measure the 77 monitor run time. All experiments are replicated to exhibit %95 confidence interval to pro- vide statistical significance. The experimental platform is a CentOS server with 112 Intel(R) Xeon(R) Platinum 8180 CPUs @ 3.80GHz CPU and 754G of RAM. Our implementation invokes the SMT-solver Z3 [97] to solve the problem described in Section 4.3. 4.5.5 Analysis of Results Impact of signal segmentation Given a signal-to-be-monitored, we have a choice of either passing the entire signal to the monitor, or chopping it into segments and monitoring each segment separately (while accounting for ε-synchrony). Monitoring a signal in one shot is computationally more expensive than monitoring a number of shorter segments. Figure 4.5 shows the results of this claim. Note that all curves are plotted in log2 clarity. As can be seen, for any signal duration, chopping the signal and invoking the monitor scale to provide more for the shorter segments reduces the run time significantly. For example, in the case of the UAV network (Figure 4.5b), for a signal duration of 2s, it takes 4.5s to monitor the signal in one shot, but only 0.55s if the monitor is invoked 20 times over the signal duration. We observe the same behavior in Figure 4.5a. This is due to the SMT-solver having to deal with much smaller search spaces in each invocation. Figure 4.6 shows the best achievable run time for different signal durations by searching over the segment count of range [1, 25]. For example, segment count of 4 is obtained for 1s signal to get minimum run time of 0.17s, while segment count of 18 is obtained for 5s signal to get minimum run time of 0.72s. The best run time shown is achieved by distributing the monitoring tasks across all the available cores (4) on the monitoring device. Notice that our predicate detection algorithm can be parallelized trivially, assigning one or a pool of segments to a different core. An important consequence of segmentation is that it enables us to monitor signals in real time, as for 3 or more segments, the run time of the monitor is less than the signal duration. For this reason, in all remaining experiments, the signal-to-monitor is chopped into 20 segments and each segment is monitored separately. Cumulative run times (of monitoring 78 e l a c s 2 g o l n i ) s ( e m i t n u R 2 1.5 1 0.5 0 Seg = 1 Seg = 2 Seg = 3 Seg = 4 Seg = 5 Seg = 7 Seg = 9 Seg = 20 e l a c s 2 g o l n i ) s ( e m i t n u R 2 0 2 − Seg = 1 Seg = 2 Seg = 3 Seg = 4 Seg = 5 Seg = 7 Seg = 9 Seg = 20 1 2 4 3 Clock skew ε (ms) 5 1 4 3 2 Clock skew ε (ms) 5 (a) Network of cars. (b) Network of UAVs. Figure 4.7 Impact of clock skew on run time. Signal duration = 2s. all 20 segments) are reported. Impact of clock skew We now study the impact of different choices of ε on monitoring run time. We choose realistic values for ε with millisecond resolution. Figure 4.7 shows the monitoring run time for a 2s signal chopped into 1 20 segments. Both Figs. 4.7a and 4.7b − show that high resolution clock synchronization results in very stable execution time for the monitor. This is a positive result, showing that for practical clock synchronization algorithms, the actual value of ε does not have an impact on the monitoring overhead. However, naturally ε has an impact on the number of violations detected, specifically false positives. To demonstrate this, we model the path of a pair of UAVs and a pair of cars, where the agents periodically reside within the given mutual separation threshold, and violate the mutual separation property. Tables 4.1 and 4.2 show the results for two cars and two UAVs, respectively, in operation for half an hour. The experiments report (1) the number of True Violations as a baseline that was reverse calculated from the introduced clock drift ε; (2) the number of Detected Violations using our method, and (3) the number of False Positives, which is essentially the difference between the true violations and the detected violations. Note that there were no 79 Clock Skew (s) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 True Violations 6 3 2 4 1 4 4 5 3 3 Detected Violations 13 19 29 41 46 52 60 70 80 89 False Positives 7 16 27 37 45 48 56 65 77 86 Table 4.1 Impact of clock skew in network of cars on verdicts using varying ε. Clock Skew (s) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 True Violations 6 6 8 4 2 1 7 2 5 6 Detected Violations 11 20 30 39 46 48 62 66 76 84 False Positives 5 14 22 35 44 47 55 64 71 78 Table 4.2 Impact of clock skew in network of UAVs on verdicts using varying ε. False Negatives. Furthermore, as the maximum clock skew is increased from 0.05s to 0.5s, naturally the number of False Positives increase as well. Impact of number of agents Now we observe the impact of the number of UAVs on the monitor. Figure 4.8a shows the effect on run time for increasing the number of agents from 2 to 10 with ε = 1ms over 5s-long signals. As each segment of a signal can be monitored inde- pendently, we improve our run time by distributing the monitoring tasks across all available cores on the monitoring device. Observe that initially the run time drastically improves as more segments are used. However, eventually the improvement becomes negligible, due to run time being dominated by non-SMT tasks, such as creating job queues, allocating jobs 80 ) s ( e m i t n u R 60 40 20 0 0 5 Agents = 2 (bottom) to 10 (top) S.D. = 5s ) s ( e m i t n u R 30 20 10 0 30 2 8 6 4 Number of agents 10 10 15 Number of segments 20 25 (a) Signal Duration = 5s and ε = 0.001s. (b) Signal Duration = 5s and ε = 0.001s. Figure 4.8 Impact of agents on run time. to cores, and so on. We refer to this run time as the best run time. Figure 4.8b shows the best run times for different number of agents with ε = 1ms over 5s-long signals. Impact of communication We examine whether the number of messages exchanged between agents has a significant impact on monitor run time. Two opposing mechanisms exist: on the one hand, messages impose an order between the send and receive moments and so reduce concurrency. In the discrete-time setting this normally reduces the asynchronous monitoring complexity. On the other hand, messages result in extra constraints in the SMT encoding via Eq. 4.7, which could increase SMT run time. Figure 4.9 shows the results. In (a) we use ε = 1ms and a 1s-long signal. Run time varies with no clear trend, suggesting that neither of the above two opposing mechanisms dominates. In (b), we use ε = 2s for a 2s-long signal: i.e., all events are concurrent. One can see the order introduced by messages are slightly increasing the runtime, instead of decreasing it. No conclusion can be drawn, and future work should study this more closely. Impact of piece-wise quadratic signals We now compare the effect on run time for piece-wise quadratic signals against piece-wise linear signals. To this end, we consider 1s- long signals for each signal model generated by the network of cars with ε = 0.001ms. For quadratic signals, each formula is constructed by the corresponding agent with the help of 81 ) s ( e m i t n u R 1.6 1.5 1.4 1.3 0 ε = 0.001s ε = 2s 400 300 200 ) s ( e m i t n u R 100 0 40 20 80 Number of messages 60 100 40 20 80 Number of messages 60 (a) Signal Duration = 1s and ε = 1s. (b) Signal Duration = 2s and ε = 2s. Figure 4.9 Impact of communication (between two agents) on run time. Piece-wise Linear Signals Piece-wise Quadratic Signals e l a c s 2 g o l n i ) s ( e m i t n u R 0 1 − 2 − 3 − 0 5 15 10 Number of segments 20 Figure 4.10 Run time (network of cars) vs. segment count. an SMT solver, using signal value of at current local time, and signal values of last two samples. This formula is then sent to the monitor. The formulas are constructed at their corresponding agents instead of the monitor due to the fact that solving quadratic equations for all agents on each sample point can become an expensive task for the monitor, especially for higher number of agents. In Figure 4.10 we observe the runtime for varying segment counts for both signal models. The runtime for piece-wise quadratic signals is generally higher than its piece-wise linear counterpart due to quadratic signals having three discrete 82 2.2 2 1.8 1.6 1.4 1.2 ) s / m ( y t i c o l e V Velocity-car1 Velocity-car2 3 2 1 0 ) s ( e m i t n u R 0 0.5 1 1.5 Time (s) 2 2.5 3 0.5 SMT-normal SMT-dynamics 1.5 1 Signal duration (s) 2 2.5 3 (a) Velocity profile of two cars. (b) Run time vs. signal duration. Figure 4.11 Impact of Algorithm 4.1 on monitoring run time. ε = 0.001s. sample points (longer) than linear signals with two sample points (shorter). In exchange for this cost in runtime, we achieve better accuracy, as a piece-wise quadratic signal model is a more accurate signal representation when compared to its piece-wise linear signal model counterpart (recall Figure 4.3). Impact of knowledge of dynamics bounds Here the predicate of interest is φ = (v1 > 1.6) ∨ (v2 > 1.3), where vi is the velocity of the ith car. The acceleration limit from system dynamics is 1m/s2. The monitor samples the received signals (Figure 4.11b) at 0.25s intervals and applies the acceleration bounds as explained in Section 4.4 to discard irrelevant pieces of the signal. As shown in Figure 4.11, applying Algorithm 4.1 clearly reduces the monitor run time. In general, of course, the exact run time reduction varies. For instance, while the speedup is 10 for 3s-long signals 3s, it is × 15 for 2s-long signals. × Impact of segment duration and number of water tanks Let P1 and P2 denote the outflow pressure indicated by the respective valve controllers attached to each water tank. For simplicity, we assume all the pipes are of the same diameter. Therefore, the pressure exerted on the Cold Leg Accumulators is P1 + P2. In the experiment, we monitor the property φP, which is, during an emergency, the outflow remains above the threshold 83 ) s ( e m i t n u R 2 0 1 2 3 4 5 2 Segment duration (s) 4 3 N u m b er of ta n ks Figure 4.12 Effect of segment duration and the number of water tanks on runtime when ε = 0.05s. pressure 600psig [117], that is, φP = P1 + P2 > 600psig. Figure 4.12 shows the effect on runtime for increasing the number of water tanks from 2 to 4 with ε = 0.05s over segment duration ranging from 1s to 5s. As expected, both segment duration and the number of water tanks contribute to driving up the runtime. We note that even when the monitor receives the distributed signals sent by the water tanks at a reasonable 1s intervals, the monitor is still able to verify the property under around half a second for four water tanks. Impact of clock skew We now study the impact of different choices of ε on monitoring verdicts. To this end, we model two Refueling Water Storage Tanks with intentional ‘faults’, where the outflow pressures of either water tank can drop below the threshold pressure of the Cold Leg Accumulators. Therefore, if at some moment in time, both the tanks’ pressures fall simultaneously, the Cold Leg Accumulators gets triggered. We also introduce a clock drift in the valve controller of one of the water tanks. We choose realistic values for clock drift with millisecond resolution. Table 4.3 shows the results for two water tanks that were active for an hour. During this operation time, Tank 1 reported low pressures for a total of 35.5 seconds, and Tank 2 reported low pressures for a total of 36.1 seconds. The experiment reports number of 84 Tank 1 Total Low Pressure Duration (s) 35.5 35.5 35.5 35.5 35.5 35.5 35.5 35.5 35.5 35.5 Tank 2 Total Low Pressure Duration (s) 36.1 36.1 36.1 36.1 36.1 36.1 36.1 36.1 36.1 36.1 Clock Skew (s) True Violations Detected Violations False Positives 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 9 4 12 11 4 7 5 7 10 7 25 42 65 80 86 99 112 127 145 160 16 38 53 69 82 92 107 120 135 153 Table 4.3 Impact of clock skew in water tanks on verdicts using varying ε. True Violations as a baseline that was reverse calculated from the introduced clock drift ε, number of Detected Violations using our method, and the number of False Positives, which is essentially the difference between the True Violations and the Detected Violations. Note that there were no False Negatives. Furthermore, as the maximum clock skew is increased from 0.05s to 0.5s, naturally the number of False Positives increase as well. 4.6 Conclusion In this chapter, we demonstrated a new approach to online predicate detection for dis- tributed signals that do not share a global clock. To make the problem tractable, in Sec- tion 4.5, we use causality analysis between real-valued signals, a reasonable assumption of maximum clock skew among local clocks, and some knowledge of system dynamics. We also studied the influence of signal dynamics information on monitoring efficiency. By experiment- ing on a real network of autonomous cars, a simulated network of UAVs, and a simulated water distribution system, we discovered that under certain circumstances, our method may be used to successfully monitor a distributed CPS in an online setting. However, this ap- proach only considers Boolean predicates over distributed CPS, and by extension, does not capture more complex specifications, such as, nested and/or temporal properties. In the next chapter, we explore the avenue of monitoring temporal specifications in distributed CPS. 85 CHAPTER 5 MONITORING SIGNAL TEMPORAL LOGIC IN DISTRIBUTED CYBER-PHYSICAL SYSTEMS In this chapter, we explore a runtime verification approach for partially synchronous dis- tributed CPS, where we make use of the signal retiming mechanism from the predicate detection technique demonstrated in Chapter 4, and the idea of the progression-based for- mula rewitting technique demonstrated in Chpater 3. In Chapter 4, we proposed an online predicate monitoring approach for distributed CPS. As mentioned before, the approach is only able to detect Boolean predicates, and therefore, suffers being unable to handle formal specification languages. 5.1 Problem Statement As distributed agents are partially synchronized within ε clock skew, a monitoring algo- rithm must explore all (infinite) possible reachable consistent cuts. We call the propagation of consistent cuts with respect to time a consistent cut flow. Our objective is to determine whether there exists some flow of moments that are within ε of each other for which at least one reachable consistent cut results in violation of a given STL formula. This intuition is formalized below, starting with the notion of a consistent cut flow. Definition 12. [Consistent cut flow] Let (E, ⇝) be a distributed signal over N agents with time interval [a, b], and S be the set of all events over E. A consistent cut flow is a function ccf : [a, b] → 2S that maps each time χ ∈ [a, b] to the frontier of a consistent cut at time χ; i.e., ccf(χ) front(C) ∈ { C (χ) } ∈ C | . For each time χ′ ∈ [a, b], and for each n [N ], if χ < χ′, ∈ 2 1 3 x1 x2 x3 3 3 2 4 3 4 ccf(0) ccf(1.5) ccf(3) Figure 5.1 A valid ccf. 86 then for all events (cn(χ), xn(cn(χ))) ∈ ccf(χ), and for all events (cn(χ′), xn(cn(χ′))) ccf(χ′), ∈ (cn(χ), xn(cn(χ))) ⇝ (cn(χ′), xn(cn(χ′))) hold. ■ Notice that a consistent cut flow induces a vector of N signals that are fully synchronized, and thus, can be verified against an STL formula φ at time t as (ccf, t) = φ using the | semantics described in Section 2.5. That is, for a consistent cut flow ccf on (E, ⇝), individual signals (x′ 1, . . . , x′ N ) can be constructed, such that, for all 1 [a, b], if (ci(χ), xi(ci(χ))) ∈ i(χ) = x1(cn(χ)). For example, let (E, ⇝) be a distributed signal consisting of signals x1, x2, and x3 as shown Figure 5.1. For an STL ccf(χ), then x′ ≤ ≤ ∈ N and for all χ i formula [0,3](x1 + x2 + x3 ≤ 10), ccf is a valid consistent flow on (E, ⇝). Note that a distributed signal (E, ⇝) encodes uncountably infinite consistent cut flows. Let us denote the set of all consistent cut flows by CCF. Our decision problem consists of determining whether there is a violation of a given STL formula by some consistent cut flow. Definition 13. [Distributed satisfaction] Let φ be an STL formula, (E, ⇝) be a distributed signal over N agents and CCF be the set of all induced consistent cut flows. We say that ((E, ⇝), 0), or simply (E, ⇝) satisfies φ, iff for each σ CCF, we have σ = φ. ■ | ∈ Problem Statement Given maximum clock skew ε > 0, a distributed signal (E, ⇝) over N agents, and an STL formula φ, decide whether there exists a consistent cut flow σ CCF where ∈ σ = φ. ̸| 5.2 Monitoring Algorithm In this chapter, we assume the monitor receives output signals from xn as piece-wise linear signals (this is by choice and other forms of discretization will not change the core monitoring algorithm). This transmission happens in segments of length T : at the kth transmission, agent An transmits xn [(k−1)T,kT ], the restriction of its output signal to the | interval [(k − 1)T, kT ] as measured by its local clock. In the rest of this chapter, we refer exclusively to the signal fragments received by the monitor in a given transmission. 87 We now revisit the restriction placed on In in Definition 7, namely, that the monitor only deals with non-empty bounded signal fragments xn [(k−1)T,kT ], therefore, In = [(k | − 1)T, kT ] for every agent at the kth transmission, measured in local time. By the bounded skew assumption, we have: Lemma 8. [Bounded skew lemma] For any two agents An, Am with the intervals In = [min In, max In] and Im = [min Im, max Im], min In min Im ε and max In max Im ε. | | ≤ > ε. However, both min In and min Im are lower bounds of | ≤ − − | Proof. Assume min In min Im | − | In and Im respectively, at the kth transmission. Therefore, by definition of partial synchrony, the difference of their values must not exceed the maximum clock skew ε. Therefore, our assumption is not possible. Thus, min In | − min Im | ≤ ε. Similarly, we can show that max In | − max Im | ≤ ε. ■ Since online monitoring happens in segments, at the end of each segment the monitor either returns ⊤ (formula already satisfied), ⊥ (already violated), or unknown, and the next segment is processed. For simplicity, our solution employs a central monitor. Our monitoring algorithm involves three key ideas: (1) formula progression, (2) signal retiming, and (3) SMT-based implementation, explained in the following sections. 5.2.1 Formula Progression Let φ be an STL formula and (E, ⇝) be a distributed signal. Without loss of generality, let this signal be split into two segments: prefix (E1, ⇝) and suffix (E2, ⇝). That is, (E, ⇝ ) = (E1E2, ⇝). Thus, the monitor first evaluates φ on (E1, ⇝). If the verdict yields true or false, then this verdict is returned and monitoring for (E, ⇝) is already complete. Otherwise, the monitor computes a new progressed formula φ′ which will be evaluated for segment (E2, ⇝). Definition 14. [Formula progression] Let (E1, ⇝) be a finite distributed signal starting at time 0 whose duration is denoted by , and (E2, ⇝) be a finite or infinite extension | of (E1, ⇝). We say STL formula φ′ is a progression of STL formula φ for (E1, ⇝) if and only (E1, ⇝) | 88 if: ((E1E2, ⇝), 0) = φ′. | ⇔ ((E2, ⇝), 0) = φ | = φ (resp., ((E1, ⇝), 0) | ). ■ = φ), then the progression ̸| It stands to reason that if ((E1, ⇝), 0) of φ is trivially φ′ = (resp., φ′ = ⊤ 5.2.2 Signal Retiming ⊥ Recall that signals are measured using their local clocks. Since the signals in our setting are partially synchronized within an ε, it is not possible to evaluate all signals at the same moment in global time. Rather, the best a monitor can do is explore all valid alignments of the concurrent local moments (i.e., those moments that are within ε of each other) and determine whether at least one such alignment violates the formula. This intuition is for- malized below, starting with the notion of a retiming function borrowed from Chapter 4 that establishes the happened-before relation in the continuous-time setting, and stretches or compresses signals to align them with each other within the ε clock skew bound. A valid retiming formalizes the notion of alignment of timelines: given two ε-synchronous timelines t and s (on two agents), we treat moments t and s = ρ(t) as being simultaneous. Thus, the signal x(t) = [x1(t), x2(ρ(t))] is now a fully synchronous signal. An ε-retiming ρ maps R+ to itself, but the restriction of ρ to a bounded interval I is an increasing function from I to ρ(I) that respects the constraint t | our attention to ε-retimings on bounded intervals. Between 2 agents, we need one retiming I. Thus, we restrict < ε for all t ρ(t) | − ∈ ρ : I2 → I1, and between N agents, we need N 1 retimings In − → I1. In general there is an infinity of valid retimings, any of which might reveal a potential violation. The next theorem establish the fundamental condition on ε-retimings among agents and violation of an STL formula. Theorem 4. Given a distributed signal (E, ⇝) over N agents, and an STL formula φ with time interval [a, b], there exists a violation at time t R+, if and only if there exists N 1 − ∈ 89 ε-retimings ρn : In → I1 that respect ⇝, where 2 n ≤ N , such that: (cid:16)(cid:0)x1, x2 ρ−1 2 , . . . , xN ◦ ◦ ≤ (cid:17) (cid:1), t ρ−1 N = φ ̸| (5.1) Here, ρ−1 m ◦ ρn : In → Im is an ε-retiming for all n = m, and ‘ ’ denotes the function ◦ composition operator, where given two functions f and g, h = g f such that h(x) = g(f (x)). ◦ Proof. We distinguish the following cases: Case 1: Suppose that such retimings exist. We define local time values for each time χ N = ρ−1 ∈ . . ., tχ N (c1(χ)), where 2 [t + a, t + b] for agents A1, A2, . . ., AN respectively as tχ 2 (c1(χ)), 2 (c1(χ)), . . ., N (c1(χ)) are the local times of agents A1, A2, . . ., AN respectively at global time χ. . By the construction of Cχ, and tχ N = ρ−1 Furthermore, define Cχ = N . In other words, tχ 1 = c1(χ), tχ 1 = c1(χ), tχ (tn, xn(tn)) 2 = ρ−1 2 = ρ−1 tn N ≤ ≤ n n ∈ { the fact that the retimings respect ⇝, it holds that if e ≤ | } tχ n ∧ every n, m 2 and n ≥ = m, it holds that tχ m = ρ−1 a consistent cut, and the flow of frontiers front(Cχ), where χ ∈ CCF that witnesses the violation of φ. σ ∈ ∈ n)) so m (ρn(tχ Cχ and f ⇝ e, then f Cχ. For ∈ tχ m| ≤ ε. Thus, Cχ is tχ n − | R+, is a consistent cut flow Case 2: Suppose σ ∈ CCF is a consistent cut flow that violate φ. By definition, there must be consistent cuts in σ that violate φ. Let Cχ denote such consistent cuts, and let front(Cχ) denote their frontiers. For every two events (tn, xn(tn)) and (tm, xm(tm)) in front(Cχ), we have tn | s + ε tn ≥ ≤ tm tm ε. Since (tn, xn(tn)) front(Cχ), we have (s, xm(s)) Cχ for all s s.t. | ≤ − tn. Thus, tm ∈ s for all such s and so tm tn − ≥ ≥ ∈ ε. By symmetry of the argument, ε holds as well, implying a retiming indeed does exist. ■ − 5.2.3 SMT Encoding We solve the monitoring problem by transforming it into an instance of the satisfiabil- ity modulo theory (SMT). Specifically, we ask whether there exists N 1 retimings, such − that (5.1) holds; equivalently, whether there exists a consistent cut flow that witnesses sat- isfaction of φ. That is, the distributed signal violates φ iff the following SMT problem is ¬ satisfiable. This transformation to SMT solving is the focus of the next section. 90 ̸ ̸ 5.3 SMT-based Monitoring Algorithm The SMT formulation part of our solution is constructed by encoding both formula progression and signal retiming into a single SMT-solving problem, and then solving it using an SMT-solver. First, we define the SMT entities and constraints, then demonstrate our monitoring approach with two complete examples. In both examples, we consider a distributed signal (E, ⇝) comprised of two individual 10 time unit long signals x1 and x2, generated by agents A1 and A2 respectively, with a clock skew bound ε = 1. Our running examples involve monitoring formulas φ1 = [0,10] p and ¬ φ2 = [0,10](p ¬ [0,5] q). ¬ ∧ 5.3.1 SMT Entities In our encoding, N signals and time intervals are defined in the same fashion as the mathematical representation in previous sections. We also include ρn retiming functions, where 2 n ≤ ≤ N , a consistent cut flow function ccf as an uninterpreted function, and real numbers t, s, and χ. Identifying interpretations of these functions will be the output SMT solving and, hence, the verdict of monitoring. The sampled signal values are constants in the encoding that are known to the monitor: xn(tn) { tn | In . } ∈ 5.3.2 SMT Constraints Recall from Section 5.1 that (ccf, t) = p denotes a consistent cut flow at time t on signals | (x′ 1, . . . , x′ N ) satisfies the atom p. To express this as an SMT problem, we encode (ccf, t) = p | as f (x′ 1[t], . . . , x′ n[t]) > 0, where (x1[t], . . . , xn[t]) Rn is a vector of signal values at time t, ∈ and f : Rn → R is a function that evaluates a vector of signal values. The SMT constraints are primarily comprised of (1) a set of constraints that ensures valid consistent cut flow, (2) a set of constraints that find violation, and (3) a set of constraints that enforce valid retimings under a given clock skew. Consistent cut flow constraints. In order to ensure that ccf identifies a valid consistent cut flow on (E, ⇝) over time interval [a, b], first we define the happened-before (⇝) notation in SMT according to Definition 7, and ensure that the events in the consistent cuts mapped 91 by ccf respect the happened-before relation: SM T flow1 = [a, b] . χ ∀ ∈ (cid:16)(cid:0)(t′ n, xn(t′ (cid:16) (tn, xn(tn)), (t′ ∀ n)) ⇝ (tn, xn(tn))(cid:1) n, xn(t′ n)) E . ∈ (cid:0)(tn, xn(tn)) ∧ ccf(χ)(cid:1)(cid:17) ∈ (t′ n, xn(t′ n)) ⇒ (cid:17) ccf(χ) . ∈ And that the consistent cuts mapped by ccf always increase and never intersect: SM T flow2 = χ, χ′ ∀ ∈ [a, b] . n ∀ ∈ [N ] . (cid:16) χ < χ′ ⇒ cn(χ) < cn(χ′) (cid:17) . Thus, the SMT constraint for consistent cut flow is the following: SM T flow = SM T flow1 ∧ SM T flow2. Retiming constraints over ccf. We ensure SM T retime1 = χ [a, b] . c2(χ) ∈ ∀ (cid:16) ρ(c2(χ)) = c1(χ)) ∀ I2 . c1(χ) I1 . ∈ ∧ ∃ c1(χ) ( | − ∈ c2(χ) | (cid:17) < ε And that ρ is always increasing: SM T retime2 = [a, b] . χ, χ′ ∀ ∈ (cid:16) c2(χ) < c2(χ′) c2(χ), c2(χ′) ∀ ∈ I2 . (cid:17) ρ(c2(χ)) < ρ(c2(χ′)) ⇒ When there are more than 2 agents , we must also encode the constraint that for all n = m, ρn is an ε-retiming. Therefore, for all n ρ−1 m ◦ that represents the inverse of the uninterpreted cm: = m, denoting fm as the uninterpreted function SM T retime3 = t ∀ ∈ In . fm(ρn(t)) = t Thus, the SMT constraint for signal retiming if the following: SM T retime = SM T retime1 ∧ SM T retime2 ∧ SM T retime3. 92 ̸ ̸ [a,b] U [a, b] i ∃ ∈ [a,b] R [a, b] i ∃ ∈ φ ψ j ∀ ∈ [0, i] (σ, i) = ψ | φ ψ (σ, i) = φ | [0, i] j ∀ ∈ (σ, j) = φ | (a) φ U [a,b]ψ. [a, b] i ∀ ∈ (σ, i) = φ | [a,b] φ [a,b] φ [a, b] i ∃ ∈ (σ, i) = φ | (b) φ R [a,b]ψ. ¬ p (σ, j) = ψ | ¬ (σ, j) = p | (c) G[a,b]φ. (d) F[a,b]φ. Figure 5.2 Conversion of STL syntax trees to their corresponding SMT syntax tree. (e) p. ¬ [0,10] ∧ [0, 10] i ∃ ∈ ∧ p [0,5] (σ, i) = p | j ∀ ∈ [0 + i, 5 + i] [0, 10] i ∃ ∈ (σ, i) = p | ¬ q ¬ (σ, j) = q | [0,10] p (a) φ1 = F[0,10]p. ¬ Figure 5.3 SMT syntax tree of STL formulas (b) ¬ φ2 = F[0,10](p φ1 and G[0,5]¬ ∧ φ2. q). ¬ ¬ Violation constraints over (E, ⇝). Let γφ be the syntax tree representation of an STL formula φ, where each internal node represents an operator, and each leaf node represents an atomic proposition. We convert γφ to its SMT syntax tree representation τφ. An SMT syntax tree τφ is a tree obtained from an STL syntax tree γφ by replacing each temporal operator in the non-leaf node of γφ with its corresponding SMT encoding. In τφ as well, each leaf represents an atomic proposition. The purpose of converting an STL formula φ to its SMT syntax tree representation τφ is to be able to easily manipulate the syntax tree and parse its corresponding SMT encoding. Figure 5.2 shows the process of converting all five 93 subtrees with STL operators to their corresponding SMT syntax tree representations. For nested formulas, this process is done for every formula in the STL syntax tree, starting from the root of the tree. For example, Figs. 5.3a and 5.3b show creating SMT syntax trees of φ1 and φ2 using ¬ ) be the SMT syntax trees created ¬ the technique shown in Figure 5.2. Let τ¬φ1 from φ1 (resp., ¬ ¬ (resp., τ¬φ2 φ2). Let us first consider the case where the monitor has the whole distributed signal (E, ⇝) (i.e., no segmentation). The case of a segmented signal will be handled by formula progression explained in Section 5.3.3. Thus, we keep the SMT syn- tax trees unchanged and we denote the corresponding SMT constraint by SM T τφ . From Figure 5.3a, for φ1, the distributed signal (E, ⇝), and the SMT syntax tree τ¬φ1 ¬ , we have: SM T τ¬φ1 = i ∃ ∈ [0, 10].((ccf, i) = p). | Recall from the beginning of this section ‘(ccf, i) = p’ is replaced with the f (.) > 0 in the | SMT constraint. For φ2, we have: ¬ SM T τ¬φ2 = = q))). | Putting everything together. The final SMT constraint is the following: [0 + i, 5 + i]( j ∧ ∀ = p | i ∃ ∈ ∈ ¬ [0, 10].((ccf, i) ((ccf, j) F inalSM T = SM T flow SM T retime SM T τ¬φ. ∧ ∧ Obviously, since there is logical equivalence between an STL formula φ and its corre- sponding SMT encoding SM T τφ , for any given a distributed signal (E, ⇝) over N agents, we have (E, ⇝) = φ if and only if F inalSM T is satisfiable (assuming all time intervals of ̸| temporal operators are within [0, (E, ⇝) | ]). | 5.3.3 Formula Progression We now consider the case where the monitor does not have the entire distributed signal and receives it in segments, or, time intervals of some temporal operators are not within [0, ]. Given a segment (E, ⇝) and formula φ, our goal is to obtain a progressed (E, ⇝) | | formula φ′ such that any (finite or infinite) extension (E′, ⇝) will be evaluated for φ′. 94 Data: SMT syntax tree τφ, partition time t Result: SMT syntax tree τ ′ φ Let rootτ be the root node of τφ and nτ be a node Function PartitionTree(nτ ): if nτ has a quantifier with range ‘[a, b]’ then if a < t b then be an empty node ≤ Let n′ τ ’ then if nτ has quantifier ‘ ∀ Label n′ τ as ‘ end ’ then if nτ has quantifier ‘ ∃ Label n′ τ as ’ ∧ ← end n′ τ .lef tchild range of n′ τ .l n′ τ .rightchild range of n′ τ .r = rootτ then if nτ nτ .parent.child ← n′ τ ← end else nτ end n′ τ ← end ∨ copy subtree rooted at nτ Set ‘[a, min(b, t))’ as the quantifier copy subtree rooted at nτ Set ‘[max(a, t), b]’ as the quantifier end foreach nτ .child in nτ do PartitionTree(nτ .child) end return PartitionTree(rootτ ) Algorithm 5.1 Function Λ. We define function Λ, that takes as input an SMT syntax tree τφ and a segment duration and returns as output (see Algorithm 5.1) an SMT syntax tree τ ′ (E, ⇝) | | We construct an SMT syntax tree τ ′ φµ from τ ′ φ such that the following properties hold: φ = Λ(τφ, ). (E, ⇝) | | • The root of τ ′ φµ is the topmost (and leftmost if there are two) node of τ ′ φ which has a quantifier label. • For every subsequent nodes, in τ ′ φµ , if the node n has the label or ∧ ∨ with chil- dren labelled with quantifiers, remove the node and only keep the left child by doing n.parent = n.lef tchild. 95 ̸ As examples, let us partition the SMT syntax trees in τ¬φ1 (Figure 5.3a) and τ¬φ2 , since the starting node nτ , which (Fig- ure 5.3b) at time t = 5 using Algorithm 5.1. For τ¬φ1 is the root node in this case, is labelled ‘ ∨ ∈ Now we create two copies of the tree at nτ , change the ranges to ‘[0, 5)’ (resp., ‘[5, 10]’), [0, 10]’, we create a node n′ τ and label it ‘ ’. i ∃ and attach them to left (resp., right) children of n′ τ . n′ τ is our new nτ . Now, we repeat the process for each child of nτ . However, as none of the children nodes are labelled with quantifiers, τ ′ ¬φ1 = nτ is our desired partitioned tree from τ¬φ1 at time t = 5, shown in Figure 5.2. Following the same process, we get τ ′ ¬φ2 as our partitioned tree from τ¬φ2 at time t = 5, shown in Figure 5.3. Lemma 9. [SMT partition tree lemma] Let (E, ⇝) be a distributed signal and φ be an STL formula. F inalSM T for (E, ⇝) and τφ is satisfiable if and only if F inalSM T for (E, ⇝) and Λ(τφ, ) is satisfiable. (E, ⇝) | | Proof. We distinguish the following cases: Case 1: First, we consider the base case of this proof, where the formula is an atomic proposition, that is, φ = p. ) The SMT encoding for (E, ⇝) and τp is: ( ⇒ (ccf, 0) = p | In other words, when the encoding above is satisfied, the events in the frontier of the con- sistent cut at time 0 satisfies p. Now, as the SMT syntax tree for p does not have any quantifiers, Algorithm 5.1 never enters succeeds a < t b. Hence, the SMT syntax tree for ≤ p remains unchanged, and the SMT encoding using E and τ ′ φ = Λ(τφ, E | ) is: | ) Trivial. ( ⇐ (ccf, 0) = p | 96 Case 2: Assume that the proof has been established for the cases when the formulas are φ = φ1 and φ = φ2. Now, we consider the case where the formula is φ = φ1 φ2. ∧ ) The SMT encoding for (E, ⇝) and τφ1∧φ2 ( ⇒ is: (ccf, 0) = φ1 | ∧ φ2 In other words, when the encoding above is satisfied, the events in the frontier of the con- sistent cut at time 0 satisfies φ1 ∧ φ2. Now, as the SMT syntax tree for φ does not have any quantifiers, Algorithm 5.1 never succeeds a < t b. Hence, the SMT syntax tree for φ ≤ remains unchanged, and the SMT encoding using E and τ ′ φ1∧φ2 = Λ(τφ1∧φ2, t′) is: (ccf, 0) = (φ1 | ∧ φ2) ∧ true ) Trivial. ( ⇐ Case 3: Assume that the proof has been established for the cases when the formulas are φ = φ1 and φ = φ2. Now, we consider the case where the formula is φ = φ1 φ2. ∨ ) The SMT encoding for (E, ⇝) and τφ1∨φ2 ( ⇒ is: (ccf, 0) = φ1 | ∨ φ2 In other words, when the encoding above is satisfied, the events in the frontier of the con- sistent cut at time 0 satisfies φ1 ∨ φ2. Now, as the SMT syntax tree for φ does not have any quantifiers, Algorithm 5.1 never succeeds a < t b. Hence, the SMT syntax tree for φ ≤ remains unchanged, and the SMT encoding for (E, ⇝) and τ ′ φ1∨φ2 = Λ(τφ1∨φ2, t′) is: (ccf, 0) = φ1 | ∨ φ2 ) Trivial. ( ⇐ Case 4: Assume that the proof has been established for the cases when the formulas are φ = φ1 and φ = φ2. We consider the case where the formula is φ = φ1 [a,b]φ2. U ) The SMT encoding for (E, ⇝) and τφ1 U [a,b]φ2 is: ( ⇒ (cid:16) [a, b] (ccf, i) = φ2 | j ∈ ∧ ∀ [0, i)(cid:0)ccf, j) (cid:1)(cid:17) = φ1 | i ∃ ∈ 97 If the above encoding is SAT, then both [0, i)(cid:0)(ccf, j) (cid:1) are SAT. For a < [a, b](cid:0)(ccf, i) i ∃ b, this can be written as: = φ2 | ∈ (cid:1) and i ∃ [a, b] j ∀ ∈ ∈ = φ1 | E | | ≤ i1 ∃ ∈ [a, (cid:16) ) | E | (ccf, i1) = φ2 (cid:0) ∀ j1 ∧ ∈ [0, i1]((ccf, j1) = φ1)(cid:1)(cid:17) , b]((ccf, j2) = φ1)(cid:1)(cid:17) | Note that this is the SMT encoding for (E, ⇝) and τ ′ (ccf, i2) = φ2 , b] | E [ | E | j2 i2 ∈ ∧ ∈ (cid:16) ∃ ∀ [ φ1 U [a,b]φ2 = Λ(τφ1 U [a,b]φ2, ∨ (cid:0) E | ), when | a < E | | ≤ b. For any other value of a < E | | ≤ b, the SMT syntax tree remains unchanged. When the SMT encoding of τφ1 U [a,b]φ2 satisfied throughout [0, U first part of the SMT encoding of τ ′ ), and φ1 | E | is SAT, either (1) φ1 U [|E|,b]φ2 is satisfied. If φ1 [a,|E|]φ2 is satisfied, or (2) φ1 is [a,|E|]φ2 is satisfied, then the U φ1 U [a,b]φ2 becomes SAT, and if φ1 is satisfied throughout [0, E | ), and φ1 | U [|E|,b]φ2 is satisfied, then the second part of the SMT encoding of τ ′ becomes SAT. Therefore, in all possible cases, if the SMT encoding of τφ1 U [a,b]φ2 then the SMT encoding of τ ′ will also yield SAT. φ1 U [a,b]φ2 φ1 U [a,b]φ2 yields SAT, ) Trivial. ( ⇐ Case 5: Assume that the proof has been established for the cases when the formulas are φ = φ1 and φ = φ2. Finally, we consider the case where the formula is φ = φ1 [a,b]φ2. R ) The SMT encoding for (E, ⇝) and τφ1 R [a,b]φ2 is: ( ⇒ (cid:16) [a, b] (ccf, i) = φ1 | j ∈ ∧ ∀ [0, i)(cid:0)ccf, j) (cid:1)(cid:17) = φ2 | i ∃ ∈ If the above encoding is SAT, then both [0, i)(cid:0)(ccf, j) (cid:1) are SAT. For a < [a, b](cid:0)(ccf, i) i ∃ b, this can be written as: = φ1 | ∈ (cid:1) and i ∃ [a, b] j ∀ ∈ ∈ = φ2 | E | | ≤ i1 ∃ ∈ [a, (cid:16) ) | E | (ccf, i1) = φ1 (cid:0) ∀ j1 ∧ ∈ [0, i1]((ccf, j1) = φ2)(cid:1)(cid:17) i2 ∃ [ E | , b] | ∈ (cid:16) (ccf, i2) = φ1 ∧ E [ | ∈ , b]((ccf, j2) = φ2)(cid:1)(cid:17) | ∨ (cid:0) j2 ∀ 98 Note that this is the SMT encoding for (E, ⇝) and τ ′ φ1 R [a,b]φ2 = Λ(τφ1 R [a,b]φ2, E | ), when | a < E | | ≤ b. For any other value of a < E | | ≤ b, the SMT syntax tree remains unchanged. When the SMT encoding of τφ1 R [a,b]φ2 satisfied throughout [0, R first part of the SMT encoding of τ ′ ), and φ1 | E | is SAT, either (1) φ1 R [|E|,b]φ2 is satisfied. If φ1 [a,|E|]φ2 is satisfied, or (2) φ2 is [a,|E|]φ2 is satisfied, then the R φ1 R [a,b]φ2 becomes SAT, and if φ2 is satisfied throughout [0, E | ), and φ1 | R [|E|,b]φ2 is satisfied, then the second part of the SMT encoding of τ ′ becomes SAT. Therefore, in all possible cases, if the SMT encoding of τφ1 R [a,b]φ2 then the SMT encoding of τ ′ will also yield SAT. φ1 R [a,b]φ2 φ1 R [a,b]φ2 yields SAT, ) Trivial. ( ⇐ ■ Given a distributed signal (E′, ⇝) and an STL formula φ, the following theorem shows that the subtree τ ′ φµ of Λ(τ¬φ, ) allows computing the progressed formula by dis- (E, ⇝) | | charging τ ′ φµ . Theorem 5. [Partial evaluation theorem] Let (E, ⇝) be a distributed signal and φ be an STL formula. It is the case that (E, ⇝) = φµ if and only if F inalSM T for (E, ⇝) and τ ′ φµ | is satisfiable. Proof. Let us assume that τ ′ φ = Λ(τφ, not satisfiable. This implies that τ ′ φµ ), E | = φµ, and F inalSM T for (E, ⇝) and τ ′ φµ | E | has at least one subtree, where the root node is the is nth nested quantifier with an interval [αn, βn] and βn > . However, while constructing | , only the left child is kept for any node that has the label with children labelled E | or τ ′ φµ ∧ ∨ with quantifiers (see Section 3.3). Furthermore, In Algorithm 5.1, the maximum range of the quantifier labelled on the left child is min(βn, E | Therefore, such a subtree cannot exist, and by extension τ ′ φµ and only if F inalSM T for (E, ⇝) and τ ′ φµ is satisfiable. ■ ). Therefore, βn > | E | cannot exist. Thus, E | is not possible. = φµ if | Simply evaluating F inalSM T for (E, ⇝) and τ ′ φµ is not enough, as we must ensure that there is no loss of information when modifying τ ′ φ using the said evaluation results. For example, in Figure 5.4b, Since (σ, j2) = | q cannot be evaluated on the first segment, finding ¬ 99 only one value of i1 in this segment may lead to loss of information, as this may ignore other valid values of i1 that are required to evaluated (σ, j2) = | would naturally occur only in its τ ′ φµ ¬ q on the next segment. subtree. To this Note that any modification to τ ′ φ end, we define a function υ, that takes as inputs an SMT syntax tree τ ′ φµ and a distributed signal (E, ⇝), and returns an SMT syntax tree τ ′ φυ in τ ′ φ , τ ′ φ can sufficiently evaluate (E′, ⇝). , such that, upon replacing τ ′ φµ with τ ′ φυ In other words, the STL representation of τ ′ φ becomes the desired progression of φ on (E, ⇝). However, before defining υ, we specify the following shorthand notations we will be using throughout its definition: • ‘τφ = p’: The root of the tree τφ is labelled p • τφ = τφ1Xτφ2 , where X = AP. ∈ , {∧ ∨} : The root of the tree τφ is labelled X, and it has two children τφ1 and τφ2 . • τφ = [a,b] τψ: The root of the tree τφ contains label • τφ = [a,b] τψ: The root of the tree τφ contains label • ((E, ⇝), t) i ∀ ∈ i ∃ ∈ = τφ : At time instance t, F inalSM T for (E, ⇝) and τφ is satisfiable. | [a, b], and it has a child τψ. [a, b], and it has a child τψ. Now we define υ in a case-by-case manner for the relevant STL operators: Atomic propositions. Let τφµ = p for some p AP. We have: ∈ υ((E, ⇝), τφµ) =   ⊤  ⊥ if ((E, ⇝), 0) = p | otherwise Conjunction. Let τφµ = τφµ1 ∧ τφµ2 . We have: υ((E, ⇝), τφµ) = υ((E, ⇝), τφµ1 ) Disjunction. Let τφµ = τφµ1 ∨ τφµ2 . We have: υ((E, ⇝), τφµ) = υ((E, ⇝), τφµ1 ) υ((E, ⇝), τφµ2 ) υ((E, ⇝), τφµ2 ) ∧ ∨ 100 τ ′ ¬ φ2µ ∨ i1 ∃ ∈ [0, 5) i2 ∃ ∈ [5, 10] ∧ ∧ τ ′ ¬ φ1µ ∨ (σ, i1) = p | ∧ (σ, i2) = p | j3 ∀ ∈ [i2, 5 + i2] i1 ∃ ∈ [0, 5) i2 ∃ ∈ [5, 10] j1 ∀ ∈ [0 + i1, 5) j2 ∀ ∈ [5, 5 + i1] (σ, j3) = | q ¬ (σ, i1) = p | (σ, i2) = p | (σ, j1) = | q ¬ (σ, j2) = | q ¬ (a) Partitioned SMT syntax tree for τ ′ ¬φ1. (b) Partitioned SMT syntax tree for τ ′ Figure 5.4 Examples of partitioned SMT syntax tree of STL formulas φ1 and ¬ ¬ ¬φ2. φ2 at t = 5. Always operator. Let τφµ = τφ′ µ . In this case, the transformation of τφµ is fairly straightforward: υ((E, ⇝), τφµ) =    [a,b] τφ′ µ ⊥ if if k ∀ k ∃ ∈ ∈ [a, b].((E, ⇝), k) [a, b].((E, ⇝), k) = τφ′ | µ = τφ′ ̸| µ Eventually operator. Let τφµ = τφ′ µ . In this case, instead of finding a single time instance where F inalSM T for (E, ⇝) and τφ′ µ is satisfiable, a valid range [k, b] must be identified, where k ∈ is satisfiable: [a, b] is the earliest time instance where F inalSM T for (E, ⇝) and τφ′ µ υ((E, ⇝), τφµ) = [k,b] τφ′ µ    ⊥ if if argmin k∈[a,b](((E, ⇝), k) = τφ′ | µ) [a, b].((E, ⇝), k) k ∀ ∈ = τφ′ ̸| µ Remark 4. Since Until (Figure 5.2a) and Release (Figure 5.2b) operators are expressed using existential and global quantifiers in SMT syntax trees, the definition of υ does not need cases for them. Now that we have defined υ, we state the necessary steps required to compute the pro- gression of some STL formula φ on a distributed signal (E, ⇝) as follows: 101 • First, we create the SMT syntax tree τφ that corresponds to the STL formula φ using the methods detailed in Figure 5.2. As examples, let us consider the SMT syntax trees for the STL formulas, φ1 = [0,10] p (Figure 5.3a) and ¬ φ2 = [0,10](p ¬ [0,5] q) ¬ ∧ (Figure 5.3b). • Next, we partition τφ at time (E, ⇝) | | using Algorithm 5.1, and obtain τ ′ φ = Λ(τφ, (E, ⇝ | ), such that τφµ ) | ple, we consider the case where the monitor only has the first 5 time units, that is, is the subtree in τφ that can be evaluated on (E, ⇝). In our exam- = 5. Figure 5.4a (resp., Figure 5.4b) shows the partitioned SMT syntax tree (E, ⇝) | | for Figure 5.3b (resp., Figure 5.3b) at time instance (E, ⇝) | | = 5 with subtrees τ ′ ¬φ1µ (resp., τ ′ ¬φ2µ ) that can be evaluated on (E, ⇝). • Finally, we partially evaluate φ on (E, ⇝) by transforming τ ′ φµ The STL representation of this new SMT syntax tree τ ′ φ to τ ′ φυ = υ((E, ⇝), τ ′ φµ). is our desired progression of φ on the extension of (E, ⇝). In our first example, Now, let us assume that p is never true in (E, ⇝). φ1p is of the form ¬ In that case, according to the [0,5] ¬ . φ′ 1p rules specified for υ, The label of the root of τ ′ ¬φ1µ stays unchanged, and the child becomes false. Therefore, the progression becomes ( is, [5,10] p upon simplification. In our second example, Now, let us assume that the minimum i for which [0,5] false) ( ∨ is of the form φ2p ¬ [0, 5)((((E, ⇝), i) [5,10] p), which . φ′ 2p [0,5] ¬ ( j ∀ = p) | i ∃ ∈ q))) is satisfied at time 3.5. In that case, according ∈ ∧ [i + 0, min(i + 5, 5)](((E, ⇝), i) = | ¬ to the rules specified for υ, The label of the root of τ ′ ¬φ2µ is changed to [3.5, 5). ∃ ( [0,5] ∈ q))). i1 ¬ Therefore, the progression becomes ( [3.5,5)( [0,5] q)) ¬ ∨ ( [5,10](p ∧ 5.4 Case Studies and Evaluation In this section, we evaluate our algorithm for monitoring STL specifications on distributed signals using two case studies. 5.4.1 Case Study 1: Network of UAVs In a similar manner as Section 4.5, we use the Fly-by-Logic framework [100], a path planner software for UAVs, to simulate flight path of two UAVs that take off after 1.5s, 102 hover, and then land after 4.5s. The trajectories are sampled at 20Hz as xn, yn, and zn coordinates for each UAV An, with an ε ranging between 1 5ms. − 5.4.2 Case Study 2: Water Distribution System We use the same model of a hybrid dynamic high pressure water distribution system consisting of two water tanks that we used in Section 4.5. Therefore, the specifications of the water tank model is identical to that of mentioned above. We use an ε range of 5 500ms. − However, despite using the same model, we will be verifying the system against STL, and observe different results from what we have witnessed in Chapter 4. 5.4.3 Experimental Setup In our UAV related experiments, we monitor three STL properties: (1) mutual separa- tion between UAVs never falls below a threshold; (2) all UAVs take off simultaneously from standby state and hover at the same altitude, and (3) all UAVs eventually land simultane- ously. The monitor receives a distributed signal every second, and we measure its execution time for each formula progression to verify truthfulness of the given formulas. In our wa- ter tank related experiments, we simulate a plant failure where the RWST in the ECCS is triggered upon receiving an emergency actuation signal. The monitor receives a distributed signal at varying time intervals from multiple water tanks. Our goal is to find possible viola- tions caused by clock drift, where the water pressure falls below threshold required to keep the failsafe CLA from triggering. In other words, we want to monitor the property during an emergency, when the outflow pressure reaches above the threshold pressure and remains above the threshold pressure forever. All experiments are replicated to exhibit 95% confi- dence interval to provide statistical significance. The experimental platform is a CentOS server with an Intel(R) Xeon(R) Platinum 8180 CPU @ 3.80GHz clock rate and 754G of RAM. Our implementation invokes the SMT-solver Z3 [97] to solve the problem described in Section 4.3. 103 2 agents 3 agents 4 agents ) s ( e m i t n u R 0.6 0.4 0.2 0 ) s ( e m i t n u R 0.8 0.6 0.4 0.2 0 2 agents 3 agents 4 agents 2 agents 3 agents 4 agents ) s ( e m i t n u R 6 4 2 0 2 1 3 5 Segment number 4 2 1 3 5 Segment number 4 2 1 3 5 Segment number 4 (a) φms (Mutual separation). (b) φeh (Eventually hover). (c) φel (Eventually land). Figure 5.5 Effect of number of segments and agents on run time for different flight properties. 5.4.4 Analysis of Results Mutual separation. This property states that the distance between every pair of UAVs in fleet always remain above a given threshold δ. The corresponding STL formula φms is: (cid:94) (cid:16)(cid:113) (xi [0,∞] i,j∈[N ],i̸=j xj)2 + (yi yj)2 + (zi − − − zj)2 > δ (cid:17) . Figure 5.5a shows the run time for each segment for evaluation of φms on the distributed signal. In each segment the progression formula remains unchanged. However, the first segment shows minimal run time due to the fact that the UAVs are stationary throughout the entirety of that segment and, therefore, require very few ‘unique’ distance calculations. The run time for the second segment and the last segment are slightly higher than that of the first segment because of the same reason; the UAVs are partially grounded throughout these two segments. Note that despite φms seemingly being a simple STL formula, the average run time per segment is relatively higher (compared to the run time of other formulas) due to requiring quadratic equations to be solved. Eventually hover. This property states that the UAVs in fleet are eventually (within 2s) airborne and hover within a λ height margin. Formally, the corresponding STL formula φeh 104 is: (cid:94) i,j∈[N ],i̸=j (cid:16) [0,2] (cid:17) zi, zj > 0 (cid:16) [0,∞] zi | zj | − (cid:17) < λ) . ⇒ Figure 5.5b shows the run time for each segment for evaluation of φeh on the distributed signal. The first segment has the lowest run time as the UAVs are stationary. The second segment has a higher run time because (zi, zj > 0) is observed and progression is needed for the following segments, where the progressed formula simply becomes [0,∞](zi = zj). Eventually land This property states that the UAVs in fleet eventually land on the ground simultaneously. Formally, the corresponding STL formula φel is: (cid:94) i,j∈[N ],i̸=j (cid:16) zi = 0 [2,∞] (cid:17) zj = 0 . ∧ Figure 5.5c shows the run time for each segment for evaluation of φel on the distributed signal. The temporal interval of φel is intentionally [2, ] instead of [0, ∞ ] since the UAVs ∞ are on the ground at the start of the distributed signal. The behavior in run time shown in this figure is opposite of what we have witnessed in Figure 5.5b. In segments 3 and 4, the UAVs are airborne, and therefore, the search-space for the SMT problem is exhaustively traversed. However, in segment 5, φel is satisfied and the progression becomes true. Impact of segment duration and number of water tanks. Let P1, P2, . . . , PN denote the outflow pressures of N number of water tanks. For simplicity, we assume all the pipes are of the same diameter. Thus, the pressure exerted on the CLA is P1 + P2 + . . . + PN . We monitor the property that states outflow pressure remains above the threshold pressure 600psig [117] indefinitely. The corresponding STL formula φP is: [0,∞] (cid:16) N (cid:88) n=1 Pn ≥ (cid:17) 600 . Figure 5.6 shows the effect on run time for increasing the number of tanks from 2 to 4 with ε = 0.05s over segment duration ranging from 1s to 5s. As expected, both segment duration and the number of tanks drive up the run time. We note that even when the monitor receives 105 Clock Skew (s) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Clock Skew (s) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 True Violations 9 4 12 11 4 7 5 7 10 7 True Violations 6 6 8 4 2 1 7 2 5 6 Detected Violations 25 42 65 80 86 99 112 127 145 160 False Positives 16 38 53 69 82 92 107 120 135 153 (a) Water tanks. Detected Violations 11 20 30 39 46 48 62 66 76 84 False Positives 5 14 22 35 44 47 55 64 71 78 False +ve Percentage 64% 90.48% 81.54% 86.25% 95.35% 92.93% 95.54% 94.49% 93.1% 95.63% False +ve Percentage 45.45% 70% 73.33% 89.74% 95.65% 97.92% 88.71% 96.97% 93.42% 92.86% (b) UAVs. Table 5.1 Impact of ε. the distributed signals sent by the water tanks at a reasonable 1s intervals, the monitor is still able to verify the property online under around half a second for four tanks. Impact of clock skew. In order to study the impact of ε on monitoring verdicts, we model two RWST modules with intentional ‘faults’, where the outflow pressures of either tank can drop below the threshold pressure of the CLA. Thus, if both tanks’ pressures fall simultaneously, the CLA gets triggered. We also introduce a clock drift in the valve controller of one of the tanks. Table 5.1a shows the results for two tanks that were active for an hour. During this time, Tank 1 and Tank 2 reported low pressures for a total of 35.5s and 36.1s respectively. Although generally we are interested in finding a single violation, in order to 106 ) s ( e m i t n u R 5 0 1 2 3 4 5 2 Segment duration (s) 4 3 N u m b er of ta n ks Figure 5.6 Effect of segment duration and the number of water tanks on runtime for φP. demonstrate the effect of clock skew, we find multiple violation instances in this experiment by tallying up pairs of piece-wise linear interpolations between samples where violations are detected. We report the number of true violations as a baseline that was reverse calculated from the introduced clock drift ε, number of detected violations using our method, and the number of false positives, which is essentially the difference between the true violations and the detected violations1. Note that there are no false negatives. Furthermore, as the clock drift is increased from 0.05s to 0.5s, the number of false positives increase as well. Similarly, we model a path for a pair of UAVs, where the agents periodically reside within the given mutual separation threshold, and violate the mutual separation property. Table 5.1b shows the results for two UAVs in operation for half an hour. We again report the number of true violations, detected violations, and false positives. 5.5 Conclusion In this chapter, we developed a mechanism for monitoring requirements defined in signal temporal logic (STL) for distributed CPS, where continuous-time and continuous-valued signals from a group of agents do not share a global clock. Our method relies on an off-the- shelf clock synchronization algorithm, such as NTP, to ensure a maximum constrained clock 1We emphasize that due to the uncertainty caused by asynchrony, the existence of false positives is inevitable, as there is no global clock to ensure a total order of events. 107 skew across all agents in the system. We also presented a signal retiming approach, borrowed from Chapter 4, that effectively aligns continuous signals in order to detect potential STL specification breaches. To address the complexity, we compress our runtime monitoring problem to an SMT solution problem and cut the distributed signals into a series of smaller parts. To that purpose, similar to that of Chapter 3, we presented a formula progression approach that takes a distributed signal and an STL formula as input and outputs another STL formula that depicts the formula’s progress over the signals. Furthermore, in Section 5.4, we presented experimental results from the monitoring of an unmanned aerial vehicle (UAV) fleet, and a water distribution system. These experiments indicate that for certain cases, it is indeed possible to monitor STL formulas online on a distributed signal using our technique. 108 CHAPTER 6 DECENTRALIZED PREDICATE DETECTION OVER PARTIALLY SYNCHRONOUS CONTINUOUS-TIME SIGNALS In this chapter, we set our sights toward decentralized monitoring of distributed CPS. The natural first step would therefore be decentralized monitoring of predicate violations over continuous-time and continuous-valued signals under partial synchrony. To this end, we propose a decentralized monitoring algorithm to detect all Boolean pred- icates over the analog (i.e. continuous-time and continuous-valued) signals generated by the agents in a distributed CPS. Similar to our approaches described in previous chapters, a clock synchronization algorithm (see Subsection 2.3) guarantees a maximum clock skew across all signals generated by the agents. It is helpful to overview our algorithm and key notions via an example before delving into the technical details. An example is shown in Figure 6.1. Three agents produce three signals x1, x2, x3. The decentralized detector consists of three local detectors D1, D2, D3, one on each agent. Each xn is observed by the corresponding Dn. The predicate ϕ = (x1 ≥ 0) is being detected. It is true over the intervals shown with solid black (x2 0) ∧ 0) ∧ ≥ (x3 ≥ bars; their endpoints are measured on the local clocks. The detector only knows that the maximal clock skew is ϵ = 1, but not the actual value, which might be time-varying. 1 0 0 0 2 2.5 4 4 5.5 x1 x2 x3 Figure 6.1 An example of a continuous-time distributed signal with 3 agents. Three timelines are shown, one per agent. The signals xn are also shown, and the local time intervals over which they are non-negative are solid black. The skew ϵ is 1. The Happened-before relation is illustrated with solid arrows, e.g. between e1 . Some satisfying cuts for 1 0) are shown as dashed arcs, and the extremal the predicate ϕ = (x1 cuts as solid arcs. All extremal cuts contain root events, and leftmost cut A also contains non-root events. , and e4 3 ⇝ e2 2 ⇝ e5 2 (x3 (x2 0) 0) ≥ ≥ ≥ ∧ ∧ 109 Because of clock skew, any two local times within ϵ of each other must be considered as potentially concurrent, i.e. they might be measured at a truly synchronous moment. For example, consistent cut at local times [4, 4.5, 3.6] might have been measured at the global time 4, in case the true skews were 0,0.5 and -0.4 respectively. The detector’s task is to find all consistent cuts that satisfy the predicate. In continuous time, there can be uncountably many, as in Figure 6.1; the dashed lines show two satisfying consistent cuts, or satcuts for short. In this example, the detector outputs two satcuts, [1.5, 2, 2.5] and [4, 5, 4], shown as thin solid lines. These two have the special property (shown in this chapter) that every satcut lies between them, and every cut between them is a satcut. For this reason we call them extremal satcuts, which is formally defined later. Thus these two satcuts are a finite representation of the uncountable set of satcuts, and encode all the ways in which the predicate might be satisfied. We note three further things: the extremal satcuts are not just the endpoints of the intervals, and simply inflating each interval by ϵ and intersecting them does not yield the satcuts. Each local detector must somehow learn of the relevant events (and only those) on other agents, to determine whether they constitute extremal satcuts. In the following sections, we state some necessary technical definitions, establish fun- damental properties of the uncountable set of events satisfying the predicate, methodology for computing finite representation of the uncountable of events, complexity analysis, and finally, implementation and experiments. 6.1 Problem Statement Before we state the problem statement, we define the class of predicates we monitor using our decentralized monitoring algorithm. This chapter focuses on specifications expressible as conjunctive predicates φ, which are conjunctions of N linear inequalities. φ := (x1 0) ∧ ≥ (x2 ≥ 0) ∧ . . . ∧ (xN ≥ 0). (6.1) 110 A1 A2 s s′ t t′ Figure 6.2 Two satcuts for a pair of agents A1 and A2, shown by the crossed solid lines (s, t′) and (s′, t). Their intersection is (s, t), shown by a dashed arc, and their union is (s′, t′), shown by a dotted arc. For a conjunctive predicate φ, the intersection and union are also satcuts, forming a lattice of satcuts. These predicates model the simultaneous co-occurrence, in global time, of events of interest, like ‘all drones are dangerously close to each other’. Equation 6.1 also captures the cases where some conjuncts are of the form xn ≤ 0 and xn = 0. If N numbers (an) satisfy predicate φ (i.e., are all non-negative), we write this as (a1, . . . , aN ) = φ. Henceforth, we | say ‘predicate’ to mean a conjunctive predicate in this chapter. Definition 15. [Distributed Satisfaction; SE] Given a predicate φ, a distributed signal (E, ⇝) over N agents, and a consistent cut C of E with frontier front(C) = (cid:16) (t1, x1(t1)), . . . , (tN , xN (tN )) (cid:17) we say that C satisfies φ iff (cid:0)x1(t1), x2(t2), . . . , xN (tN )(cid:1) say that C is a satcut. The set of all satcuts in E is written SE. ■ = φ. We write this as C | = φ, and | 6.1.1 Decentralized Predicate Detection As stated before, our algorithm seeks to find all possible global states that satisfy a given predicate, i.e. all satcuts in SE. In general, SE is uncountable. Architecture. The system consists of N agents with partially synchronous clocks with drift bounded by a known ϵ, generating a continuous-time distributed signal (E, ⇝). Agents communicate in a FIFO manner. 111 Problem Statement Given (E, ⇝) and a conjunctive predicate φ, find a decentralized detection algorithm that computes a finite representation of SE. The detector is decentralized, meaning that it consists of N local detectors, one on each agent, with access only to the local signal xn (measured against the local clock), and to messages received from other agents’ detectors. By computing a representation of all of SE (and not some subset), we account for asyn- chrony and the unknown orderings of events within ϵ of each other. One might be tempted to propose something like the following algorithm: detect all roots on all agents, then see if any N of them are within ϵ of each other. This quickly runs into difficulties: first, a satisfying cut is not necessarily made up of roots; some or all of its events can be interior to the intervals where xn’s are positive (see Figure 6.2). Second, the relation between roots and satcuts must be established: it is not clear, for example, whether even satcuts made of only roots are enough to characterize all satcuts (it turns out, they are not). Third, we must carefully control how much information is shared between agents, to avoid the detector degenerating into a centralized solution where everyone shares everything with everyone else. 6.2 The Structure of Satisfying Cuts We establish fundamental properties of satcuts. In the rest of this chapter we exclude the trivial case C = E. Proposition 1 mirrors a discrete-time result [26]. Proposition 1. The set of satcuts for a conjunctive predicate is a lattice where the join and meet are the union and intersection operations, respectively. Proof. Define the intersection I = C ∩ C ′ and let e be an element of I. Then by definition of a cut, every event that happened-before e is in C and in C ′, and therefore is in their intersection, so I is a cut. The frontier of I is made of events (tn, xn(tn)) such that tn = max { t | (t, xn(t)) C C ′ } ∩ ∈ . In words, (tn, xn(tn)) is the last event on signal xn belonging to both satcuts, which implies it is the last event on at least one of the cuts, say C. Therefore 112 (tn, xn(tn)) is on the frontier of C, and so xn(tn) 0 by definition of a conjunctive predicate. ≥ Since this is true for every n in [N ], we have that the frontier of I is a consistent state that satisfies the predicate, and so I = φ. The union C | ∪ so the set of satcuts is a lattice. ■ C ′ is also a satcut by similar arguments, We show that the set of satcuts is characterized by special elements, which we call the leftmost and rightmost cuts. Definition 16. [Extremal cuts] Let SE be the set of all satcuts in a given distributed signal (E, ⇝). For an arbitrary C ∈ SE with frontier (etn n )n and positive real α, define C α to be − the set of cuts whose frontiers are given by (et1−δ1 1 , et2−δ2 2 , . . . , etN −δN N ) s.t. for all n : 0 δn ≤ ≤ α and n. δn > 0 ∃ A leftmost satcut is a satcut C ∈ SE for which there exists a positive real α s.t. C α and − SE do not intersect. A rightmost cut C (not necessarily sat) is one for which there exists a positive real α s.t. C + α and SE do not intersect, and C α − ⊂ SE. We refer to leftmost and rightmost (sat)cuts as extremal cuts. ■ Intuitively, C α is the set of all cuts one obtains by slightly moving the frontier of − C to the left by amounts less than α. If doing so always yield non-satisfying cuts, then C is a leftmost satcut. Analogous intuition applies to rightmost cuts. If signals xn are all continuous, then rightmost cuts are all satisfying as well. In a signal, there are multiple extremal cuts. Figure 6.2 suggests, and Lemma 10 proves, that all satcuts live between a leftmost satcut and rightmost cut. Lemma 10. [Satcut intervals] Every satcut of a conjunctive predicate lies in-between a leftmost satcut and rightmost cut, and there are no non-satisfying cuts between a leftmost satcut and the first rightmost cut that is greater than it. Proof. Let C be a satcut, so that xn(tn) ≥ 0 for every (tn, xn(tn)) in its frontier. Let sn be the biggest shift backwards in time preserving positivity: s sn := sup { | s ≥ 0 and 0 ∀ ≤ σ ≤ s. xn(tn σ) . 0 } ≥ − (6.2) 113 By the starvation-freedom assumption derived from Assumption 2.2, sn is finite and by the right-continuity of xn, xn(tn sn) 0. Now the cut with frontier (etn−sn )n satisfies the n − predicate, but might not be consistent because it could be that ≥ sn tn | sm) | s1 is the largest of all the tn (tm − − − > ϵ for sn’s. − some n, m. Suppose without loss of generality that t1 − Define bn = max(tn sn, t1 s1 − − − ϵ) for all n > 1. Note that bn ≤ tn because C is consistent (t1 tn ϵ so a fortiori t1 tn − bn = tn ≤ sn, is immediate), and xn(bn) − ≤ 2 , . . . , ebN N ) is consistent and satisfies the predicate. It is also leftmost by construction of s1. Therefore , eb2 ≥ − 1 − 0. Then the cut L with frontier (et1−s1 − ≤ ϵ + s1 and so bn = t1 s1 ϵ tn, whereas the other case, L is a leftmost satcut. The reasoning for rightmost cuts follows the above lines, except for predicate satisfaction. Namely: let sn now be the biggest shift forwards in time preserving positivity: sn := sup { s | s ≥ 0 and 0 ∀ ≤ σ ≤ s. xn(tn + σ) . 0 } ≥ (6.3) By the starvation-freedom assumption derived from Assumption 2.2, sn is finite. Now the cut with frontier (etn+sn n )n might not be consistent because it could be that tn+sn | (tm+sm) | − > ϵ for some n, m. Suppose without loss of generality that t1+s1 is the smallest of all the tn+sn’s. Define bn = min(tn + sn, t1 + s1 + ϵ) for all n > 1. Note that bn tn because C is consistent. ≥ Then the cut R with frontier (et1+s1 N ) is consistent, but does not necessarily satisfy the predicate because of possible discontinuities (Namely, if tn + sn is a point of discontinuity 2 , . . . , ebN , eb2 1 for xn then possibly xn(tn + sn) < 0). R is also rightmost by construction of s1 and the bn. Therefore R is a rightmost cut. Thus every satcut is between a leftmost satcut and rightmost cut. Also by construction of L and R (specifically, Equation 6.2 and 6.3), there is no cut in-between that does not satisfy the predicate. That is, there is no C s.t. L C ⊑ ⊑ R and C = φ. (Here ̸| ⊑ is the ordering relation on the lattice of cuts). ■ Thus we may visualize satcuts as forming N -dimensional intervals with endpoints given by the extremal cuts. The main result of this section states that there are finitely many extremal 114 satcuts in any bounded time interval, so the extremal satcuts are the finite representation we seek for SE. Theorem 6. A distributed signal has finitely many extremal satcuts in any bounded time interval. In order to prove the above theorem, we will need the following definitions: the leftmost event of a cut C is an event et n ∈ With β a real number, an event et′ m front(C) where t t′ for all other events et′ m ∈ ≤ front(C). is said to be β-offset from et n if and only if t′ = t + β. We will need the following three lemmas. Lemma 11. The leftmost event of a rightmost cut is a right root. Proof. Consider the leftmost event et n of a rightmost cut C. Because C is rightmost, then xn(t δ) 0 for all sufficiently small positive δ. Assume for a contradiction that et n is not a − ≥ right root, so xn(t + α) 0 for all sufficiently small α ≥ ≥ 0, say all α strictly less than some α. Since et n is leftmost, we can add the events et+α n , 0 α min { ≤ ≤ ϵ 2, α 2 , γ , to C to form a } new cut C ′. Choosing γ small enough guarantees that C ′ is consistent. C ′ is also satisfying because we only added events such that xn(t + α) 0. This shows C is not a rightmost cut, ≥ which contradicts our choice of C. ■ Lemma 12. All events of the frontier of a rightmost cut are either right roots or ϵ-offset from a right root. Proof. Let et n be the leftmost event in the frontier of a rightmost cut C. By Lemma 11 this event is a right root. Now consider any other event et′ front(C) which is not a right root, and assume for contradiction that t′ m ∈ = t + ϵ. Then xm(t′) 0 and (as in the proof of ≥ Lemma 11) xm(t′ + α) the events et′+α m | { α ∈ ≥ [0, γ) } 0 for all sufficiently small α. If t′ < t + ϵ, then it is possible to add to C, with γ small enough, to obtain a satcut to its immediate right, which contradicts C being rightmost. On the other hand if t′ > t + ϵ this contradicts that et n and et′ m are part of the same frontier. Thus t′ = t + ϵ. ■ 115 ̸ The next lemma (and its proof) parallels Lemma 11 and Lemma 12, but for leftmost satcuts. Lemma 13. The rightmost event of a leftmost satcut is a left root. Moreover, every event of the frontier of a leftmost satcut is either a left root or is ( ϵ)—offset from a left root. − Thus every extremal satcut has a left root or a right root as one its constituent events. Since there are only finitely many roots in any bounded interval, this gives us the desired conclusion. Therefore, it is conceivably possible to recover algorithmically the extremal satcuts, and therefore all satcuts by Lemma 10. The rest of this chapter shows how. 6.3 The Abstractor Process Having captured the structure of satcuts, we now define the distributed abstractor process that will turn our continuous-time problem into a discrete-time one, amenable to further processing by our modified version of the slicer algorithm of [26]. This abstractor also has the task of creating a happened-before relation. We first note a few complicating factors. First, this will not simply be a matter of sampling the roots of each signal. That is because extremal satcuts can contain non-root events, as shown in Figure 6.1. Thus the abstractor must somehow find and sample these non-root events as part of its operation. Second, as in the discrete case, we need a kind of clock that allows the local slicer to know the happened- before relation between events. The local timestamp of an event, and existing clock notions, are not adequate for this. Third, to establish the happened-before relation, there is a need to exchange event information between the processes, without degenerating everything into a centralized process (by sharing everything with everyone). This complicates the operation of the local abstractors, but allows us to cut the number of messages in half. 6.3.1 Abstractor Description The abstractor is described in algorithm 6.1 on page 117. Its output is a stream of discrete-time events, their correct PVC values, and the relation ⇝ between them - i.e., a 116 A1 A2 A1 A2 Figure 6.3 A distributed signal of two agents (top) and the output of the abstractor (bottom). The abstractor marks zero-crossings as discrete root events and creates new events (dark circles) to maintain consistency. Data: Signal of agent An Result: A stream of discrete events which are roots or ϵ-offset from roots trigger found a root et n at local time t: add et n info (n, t, PVC, left or right root) to local buffer if et n is right root: for each agent m = n: info to agent m send et n trigger received message about right root et m from agent Am: Set t′ := t + ϵ, where ϵ is the maximum clock skew create local event et′ n create (setting the PVC for et′ info (n, t′, PVC, relation et n m left or right root) to local buffer /* Visit events in the buffer, forwarding */ ones that are ready to the slicer. appropriately) add et′ n ⇝ et′ n for each event es n in the local buffer : /* Ready events are those whose PVCs will not be updated anymore. See */ from every other agent Ak text for details. if An received at least one message about a right root etk k such that tk Set vs it to local slicer t: n[n] = s and vs ϵ for all k n[k] = s − ≥ = n Remove es n from buffer and send Algorithm 6.1 Local abstractor for agent An discrete-time distributed signal. This signal is processed by the local slicers as it is being produced by the abstractor. The abstractor runs as follows. It is decentralized, meaning that there is a local abstractor running on each agent. Agent An’s local abstractor maintains a buffer of discrete events, and consists of two trigger processes. The first is triggered when a root is detected (by a 117 ̸ ̸ local zero-finding algorithm). It stores the root’s information in a local buffer (for future processing). If it is a right root, it also sends it to the other agents. The second trigger process is triggered when the agent receives a right root information from some other process, at which point it does three things: it creates a local discrete event and a corresponding relation ⇝ between events, it updates events in its local buffer to see which ones can be sent to the local slicer process (described later), and then it sends them. It is clear, by construction, that ⇝ is a happened-before relation: it is the subset of ⇝ needed for detection purposes. Before an event et n is sent to the slicer, it must have a PVC that correctly reflects the happened-before relation. This means that all events that happened-before et n must be known to agent n, which uses them to update the PVC timestamps. This happens when events have reached agent An from every other agent, with timestamps that place them after . This is guaranteed to happen by the starvation-free assumption. et n The output of a local abstractor is a stream of discrete events, so that the output of the decentralized abstractor as a whole is a distributed discrete-time signal. See Figure 6.3. Given that all right roots are assigned discrete events by the first trigger, and given that ϵ-offset events are also created from them by the second trigger, Theorem 7. All events in rightmost cuts are returned by the abstractor. Moreover, a rightmost cut of E is also a cut of the discrete signal returned by the abstractor. Thus the slicer process can find the rightmost satcuts when it processes the discrete signal. The leftmost satcuts will be handled by the slicer using the PVCs, as will be shown in the next section. Doing it this way relieves the abstractor from having to communicate the left roots between processes, thus saving on messages and the corresponding wait times. 6.4 The Slicer Process for Detecting Predicates The second process in our detector is a decentralized slicer process, so-called to keep with the common terminology in discrete distributed systems [53]. The slicer is decentralized: it consists of N local slicers n, one per agent. The slicer runs in parallel with the abstractor S and processes the abstractor’s output as it is produced. Recall that the abstractor’s output 118 consists of a stream of discrete events, coming from the N agents. These events are either roots or ϵ-offset from roots. If an event is a left root or ϵ-offset from a left root, we will call it a left event. We define right events similarly. We will write Fn for those events, output by the abstractor, that occurred on An. Every slicer n maintains a token Tn, which is a constant-size data structure to keep S track of satcuts that contain An events. Specifically, for every event et n in Fn, the token Tn is forwarded between the agents, collecting information to determine whether there exists a satcut that contains et n . We say the slicer is trying to complete et n . The token’s updates are such that it will find that satcut if it exists, or determines that none exists; either way, it is then reset and sent back to its parent process An to handle the next event in Fn. Let et n be an event that the slicer is currently trying to complete. The token’s updates vary, depending on whether it is currently completing a left event, or a right event. If Tn is completing a right event, the token is updated as follows. The token currently has a cut whose frontier contains et n , which is either a satcut or not. If it is, the token has successfully completed the event and is returned to An to handle the next event in Fn. If not, then by the property of regular predicates [26], there exists a forbidden event es m on the frontier of the cut which either prevents the cut from being consistent or from satisfying the predicate. Tn is sent to the process Am containing this forbidden event. Tn’s so-called target event, whose inclusion may give Tn a satcut, is the event on Am following the forbidden es m m until it receives S token does not find a next event following es m , then the token is kept by . If the the next event from the abstractor (which is guaranteed to happen under the starvation- free assumption). After the token retrieves the next event, the updates to the token and progression of n then follow the CGNM slicer [26]. Space limitations make it impossible to S describe the CGNM slicer here, and we refer the reader to the detailed description in [26]. If handling a left event, the token is updated as follows. First, as before, Tn is sent to the process Am which generates the forbidden es m – i.e., which prevents Tn from completing et n . Tn’s target event may not be the next event on that process following es m : this is because 119 if et n is a left root, there may exist a left event et−ϵ m on Am which is part of a continuous- time leftmost satcut (by Definition 7), but which was not created by the abstractor. In this case, if the token were to follow the updates for a right event, it would skip a potential satcut. Instead, the slicer m will create this event: namely, if S m sees a new event es′ m S where s′ > t − ϵ, it knows that et−ϵ m has not and will not show up (will not be produced by the abstractor) because messages are FIFO. The slicer at this point creates the new event et−ϵ m . This is valid since in continuous-time, by definition, every moment has a corresponding event on every agent. Once the token retrieves this created et−ϵ m as its new target, the updates to the token and progression of n follow the CGNM slicer [26], similarly to the right event S scenario. Correctness of S . We will show that all extremal cuts of the continuous-time signal are included in the discrete lattice. Since the CGNM slicer computes the discrete lattice, this means in particular that it computes the extremal cuts that are in it. From these extremal cuts, we can then recover the continuous-time satcuts by Lemma 10. Lemma 14. For all events et n that are left roots, the token Tn incorporates all et−ϵ m for all m = n. Proof. For a left root et n , by Theorem 2 its PVC is vt n = [t ϵ, . . . , t ϵ, t, t ϵ, . . . , t ϵ]. Since − = n it must incorporate a token Tn is tasked with identifying consistent cuts, for each m − − − the leastmost event on Am which can form a consistent cut with et n is a left root on An. ■ . Therefore, Tn incorporates all et−ϵ m events where et n event as et−ϵ m . The PVC identifies this Lemma 15. The modified slicer processes all events of a leftmost satcut. Proof. By Lemma 13, all events of a leftmost satcut are either at time t or t ϵ, where t − is the time of a left root. Since by Lemma 14 every token Tn will visit the t ϵ event for − any left root at t on An, every t − ϵ will be processed for any left root. Thus, all events of a leftmost satcut will be processed. ■ Theorem 8. Our slicer returns all extremal cuts. 120 ̸ ̸ Proof. The abstractor creates discrete events for all roots, as well as ϵ-offsets from right roots. By Lemma 15, the slicer creates all events of a leftmost satcut. This means that all events of leftmost and rightmost satcuts are processed by the slicer. Therefore, since the modified slicer returns a lattice of satcuts, the extremal satcuts are included. ■ We give the space and time complexity of the overall detector. Since this is an online detector which runs forever (as long as the system is alive), we must fix a time interval for the analysis. Theorem 9. The time complexity for each agent is O(2RN ), where R is the number of right roots in the given analysis interval. The detector consumes O(N 3) memory to store the tokens. If roots are uniformly distributed, then the local buffers of the abstractor and slicer grow at the most to size O(N 2). Proof. We distinguish the following cases: Time complexity. The calculations in our algorithm come from the abstractor, and the modification to the CGNM slicer. Finding a root of a signal xn takes constant time in the system parameters. The abstractor has every process send right root info to every other process, for a complexity of N 1 per right root, and total complexity of (N − R is the number of right roots in the system in a given bounded window of time. − 1)R where Consider slicer n, which is hosting token Tm. The slicer creates a new event, for every S target event of Tm that was not produced by the abstractor of Am. Event creation is O(N ) since it requires the creation of a size-N PVC assigned to the event. Event storage takes constant time if the new event is simply appended at the end of the local buffer, or O(k) if the event is inserted in-order in the sorted local buffer of size k. Either one works: the first one is cheaper, but an unsorted buffer costs more to find events in it. The latter is more expensive up-front, but the sorted buffer can be searched faster. Either way, the slicer modification costs a total of O(N · events in the system. M ) in a given bounded window of time with M missed 121 Now the number of target events requiring creation is on the order of the number of right roots since they result from left roots, and there are equal numbers of left and right roots. Thus M = O(R). Therefore, the total complexity for our algorithm in a given bounded window of time is O(R(N − 1 + N )). Of course, this is then added to the complexity of running the modified slicer, which is O(N 2D), where D is the number of events in the discrete-time signal. At the most, there are 2R events. So finally the total time complexity is O(R(N − 1 + N ) + 2N 2R), or O(R(2N + 2N 2)/N ) = O(2RN ) per agent. Space complexity. Indeed, a PVC timestamp has size O(N ) (since it is an N -dimensional vector). This is in fact the optimal complexity for characterizing causality [23]. One token stores N PVCs at all times and token updates replace old PVC values by new ones. Therefore one token has size O(N 2), and all N tokens (one per agent) require O(N 3) space. How long events stay in the abstractor’s local buffers depends on message transmission times, since events are removed from the buffers after the appropriate messages are received (see Algorithm 6.1). It also depends on the distribution of events within the interval of analysis, not just their rate 1/R. E.g. if roots are uniformly distributed in the analysis interval, then the nth abstractor’s local buffer grows at the most to size O(N 2), as it receives roots from the other N − 1 agents and stores the O(N ) PVC timestamp for each root. Then event removal starts as An receives target events. Similar considerations apply to the slicer’s local buffers. In such a case the detector’s total space complexity is O(N 3 + 2N 2). ■ Finally, there is no bound on detection delay, since we do not assume any bounds on message transmission time. Assuming some bound on transmission delay easily yields a corresponding bound on detection delay. 6.4.1 Worked-out example We now work through an example execution of the detector on Figure 6.4. We focus on agent A2, its abstractor 2, slicer A 2 and its token T2. S 1. Agent A2 encounters a left root in the signal at local time 3.5. This information is 122 3.5 2 ϵ − 5.8 − ϵ 5.8 + ϵ 6 A1 A2 2 − ϵ 3.5 6 ϵ 5.8 6 + ϵ − Figure 6.4 Example of subsection 6.4.1. Bold intervals are where the local signals are non- negative. The happened-before relation is illustrated with solid arrows. The predicate is ϕ = 0). Solid circles represent discrete events returned by the abstractor; hollow (x1 ≥ circles are those created by the slicers. The leftmost satcut of this example is [3.5 ϵ, 3.5] and the rightmost is [6, 5.8]. (x2 0) − ≥ ∧ forwarded to the abstractor. 2. The abstractor 2 adds the new root to its buffer with a PVC =[3.5 A ϵ, 3.5]. − 3. A2 finds a right root in the signal at local time 5.8 and forwards it to 2. A 4. The abstractor sends the root information to agent A1. It then adds this root to its buffer with a PVC timestamp of [5.8 ϵ, 5.8]. − 5. Abstractor 2 receives a message from A1 about a right root at A1’s local time 6. Note A that this is the first knowledge A2 has about anything that is occurring on A1, even though A1 has already found a left root. 6. 7. 2 uses A1’s message to create a new local event at 6 + ϵ with PVC [6, 6 + ϵ]. A 2 also adds this new local event to its buffer. Since all messages are FIFO, A2 knows A that there will be no new messages which will create events before 6 + ϵ. Thus, it can remove both of the events 3.5 and 5.8 from the buffer and forward them to its local slicer 2. At this point both of A1’s events have been forwarded to its slicer, although S A2 has no knowledge of this. 8. The slicer 2 receives an event with a PVC [3.5 S − ϵ, 3.5]. Token T2 is waiting for the next event, so it adds this event to its potential cut. 123 9. The token is processed with its new potential cut. The cut is found to be inconsistent since T2 has no information about any A1 events. 10. The token’s target is set to be 3.5 ϵ on A1 and the token is sent to A1. − 11. A1 receives T2. It walks through its local events 2 and 6 and determines that T2’s target event is between the two. 12. 1 creates a new event e3.5−ϵ S 2 and notes that x2(3.5 ϵ) − ≥ 0. 13. Token T2 incorporates the new event to its potential cut. The new potential cut is consistent and satisfies the predicate. It is then sent back to A2. 14. A2 receives T2. T2 indicates a satisfying cut, which the agent outputs as a result. It then advances T2 to its next event at time 5.8. 15. T2 has the current cut of [3.5 − ϵ, 5.8]. This is not consistent, so it is given the target ϵ on A1. It is then sent to A1. 5.8 − 16. A1 receives the token. 1 walks through its local events and finds that the token’s S target is between the left root and the right root. 17. 1 creates a new event at 5.8 S − ϵ and notes that x1(5.8 ϵ) − ≥ 0. 18. The token adds the event to its potential cut. It finds that its new potential cut is consistent and satisfies the predicate. It is then sent back to A2. 19. A2 receives T2 and outputs the satcut. The algorithm then continues with new events as they occur. Through this example, agent A2 discovered the satcuts [3.5 ϵ, 3.5] and [5.8 ϵ, 5.8]. The − − first is the leftmost satcut of the interval of satcuts. A1 discovered an additional satcut [6, 6 − ϵ]. Joining this satcut with A2’s second satcut returns a result of [6, 5.8], which is the rightmost satcut of the interval of satcuts. 124 N = 4 ) s ( e m i t n u R 5 4 3 2 1 0 10 40 30 20 Root rate (roots/s) 50 (a) Runtime vs root rate on 4 synthetic sig- nals. ) s ( e m i t n u R 5 10 20 30 40 50 2 Root rate (roots/s) 6 3 5 4 N u m b er of a ge nts (b) Online monitoring. The red horizon- tal plane indicates the runtime threshold (namely, 5s) below which it is possible to do online detection. Figure 6.5 Runtime vs root rate and N on synthetic data. 6.5 Case Studies and Evaluation We implemented our detection algorithm and ran experiments to 1) illustrate its opera- tion, and 2) observe runtime scaling with number of agents and with average rate of events. The detector was implemented in Julia for ease of prototyping, but future versions will be in C for speed. All experiments are replicated to exhibit 95% confidence interval. Experiments were ran on a single thread of an Ubuntu machine powered by an AMD Ryzen 7 5800X CPU @ 3.80GHz. We consider two sources of data: the first is a set of N synthetically generated signals, N = 1...6. Each signal has a 5s duration, and is generated randomly while ensuring an average root rate of µn. That is, on average, µn roots exist in every second of signal xn. For the second source of data, we use the Fly-by-Logic toolbox [100] to control up to 6 simulated UAVs (i.e., drones) performing various reach-avoid missions. Their 3-dimensional trajectories are recorded over 6 seconds. We monitor the predicate “All UAVs are at a height of at least 10m simultaneously”. Maximum clock skew ϵ is set to 0.05s. Effect of root rate (µn) on run time. We use 4 synthetic signals of 5s duration, and measure the detection runtime as the root rate for all signals is varied between 10roots/s 125 ) s ( e m i t n u R 15 10 5 0 µn = 50roots/s 0.3 0.2 0.1 ) s ( e m i t n u R 2 3 5 4 Number of agents 6 2 3 5 4 Number of agents 6 (a) Detection of synthetic signals at 50 root- s/s. (b) Detection of UAV signals. Figure 6.6 Runtime vs number of agents. and 50roots/s. Figure 6.5a shows the results. Naturally, as µne increases, so does the run time due to having to process more tokens. Online detection. We want to identify when it is possible for us to perform online de- tection with the Julia implementation, i.e. such that the detector finishes before the end of the signal being processed. To this end, we use the synthetic signals of duration 5s and vary both root rate and number of agents. Figure 6.5b shows the results: all combinations of root rates and number of agents with runtimes under the threshold of 5s can be performed online. Effect of number of agents on run time. Figure 6.6 shows the effect of number of agents N on runtime. As expected, the runtime increases with N . 126 CHAPTER 7 RESOURCE OPTIMIZATION OF STREAM PROCESSING IN LAYERED SENSOR NETWORKS In this chapter, we set our sights on monitoring reliability by optimizing resource consump- tion in a generalized class of CPS. In Chapters 3, 4, 5, and 6 we proposed different monitoring techniques for distributed systems with respect to different specifications under both central- ized and decentralized monitoring settings. However, solely monitoring formal specification on a distributed CPS is not enough to guarantee its functionality. For example, in a decen- tralized monitoring setup, if one or more monitors start reporting erroneous results, then it is possible to reach false positive and/or false negative verdicts on the distributed CPS against some specification. Therefore it is imperative that we ensure the reliability of all monitors, and by extension, the reliability of the distributed CPS, that is, the network of monitors or agents, as a whole. However, determining reliability of a distributed CPS depends on an array of factors, including the type of agents in the network. For example, the method for computing relia- bility of a network of UAVs will vastly vary from the method for computing reliability of a network of medical equipment. To this end, we present a generalized model of a class of CPS, where each monitor is represented by an (IoT) device or an agent in a layered network of producers and consumers. We elaborate our technique for monitoring reliability of layered stream processing networks, while optimizing for minimal resource consumption by its nodes. 7.1 Producer-Consumer Network with Resource Constraints Before talking about our problem statement, we present our model that is used to capture a layered network of nodes tasked with stream processing jobs subject to resource constraints, flows, and target reliability. 7.1.1 Resource Bounds We first present the notions of reusable and consumable resources in our model: 127 • Reusable resources are not depleted when an item is processed. Examples of reusable resources are CPU, power, memory, network bandwidth, and quality. These resources are instantly reclaimed once an item is processed. We denote the finite set of reusable resources in the system as follows: R = { R1, R2, . . . , Rn , } 1. for some n ≥ • Consumable resources are depleted once an item is processed. Examples of consumable resources are energy, time, and reliability. For instance, once error is encountered during the processing of an item along its path in the network, it cannot be reclaimed. We denote the finite set of consumable (depletable) resources as follows: D = { D1, D2, . . . , Dm } for some m 1. ≥ Our model supports bounding resources on both nodes and edges. Let G = (V, E) be a producer-consumer network. A bound on a resource res R D for a subset of nodes V (respectively, a subset of edges E V ⊆ also set bres V = lb, ub ⟩ ⟨ (respectively, bres E = ∪ E) is denoted by bres V ∈ ⊆ ) as a pair that implies the sum of resource lb, ub ⟩ ⟨ (respectively, bres E ). We res R ∪ ∈ D unit (e.g., power) consumed by all nodes (respectively, edges) in V (respectively, E) must reside within the lower bound lb and the upper bound ub. Finally, we denote the set of all resource bounds for all resources in R D and for any subsets of nodes and edged ∪ by B. For instance, if a node v has 8 cores, using the conventional notation of multi-core systems, the maximum CPU usage is 800%. In this case, a bound bCPU {v} = is applied. Another 0, 800 ⟩ ⟨ example is applying power bounds to a cluster of nodes. This implies that the sum of power consumed by all nodes in the cluster should not exceed a specific value. A bound bPWR {v1,v2,v3} = exceed 500 watts. 0, 500 ⟩ ⟨ denotes the total power consumption of nodes v1, v2 and v3 should not 128 Let µres v (respectively, µres e ) be the amount of resource res R D unit consumed by node v ∈ V (respectively, edge e ∈ E). Formally, a bound bres ∈ V = ∪ lb, ub ⟩ ⟨ on vertices V V ⊆ and a resource res enforces the following: lb (cid:88) ≤ v∈V µres v ≤ ub. Likewise, a bound bres E = lb, ub ⟩ ⟨ on edges E ⊆ E and a resource res enforces the following: lb ≤ (cid:88) e∈E µres e ≤ ub. 7.1.2 Configurations There are various configuration parameters that impact the resource usage of a node and the reliability of its output. • Sampling rate. Some systems depend on sampling from continuous-time and continuous- valued signals and the amount of resources consumed by a node is proportional to the sampling rate [18]. Lower sampling rate is usually associated with reduced reliability or confidence. Hence, sampling rate is a configuration parameter that controls the tradeoff between resource usage and reliability. • Outgoing data rate. The outgoing data rate of a node impacts the resource usage of subsequent nodes [18, 113]. If subsequent nodes decide to sample this data, then reliability is negatively impacted. • Precision. Some algorithms support controllable precision. For instance, image pro- cessing may be accomplished with high or low precision [86]. The work in [84] demon- strates how configurable precision impacts accuracy and resource usage. • Algorithm alternatives. In some systems, there are different algorithms that can be used to process the data, with varying degrees of resource usage and reliability [70, 76]. For instance, data loss prevention (DLP) systems employ different classifiers for malicious activity that are designed to have different processing costs [93]. 129 To simplify our model, we abstract all the above parameters into a single quality symbol. This symbol encompasses sampling, buffering, precision, and algorithmic alternatives. Given a producer-consumer network G = (V, E), let us associate each node v V with a finite set ∈ of quality levels: Qv = (cid:110) Qual1(v), Qual2(v), . . . , Qualk(v) (cid:111) where the number of levels k can be different for each node v. A node v can use each quality level Quali(v), where 1 i ≤ ≤ k to process items that are being received at some input data rate in IRate(v) and being produced at some outgoing data rate in ORate(v). Part of our stream optimization (see Section 7.2) is to find the best quality for the possible input/output data rates. To this end, for each node v V, let ∈ ϑv : IRate(v) Qv × → ORate(v) be a function that maps an incoming data rate and a quality level to an outgoing data rate. That is, we have: ORatev = ϑ (cid:16) IRate(v), Quali(v) (cid:17) where Quali(v) is the ith quality level of node v. 7.1.3 Reliability Quantifying reliability is generally a challenging task. Reliability of each node not only depends on its quality level, but on other environmental factors as well. For example, the reliability of a node that captures video streams may vary based on the time of the day and the surrounding lighting conditions. Another example would be the case where a node may become less reliable once it nears the end of its average life cycle. Let us assume each node V is influenced by mv number of environmental factors. We denote Uj ≤ mv as the jth environmental factor of the node v. An environmental factor of 1 indicates [0, 1] where 1 v ∈ v j ∈ ≤ the best possible reliability when other factors (as well as the quality) remain unchanged, whereas an environmental factor of 0 indicates the worst. All the intermediary values are determined by the node’s architecture. In a similar manner, we denote UQual v [0, 1] as the ∈ 130 quality factor of the node v. A quality factor of 1 maps to the highest quality level Qualmax(v) supported by the implementation of the node’s code. This could be a configuration where a computationally intensive algorithm is used, input data is not sampled or buffered, and numerical precision is set to the maximum supported precision. On the other hand, a quality factor of 0 maps to the lowest quality level Qualmin(v) supported by the node. This should be a configuration below which the system becomes unusable. Quality factor for the remaining quality levels in Qv − { that we have defined the quality factor UQual Qualmax(v), Qualmin(v) } and the environmental factors U1 are determined by the system design. Now v, U2 v, . . . , Umv v v for a node v, we are ready to define its reliability αv [0, 1] as follows: ∈ αv = UQual v + W1 v.U1 1 + W1 v + W2 v + W2 v + . . . + Wmv v v.U2 v + . . . + Wmv v .Umv v Where Wv = { W1 v, W2 Uv = { U1 v, U2 v, . . . , Umv v } v, . . . , Wmv v } . are the respective weights of the environmental factors Note that it is difficult to determine the discrete quality levels of a node, and map the said quality levels to numerical quality factor values. This is mainly because different nodes in a producer-consumer network carry out different tasks, and therefore, require their own method of quality level determination. For example, the quality level of a node that is tasked with capturing and streaming video can be determined by its current video resolution. In other words, the maximum operational resolution can be considered as the highest quality and mapped to a quality factor of 1, the minimum operational resolution can be considered as the lowest quality and mapped to a quality factor of 0, and every other operational resolutions in between can be mapped between 0 and 1 based on their pixel count. However, this method will clearly fail determine the quality levels of a node that is tasked with detecting motion, where the polling interval rate could be a better representation of quality levels for the said node. Note that, we do not attempt to provide an absolute method for determining quality levels and mapping them to appropriate quality factor values. We merely propose an abstraction that allows tweaking the system into yielding desirable results. 131 7.1.4 Relationship between Configurations and Resources We now define the relationships between configurations and resources. Let CRateres denote the set of possible rates of consumption of resource res. Also, let φres v : IRate(v) Qv × → CRate(res) be a function that maps the rate of incoming data and the quality level of node v V to ∈ a possible consumption rate value in CRate(res). For example, for a node v with a quality levels of Qual(v) that is receiving data at the rate of IRate(v), we determine the rate at which resource PWR is consumed on the said node using φPWR v (IRate(v), Qual(v)). Recall that resource res can be either reusable or non-reusable. Hence, each node defines a set of functions, in which the elements are functions φres v for all resources res D as follows: R ∪ ∈ Φv = (cid:110) φres v res | ∈ R D ∪ − { REL (cid:111) } where REL is the reliability resource. While reliability depends on the quality level of a node like other resources, it also depends on the reliability of incoming data, as well as the environmental factors. Therefore, we exclude reliability since it is defined differently. We incorporate this notion of reliability to model systems where error is compound, i.e., receiving erroneous data may impact the reliability of produced data differently even at the same quality level, and under same environmental factors. This behavior is common in precision based quality levels, where rounding error is compounded as more mathematical operations are performed on a data path. Thus, we introduce the following recursive function ψv to determine the compounded reliability of node v V as follows: ∈ ψv(Qual(v)) = ψu(Qual(u)) { | u ∈ ) if Pred(v) Pred(v) } = ∅    comp(Qual(v), Uv, Wv, αv if Pred(v) = ∅ 132 ̸ ORate q2 75 60 70 90 80 q1 100 90 100 120 110 q3 50 30 40 60 70 PWR q2 65 55 65 70 75 q1 80 75 80 85 85 q3 40 35 40 40 36 TIME q2 13.3 16.6 14.3 11.1 15.1 q1 10 11.1 10 8.3 10.3 q3 20 33.3 25 16.6 21.1 Table 7.1 Nodes v[1,5] resource usage. PWR q2 70 65 55 70 80 q1 85 80 88 90 100 q3 55 50 50 55 60 TIME q2 20.1 18.2 17.7 14.5 22 q1 16 18.2 15 13 17 q3 25 38.3 29 19 45 αv q2 90 92 88 93 q1 100 100 100 100 q3 82 84 79 83 − − − Table 7.2 Nodes v[6,10] resource usage. v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 where αv is the reliability of v when it is a source node (by system design) and comp denotes a function that computes the reliability of a node given its quality, environmental factors and their weights, and the reliability of its predecessors. For instance, comp could be instantiated with a function that computes the average (or maximum) reliability of all predecessors times the quality level of the node. For example, we introduce the characteristics of the nodes in the network of Figure 7.2. Table 7.1 lists the production rate (ORate) for resources power (PWR) and response time (TIME) of nodes v[1,4] at different quality levels. We abbreviate the quality level Quali as qi The quality level for nodes v[1,4] is designated by the sampling rate. Thus, the highest quality level q1 has the highest rate of outgoing items, versus the lowest quality level q3. 7.1.5 Revised Definitions Based on the definitions introduced in the previous subsections, we now redefine a node as follows: (cid:68) v = Pred(v), Succ(v), Qual(v), ORate(v), Φv, ψv (cid:69) 133 Thus, the node now includes a set of quality levels (i.e., Qual(v)), a function that determines the rate of outgoing data (i.e., ORate(v)), a set of functions that determine resource usage (i.e., Φv), and a function that determines the reliability of the node’s output (i.e., ψv). Finally, we redefine the graph as follows: (cid:68) G = V, E, R, D, B (cid:69) Thus, the graph now defines a set D of consumable resources, a set R of reusable resources, and a set B of bounds on resources. 7.2 Problem Statement First, observe that in the model proposed in Section 7.1, quality levels may affect the following: 1. Node reliability ψv, which is a function of the quality level, environmental factors and their weights, and the incoming reliability values of all predecessors. 2. Resource consumption φres v , which is a function of the quality level and the incoming data rate. 3. Production rate ORate(v), which is a function of the quality level and the incoming data rate. The majority of the stream processing systems benefit greatly from knowing how to answer one or both of these two questions; (1) how the usage of available resources can be optimized to reach maximum reliability, and (2) how to minimize available resource usage while ensuring reliability is maintained above a target threshold. Thus, roughly speaking, our multi-objective problem statement is as follows. Given (1) a producer consumer network on which a set of bounds is defined, and (2) a target reliability for all consumer-only nodes, • Quality Maximization. Our first objective is to identify a single quality level for every node, such that the reliability for consumer-only nodes, is maximized, while satisfying all bounds. For example, maximizing the efficiency of each device in a sensor network 134 of smart home devices while not exceeding the specified renewable resources like power, CPU, bandwidth etc. • Resource Usage Minimization. Our second objective is to minimize consumption of a given resource for all nodes while achieving a target reliability. For example, minimizing the power usage of a producer-consumer network, that does not demand maximum reliability from its nodes. Formally, our optimization problem is as follows: Problem Statement Given a producer-consumer network G = , a resource res V, E, R, D, B ⟩ ⟨ ∈ R ∪ D, identify quality levels Qual(u) for all u V subject to: ∈ and v ∀ v′ ∈ { | Succ(v′) = . max ∅} (cid:16) ψv (cid:17) (cid:0)Qual(v) V. min (cid:16) φres v (cid:0)IRate(v), Qual(v)(cid:1)(cid:17) v ∀ ∈ In the next section, we will present our solution to solve the above optimization problem. 7.3 SMT-based Solution In this section, we present our solution to solve the multi-objective optimization problem presented in Section 7.2. Our solution is based on a reduction to solving the satisfiability problem for SMT. Practically, we will utilize an SMT-solver in order to optimize reliability and resource consumption tradeoffs. The SMT problem is solved on a remote network monitor that is used to poll each sensor node in the network at a fixed interval in order to keep track of available resources, as well as control the quality levels of the said node. Each SMT instance is described in terms of (1) SMT entities (e.g., variables, functions, constants, etc.) and (2) SMT constraints (e.g., Boolean conditions over first-order predi- cates). 135 7.3.1 SMT Entities We now introduce the entities that are used to represent the components of our producer- consumer network G = (V, E). In some SMT entity definitions, we use free variables that the SMT solver can manipulate in order to provide a satisfaction verdict. Nodes. In our SMT encoding, we represent the set of nodes V as a set of integers 1, 2, { · · · , , V | |} where each element represents a node in V. Edges. We store the information of edges in E in the form of a V | | × | V | Boolean array edge such that: |V| (cid:94) |V| (cid:94) i=1 j=1 edge[i][j] =    true if false if (i, j) (i, j) E E ∈ ̸∈ where edge[i][j] implies there exists an edge from node vi to node vj in G. Successor Nodes. We encode the function Succ as an SMT function succ that maps a node to a set of successor nodes as follows: |V| (cid:94) i=1 succ(i) = (cid:110) j | (i, j) ∈ (cid:111) E Predecessor Nodes. We encode the function Pred as an SMT function pred that maps a node to a set of predecessor nodes as follows: |V| (cid:94) i=1 pred(i) = (j, i) j { | E } ∈ Node Resource Consumption. We define the function rcon that maps a resource and a node to a free variable that denotes the resource consumption of the said node as follows: (cid:94) |V| (cid:94) res∈R∪D i=1 rcon(res, i) = µres i Edge Resource Consumption. We define the function rcoe that maps a resource and an edge to a free variable that denotes the resource consumption of the said edge as follows: (cid:94) (cid:94) res∈R∪D (i,j)∈E rcoe(res, (i, j)) = µres (i,j) 136 Resource Inflow. We define the function iflo that maps a resource and a node to a free variable that denotes the inflow of resource to the said node as follows: (cid:94) |V| (cid:94) res∈R∪D i=1 iflo(res, i) = ζ res i , where ζ res v denotes the inflow of resource res into node v Resource Outflow. We define the function oflo that maps a resource and a node to a free variable that denotes the outflow of resource from the said node as follows: (cid:94) |V| (cid:94) res∈R∪D i=1 oflo(res, i) = ξres i , where ξres v denotes the outflow of resource res from node v. Edge Flow. We define the function eflo that maps a resource and an edge to a free variable that denotes the amount of flow going through the said edge as follows: (cid:94) (cid:94) res∈R∪D (i,j)∈E eflo(res, (i, j)) = νres (i,j) where νres (i,j) denotes the flow of resource res through edge (i, j). Quality. We encode the function Qual as an SMT function qual that maps a node to all of its abstracted quality levels in disjunction (see Section 7.1) as follows: |V| (cid:94) i=1 (cid:0)qual(i) = |Q| (cid:95) j=1 Qualj(vi)(cid:1) Reliability. We encode the function ψ as an SMT function rel that maps the quality of a node to its reliability as follows: |V| (cid:94) i=1 rel(qual(i)) =    eval(qual(i), Uv, Wv, rel(qual(j)) { j | ∈ pred(i) ) } αi if if pred(i) = pred(i) = ∅ ∅ Note that when pred(i) = , node i is the source (producer-only) node in the producer- ∅ consumer network, and therefore, its reliability, αi is known. 137 ̸ 7.3.2 SMT Constraints We now introduce the constraints that address our problem statement using the SMT entities we defined in the previous section. Resource Inflow Constraint. For resources res D, the amount of a resource flowing R ∪ ∈ into a node depends on the amount of flow carried over all its incoming edges. Traditionally, the inflow is the sum of all flows on incoming edges. That is, (cid:94) |V| (cid:94) res∈R∪D i=1 ζ res i = (cid:88) (cid:8)eflo(res, (i, j)) (i, j) | (v′, v) ∈ { v′ | pred(v) } ∈ (cid:9) For instance, power is a resource that can be summed over incoming edges. Resource Outflow Constraint. For resources in res D, the amount of a resource R ∪ ∈ flowing out from a node is traditionally equal to the amount of the resource flowing in, which is the conservation of flow principle. This models renewable resources efficiently, yet does not capture non-renewable resources. We generalize the resource outflow constraint using function Ξ: (cid:94) |V| (cid:94) res∈R∪D i=1 oflo(res, i) = Ξ r i (iflo(res, i), rcon(r, i)) For instance, power is a renewable resource, and thus, |V| (cid:94) i=1 ΞPWR i (iflo(PWR, i), rcon(PWR, i)) = iflo(PWR, i) However, energy is depletable, and therefore, |V| (cid:94) i=1 ΞEGY i (iflo(EGY, i), rcon(EGY, i)) = iflo(EGY, i) rcon(EGY, i) − Resource Bound Constraint on Nodes. For all resources res D, we enforce the R ∪ ∈ given upper bound and lower bound on nodes as follows: (cid:94) (cid:16) lb res∈R∪D |V| (cid:88) ≤ i=1 (cid:8)rcon(res, i)(cid:9) (cid:17) ub ≤ Resource Bound Constraint on Edges. We use bounds on edges to control the dis- tribution of resources res R ∪ ∈ D across outgoing edges of a node. We identify two main methods of assigning flow to outgoing edges: broadcast and distribution resources. 138 Broadcast. In this case, nodes broadcast their outflow to outgoing edges. Reliability is broadcast, since all outgoing edges of a node carry data with the same reliability level that the node produces. We can enforce resources res D to be broadcast using the following R ∪ ∈ constraint: (cid:94) (cid:94) (cid:16) (i,j)∈E Upon simplification, we have: res∈R∪D oflo(res, i) ≤ rcoe(res, (i, j)) ≤ oflo(res, i) (cid:17) (cid:94) (cid:94) res∈R∪D (i,j)∈E rcoe(res, (i, j)) = oflo(res, i) Distribution. In this case, the outflow is distributed across all outgoing edges. For in- stance, in a multiple consumer setting any one of a set of receiving nodes can process items. In this case, the outgoing data flow of the producer is distributed among all consumer nodes. The objective here is to determine the fraction of data flowing to each consumer such that resource bounds are respected and the usage of a specific resource is optimized. We can enforce resources res (cid:94) ∈ |V| (cid:94) res∈R∪D i=1 D to be distributed using the following constraint: R ∪ (cid:0)oflo(res, i) (cid:88) ≤ j∈succ(i) (cid:8)rcoe(res, (i, j))(cid:9) oflo(res, i)(cid:1) ≤ Upon simplification, we have: (cid:94) |V| (cid:94) (cid:88) res∈R∪D i=1 j∈succ(i) (cid:8)rcoe(res, (i, j))(cid:9) = oflo(res, i) Data Flow Constraint. Data outflow can be either broadcast or distributed. We encode the function Out as an SMT function out that maps an edge to outdoing data rate of that edge. Broadcast. all edges: In case the data outflow is broadcast, we enforce the following constraint on (cid:94) (i,j)∈E out((i, j)) = ORate(vi) 139 Distribution. on all edges: In case the data outflow is distributed, we enforce the following constraint |V| (cid:94) (cid:88) i=1 j∈succ(i) (cid:8)out((i, j))(cid:9) = ORate(vi) Reliability Maximization Constraint. Finally, let C denote the conjunction of all the above constraints. The constraint for maximizing reliability on sink (consumer only) nodes is as follows: C ∧ (cid:0) (cid:94) max(rel(qual(i)))(cid:1) i∈{j|succ(i)=∅} Resource Optimization Constraint. If we want to minimize the total consumption of some res R ∪ ∈ D across all nodes, while ensuring the reliability on sink (consumer only) nodes remain above a given threshold α, then we enforce the following constraint instead: C ∧ (cid:16) (cid:94) i∈{j|succ(i)=∅} rel(qual(i)) α(cid:1) ≥ ∧ min (cid:0) |V| (cid:88) i=1 (cid:8)rcon(res, i)(cid:9)(cid:17) 7.3.3 Solver Optimization Solving the Reliability Maximizing Constraint and the Resource Optimization Constraint both require a significant amount of computation power and time (see Figure 7.1a). This is mostly because C is a conjunction of a large set of constraints, coupled with the fact that it is a minimization or maximization problem. This means, there is only one solution for which the value of the object is maximized or minimized, which in turn means our SMT solver having to explore a large search space. To this end, we employ some optimization techniques to our model in order to reduce run time for the SMT solver. In this subsection, we show one such technique and report the improvement it shows in terms of run time over the naive method. Binary Probing: First, let Aα be an SMT constraint such that, Aα = C ∧ (cid:0) (cid:94) i∈{j|succ(i)=∅} rel(qual(i)) α(cid:1) ≥ 140 Data: Producer-consumer network G, SMT constraint Aα, Target reliability α, Error margin ϵ Result: Estimated Best Reliability α′ αmin f ound if ← false while αmin .5 α′ f ound do 0 αmax 1 pivot ϵ then ← ← ¬ ← 0 ← αmax − | f ound end β | ≤ true ← solve(Apivot, G) if β ← if β > α′ then = 1 then − α′ β ← pivot ⌋ ← ⌊ end αmin end else αmax pivot ⌋ ← ⌊ end if f ound then break end pivot ← end return α′ (αmin + αmax)/2 Algorithm 7.1 Best Reliability Estimation Algorithm We say Aα = G iff Aα is satisfied for α, otherwise Aα | = G. Let solve be a function that, ̸| given G and Aα, returns some value β such that α β ≤ ≤ 1 and Aα = G, otherwise β = | 1. − Formally, solve(Aα, G)    β − if Aα = G | ∧ β ∈ [α, 1] 1 otherwise Now, using a Binary Probing technique described in Algorithm 7.1, we can invoke our SMT solver in a pattern similar to the traditional binary search, and find an estimated α′, that is sufficiently close, that is, within the error margin ϵ of the real best reliability. Using a similar technique to this, we can also find the estimated minimum of any res D. It R ∪ ∈ should be mentioned that even in worst case, after just five SMT invocations, the error from binary probing will only be ≈ 3.125%, which usually falls within the acceptable error margins for devices in sensor networks, as far as reliability and other resources are concerned. 141 ̸ 7.4 Machine Learning-based Optimization On top of our SMT-based solution described in Section 3.3, we employ a machine learning- based optimization technique to further improve our solution in terms of run time at the cost of negligible (details in the next section) accuracy. 7.4.1 Artificial Neural Network We first create an Artificial Neural Network [2] (ANN) where neurons in the output layer denote the resources that need to be optimized, and the neurons in the input layer denote the remaining resources. For example, when solving the Quality Maximization problem, each neuron in the output layer represents each quality level of each node v V, that is, the ∈ number of neurons in the output layer is lo = V | . each neuron in the input layer represents | remaining resources like power, CPU, memory, bandwidth etc., that is li = , }| R is the quality resource. For determining the number of neurons in the QUAL R | −{ D ∪ where QUAL ∈ hidden layer, we chose the method proposed by the authors in [129], that is, the number of neurons in the hidden layer is, lh = (cid:16) 2 3 × | R D ∪ − { QUAL }| (cid:17) + V | | As an example, let us consider the producer-consumer network in Figure 2.5. If we want to maximize the quality of the network with respect to power (PWR), CPU (CPU), memory (MEM) and bandwidth (BW), then the corresponding ANN should be of the form where lo = 9, li = 4, and lh = 12. 7.4.2 Training Dataset Our training dataset for the ANN is generated using the SMT-based solution detailed in Section 3.3. For example, in order to generate the training data-set for the Quality Maximization problem, we find the best qualities for all nodes for resources with random values, and populate the data-set with the results. This allows us to carry out machine learning process in an unsupervised manner. During training, while the traditional approach is to split the dataset into two subsets 142 (i.e. training dataset and testing dataset), for smaller datasets, this may introduce biased estimates [56]. As our model must be applicable to both small and large datasets, in order to reduce statistical bias, we employ k-fold cross validation [123] to train our ANN. The process of k-fold cross validation is as follows: 1. Split the dataset into randomized groups of equal sizes: g1, g2, . . . , gk. 2. For i ∈ [1, k] do: • Assign group gi as the test dataset. • Assign groups g1, g2, . . . , gi−1, gi+1, . . . , gk as the training dataset. • Train the model on the training set and evaluate it on the test dataset. Using k-fold cross validation ensures that each group is used as the testing dataset once, and as the training dataset k − 1 times. There various ways of selecting the value of k. However, in our work, we assign k = 10, as this is shown to generally yield minimal statistical bias [51, 69]. Note that for generating our dataset, we normalize the sample values to avoid unwanted weights. However, when we report the experimental results, we use the actual values for ease of camparison and understandability. 7.4.3 Model Accuracy We determine the accuracy of our trained model by directly comparing its results with the results from SMT-based solution. For the Quality Maximization problem, we define the accuracy, accq of the model as follows: accq = 1 − (cid:80) v∈V SM Tqv − | (cid:80) Qv | | v∈V M Lqv | Where the quality level reported by the SMT-based solution is the SM T th qv node v, the quality level reported by the machine learning model is the M Lth qv quality level of quality level of node v, and Qv is the set of quality levels of node v. For the Resource Minimization problem, we define the accuracy accr of the model as follows: (cid:80) res∈R∪D accr = |SM Tresv −M Lresv | M AXresv −M INresv R | D | ∪ 143 Where SM Tresv is the value of resource res of node v reported by the SMT-based solution, M Lresv is the value of resource res of node v reported by the machine learning model, and M AXresv (resp., M INresv ) denotes the upper bound (resp., lower bound) of resource res observed in the dataset. Note that, 0 ≤ { accuracy, and 1 indicates the best accuracy. accq, accm 1, where 0 indicates the worst } ≤ 7.5 Case Studies and Evaluation In this section, we evaluate our technique for resource optimization using synthetic data generated from a simulated layered network of nodes, as well as real world data collected from a network of embedded devices, where the nodes in the network are Raspberry Pi devices tasked with specific streaming objectives. 7.5.1 Synthetic Experiments In this subsection, we introduce our synthetic experiments to demonstrate how our pro- posed model can be used to optimize various resources. 7.5.1.1 Experimental Setup We construct our producer-consumer network using 8 nodes, V = v1, v2, { · · · , v8 } , with v[1,7] being the producer nodes, and v[2,8] being the consumer nodes. We add two placeholder nodes to the network, vin and vout, along with two edges (vin, v1) and (v8, vout). Figure 7.2 shows our producer-consumer network. We use edge (vin, v1) to regulate bounds on resources, and we use node vout to compute network reliability. 7.5.1.2 Resource Bounds In this experiment, we consider the two resources power (PWR) and reliability (REL). Where PWR R and REL ∈ ∈ D. We assign three possible quality levels to each node. Table 7.3 shows different power consumption values for all possible quality levels in each node. 144 20 15 10 5 0 ) s ( e m i t n u R 18.86 bPWR V = 0, 1835 ⟩ ⟨ 3.19 10−2 3.6 · 1 0.8 0.6 y t i l i b a i l e R SMT-based model ML-based model Naive Probing ML 1,900 1,850 1,800 1,750 1,700 Algorithm Power (Watt) (a) Naive vs. optimized algorithm. (b) Reliability vs. power. ) s ( e m i t n u R 4 3 2 1 0 SMT-based model ML-based model ) s ( e m i t n u R 3 2 1 0 SMT-based model ML-based model 1,900 1,850 1,800 1,750 1,700 Power (Watt) 10 9 7 8 6 Node count 5 4 (c) Run time vs. power. (d) Run time vs. reliability. Figure 7.1 Synthetic experiment results. 145 vin v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 vout Figure 7.2 A producer-consumer network of 8 nodes. q1 - 200 185 190 195 180 195 185 180 190 200 - q2 - 195 180 185 190 175 190 180 175 185 195 - q3 - 190 175 180 185 170 185 175 170 180 190 - q4 - 185 170 175 180 165 180 170 165 175 185 - q5 - 180 165 170 175 160 175 165 160 170 180 - vin v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 vout Table 7.3 Nodes v[1,9] power consumption (in watts) for different quality levels. 7.5.1.3 Machine Learning Setup For our machine learning dataset, we generate 500 data samples using our SMT-based solution. each sample contains a randomly selected PWR value, and the optimized quality levels for the 10 nodes in Figure 7.2. We train our ANN with this dataset for 10 epochs (full iterations) using the k-fold cross validation method described in Section 7.4. 7.5.1.4 Experimental Results We now run a variety of experiments on the setup described above and report our findings below. 146 Naive Model vs. Optimized Model. vs. Machine Learning Model. First we run the model to find the best possible reliability, given bounds on other resources in the producer-consumer network. We want to observe the improvement in run time between a model that performs regular constraint solving, and a model that performs constraint solving using the binary probing technique shown in Algorithm 7.1. To this end, we assign the PWR resource bPWR V to 0, 1835 ⟩ ⟨ and run our solvers. As shown in Figure 7.1a, using the naive (brute force) technique, we get a best reliability value of 0.855 within a run time of 18.855 seconds, whereas using the binary probing technique, we get a best reliability value of 0.857 within a run time of 3.185 seconds. Using our machine learning model, we get a best reliability value of 0.839 in under 0.036 seconds. Reliability vs. Power. We now observe the tradeoff between reliability and power. We start off by assigning the PWR resource bPWR V to . We can observe in Figure 7.1b 0, 1900 ⟩ ⟨ the burndown of reliability as we tighten the power bound by 5 watts on each iteration. We stop at 0, 1700 ⟩ ⟨ when the network is no longer functional due to the given power being lower than the minimal requirement. In this scenario, our machine learning-based model report a more uniform reliability drop than our SMT-based model. Figure 7.1c demonstrates the run time measurements for the same experiment. As the available power is reduced, the search space for the SMT solver gets smaller as well due to having to check for fewer valid configurations. Which is why a gradual decline in run time can be observed for the SMT-based model. However, the machine learning-based model reports the results through inference, and therefore show very little variation in run time. Run time vs. Node availability. In this experiment, we observe the effect of node availability, and by extension overall reliability on run time. To this end, we assign the PWR resource bPWR V to 0, 1835 ⟩ ⟨ and reduce reliability of the nodes v5, v2, v3, v8, v7 and v9 to 0 one node at a time in the given order. Figure 7.1d shows the run time of this experiment. As expected, as more nodes become inactive, the overall run time for the solver decreases for the SMT-based model. However, similar to the previous observation, for the ML-based 147 Figure 7.3 A Multi-Layer Network of Raspberry Pi Devices. model, the run time remains steady. Note that removing any more nodes from the network will render the network inactive, as the solver will fail to find any valid path from vin to vout. 7.5.2 Case Study In this subsection, we introduce our case study on a real world layered sensor network, where each sensor is tasked with a streaming or processing job, and is operated with a Raspberry Pi device. 7.5.2.1 Experimental Setup We construct our layered sensor network with five nodes as shown in Figure 7.3, where v0, v1 and v2 are producer nodes, and v1, v2, v3 and v4 are consumer nodes in V. Just like before, we add two place holder nodes vin and vout such that, vin has an outgoing edge to v0, and vout has two incoming edges from v3 and v4. Below we explain the streaming tasks of each device, along with the resources and quality levels. Motion Sensor Node (v0). This node is comprised of four motion sensors that are able to detect objects and movement. When any one of these motion sensors are activated, v0 sends an activation signal to its subsequent nodes. Table 7.4a shows resource consumption for node v0 under different quality levels. In this case, the quality levels simply indicate the number of active motion sensors. Motion sensors are fairly reliable under normal operational circumstances, which is why v0 has high reliability as long as at least one sensor is active. 148 𝑣0𝑣𝑜𝑢𝑡𝑣𝑖𝑛𝑣1𝑣3𝑣2𝑣4 Active Sensors Reliability Power Usage 0 0.94 0.96 0.98 1 450 455 460 465 470 0 1 2 3 4 q5 q4 q3 q2 q1 Resolution Bandwidth Power Usage 15 256 x 144 35 426 x 240 50 640 x 360 125 854 x 480 275 1280 x 720 2500 1920 x 1080 446 476 488 500 512 566 q6 q5 q4 q3 q2 q1 (a) Quality levels and resource usage of v0. (b) Quality level and resource usage of v1. Illuminance Brightness Power Usage 0 10 20 30 40 50 60 70 80 90 100 120000 108000 96000 84000 72000 60000 48000 36000 24000 12000 0 472 479 486 493 500 507 514 521 528 535 542 q11 q10 q9 q8 q7 q6 q5 q4 q3 q2 q1 Resolution Bandwidth Power Usage 15 256 x 144 35 426 x 240 50 640 x 360 125 854 x 480 275 1280 x 720 2500 1920 x 1080 458 464 470 476 482 488 q6 q5 q4 q3 q2 q1 (c) Quality level and resource usage of v2. (d) Quality level and resource usage of v3. Resolution Bandwidth Power Usage 30 256 x 144 70 426 x 240 100 640 x 360 250 854 x 480 550 1280 x 720 5000 1920 x 1080 452 452 452 452 452 452 q6 q5 q4 q3 q2 q1 1410 1405 1400 1395 1390 1385 1380 1375 1370 1365 1360 1355 MOTION CAM LIGHT LOCAL CLOUD q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q2 q3 q4 q5 q6 q6 q6 q6 q1 q2 q3 q4 q4 q4 q4 q4 q4 q4 q4 q4 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q3 q6 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 (e) Quality level and resource usage of v4. (f) Quality level change due to power. Table 7.4 Quality level tables for different nodes. 149 ) p ( n o i t u l o s e R 1080 720 480 360 240 144 Cloud node Local node y t i l i b a i l e R 1 0.8 0.6 0.4 Bandwidth usage (kB/s) Power usage (mAh) 0 500 1,000 1,500 2,000 2,500 144 240 360 480 720 1080 Power (mAh) / Bandwidth usage (kB/s) Resolution (p) (a) Resolution vs. power. (b) Local vs. cloud processing. y t i l i b a i l e R 1 0.8 0.6 0.4 0.2 0 2 1.5 1 0.5 ) s ( i e m T n u R 1,410 1,400 1,390 1,380 1,370 1,360 1,350 1,410 1,400 1,390 1,380 1,370 1,360 1,350 Power (mAh) Power (mAh) (c) Reliability vs. power. (d) Run time vs. power. Figure 7.4 Case study results. 150 Camera Node (v1). This node is tasked with operating a 5MP camera at varying resolu- tion/bitrate, and is activated upon receiving an activation signal from v0. The resolution/bi- trate of the captured stream is changed at runtime, which allows us to map the different resolution/bitrate modes to the quality level of node v1. The node consumes two other re- sources: bandwidth and power. At the highest quality level, this node uses 2.5 Mbit/s ≈ and ≈ 566 mAh. Table 7.4b shows resource consumption for node v1 under different quality levels, and Figure 7.4a shows the tradeoff between stream resolution vs. power consumption and bandwidth usage. Note that change stream resolution/bitrate is not always proportional to bandwidth. This is due to video compression standards (e.g H.264). Illuminance Detection Node (v2). This node is connected to an illuminance detector, and a smart light bulb with adjustable brightness. As both of these devices are powered and operated by node v2 with minimal data transfer and delay, we consider them to be a part of node v2 itself. Similar to v1, this node is activated upon receiving an activation signal from v0. Depending on how dark or bright the general area that is being streamed by the camera on node v1 is, node v2 adjusts the brightness of the smart bulb accordingly. We keep the illuminance detector some distance away from the network, in order to prevent light flickering due to feedback loop. Table 7.4c shows resource consumption for node v1 under different quality levels. Local Storage Node (v3). This node compresses, checksum verifies, and stores the video stream received through edge (v1, v3) into a secure external hard disk drive. Table 7.4d shows resource consumption for v1 under different quality levels. Cloud Storage Node (v4). Similar to node v3, node v4 is also tasked with storing the video stream received through edge (v1, v3). However, instead of storing it locally, the stream is uploaded directly to a cloud storage. Offloading the compression and checksum verifica- tion task to the cloud allows node v4 to consume minimal power, at the cost of additional bandwidth. Table 7.4e shows resource consumption for node v4 under different quality levels. 151 Note that the bandwidth requirement for node v4 is double the amount in comparison to that of v3 at the same quality level. This is due to the fact that when v4 receives a video stream from node v1, it uploads the same stream to the cloud, effectively doubling the required bandwidth. Furthermore, we assume that while maintaining similar qualities and stream, node v4 generally more reliable than node v3 for being able to store and backup data in the cloud as shown in Figure 7.4b. 7.5.2.2 Machine Learning Setup For our machine learning dataset, similar to our synthetic experiments, we generate 500 data samples using our SMT-based solution. each sample contains a randomly selected PWR value, and the optimized quality levels for the 5 nodes in the Pi network shown in Figure 7.3. We train our ANN with this dataset for 10 epochs (full iterations) using the k-fold cross validation method described in Section 7.4. 7.5.2.3 Experimental Results We run a variety of experiments on the setup described above and report our findings below. Reliability vs. Power. We now observe the tradeoff between reliability and power in our multi-layer network of nodes. We start off by assigning the PWR resource bPWR V to as shown in Figure 7.4c. From this point we measure the best obtainable reliability 0, 1410 ⟨ ⟩ and tighten the bound by 5 mAh on each iteration in a similar manner as our synthetic experiment. Figure 7.4d shows the run time for the same experiment. The run time is low at the beginning due to the system having adequate power flow to operate all nodes at a near maximum reliability, and therefore having very few SMT constraints to solve. Observe that our machine learning-based model exhibits similar behavior as seen during our synthetic experiments. The reliability drop is a steady decline, whereas the run time does not very to a great degree. However, for bother models this changes when bPWR V is 0, 1410 ⟩ ⟨ and onward, as the available power is no longer sufficient for all nodes to operate at maximum quality 152 and reliability. We now observe the changes in quality levels for which the burndown in reliability has occurred. From Table 7.4f, we can see that the model gradually lowered the quality levels of v0 to q4 first due to the small difference in reliability between each quality level. Afterwards, the quality level of v3 (local node) is lowered instead of v4 (cloud node) due to v3 consuming more power than v4. Once the quality level of v3 has reached the lowest point, lowering the available power further finally caused the model to lower the quality level of v4. Finally, below 900 mAh, the available power was not sufficient for keeping the nodes running, even at the lowest quality levels, and therefore, the network was shutdown. We run the same experiment again with available bandwidth as the tightening resource. In this case, we observe the exact opposite behavior, where the quality level of v4 (cloud node) was lowered first, and then v3 (local node). This is due to the fact that v4 requires double the bandwidth when compared to v3, as shown in Table 7.4e. 7.6 Conclusion In this chapter, we developed a generalized model of a streaming sensor network CPS as a network of producers and consumers. Our approach incorporates tradeoffs between out- put quality and resource utilization. These tradeoffs were articulated as a multi-objective optimization problem with the goal of minimizing resource consumption while maximizing the reliability (and quality) of devices (or tasks) in a network. To tackle the aforementioned optimization challenge, we provided an efficient technique based on constraint solving uti- lizing SMT-solvers to identify the ideal processing quality selection for each node in the network while respecting resource limitations and minimizing error. We further improve this work by incorporating machine learning and dramatically speed up the resource optimiza- tion speed. Since sensor network applications frequently require stream processing, which entails a complicated network of processing nodes where data is collected, analyzed, and then communicated to succeeding nodes, this is a significant problem. We put our work into practice, and provide the results of an experiment using an IoT device network. 153 CHAPTER 8 RELATED WORK In this section, we summarize a portion of the vast quantity of work that has been done thus far in the field of distributed CPS that has influenced this dissertation, going as far back as the origins of distributed monitoring. 8.1 Lattice-based Distributed Monitoring Early work on distributed monitoring involving predicate detection in distributed com- puting [53, 90, 115] are known to be NP-complete. To propose a more efficient solution, a computation slicing [91] technique is used to reduce the computation size, which in turns results in smaller search space in predicate detection, as far as state space is concerned. This work is later extended to an online distributed monitoring algorithm [26]. Lattice-based distributed monitoring solutions generally suffer from two shortcomings: (1) having to handle an enormous amount of concurrent states, and (2) lacking methods to handle temporal properties. A methodology for detecting Basic Temporal Logic [99] partially addresses the latter issue by providing methodologies for detecting a subset of temporal operators in distributed systems. A bound-based monitoring approach [130] addresses state space problem, and then later extended to a more efficient technique [116] that utilizes SAT solvers. In our work, we avoid using any lattice-based distributed monitoring approaches, and by extension, avoid needing to handle a unamanagable amount of concurrent states. 8.2 Runtime Monitoring in CPS Accurate time-keeping for CPS was thoroughly investigated by the Roseline project [98]. Roseline team addresses the problem that local clocks have little, if any, knowledge of the quality of time needed by the software, nor any ability to adapt to it. They achieve this by rethinking and re-engineering how the knowledge of time is handled across a computing sys- tem’s hardware and software and driving accurate timing information deep into the software system. 154 Assuming perfect time synchrony, an offline toolbox called S-TALIRO [5] is introduced by its developers, that searches for trajectories of robust MTL [67] semantics. S-TALIRO can analyze arbitrary Simulink models or user defined functions that model the system, and operate using randomized testing based on stochastic optimization techniques like Monte- Carlo methods [87] and Ant-Colony optimization [37]. An online monitoring technique [35] of STL [36] for continuous and hybrid systems employs an efficient algorithm for computing the robustness degree in which a piecewise- continuous signal satisfies or violates an STL formula. An efficient monitoring solution [34] is proposed by its authors that utilizes Dynamic Programming algorithms for online monitoring of the state robustness of MTL [67] specifica- tions with past time operators. The authors provide an approach for predictive monitoring by computing the robustness of MTL with unbounded past and bounded future temporal operators over sampled traces of CPS. However, in order to do so, prior knowledge the full dynamical model of the system is required. Our work relating to predicate detection is closer to the work involving an online monitor- ing approach [33], where a robust online monitoring of partial traces is formalized. However, this work assumes worst-case a priori bounds on signal values, but without factoring system dynamics. A logic called differential dynamic game logic [105] is a new logic that aims to demonstrate how the satisfaction of a temporal property is affected by imperfect implementations. This work is similar to a conformance testing framework [1], where the authors quantify the closeness between two systems via a distance measure between their outputs, and study of how the satisfaction of a temporal property is affected by timing inaccuracies. The a control-theoretic software monitoring solution [83] is proposed by its authors for coordinating time predictability and memory utilization in runtime monitoring of systems that interact with the physical world. This method maximizes memory utilization being employing a minimally intrusive monitoring tactic. 155 A tool called Brace [133] is introduced by its developers, where using the tool, users can attempt to minimize false positives and false negatives, while trying to stay under a given threshold for computation overhead of CPS. The authors do not provide any guarantees of completely eliminating false positives or false negatives, only minimizing them. This aspect is different from our approach on monitoring distributed signals in CPS, due to the fact that our approach guarantees no false positives. Another tool called ModelPlex [89] is introduced as a method for ensuring verification results for models. ModelPlex also allows the said models to account for the effect of envi- ronmental variances and disturbances on a CPS, while considering only the relevant part of surrounding physics. In the medical field, a specification language DRTV [63] in order to specify vital real- time data sampled by medical devices. DRTV also allows for runtime monitoring temporal properties originated from clinical guidelines. A hybrid approach to runtime monitoring in CPS called Extended Hidden Markov sys- tems [112] is explored, where the systems under inspection are comprised of both integer- valued and real-valued variables. While the above works propose various techniques and tools for monitoring CPS, they do not account for partial synchrony, nor system dynamics in their monitoring methodologies. 8.3 Asynchronous Distributed Monitoring The notion of computational slide [91] introduces the ability to monitor distributed sys- tems in an asynchronous setting. In this approach, the slice of a computation with respect to a predicate is a sub-computation with the least number of consistent cuts that contains all consistent cuts of the computation satisfying a given predicate. This work is later extended to a distributed setting [26], where a distributed algorithm is presented for computing the slice of a distributed computation with respect to a regular predicate. The work on distributed monitoring of concurrent and asynchronous systems [14] inves- tigates the problem of distributed monitoring under time asynchrony, with application to 156 distributed fault management in telecommunications networks. To this end, the authors combine compositional unfoldings to handle concurrency, and a variant of graphical algo- rithms and belief propagation, originating from statistics and information theory. This work is later further extended [44], where the authors study the diagnosis of distributed asyn- chronous systems with concurrency. In this work, diagnosis is performed by a peer-to-peer distributed architecture of supervisors. This approach relies on Petri net [103] unfoldings and event structures, as means to manipulate trajectories of systems with concurrency. A tool called DIANA [109] is introduced by its developers in order to monitor temporal properties of distributed systems. The authors use past time distributed temporal logic (a variant of past time linear temporal logic) as the specification language. In this approach, the notion of knowledge vector is introduced where each process is kept aware of other processes’ local states. This approach, however, suffers from producing false negatives. A decentralized runtime verification technique for LTL specifications [96] demonstrates a method for runtime verification of asynchronous distributed programs for the 3-valued semantics of LTL specifications. This approach however, also suffers from false negatives results. On the other hand, in a temporal logic predicate detection approach [99] the authors introduce the concept of a compact representation of all global cuts that satisfy a predicate. The approaches mentioned above all operate within fully asynchronous setting. To the contrary to these approaches, we leverage a practical assumption and employ an off-the-shelf clock synchronization algorithm to limit the time window of asynchrony. A method for designing parallel algorithms [54] is proposed to solve constrained combina- torial optimization problems like marriage problem, shortest path problem, market clearing price problem, and so on. The authors achieve this by transforming these problems into a search problem, where an element that satisfies an appropriate predicate in a distributive lattice is obtained. An approach for detecting latent bugs caused by concurrency and race conditions among concurrent processes [116] is by its authors. In this work, the authors propose a method 157 for detecting errors and monitoring system constraints in partially synchronous distributed systems using a monitoring framework with SMT as its foundation. In the work involving runtime monitoring of LTL formulas for synchronous distributed systems in the absence of a central data collection point [9], the authors propose an approach where LTL formulas are decomposed into sub-formulas, such that satisfaction or violation of specifications can be detected by local monitors alone. This work is later expanded upon with the introduction of a synchronous global clock [28], in which monitors are organised as a tree across the distributed system, and each child feeds intermediate results to its parent. A similar approach using LTL, but for stream runtime verification of CPS [31] is later proposed in. However, these approaches assume perfectly synchronous clocks, which is rarely achieveable. The four-valued logic Runtime Verification Linear Temporal Logic RV-LTL [11] intro- duces a logic, where the system behavior either (i) satisfies the monitored property, (ii) violates the property, (iii) will presumably violate the property, or (iv) will presumably conform to the property in the future, once the system has stabilized. This work is later improved upon, where a fault tolerant verification technique LTL2k+4 [16] is proposed for asynchronous systems. In our automata-based monitoring technique, we used LTL3 over four-valued LTL or LTL2k+4, because the unknown verdict in LTL3 was sufficient for our monitoring purposes, and the distinction between ‘will presumably violate the property’ and ‘will presumably satisfy the property’ served no additional benefit. 8.4 Synchronous Distributed Monitoring Two approaches for runtime monitoring of LTL formulas have been studied by the au- thors of the monitor framework THEMIS [41]. The first approach introduces a data structure that keeps track of the execution of an automaton, has predictable parameters and size, and guarantees strong eventual consistency. The second approach defines decentralized specifi- cations wherein multiple specifications are provided for separate parts of the system. The framework THEMIS can be used to analyze systems using the two approaches. 158 An adaptive synchronous parallel method for distributed machine learning [131] is ex- plored, where the performance monitoring model adaptively adjusts the synchronization method of each computing node with the parameter server by taking into account the whole performance of each node, ensuring improved accuracy. Furthermore, this technique guards against the machine learning model being influenced by irrelevant tasks in the same cluster. A hybrid approach to monitoring is taken by the authors of the monitoring tool called SMEDL [132], where low-level properties are checked synchronously, while higher-level ones are checked asynchronously. SMEDL can be used to construct and deploy monitors based on an architecture specification. The specification language intended for industrial use called LOLA [119] is proposed, where the authors provide a syntactic characterization of efficiently monitorable specifica- tions, for which the space requirement of the online monitoring algorithm is independent of the size of the trace, and linear in the specification size. can express properties involving both the past and the future. Both online and offline verification techniques using temporal logics [107] are studied by the authors of the specification language LOLA and present in detail the online and offline monitoring algorithms. To this end, the authors use temporal logic for Stream Runtime Verification [17]. A novel efficient two-layered monitoring technique [119] is proposed by its authors that aims to overcome the time and space constraints introduced by most synchronous monitoring approaches. The first layer is imperfect yet effective, whereas the second layer is exact but (relatively) ineffective. The two-layered monitor also supports the usage of O(1) sized Hybrid Logical Clocks. Another approach that aims to overcome the time and space constraints is a monitoring method that incorporates a control idea of synchronization on CPS [60], which include dividing a node’s main loop program into several processes and adopting twice trigger signals to activate synchronization control. The impact of synchronous and asynchronous monitoring instrumentation on runtime 159 overheads in the context of a runtime verification framework for actor-based systems [21] is thoroughly studied, and the authors show that, in such a context, asynchronous monitoring incurs substantially lower overhead costs. They also demonstrate how, for certain properties that require synchronous monitoring, a hybrid approach can be used that ensures timely vi- olation detections for the important events while, at the same time, incurring lower overhead costs that are closer to those of an asynchronous instrumentation. A solution to the decentralized monitoring problem for the more general setting of stream runtime verification [31] is provided by its authors, and a property on specification is also introduced here that guarantees that the online monitoring can be performed with bounded resources. An algorithm for distributing and monitoring LTL formula [10] employs a technique where satisfaction or violation of specifications can be detected by local monitors alone, even when the system’s implementation details are hidden to the user. However, these approaches have the shortcoming of assuming a global clock across all distributed processes. Predicate detection for asynchronous system [114] has been studied extensively where 3 distinct detection modalities are achieved by introducing the notion of ‘definitely occurred before’ and ‘possibly occurred before’ event orderings. However, doing so causes the assumption needed to evaluate happen-before relationship to be too strong. In this dissertation, we utilize HLC, which not only is more realistic but also decreases the level of concurrency. Finally, an automata-based fault tolerant verification technique [64] is proposed for synchronous systems with no clock skews across the distributed processes. A fault-tolerant distributed membership protocol for the determination of the set of active nodes in a synchronous distributed real-time systems [66] is presented. We use a clock synchronization algorithm which guarantees bounded clock skews. Our solution is also SMT based and to our knowledge this is the first SMT based distributed monitoring algorithm for LTL, which results in better scalability. 160 8.5 Partially Synchronous Distributed Monitoring In the context of monitoring partially synchronous systems, the feasibility of monitoring partially synchronous distributed systems [116] in order to detect latent bugs was first inves- tigated. The authors provide a monitoring framework, where both the system constraints, and the latent bugs are modeled as SMT formulas. The latent bugs are identified using SMT solvers. This technique was later generalized to full LTL [49], where the presence of latent bugs are detected using SMT solvers in a discrete setting. The authors introduce two monitoring techniques where the specification in the LTL is either represented by a deter- ministic finite automaton, or, a progression-based formula rewriting technique to reduce the distributed runtime verification problem to an SMT problem. A tool for identifying data races in distributed system traces called SPIDER [102] is introduced for handling non-deterministic discrete event orderings. This is an automated tool that can be used to identify data races in distributed system traces. However, these approaches cannot fully capture the continuous-time and continuous-valued behavior of CPS. There is extensive work in identifying a subclass of systems [22], for which convergence features may be confirmed using the proof of convergence for the related discrete-time shared state system. The method is extended to systems in which an agent’s state develops con- tinuously over time. The proof approach was formalized in the PVS interface for timed I/O automata and used to verify the convergence of a mobile agent pattern formation algorithm. A failure detector for partially synchronous distributed systems [121] is proposed by its authors, where the authors present an alternative failure detector algorithm, which is based on a clock synchronization algorithm. A solution to the processor group membership problem [29] is achieved by precisely specifying the processor problem in order to define the system model and failure assumptions. The author then provides two protocols for solving this problem. A technique to monitor predicates on a partially synchronous distributed system by retim- ing continuous signals [95] is explored. While this approach improves monitoring efficiency 161 by levering knowledge about system dynamics, it is limited to only being able to monitor predicates, and cannot capture temporal behavior. A method for runtime monitoring of blockchain executions for partially synchronous distributed computations [50] is proposed where the specification language is metric temporal logic [67]. The effects of the impedance mismatch between the monitor and the underlying program for the detection of conjunctive predicates [130] is analyzed. An interesting observation of this work is that the authors identify a small interval where the monitor assumptions are hypersensitive to the underlying program environment. A domain specific language called PSync [38] based on the Heard-Of model [24, 25], is demonstrated, where asynchronous faulty systems are viewed as synchronous ones with an adversarial environment that simulates asynchrony and faults by dropping messages. While the approaches above provide various techniques for monitoring partially syn- chronous discrete systems, they are unable to fully capture the continuous nature of CPS. 8.6 Decentralized Distributed Monitoring There is a rich literature dealing with decentralized predicate detection in the discrete- time setting. These works range from detection of regular discrete-time predicates [26] to detecting lattice-linear predicates over discrete states [54]. There is recent work on perform- ing detection on a regular subset of Computation Tree Logic [108] that aims to avoid the state explosion problem. Literature books by Garg [52] and Singhal [68] elaborate extensively on decentralized monitoring in discrete-time settings. By contrast, we are concerned with monitoring continuous-time signals in decentralised setting, which have uncountably many events and necessitate new techniques. For instance, one cannot iterate through events as done in the discrete setting. The recent works [94, 95] do monitoring of temporal formulas over partially synchronous analog distributed systems, that is, they only find one satisfaction, not all. Moreover, their solution is centralized. 162 Generally, there is a plethora of work to be found on monitoring temporal logic proper- ties, especially Linear Temporal Logic (LTL) and Metric Temporal Logic (MTL). Notably, these works involve using a three-valued MTL for monitoring in the presence of failures and non-FIFO communication channels [7], monitoring satisfaction of an LTL formula [10], us- ing a three-valued LTL for distributed systems with asynchronous properties [96], using a tableau technique for three-valued LTL [8], and finally, using a past-time distributed tempo- ral logic which emphasizes distributed properties over time [109]. However, all these methods either focus on centralized monitoring or work in discrete settings. In our work, we provide methodologies for decentralized monitoring in continuous time settings. 8.7 Monitoring Reliability in CPS Resource trade-off is a broadly studied with respect to monitoring reliability in CPS. Such as the work exploring the trade-offs between power and reliability of Wireless Sensor Networks [30]. This work proposes a model for evaluating the reliability of WSNs considering the battery level as a key factor. The problem of modeling and evaluating the coverage-oriented reliability of CPS subject to common-cause failures is explored [110], where the proposed methodology takes advantage of reduced ordered binary decision diagrams, which is similar to our binary probing technique. A methodology based on an automatic generation of a fault tree [111] is proposed in order to evaluate the reliability and availability of CPS, when permanent faults occur on network devices. One stream of work in CPS is concerned with security related trade-offs, where security comes at a cost of energy or performance. A relevant literature in this regard provides a classification of existing security concerns and researches [3]. A method to determine when to inject cryptographic checks without interfering with control tasks [75] is demonstrated by its authors, where the general idea behind the method- ology is maximizing security checks while maintaining a predefined level of control quality. Another similar work is available where the authors propose a feedback scheduling tech- 163 nique for maintaining network quality of service in wireless sensor networks [125]. Both of these works fall under soft real-time constraints, where essentially security is traded off with deadline adherence. A notable literature review further studies and elaborates on the challenges in designing reliable CPS [74]. The work emphasizes on the necessity of raising the level of abstraction in terms of designing reliable CPS, as the current networking technologies often do not provide adequate foundation for CPS. There is a line of work in the parallel and distributed processing domain on distributing power resources efficiently. For example, a method to bound the energy consumption of a message passing interface program [106] is proposed, where the authors use a linear program- ming model that knows the execution time of jobs on machines and the effect of changing the frequency on their speedup. This work has been then extended to propose a scalable method to determine individual task power bounds in a distributed setting given a global power bound [84]. Our work involving resource consumption in a producer consumer network draws inspi- ration from existing work that addresses the problem of energy consumption in a producer consumer network using learning mechanisms to reduce the energy consumption of the overall system [82]. Many researchers target the problem of finding optimal energy savings without impacting performance. One such notable study observes the trade-offs between energy and delay for a wide set of applications [48]. The work also studies metrics to use to predict memory or communication bottlenecks. There exist multiple works that attempt to tackle this bottle- neck problem on a single processor [62, 78]. These works mainly proposes an alternative Dynamic voltage and frequency scaling strategy that maintains the same performance at reduced energy consumption. The work on formal control techniques for power-performance management [124] discusses the effectiveness of using control theory in power management. Furthermore, the series of 164 works [126, 127, 128] construct an integer linear programming (ILP) model to determine the minimum energy consumption that a program can consume on a single processor. Our work on multi-resource multi-node optimization problem is similar to the work on the management of energy security trade-off in a distributed cyber-physical system [120], in the sense that our approach is more generalized. 165 CHAPTER 9 CONCLUSION In this chapter, we summarize our work and highlight our contributions for each methodology. We then discuss our current ongoing work and short term goals. Finally, we conclude by exploring potential future avenues of research that could be logical next steps of our work. 9.1 Summary We begin with distributed runtime monitoring in this dissertation. Our proposed tech- niques take an LTL formula and a distributed computation as input and, assuming a bounded clock skew among all processes, chops the computation into multiple segments before ap- plying the automata-based and progression-based monitoring algorithms implemented as an SMT decision problem to verify the correctness of the formula. We carried out rigorous synthetic experiments using LTL formulas of varying complexity. Although we attempted to keep our synthetic experiments as close to real world scenarios as possible, we acknowledge that in these synthetic experiments (as well as any synthetic experiments in our following works), there could be missing environmental variables (which would otherwise be present in real world scenarios) that could influence our run time. However, to partially account for this shortcoming, we carried out case studies on Cassandra consistency circumstances and NASA air traffic control dataset. Following that work, we show an online predicate detection strategy for distributed signals that do not share a global clock. To make the problem tractable, we use causality analysis between real-valued signals, a reasonable assumption on maximum clock skew among local clocks, and rough knowledge of system dynamics. We also studied the influence of signal dynamics information on monitoring efficiency. By testing on a real network of autonomous cars, a simulated network of UAVs, and a simulated water distribution system, we discov- ered numerous noteworthy discoveries. Our method may be used to successfully monitor a distributed CPS in an online setting. For distributed CPS, we presented an approach for monitoring specifications expressed in 166 signal temporal logic (STL), where continuous-time and valued signals from a group of agents do not share a global clock. Our method relies on an off-the-shelf clock synchronization solution, such as NTP, to ensure a maximum constrained clock skew across all agents in the system. Leveraging our work in predicate detection, we also presented a signal retiming approach that effectively aligns continuous signals in order to detect potential STL violations. To address the complexity, we simplify our runtime monitoring problem to a basic SMT solving problem and cut the distributed signals into a sequence of smaller segments. We also presented a formula progression approach similar to our work with distributed systems, which takes a distributed signal and an STL formula as input and outputs another STL formula that depicts the formula’s progress through the signals. We also presented experimental results from the monitoring of an unmanned aerial vehicle (UAV) fleet and a water distribution system. We then extend our work to decentralized monitoring, where we perform online conjunc- tive predicate detection for distributed signals. Our algorithm returns all possible violations of the predicate, which in turn allows us to identify and eliminate bugs from distributed systems regardless of actual clock drift. Finally, we provided a generalized model of a streaming network CPS as a producer- consumer network. Our approach incorporates tradeoffs between output quality and resource utilization. These tradeoffs were articulated as a multi-objective optimization problem with the goal of lowering resource utilization while maximizing the reliability (and quality) of devices (or jobs) in a network. To tackle the aforementioned optimization challenge, we provided an efficient technique based on constraint solving utilizing SMT-solvers to identify the ideal processing quality selection for each node in the network while respecting resource limitations and minimizing error. This is a significant problem since network applications frequently require stream processing, which entails a complicated network of processing nodes where data is collected, analyzed, and then communicated to following nodes. We have fully implemented our approach and shown testing findings on an IoT device network. 167 9.2 Ongoing Work We have thus far discussed methodologies for monitoring various formal specifications on partially synchronous distributed CPS under both centralized and decentralized monitoring settings. However, in every case, we assume that all the agents in these systems are honest, that is, the agents follow the intended behaviors and protocols without malicious intent. Our current work involves designing secure monitoring techniques for both centralized and decentralized distributed CPS, where ensuring data privacy is the primary objective. We explain the necessity of data privacy in CPS with the following example. Alice uses health monitoring wearables to measure her heart rate, blood glucose level, etc. Alice’s hospital has a server (monitor) that would like to monitor Alice’s health data, and if a certain specification is met (e.g., Alice’s heart rate is above a threshold and glucose level is below a threshold), send an alert to Alice’s caregiver. However, Alice does not wish to reveal her personal health data to the monitor, and the monitor does not want to reveal specification to Alice. In other words, both Alice and the monitor wish runtime verification to be performed on Alice’s data using the monitor’s specification, while keeping each party’s data private. 9.2.1 Monitoring with Secure Multi-Party Computation Secure Multi-Party Computation or simply Multi-Party Computation (MPC) [57] is a cryptographic protocol that allows multiple parties to jointly compute a function over their individual private inputs without ever revealing the said inputs to each other. An example of MPC would be, consider a scenario where three friends, Alice, Bob and Charlie wish to compute their average salary while never disclosing their actual salary to one another. Let Sa, Sb and Sc be the salaries of Alice, Bob and Charlie respectively. Only Alice knows the value of Sa, Bob knows the value of Sb, and Charlie knows the value of Sc. Now Alice privately splits her salary amount into three random pieces, such that, Sa = a1+a2+a3. Bob and Charlie do the same, that is, Sb = b1 + b2 + b3 and Sc = c1 + c2 + c3. Now, Alice shares a2 with Bob, and a3 with Charlie. Bob shares b1 with Alice, and b3 with Charlie. 168 Charlie shares c1 with Alice, and c2 with Bob. Now Alice computes S1 = a1 + b1 + c1, Bob computes S2 = a2 + b2 + c2, and Charlie computes S3 = a3 + b3 + c3. It should be noted that, it is impossible to extract any salary amounts from S1, S2 or S3. However, if Alice, Bob and Charlie now share S1, S2 or S3 with each other, add compute (S1 + S2 + S3)/3, then the desired average salary can be obtained with any party revealing their salary amount to others. While the above example is fairly straightforward, MPC provides more complex protocols with which arithmetic operations can be carried out without any loss of precision [42]. In our work we are mostly interested in performing addition and multiplication operations with MPC protocols efficiently, as generally rely on these two operations for our retiming approach (Recall 4.4e). However, MPC does come with its own set of challenges. While addition protocols can be executed locally (i.e., on agents), multiplication protocols require agents to share partial data with each other multiple times before being able to compute the solution. Naturally, this is an issue for runtime verification, as there are various factors (e.g., network latency, workload, agent availability) that can influence communication delay, and by extension, run time. We have already made significant headway in addressing some of the challenges presented by runtime verification using MPC. We hope to continue our work in this direction, and make significant progress in the near future. 9.3 Future Work The work done in this dissertation paves way for various intriguing directions for further investigation. In this section we discuss the possible avenues of future work that are currently in our consideration. First of all, for the monitoring approaches proposed in Chapter 3, 4, and 5, a study of the trade-off between accuracy and scalability can be conducted. We can define accuracy of 169 verdicts as follows: actual number of correct verdicts actual number of correct verdicts − number of missed verdicts An interesting scope of research would be to observe and report the relationship between the degradation of accuracy and the improvement of runtime for the aforementioned monitoring techniques. While monitoring predicates on distributed signals, our approach finds the first global states that violate a predicate in a segment. A crucial step in debugging distributed CPS is to find all such states. Thus, it is important to investigate data structures that can efficiently represent a set of global states of distributed continuous signals that violate a predicate. In the discrete setting, computation slices [91] are an example of such a data structure. One way to achieve this is by using the long-known notion of regions in timed automata [4]. Because we are reducing the monitoring problem to an SMT solution problem, the prob- lem may become undecidable in some cases. The inevitable next step is to identify the STL piece where the problem is undecidable. Another conceivable aim is for our monitor of the framework to become fully distributed, as we assume a central monitor in all cases in this dissertation. Having a centralized monitor also exposes our techniques to a single point of failure. Furthermore, we have every reason to suspect that individual monitors in the system may have faults, such as crashing or reporting false verdicts. This necessitates the development of distributed fault-tolerant monitoring techniques. For our approach on monitoring reliability of CPS, one obvious use of our method is to represent networks that are not necessarily acyclic, that is, the network may include feedback loops. Another intriguing line of study is to watch and report on the trade-off between monitor reliability and runtime overhead, as well as network communication. 170 BIBLIOGRAPHY [1] Abbas, H., Mittelmann, H., and Fainekos, G. (2014). Formal property verification in In 2014 Twelfth ACM/IEEE Conference on Formal a conformance testing framework. Methods and Models for Codesign (MEMOCODE), pages 155–164. IEEE. [2] Abiodun, O. I., Jantan, A., Omolara, A. E., Dada, K. V., Mohamed, N. A., and Arshad, H. (2018). State-of-the-art in artificial neural network applications: A survey. Heliyon, 4(11):e00938. [3] Alguliyev, R., Imamverdiyev, Y., and Sukhostat, L. (2018). Cyber-physical systems and their security issues. Computers in Industry, 100:212–223. [4] Alur, R. and Dill, D. L. (1994). A theory of timed automata. Theoretical computer science, 126(2):183–235. [5] Annpureddy, Y., Liu, C., Fainekos, G., and Sankaranarayanan, S. (2011). S-taliro: A tool for temporal logic falsification for hybrid systems. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 254–257. Springer. [6] Barrett, C. and Tinelli, C. (2018). Satisfiability modulo theories. Springer. [7] Basin, D., Klaedtke, F., and Zălinescu, E. (2015). Failure-aware runtime verification of distributed systems. In 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015), volume 45, pages 590–603. Schloss Dagstuhl-Leibniz-Zentrum für Informatik. [8] Bataineh, O., Rosenblum, D. S., and Reynolds, M. (2019). Efficient decentralized ltl mon- itoring framework using tableau technique. ACM Transactions on Embedded Computing Systems (TECS), 18(5s):1–21. [9] Bauer, A. and Falcone, Y. (2012). Decentralised ltl monitoring. In International Sympo- sium on Formal Methods, pages 85–100. Springer. [10] Bauer, A. and Falcone, Y. (2016). Decentralised ltl monitoring. Formal Methods in System Design, 48(1):46–93. [11] Bauer, A., Leucker, M., and Schallhart, C. (2010). Comparing ltl semantics for runtime verification. Journal of Logic and Computation, 20(3):651–674. [12] Bauer, A., Leucker, M., and Schallhart, C. (2011). Runtime verification for ltl and tltl. ACM Transactions on Software Engineering and Methodology (TOSEM), 20(4):1–64. [13] Benndorf, M. and Haenselmann, T. (2016). Time synchronization on android devices In The Tenth International Conference on Sensor for mobile construction assessment. 171 Technologies and Applications. Thinkmind. [14] Benveniste, A., Haar, S., Fabre, E., and Jard, C. (2003). Distributed monitoring of con- current and asynchronous systems. In International Conference on Concurrency Theory, pages 1–26. Springer. [15] Bhuyan, B., Sarma, H. K. D., Sarma, N., Kar, A., Mall, R., et al. (2010). Quality of service (qos) provisions in wireless sensor networks and related challenges. Wireless Sensor Network, 2(11):861. [16] Bonakdarpour, B., Fraigniaud, P., Rajsbaum, S., Rosenblueth, D. A., and Travers, C. (2016). Decentralized asynchronous crash-resilient runtime verification. In 27th Interna- tional Conference on Concurrency Theory (CONCUR 2016). Schloss Dagstuhl-Leibniz- Zentrum fuer Informatik. [17] Bozzelli, L. and Sánchez, C. (2014). Foundations of boolean stream runtime verification. In International Conference on Runtime Verification, pages 64–79. Springer. [18] Brunelli, D. and Caione, C. (2015). Sparse recovery optimization in wireless sensor networks with a sub-nyquist sampling rate. Sensors, 15(7):16654–16673. [19] Cassandra, A. (2014). Apache cassandra. Website. Available online at http://planetcassandra. org/what-is-apache-cassandra, 13. [20] Cassandras, C. G. and Lafortune, S. (2008). Introduction to discrete event systems. Springer. [21] Cassar, I. and Francalanza, A. (2015). On synchronous and asynchronous monitor instrumentation for actor-based systems. arXiv preprint arXiv:1502.03514. [22] Chandy, K. M., Mitra, S., and Pilotto, C. (2008). Convergence verification: From shared memory to partially synchronous systems. In International Conference on Formal Modeling and Analysis of Timed Systems, pages 218–232. Springer. [23] Charron-Bost, B. (1991). Concerning the size of logical clocks in distributed systems. Information Processing Letters, 39(1):11–16. [24] Charron-Bost, B. and Schiper, A. (2006). The heard-of model: Unifying all benign failures. EPFL Scientific Publications. [25] Charron-Bost, B. and Schiper, A. (2009). The heard-of model: computing in distributed systems with benign faults. Distributed Computing, 22(1):49–71. [26] Chauhan, H., Garg, V. K., Natarajan, A., and Mittal, N. (2013). A distributed abstrac- tion algorithm for online predicate detection. In 2013 IEEE 32nd International Symposium 172 on Reliable Distributed Systems, pages 101–110. IEEE. [27] Chen, H. (2017). Applications of cyber-physical system: a literature review. Journal of Industrial Integration and Management, 2(03):1750012. [28] Colombo, C. and Falcone, Y. (2016). Organising ltl monitors over distributed systems with a global clock. Formal Methods in System Design, 49(1):109–158. [29] Cristian, F. (1988). Agreeing on who is present and who is absent in a synchronous distributed system. In 1988 The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers, pages 206–207. IEEE Computer Society. [30] Dâmaso, A., Rosa, N., and Maciel, P. (2014). Reliability of wireless sensor networks. Sensors, 14(9):15760–15785. [31] Danielsson, L. M. and Sánchez, C. (2019). Decentralized stream runtime verification. In International Conference on Runtime Verification, pages 185–201. Springer. [32] De Moura, L. and Bjørner, N. (2008). Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer. [33] Deshmukh, J. V., Donzé, A., Ghosh, S., Jin, X., Juniwal, G., and Seshia, S. A. (2017). Robust online monitoring of signal temporal logic. Formal Methods in System Design, 51(1):5–30. [34] Dokhanchi, A., Hoxha, B., and Fainekos, G. (2014). On-line monitoring for temporal In International Conference on Runtime Verification, pages 231–246. logic robustness. Springer. [35] Donzé, A., Ferrere, T., and Maler, O. (2013). Efficient robust monitoring for stl. In International Conference on Computer Aided Verification, pages 264–279. Springer. [36] Donzé, A. and Maler, O. (2010). Robust satisfaction of temporal logic over real-valued signals. In International Conference on Formal Modeling and Analysis of Timed Systems, pages 92–106. Springer. [37] Dorigo, M., Birattari, M., and Stutzle, T. (2006). Ant colony optimization. IEEE computational intelligence magazine, 1(4):28–39. [38] Drăgoi, C., Henzinger, T. A., and Zufferey, D. (2016). Psync: a partially synchronous language for fault-tolerant distributed algorithms. ACM SIGPLAN Notices, 51(1):400– 415. [39] Drone Life (2019). FAA UTM project: Decentralized uas traffic management 173 demonstration. https://dronelife.com/2019/09/09/decentralized-uas-traffic-management- demonstration. [40] Dwork, C., Lynch, N., and Stockmeyer, L. (1988). Consensus in the presence of partial synchrony. Journal of the ACM (JACM), 35(2):288–323. [41] El-Hokayem, A. and Falcone, Y. (2020). On the monitoring of decentralized specifi- cations: semantics, properties, analysis, and simulation. ACM Transactions on Software Engineering and Methodology (TOSEM), 29(1):1–57. [42] Evans, D., Kolesnikov, V., Rosulek, M., et al. (2018). A pragmatic introduction to secure multi-party computation. Foundations and Trends® in Privacy and Security, 2(2- 3):70–246. [43] FAA (2019). DOT UAS initiatives. https://www.faa.gov/uas/programs_partnerships/ DOT_initiatives. [44] Fabre, E., Benveniste, A., Haar, S., and Jard, C. (2005). Distributed monitoring of concurrent and asynchronous systems. Discrete Event Dynamic Systems, 15(1):33–84. [45] Fainekos, G. E. and Pappas, G. J. (2007). Robust sampling for mitl specifications. In International Conference on Formal Modeling and Analysis of Timed Systems, pages 147–162. Springer. [46] Fraigniaud, P., Rajsbaum, S., and Travers, C. (2013). Locality and checkability in wait-free computing. Distributed Computing, 26(4):223–242. [47] Fraigniaud, P., Rajsbaum, S., and Travers, C. (2020). A lower bound on the number of opinions needed for fault-tolerant decentralized run-time monitoring. Journal of Applied and Computational Topology, 4(1):141–179. [48] Freeh, V. W., Lowenthal, D. K., Pan, F., Kappiah, N., Springer, R., Rountree, B. L., and Femal, M. E. (2007). Analyzing the energy-time trade-off in high-performance computing applications. IEEE Transactions on Parallel and Distributed Systems, 18(6):835–848. [49] Ganguly, R., Momtaz, A., and Bonakdarpour, B. (2021). Distributed runtime verifica- tion under partial synchrony. In 24th International Conference on Principles of Distributed Systems (OPODIS 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik. [50] Ganguly, R., Xuey, Y., Jonckheere, A., Ljungy, P., Schornsteiny, B., Bonakdarpour, B., and Herlihy, M. (2022). Distributed runtime verification of metric temporal properties for cross-chain protocols. arXiv preprint arXiv:2204.09796. [51] Gareth, J., Daniela, W., Trevor, H., and Robert, T. (2013). An introduction to statistical learning: with applications in R. Spinger. 174 [52] Garg, V. (2002a). Elements of Distributed Computing. John Wiley & Sons. [53] Garg, V. K. (2002b). Elements of distributed computing. John Wiley & Sons. [54] Garg, V. K. (2020). Predicate detection to solve combinatorial optimization problems. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, pages 235–245. [55] Garg, V. K. and Chase, C. M. (1995). Distributed algorithms for detecting conjunctive In Proceedings of 15th International Conference on Distributed Computing predicates. Systems, pages 423–430. IEEE. [56] Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/- variance dilemma. Neural computation, 4(1):1–58. [57] Goldreich, O. (1998). Secure multi-party computation. Manuscript. Preliminary ver- sion, 78(110). [58] Hasabelnaby, M. (2016). Decentralized runtime verification of ltl specifications in dis- tributed systems. Master’s thesis, University of Waterloo. [59] Havelund, K. and Rosu, G. (2001). Monitoring programs using rewriting. In Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001), pages 135–143. IEEE. [60] He, F. and Zhao, S. (2008). Research on synchronous control of nodes in distributed network system. In 2008 IEEE International Conference on Automation and Logistics, pages 2999–3004. IEEE. [61] Hendry-Brogan, M. (2019). Global unmanned aerial vehicle (uav) market report. Tech- nical report, Technical report, May. [62] Hsu, C.-H. and Kremer, U. (2003). The design, implementation, and evaluation of a compiler algorithm for cpu energy reduction. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pages 38–48. [63] Jiang, Y., Song, H., Wang, R., Gu, M., Sun, J., and Sha, L. (2016). Data-centered IEEE transactions on runtime verification of wireless medical cyber-physical system. industrial informatics, 13(4):1900–1909. [64] Kazemlou, S. and Bonakdarpour, B. (2018). Crash-resilient decentralized synchronous In 2018 IEEE 37th Symposium on Reliable Distributed Systems runtime verification. (SRDS), pages 207–212. IEEE. [65] Ketkar, N. (2017). Introduction to keras. In Deep learning with Python, pages 97–111. 175 Springer. [66] Kopetz, H., Grünsteidl, G., and Reisinger, J. (1991). Fault-tolerant membership service In Dependable Computing for Critical in a synchronous distributed real-time system. Applications, pages 411–429. Springer. [67] Koymans, R. (1990). Specifying real-time properties with metric temporal logic. Real- time systems, 2(4):255–299. [68] Kshemkalyani, A. and Singhal, M. (2011). Distributed Computing: Principles, Algo- rithms, and Systems. Cambridge University Press. [69] Kuhn, M., Johnson, K., et al. (2013). Applied predictive modeling, volume 26. Springer. [70] Kuila, P. and Jana, P. K. (2014). A novel differential evolution based clustering algo- rithm for wireless sensor networks. Applied soft computing, 25:414–425. [71] Kulkarni, S. S., Demirbas, M., Madappa, D., Avva, B., and Leone, M. (2014). Logical physical clocks. In International Conference on Principles of Distributed Systems, pages 17–32. Springer. [72] Lakshman, A. and Malik, P. (2010). Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35–40. [73] Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565. [74] Lee, E. A. (2008). Cyber physical systems: Design challenges. In 2008 11th IEEE international symposium on object and component-oriented real-time distributed computing (ISORC), pages 363–369. IEEE. [75] Lesi, V., Jovanov, I., and Pajic, M. (2017). Security-aware scheduling of embedded control tasks. ACM Transactions on Embedded Computing Systems (TECS), 16(5s):1–21. [76] Lim, K. K., Park, J., and Shon, J. G. (2019). Differential data processing technique to improve the performance of wireless sensor networks. The Journal of Supercomputing, 75(8):4489–4504. [77] Liu, L., Kong, W., Ando, T., Yatsu, H., and Fukuda, A. (2013). A survey of acceleration techniques for smt-based bounded model checking. In 2013 international conference on computer sciences and applications, pages 554–559. IEEE. [78] Lorch, J. R. and Smith, A. J. (2001). Improving dynamic voltage scaling algorithms with pace. ACM SIGMETRICS Performance Evaluation Review, 29(1):50–61. 176 [79] Maler, O. and Nickovic, D. (2004). Monitoring temporal properties of continuous signals. In Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems, pages 152–166. Springer. [80] Manna, Z. and Pnueli, A. (2012). Temporal verification of reactive systems: safety. Springer Science & Business Media. [81] Mattern, F. et al. (1988). Virtual time and global states of distributed systems. Univ., Department of Computer Science. [82] Medhat, R., Bonakdarpour, B., and Fischmeister, S. (2018). Energy-efficient multiple producer-consumer. IEEE Transactions on Parallel and Distributed Systems, 30(3):560– 574. [83] Medhat, R., Bonakdarpour, B., Kumar, D., and Fischmeister, S. (2015). Runtime moni- toring of cyber-physical systems under timing and memory constraints. ACM Transactions on Embedded Computing Systems (TECS), 14(4):1–29. [84] Medhat, R., Funk, S., and Rountree, B. (2017). Scalable performance bounding under multiple constrained renewable resources. In Proceedings of the 5th International Workshop on Energy Efficient Supercomputing, pages 1–8. [85] Mehlitz, P., Giannakopoulou, D., and Shafiei, N. (2019). Analyzing airspace data with race. In 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC), pages 1–10. IEEE. [86] Mehmood, I., Ullah, A., Muhammad, K., Deng, D.-J., Meng, W., Al-Turjman, F., Sajjad, M., and de Albuquerque, V. H. C. (2019). Efficient image recognition and retrieval on iot-assisted energy-constrained platforms from big data repositories. IEEE Internet of Things Journal, 6(6):9246–9255. [87] Metropolis, N. and Ulam, S. (1949). The monte carlo method. Journal of the American statistical association, 44(247):335–341. [88] Mills, D., Martin, J., Burbank, J., and Kasch, W. (2010). Network time protocol version 4: Protocol and algorithms specification. RFC 5905, RFC Editor. [89] Mitsch, S. and Platzer, A. (2016). Modelplex: Verified runtime validation of verified cyber-physical system models. Formal Methods in System Design, 49(1):33–74. [90] Mittal, N. and Garg, V. K. (2001). On detecting global predicates in distributed compu- tations. In Proceedings 21st International Conference on Distributed Computing Systems, pages 3–10. IEEE. [91] Mittal, N. and Garg, V. K. (2005). Techniques and applications of computation slicing. 177 Distributed Computing, 17(3):251–277. [92] Mittal, V., Gupta, S., and Choudhury, T. (2018). Comparative analysis of authentica- tion and access control protocols against malicious attacks in wireless sensor networks. In Smart computing and informatics, pages 255–262. Springer. [93] Mogull, R. and Securosis, L. (2007). Understanding and selecting a data loss prevention solution. Technicalreport, SANS Institute, 27. [94] Momtaz, A., Abbas, H., and Bonakdarpour, B. (2023). Monitoring signal temporal logic in distributed cyber-physical systems. In Proceedings of the ACM/IEEE 14th Inter- national Conference on Cyber-Physical Systems (with CPS-IoT Week 2023), ICCPS ’23, page 154–165, New York, NY, USA. Association for Computing Machinery. [95] Momtaz, A., Basnet, N., Abbas, H., and Bonakdarpour, B. (2021). Predicate mon- In International Conference on Runtime itoring in distributed cyber-physical systems. Verification, pages 3–22. Springer. [96] Mostafa, M. and Bonakdarpour, B. (2015). Decentralized runtime verification of ltl specifications in distributed systems. In 2015 IEEE International Parallel and Distributed Processing Symposium, pages 494–503. IEEE. [97] Moura, L. d. and Bjørner, N. (2008). Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer. [98] National Science Foundations (2014). Revolutionizing how we keep track of time in cyber-physical systems. https://nsf.gov/news/news_summ.jsp?cntnid=131691. [99] Ogale, V. A. and Garg, V. K. (2007). Detecting temporal logic predicates on distributed In International Symposium on Distributed Computing, pages 420–434. computations. Springer. [100] Pant, Y. V., Abbas, H., and Mangharam, R. (2017). Smooth operator: Control using the smooth robustness of temporal logic. In 2017 IEEE Conference on Control Technology and Applications (CCTA), pages 1235–1240. IEEE. [101] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blon- del, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830. [102] Pereira, J. C., Machado, N., and Sousa Pinto, J. (2020). Testing for race conditions in distributed systems via smt solving. In International Conference on Tests and Proofs, pages 122–140. Springer. 178 [103] Petri, C. A. and Reisig, W. (2008). Petri net. Scholarpedia, 3(4):6477. [104] Pnueli, A. (1977). The temporal logic of programs. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 46–57. ieee. [105] Quesel, J.-D. (2013). Similarity, logic, and games: bridging modeling layers of hybrid systems. PhD thesis, Univ., Fak. II, Department für Informatik. [106] Rountree, B., Lowenthal, D. K., Funk, S., Freeh, V. W., De Supinski, B. R., and Schulz, M. (2007). Bounding energy consumption in large-scale mpi programs. In SC’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pages 1–9. IEEE. [107] Sánchez, C. (2018). Online and offline stream runtime verification of synchronous systems. In International Conference on Runtime Verification, pages 138–163. Springer. [108] Sen, A. and Garg, V. K. (2004). Detecting temporal logic predicates in distributed programs using computation slicing. In Principles of Distributed Systems: 7th Interna- tional Conference, OPODIS 2003, La Martinique, French West Indies, December 10-13, 2003, Revised Selected Papers 7, pages 171–183. Springer. [109] Sen, K., Vardhan, A., Agha, G., and Rosu, G. (2004). Efficient decentralized moni- toring of safety in distributed systems. In Proceedings. 26th International Conference on Software Engineering, pages 418–427. IEEE. [110] Shrestha, A., Xing, L., and Liu, H. (2007). Modeling and evaluating the reliability of wireless sensor networks. In 2007 Annual Reliability and Maintainability Symposium, pages 186–191. IEEE. [111] Silva, I., Guedes, L. A., Portugal, P., and Vasques, F. (2012). Reliability and availabil- ity evaluation of wireless sensor networks for industrial applications. Sensors, 12(1):806– 838. [112] Sistla, A. P., Žefran, M., and Feng, Y. (2011). Runtime monitoring of stochastic cyber- physical systems with hybrid state. In International Conference on Runtime Verification, pages 276–293. Springer. [113] Sodhro, A. H., Chen, L., Sekhari, A., Ouzrout, Y., and Wu, W. (2018). Energy efficiency comparison between data rate control and transmission power control algorithms for wireless body sensor networks. International Journal of Distributed Sensor Networks, 14(1):1550147717750030. [114] Stoller, S. D. (1997). Detecting global predicates in distributed systems with clocks. In International Workshop on Distributed Algorithms, pages 185–199. Springer. [115] Stoller, S. D. and Schneider, F. B. (1995). Verifying programs that use causally-ordered 179 message-passing. Science of computer programming, 24(2):105–128. [116] Tekken Valapil, V., Yingchareonthawornchai, S., Kulkarni, S., Torng, E., and Demir- bas, M. (2017). Monitoring partially synchronous distributed systems using smt solvers. In International Conference on Runtime Verification, pages 277–293. Springer. [117] USNRC (2021a). Emergency core cooling systems. https://www.nrc.gov/docs/ML1122/ML11223A220.pdf. [118] USNRC (2021b). Pressurized water reactor systems. https://www.nrc.gov/reading- rm/basic-ref/students/for-educators/04.pdf. [119] Valapil, V. T., Kulkarni, S., Torng, E., and Appleton, G. (2020). Efficient two-layered monitor for partially synchronous distributed systems (technical report). arXiv preprint arXiv:2007.13030. [120] Vu, A.-D., Medhat, R., and Bonakdarpour, B. (2019). Managing the security-energy In Proceedings of the 10th ACM/IEEE tradeoff in distributed cyber-physical systems. International Conference on Cyber-Physical Systems, pages 118–128. [121] Widder, J., Lann, G. L., and Schmid, U. (2005). Failure detection with booting in In European Dependable Computing Conference, pages partially synchronous systems. 20–37. Springer. [122] Wolf, W. (2009). Cyber-physical systems. Computer, 42(03):88–89. [123] Wong, T.-T. and Yeh, P.-Y. (2019). Reliable accuracy estimates from k-fold cross validation. IEEE Transactions on Knowledge and Data Engineering, 32(8):1586–1594. [124] Wu, Q., Juang, P., Martonosi, M., Peh, L.-S., and Clark, D. W. (2005). Formal control techniques for power-performance management. IEEE micro, 25(5):52–62. [125] Xia, F., Ma, L., Dong, J., and Sun, Y. (2008). Network qos management in cyber- physical systems. In 2008 International Conference on Embedded Software and Systems Symposia, pages 302–307. IEEE. [126] Xie, F., Martonosi, M., and Malik, S. (2003). Compile-time dynamic voltage scaling settings: Opportunities and limits. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pages 49–62. [127] Xie, F., Martonosi, M., and Malik, S. (2004). Intraprogram dynamic voltage scaling: Bounding opportunities with analytic modeling. ACM Transactions on Architecture and Code Optimization (TACO), 1(3):323–367. [128] Xie, F., Martonosi, M., and Malik, S. (2005). Bounds on power savings using runtime 180 dynamic voltage scaling: an exact algorithm and a linear-time heuristic approximation. In Proceedings of the 2005 international symposium on Low power electronics and design, pages 287–292. [129] Xu, S. and Chen, L. (2008). A novel approach for determining the optimal number of hidden layer neurons for fnn’s and its application in data mining. 5th International Conference on Information Technology and Applications. [130] Yingchareonthawornchai, S., Nguyen, D. N., Valapil, V. T., Kulkarni, S. S., and Demir- bas, M. (2016). Precision, recall, and sensitivity of monitoring partially synchronous dis- tributed systems. In International Conference on Runtime Verification, pages 420–435. Springer. [131] Zhang, J., Tu, H., Ren, Y., Wan, J., Zhou, L., Li, M., and Wang, J. (2018). An adaptive synchronous parallel strategy for distributed machine learning. IEEE Access, 6:19222–19230. [132] Zhang, T., Gebhard, P., and Sokolsky, O. (2016). Smedl: combining synchronous and In International Conference on Runtime Verification, pages asynchronous monitoring. 482–490. Springer. [133] Zheng, X., Julien, C., Podorozhny, R., Cassez, F., and Rakotoarivelo, T. (2016). Effi- cient and scalable runtime monitoring for cyber–physical system. IEEE Systems Journal, 12(2):1667–1678. [134] Zhou, Y., Zhang, Y., and Fang, Y. (2007). Access control in wireless sensor networks. Ad Hoc Networks, 5(1):3–13. 181