RUNTIME VERIFICATION OF PARTIALLY SYNCHRONOUS
DISTRIBUTED CYBER-PHYSICAL SYSTEMS

By

Anik Momtaz

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2023

ABSTRACT

This dissertation addresses the problem of runtime verification of distributed cyber-physical

systems (CPS) with respect to a given formal specification. Cyber-physical systems are com-

puter systems with integrated software and physical (hardware) components that, in an ideal

environment, seamlessly interact with the real world, as well as each other. Since exhaus-

tively validating correctness of a distributed CPS is usually infeasible (if not impossible),

many modern validation methods involve runtime verification of distributed CPS based on

safety properties. Our work focuses on developing time efficient and resource efficient ver-

ification techniques that can run in parallel with the execution of these systems to ensure

reliability.

In this dissertation, we propose different methodologies to reason about the correctness

of distributed CPS in real-time, depending on the system settings and architecture. We also

provide case studies relevant to each approach in order to demonstrate real-world applica-

tions. In all our proposed techniques, we assume a partially synchronous setting, where a

clock synchronization algorithm guarantees a bound on clock drifts among all signals.

To this end, we first introduce two monitoring methods for distributed systems with

discrete events, where the specification in the linear temporal logic (LTL) [12] is evaluated

on a system using (1) a deterministic finite automaton-based technique, and (2) a progression-

based formula rewriting technique.

We then extend this work to detecting violations of predicates over distributed continuous-

time and continuous-valued signals in CPS. We introduce a novel retiming technique that

allows reasoning about the correctness of predicates among continuous-time signals that do

not share a global view of time. In addition, we show that leveraging simple knowledge of

physical dynamics allows for further reduction in run time.

Leveraging the previous two methods, we then introduce a monitoring technique for

solving the problem of runtime verification for distributed CPS using the signal temporal

logic (STL) [36]. We employ a formula progression technique utilizing a signal retiming

method, that enables reasoning about the correctness of formulas among continuous-time

and continuous-valued signals in CPS, even when only a partial signal is available.

We also extend our previous work on detecting violations of predicates over distributed

signals in CPS from a centralized monitoring setting to a decentralized monitoring setting.

We employ a technique that allows us to indentify all possible violations, not just one. Which

in turn allows for identification and elimination of bugs from distributed systems regardless

of the actual clock drift.

Finally, we introduce the notion of monitoring reliability on a network of monitors in

decentralized monitoring setting. To this end, we present a generalized model of a class of

CPS, where each monitor is represented by an Internet of Things (IoT) device (or node)

in a layered network of producers and consumers. Our model monitors the events in nodes

where resource usage occurs, and captures the tradeoffs between the reliability of the system

and resource usage. We present an efficient algorithm to determine the optimal selection of

processing quality for each node in this producer-consumer network, such that target system

reliability is achieved while respecting the given resource bounds, and resource usage is

minimized. In addition, we present a lightweight machine learning based solution to improve

our model in terms of run time.

To you, Nuban Mama.
I will forever endeavor to illuminate the void created by your absence
with the light you have inspired.

iv

ACKNOWLEDGEMENTS

First and foremost, I would like to express my sincere and heartfelt gratitude to my PhD

advisor, Dr. Borzoo Bonakdarpour, for his unwavering guidance and invaluable mentorship

throughout the journey to complete my degree. His deep expertise, relentless dedication, and

insightful feedback have been instrumental in shaping the quality and depth of my research.

It is impossible for me to overstate the pivotal role his support has played in my academic

growth. I am truly fortunate to have had the privilege of working under his tutelage. I would

also like to express my gratitude to the remaining members of my PhD guidance committee,

Dr. Betty Cheng, Dr. Bahare Kiumarsi, and Dr. Sandeep Kulkarni, for their continuous

support and indispensable feedback on my research.

Throughout my academic endeavors, I have had the pleasure of closely working with

Dr. Houssam Abbas, from Oregon State University. His immeasurable contributions to my

work in runtime verification of signals have made it possible to produce multiple high-quality

papers, including one on predicate detection for signals that received the Best Paper Award

at The 21st International Conference on Runtime Verification. I have also co-authored several

papers with my exceptional colleague and dear friend, Ritam Ganguly. In every paper we

co-authored, his contributions were instrumental, and second to none. My sincere gratitude

is also extended to the rest of my brilliant colleagues, Oyendrila Dobe, Tzu-Han Hsu, and

Eshita Zaman, who generously devoted many hours from their busy schedules to review and

provide valuable suggestions on numerous aspects of my research.

Words cannot fully express the depth of my gratitude to my family for giving me un-

conditional love, and continuous encouragement from halfway across the globe. I want to

specifically extend my heartfelt thanks to my parents, Asfia Sabina (Ammu) and Motazid

Momtaz (Abba), my little sister, Monisha Momtaz (Monomono), my aunt, Simin Seury

(Khamma), and my grandmother, Banu Tarafdar (Didda). Additionally, I am immensely

grateful to my wonderful wife, Tiana Momtaz, for her love, support, and boundless sacrifices

throughout this journey, and for always believing in me when, at times, even I could not.

v

I am also thankful to Sadika Amreen, Reazul Hoque, Balabhadra Khatiwada, Meena

Khatiwada, and Emily Mui, for being extraordinary friends, and possessing the remarkable

ability to make good times better, and not-so-good times bearable.

Last, but most certainly not least, I must express my gratitude to Michigan State Uni-

versity, specifically the Department of Computer Science and Engineering, for affording me

the opportunity to pursue my dream of obtaining a PhD in a field I am passionate about. I

am truly and unequivocally proud to call myself a Spartan.

vi

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF ABBREVIATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

CHAPTER 1

1
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

CHAPTER 2

PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Linear Temporal Logics (LTL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Distributed Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Hybrid Logical Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Signal Temporal Logic (STL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Producer-Consumer Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

CHAPTER 3

RUNTIME VERIFICATION OF PARTIALLY SYNCHRONOUS
DISTRIBUTED DISCRETE-EVENT SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Formula Progression for LTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 SMT-based Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

CHAPTER 4

PREDICATE MONITORING IN
DISTRIBUTED CYBER-PHYSICAL SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Signal Transmission to the Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 SMT-based Monitoring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Exploiting the Knowledge of System Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

CHAPTER 5 MONITORING SIGNAL TEMPORAL LOGIC IN

DISTRIBUTED CYBER-PHYSICAL SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . 86
5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Monitoring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 SMT-based Monitoring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

vii

5.4 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

CHAPTER 6 DECENTRALIZED PREDICATE DETECTION OVER

PARTIALLY SYNCHRONOUS CONTINUOUS-TIME SIGNALS . . . . . 109
6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 The Structure of Satisfying Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3 The Abstractor Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4 The Slicer Process for Detecting Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.5 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

CHAPTER 7

RESOURCE OPTIMIZATION OF STREAM PROCESSING IN
LAYERED SENSOR NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.1 Producer-Consumer Network with Resource Constraints . . . . . . . . . . . . . . . . . . . . . 127
7.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3 SMT-based Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 Machine Learning-based Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.5 Case Studies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

CHAPTER 8

RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.1 Lattice-based Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.2 Runtime Monitoring in CPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3 Asynchronous Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.4 Synchronous Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.5 Partially Synchronous Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.6 Decentralized Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.7 Monitoring Reliability in CPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

CHAPTER 9

CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.2 Ongoing Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

viii

LIST OF TABLES

Table 4.1

Impact of clock skew in network of cars on verdicts using varying ε. . . . . . . . . . 80

Table 4.2

Impact of clock skew in network of UAVs on verdicts using varying ε. . . . . . . . 80

Table 4.3

Impact of clock skew in water tanks on verdicts using varying ε. . . . . . . . . . . . . . 85

Table 5.1

Impact of ε. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Table 7.1 Nodes v[1,5] resource usage.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Table 7.2 Nodes v[6,10] resource usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Table 7.3 Nodes v[1,9] power consumption (in watts) for different quality levels. . . . . . . . . 146

Table 7.4 Quality level tables for different nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

ix

LIST OF FIGURES

Figure 1.1 Hybrid dynamic cooling system with water tanks. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Figure 1.2 A distributed CPS composed of autonomous aerial vehicles with drifting
clocks. The violation property to be monitored is, for any two aerial
vehicles the distance along x axis is within 1 and the distance along y
axis is within 1.7. Asynchronous signals produced by the vehicles must
be monitored for predicate violations, while leveraging some knowledge
of system dynamics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.3 Monitoring automaton for formula φ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.4 A distributed computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.5 Progression and segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

7

7

8

Figure 2.1 LTL3 monitor for φ = a

U

b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Figure 2.2 HLC example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Figure 2.3 Two partially synchronous continuous concurrent timelines with ε = 0.5,
and corresponding signals x and y. (Solid dot indicates signal value at
discontinuity). C is a consistent cut but C ′ is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 2.4 A trace σ generated by a system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Figure 2.5 A producer-consumer network of 10 nodes.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Figure 3.1 Progression example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Figure 3.2 Removing non-loop cycles in an LTL3 Monitor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Figure 3.3 Reachability Matrix for a

Figure 3.4 Reachability Tree for a

U

b.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

U
b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Figure 3.5 Synthetic experiments - impact of different parameters. . . . . . . . . . . . . . . . . . . . . . . 54

Figure 3.6

Impact of parallelization on different data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Figure 3.7 Cassandra experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Figure 4.1 Predicate violation between two signals x and y measured using partially

synchronized clocks t and s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Figure 4.2 Piece-wise interpolations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

x

Figure 4.3 Piece-wise linear signals vs. piece-wise quadratic signals. . . . . . . . . . . . . . . . . . . . . 72

Figure 4.4 Leveraging dynamics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 4.5

Impact of signal segmentation on run time with varying signal duration
(S.D.) and fixed ε = 0.001s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Figure 4.6 Best run time (network of cars) for different signal duration. . . . . . . . . . . . . . . . . 77

Figure 4.7

Impact of clock skew on run time. Signal duration = 2s. . . . . . . . . . . . . . . . . . . . . 79

Figure 4.8

Impact of agents on run time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Figure 4.9

Impact of communication (between two agents) on run time.. . . . . . . . . . . . . . . . 82

Figure 4.10 Run time (network of cars) vs. segment count. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Figure 4.11 Impact of Algorithm 4.1 on monitoring run time. ε = 0.001s. . . . . . . . . . . . . . . . 83

Figure 4.12 Effect of segment duration and the number of water tanks on runtime

when ε = 0.05s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Figure 5.1 A valid ccf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Figure 5.2 Conversion of STL syntax trees to their corresponding SMT syntax tree. . . 93

Figure 5.3 SMT syntax tree of STL formulas

φ1 and

¬

φ2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

¬

Figure 5.4 Examples of partitioned SMT syntax tree of STL formulas

at t = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

φ1 and

¬

φ2

¬

Figure 5.5 Effect of number of segments and agents on run time for different flight

properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Figure 5.6 Effect of segment duration and the number of water tanks on runtime

for φP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Figure 6.1 An example of a continuous-time distributed signal with 3 agents. Three
timelines are shown, one per agent. The signals xn are also shown, and
the local time intervals over which they are non-negative are solid black.
The skew ϵ is 1. The Happened-before relation is illustrated with solid
. Some satisfying cuts for
, and e4
arrows, e.g. between e1
⇝ e2
3
1
2
the predicate ϕ = (x1
0) are shown as dashed
0)
(x2
0)
≥
∧
arcs, and the extremal cuts as solid arcs. All extremal cuts contain root
events, and leftmost cut A also contains non-root events. . . . . . . . . . . . . . . . . . . . . 109

⇝ e5
2
(x3

≥

≥

∧

xi

Figure 6.2 Two satcuts for a pair of agents A1 and A2, shown by the crossed solid
lines (s, t′) and (s′, t). Their intersection is (s, t), shown by a dashed
arc, and their union is (s′, t′), shown by a dotted arc. For a conjunctive
predicate φ, the intersection and union are also satcuts, forming a lattice
of satcuts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Figure 6.3 A distributed signal of two agents (top) and the output of the abstractor
(bottom). The abstractor marks zero-crossings as discrete root events
and creates new events (dark circles) to maintain consistency.

. . . . . . . . . . . . . 117

Figure 6.4 Example of subsection 6.4.1. Bold intervals are where the local signals
are non-negative. The happened-before relation is illustrated with solid
arrows. The predicate is ϕ = (x1
0). Solid circles represent
0)
discrete events returned by the abstractor; hollow circles are those cre-
ated by the slicers. The leftmost satcut of this example is [3.5
ϵ, 3.5]
and the rightmost is [6, 5.8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

(x2

≥

≥

−

∧

Figure 6.5 Runtime vs root rate and N on synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Figure 6.6 Runtime vs number of agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Figure 7.1 Synthetic experiment results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Figure 7.2 A producer-consumer network of 8 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Figure 7.3 A Multi-Layer Network of Raspberry Pi Devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Figure 7.4 Case study results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

xii

LIST OF ABBREVIATIONS

AATC Automated Air Traffic Control

ANN Artificial Neural Network

AP Atomic Proposition

CLA Cold Leg Accumulator

CPS Cyber-Physical Systems

CPU Central Processing Unit

ECCS Emergency Core Cooling System

FAA Federal Aviation Administration

HLC Hybrid Logical Clock

IoT Internet of Things

LTL Linear Temporal Logic

MPC Multi-Party Computation

MTL Metric Temporal Logic

NTP Network Time Protocol

PTP Precision Time Protocol

PVC Physical Vector Clock

RAM Random Access Memory

RWST Refueling Water Storage Tank

SMT Satisfiability Modulo Theory

STL Signal Temporal Logic

UAV Unmanned Aerial Vehicle

xiii

CHAPTER 1

INTRODUCTION

Distributed monitoring is the process of analyzing the execution of distributed systems with

a centralized or decentralized monitor in relation to a given formal specification. While

attempting to complete a collaborative job, distributed systems often consist of numerous

systems that do not share a global clock and memory. In a distributed database, for example,

data is kept in several physical locations, usually spread across a network of interlinked

computers. A monitor may want to guarantee that queries to the distributed database fulfill

some form of consistency requirements. The class of systems containing both software and

physical (hardware) components that interact with the actual world as well as each other

is a prominent class of distributed systems. These systems are referred to as cyber-physical

systems (CPS) [122].

Our reliance on CPS has grown rapidly over the past decade, as these systems are more

and more frequently deployed over networks of agents due to the emergence of the Internet

of Things (IoT) and edge applications [27]. Therefore, validating the accuracy of these sys-

tems, especially for the class of CPS that is safety-critical, is now of paramount importance.

Software applications deployed among networked nodes, referred to as agents, are with a

critical class of CPS. Examples include autonomous car fleets, sensor networks in infras-

tructure, health-monitoring wearables, and medical device networks. Because CPS are often

safety-sensitive, obtaining assurance regarding their accuracy is vital. CPS are distinguished

by three defining characteristics:

• First, because the signals are analog, they include an infinite number of events, ren-

dering traditional reasoning approaches designed for discrete systems ineffective, if not

inapplicable in most circumstances. The applications we target, such as those men-

tioned above, require continuous-time behavior. It is not enough, for example, to assert

that a voltage does not spike at sample times. As a result, increasing the signal sample

rate does nothing to alleviate the necessity for analog signal reasoning.

1

• Second, each agent in these CPS has a local clock that drifts from the clocks of other

agents. Hence, the concept of time, which is taken for granted in centralized sys-

tems, must be changed, as it is unclear whether events are consecutive and concurrent.

Furthermore, it is unclear how continuous events in various processes respect the hap-

pened before relation [73], and how one may reason about the sequence of occurrence

of continuous events.

• Third, CPS signals obey physical laws and dynamics. An understanding of these

dynamics may be used to reason about distributed signals and predict their behavior,

as well as improve efficiency of reasoning.

The characteristics listed above define the concept of distributed signals, and reasoning

about them necessitates the establishment of some notion of ordering. Building such ordering

for an infinite number of events from different signals while clock drifts occur at runtime is

a difficult undertaking.

1.1 Motivating Examples

We demonstrate the crucial need of monitoring distributed CPS through a critical ap-

plication in automated air traffic control (AATC). The market for unmanned aerial vehicles

(UAVs) is expanding rapidly [61]. In the United States, the Federal Aviation Administra-

tion (FAA) envisions a federated framework in which UAVs that contribute in monitoring

global air safety parameters are rewarded with faster free-flight pathways to their destina-

tions [39, 43].

To support this federated structure, AATC tower software must monitor analog inputs

such as UAV location and velocity to determine if they violate global instantaneous safety

characteristics, also known as predicates. These predicates are Boolean expressions defined

over the concurrent states of the several CPS agents, such as mutual separation, conditional

speed limitations, and minimal energy storage. These predicates must be evaluated on the

global state, which is the combined state of all UAVs at the same time. However, in the

absence of a perfect shared clock across all UAVs, UAV1’s clock may report t = 5 and UAV2’s

2

Figure 1.1 Hybrid dynamic cooling system with water tanks.

clock may report t = 5.2 at the same physical ‘real’ time. Equivalently, the same value on

two clocks may represent distinct physical moments. If the central AATC monitor relies on

these two states to determine if the predicate has been violated, then it may result in false

negatives (i.e., missing violations) or false positives (i.e., declaring a violation when none

exists).

The UAV example has two characteristics that are shared by many different distributed

CPS: First, while perfect continuous-time synchrony is often impossible to achieve, clock

synchronization algorithms such as Network Time Protocol (NTP) [88] ensure that drift

among local clocks remains within some bounds. Second, the central monitor frequently

recognizes certain restrictions on the UAV dynamics, such as velocity limits. In this case,

the AATC tower would be aware of the UAVs’ speed limitations. In developing our solution,

we make use of these two characteristics.

As another example, consider the water distribution system shown in Figure 1.1, where

several tanks deliver water to an offsite location via a common pipe. Water tank outflow rate

and pressure are monitored locally using drifting local clocks. If the compounded pressure

3

Cold LegAccumulatorsRefueling Water StorageTank 1ValveHigh Pressure Injection SystemsRefueling Water StorageTank 2ValveCoreReactor Coolant Pumpor flow rate on the pipe is a concern and has to be monitored, correctly measuring these

values becomes difficult since the continuous signals indicating the pressure and/or flow rate

of the tanks are not synchronized. If the flow rate and pressure must always remain below a

given threshold, clock drift among the local clocks may cause values for which the threshold

is breached to be missed.

1.2 Challenges

While there are approaches for monitoring temporal logic for distributed discrete-event

systems (e.g., [49, 58, 96, 99]), we still lack a good understanding of distributed CPS. Al-

though the literature on distributed computing is decades old, and many important problems

have been solved in the context of discrete-event systems, the main challenge with distributed

monitoring is that it is not always possible for the monitor to establish the right order of

occurrence of events across different agents in the absence of a global clock. Given the

non-deterministic nature of distributed programs, it is expected of a runtime monitor to

provide multiple results for the same distributed computation. This leads to a combinato-

rial explosion of possibilities that the monitor must examine at runtime, making the task

computationally costly.

Monitoring and detecting violations of formal specifications is a common and effective

technique to reasoning about the health of CPS. Broadly speaking, the state of the art in

runtime monitoring focuses on either (1) centralized monitoring for stand-alone applications

or multi-agent systems that share a global clock while being blind to system dynamics [1,

5, 34, 35, 33, 83], or (2) decentralized monitoring in pure discrete-time for ordering discrete

events [10, 16, 26, 28, 31, 46, 47, 49, 55, 64, 96, 99, 58], which is appropriate for pure software,

but not CPS. As a result, solutions for monitoring CPS where analog signals are created by

distributed agents that do not share a global clock are currently lacking (see the related

work in Chapter 8). Lack of synchronization, in particular, poses substantial issues since the

monitor must reason about signal levels at distinct agents’ local times, which may result in

conflicting monitoring verdicts. This problem is exacerbated by the fact that agents often

4

Figure 1.2 A distributed CPS composed of autonomous aerial vehicles with drifting clocks.
The violation property to be monitored is, for any two aerial vehicles the distance along x
axis is within 1 and the distance along y axis is within 1.7. Asynchronous signals produced
by the vehicles must be monitored for predicate violations, while leveraging some knowledge
of system dynamics.

communicate with one another, imposing extra limits on event ordering. Furthermore, in a

distributed system, a central monitor that receives all signals is subject to a single point of

failure. That is, if the monitor fails, predicate detection fails altogether.

In decentralized monitoring, the concept of reliability of a network of monitors adds

another layer to the list of challenges. To handle trade-offs, most systems use manual con-

trols. Some network applications, for example, enable administrators to alter the quality

of malicious activity detection systems based on predicted traffic [92, 134]. This strategy

frequently focuses on a subset of resources and lacks the flexibility required by huge dynamic

systems. Another strategy is to aggressively over-provision the processing infrastructure in

terms of machine capabilities (e.g., CPU, memory, etc.), network bandwidth, and assigned

power budget to ensure that no limitations are reached [15]. This is an extremely expensive

approach that is frequently not feasible and is not future-proof.

The major problem in resource management and optimization is that monitors in a net-

work often receive data, process it, and then transmit it to succeeding monitors. This results

5

| |˙x|1,|˙y|1.7in a quality vs. cost trade-off across distinct monitors, where resources are determined not

simply by pairs of consecutively interacting monitors, but by the interaction of all monitors

in the network. In other words, lowering the processing quality of a monitors might have an

impact on subsequent monitors in the network that receive lower quality data. This means

that quality versus resource utilization must be optimized across the entire network, not

simply for pairs of monitors communicating with each other. On top of that, it is easy to

see that quality and resource utilization are frequently at odds; that is, greater quality and

dependability need higher resource usage, making optimization more challenging.

1.3 Thesis Statement

Now that we have provided challenges and motivation for this dissertation, in this section,

we define the statement of thesis as follows:

Thesis Statement

It is possible to develop trustworthy verification methodologies under both centralized

and decentralized monitoring settings in order to reason about the correctness of safety-

critical partially synchronous distributed cyber-physical systems in real-time.

1.4 Contributions

In this dissertation, we take steps toward rigorous, automated reasoning about distributed

CPS, the accuracy and integrity of which is critical to ensuring the safety of the environment

in which they function. Based on the proposed verification approaches, our contributions

are grouped into five primary segments. These techniques differ in terms of (1) system archi-

tecture (i.e., discrete events vs. continuous time), (2) monitor architecture (i.e., centralized

vs. decentralized), and (3) specification language (i.e., LTL vs. STL).

1.4.1 Monitoring Discrete-Event Systems using LTL

First, we present two sound and complete solutions to the problem of distributed run-

time verification (RV) with regard to LTL formulas. Both approaches employ a fault-proof

6

q0

,
∅

,

p, r
{

}

p
{

,
}

r

{

}

q1

,

∅

p
{

}

q2

∅

,

p, r
{

}

r
{

}

,

p, r
{

}

r

{

}

true

q⊤

true

q⊥

P1

P2

Figure 1.3 Monitoring automaton for formula φ.

∅
e1
0

∅
e1
1

∅
e1
2

∅
e1
3

r

e1
4

∅
e1
5

p

e2
0

seg1

∅
e2
1

∅
e2
2

p

e2
4

∅
e2
3

seg2

Figure 1.4 A distributed computation.

central monitor, and to address the explosion of various interleavings, we propose a practical

assumption, namely, a bounded skew ε between local clocks of each pair of processes, which

is guaranteed by a fault-proof clock synchronization mechanism (e.g., NTP [88]). This im-

plies that time instants from multiple local clocks within ε are deemed concurrent, i.e., their

order of occurrence cannot be determined. This is a partial synchrony setting that does not

presume a global clock but restricts the impact of asynchrony within clock drifts.

Our first approach is based on constructing the LTL3 [12] monitor automaton of an

LTL formula and constructing multiple Satisfiability Modulo Theory (SMT)[6] queries to

determine which states of the monitor automaton are reachable for a given distributed com-

putation. For example, Figure 1.3 shows the monitor automaton for formula φ mentioned

earlier and one has to construct 4 different SMT queries to determine the set of all possible

reachable states at the end of the computation in Figure 1.4. We transform our monitoring

decision problem into an SMT solving problem. The SMT instance includes constraints that

encode (1) our monitoring algorithm based on the 3-valued semantics of LTL, (2) behavior of

7

P1

P2

∅
e1
0

p

e2
0

∅
e1
1

∅
e1
2

∅
e2
1

seg1

∅
e2
2

P1

P2

∅
e1
3

∅
e2
2

∅
e2
3
seg2

φ = (

r

(

p

¬

U

→

r))

Figure 1.5 Progression and segmentation.

r

e1
4

∅
e1
5

p

e2
4

communicating processes and their local state changes in terms of a distributed computation,

and (3) the happened-before relation subject to the ϵ clock skew assumption. Afterwards,

it attempts to concretize an uninterpreted function whose evaluation provides the possible

verdicts of the monitor with respect to the given computation. We divide a computation

into multiple segments to make the verification problem tractable, significantly reducing the

search space of each SMT query. Thus, the result of monitoring each segment (the possi-

ble LTL3 states) should be carried to the next segment. Furthermore, because distributed

applications are now operated on large cloud services, we extend our method to a parallel

monitoring algorithm to take use of the available computational resources and gain greater

scalability.

The intuition behind our second monitoring technique is that since (in the first approach)

running SMT queries to test whether each state of the LTL3 monitor automaton is reachable

is excessive, it should be sufficient to test whether temporal sub-formulas of an LTL formula

hold in a distributed computation. Similar to the first approach, we utilize segmentation

to break down the problem size. In the second approach, to carry the result of monitoring

from one segment to the next, we also develop a formula progression technique. Specifically,

given a finite trace α, and an LTL formula φ, we define a function Pr, such that Pr(α, φ)

characterizes the progression of φ and α.

We create a formula progression approach to convey the results of monitoring from one

segment to the next. Specifically, given a finite trace α, and an LTL formula φ, we define

8

a function Pr, such that Pr(α, φ) characterizes the progression of φ and α. Progression is

defined as the rewritten formula for future extensions of α that yields true, false, or an

LTL formula based on what has been seen thus far. We emphasize fundamental distinction

between our approach and the standard rewriting technique [59] is that, function Praccepts

a finite trace as input, whereas the algorithm in [59] rewrites the input LTL formula in a

state-by-state manner. This suggests that rewriting based on the fixed point representation

of temporal operators is not possible in our context. Our motivation stems from the fact

that when a given distributed computation is divided into a number of segments, a state-by-

state rewriting approach will generate too many SMT queries, rendering it unscalable. For

example, in Figure 1.5 (which is the computation in Figure 1.4 chopped to two segments),

our progression-based approach needs the same 4 SMT queries for seg1
sub-formulas

p)) as compared to [49]. The evaluation yields formulas

r and (

(2 for each of the

r)

(
¬

¬

and

(

r

p
¬
queries in seg2

→

r) as the possible formulas and as a result we only need to build 4 SMT

U
compared to 5 for the automata-based approach in [49].

We make a detailed comparison between the proposed approaches through not only a set

of vigorous synthetic experiments, but also monitoring the same set of consistency conditions

in Cassandra. We also put our approach to test using a real-time airspace monitoring dataset

(RACE) from NASA [85]. Our experiments show that the progression-based approach has

35% reduced overhead as compared to the automata-based approach.

1.4.2 Monitoring Predicates on CPS

We provide a sound and complete solution to the problem of predicate monitoring for

distributed systems when extended to CPS. Our system, which employs a central monitor to

receive distributed signals, may be characterized as follows: We assume a clock synchroniza-

tion mechanism guarantees limited skew ε between all local clocks. That is, time instants

from separate clocks within ε are regarded concurrent, i.e., their sequence of occurrence

cannot be determined below an ε of resolution. The limited skew assumption is used to sup-

plement the classic happened-before relation [73]. We introduce a retiming technique that

9

leverages the concept of retiming functions from stochastic processes to make the monitor

align the locally timed agent signals. A retiming function aligns the supports of two sig-

nals while taking into account the order, ε-skew, and arbitrary message exchanges between

agents. Our monitoring decision problem is transformed into a Satisfiability Modulo Theory

(SMT) problem that seeks a retiming function that observes a predicate violation. We show

how to simplify the general SMT problem of searching for arbitrary retiming functions to

the considerably simpler problem of looking for piece-wise linear retimings. Furthermore,

knowledge about agent dynamics constraints may be used to decrease monitoring overhead.

The following are our contributions:

1. An SMT-based algorithm for centralized monitoring of distributed analog signals for

predicate violations, supplemented with a clock synchronization algorithm that ensures

finite skew between all local clocks, employing the classic happened-before relation. [73];

2. A signal retiming approach based on the concept of retiming functions as used in

stochastic processes to address the challenges presented by time asynchrony;

3. A lightweight approach for adding system dynamics constraints in order to decrease

monitoring overhead;

4. An analysis of the relationship between monitoring overhead’s sensitivity to the skew

bound and the quantity of communication between agents, and

5. A method for parallelizing the monitoring algorithm in order to improve scalability.

We have fully implemented our methodologies and provide the results of experiments on

monitoring a network of autonomous ground vehicles (in the real world), aerial vehicles (in

simulation), and a water distribution system (in simulation). It should be noted that systems

with a central monitor are inherently vulnerable to a single point of failure. Our work is

concerned with establishing the suggested theory and does not take into consideration fault

tolerance. The following are our observations. First, while our solution is based on SMT

10

solving, it may be used for online monitoring if the monitor is run at an acceptable fre-

quency (i.e., the monitoring overhead does not exceed the system’s regular operating time).

Second, adding knowledge of system dynamics is hugely beneficial in decreasing monitoring

overhead. In some cases, the speedup (as compared to when the information is not used)

can be an order of magnitude. Third, when practical clock synchronization protocols (e.g.,

NTP and PTP) are used, monitoring overhead is independent of clock skews. Finally, we

notice that communication between agents does not always reduce monitoring overhead in

the continuous-time context; this contradicts popular perception in the discrete-time situa-

tion, where communication event orderings are thought to make automated reasoning more

efficient.

1.4.3 Monitoring CPS using STL

We expand our approach from monitoring just Boolean predicates across distributed

signals to whole signal temporal logic (STL) [36]. To this end, we start with a partially syn-

chronous scenario, in which a clock synchronization mechanism ensures a maximum bound

ε on clock drifts across all signals. This can be ensured by off-the-shelf algorithms such as

NTP [88]. We use the signal retiming approach presented in [95] to align continuous-time

signals that do not share a global sense of time. Assuming the bound ε, the decision problem

is to find a retiming function that violates an input STL formula. If no such function exists,

then it indicates that the distributed signals have not yet broken the formula (it may or may

not in the future).

To minimize the size of a distributed signal to more manageable smaller problems, we

break the original signal into smaller signals known as segments. The problem here is that

the outcome of monitoring one segment should be carried over to the next. For example,

consider STL formula φ = [0,5] p (which means proposition p should hold at all times in

time interval [0, 5]) and the current segment of signals that end at time 3. This means if p

holds in the interval [0, 3], then the formula has to be rewritten to φ′ = [0,2] p for the second

segment. Of course, such rewriting can become challenging when the formulas have multiple

11

nested temporal operators with relative time intervals. To this end, we propose a formula

progression technique that takes as inputs an STL formula and a finite-time distributed

signal σ and returns an STL formula φ′ such that for any extension σ′, we have σσ′

= φ
|

if and only if σ′

= φ′. We encode the resulting problem as a (SMT) problem that searches
|

for a retiming function given the constraints of the current segment and STL formula. We

provide approaches for solving the SMT encoding efficiently. We should highlight that we

are not concerned in this dissertation with problems such as monitoring fault-tolerance (i.e.,

we assume a flawless centralized monitor with no noise or communication failures).

We have fully implemented our approach on two distributed CPS applications: moni-

toring of a (1) network of aerial vehicles for a set of properties such as mutual separation

and formation, and (2) a water distribution system for the property in which the outflow

pressure exceeds the threshold pressure. The results indicate that in some circumstances, a

distributed CPS can be monitored fast enough for online deployment.

1.4.4 Decentralized Monitoring Predicates on CPS

In order to address the issue of single point of failure in a distributed system, we also

expand our approach of centralized predicate detection for distributed CPS with drifting

clocks under partial synchrony to a decentralized monitoring approach. To this end, our

contributions are as follows:

1. A fully decentralized monitoring approach, where each agent only has access to its own

signal, and exchanges a limited amount of information with other agents;

2. A detection technique that identifies all violating predicates, not just one;

3. An online algorithm applying a class of global properties that are conjunctions of local

propositions, that can be executed in parallel to tasks carried out by agents;

4. A novel physical vector clock that orders continuous-time events in a distributed com-

putation without a shared clock, and

12

5. A method to deploy our algorithm to existing infrastructure. Specifically, our algo-

rithm includes a modified version of the classical detector described in [26] that can be

deployed on top of existing infrastructure.

Our methodologies are fully implemented, and we provide the results of experiments on

two synthetically generated signal datasets.

1.4.5 Monitoring Reliability in a Multi-Layered CPS

Finally, we introduce the notion of monitoring reliability on a network of monitors in

decentralized monitoring setting. To this end, we present a generalized model of a class of

CPS, where each monitor is represented by an (IoT) device or a node in a layered network

of producers and consumers. Assuming a layered producer-consumer network with stream

processing, each node in the network faces a trade-off between processing quality and re-

source utilization. An abstract model of stream processing applications is presented. The

processing nodes, in particular, are modeled as a network of producers-consumers, which is

a directed acyclic graph in which a node can be a producer, a consumer, or both based on

its incoming/outgoing edges. Each node in the network consumes data that flows through

its incoming edges and produces data that flows through its outgoing edges. The processing

of data consumed/produced by a node can be done at various quality levels. The quan-

tity of resources utilized by the node is determined by the processing quality level. Power,

energy, RAM, disk, or network bandwidth are all examples of resources.

In addition to

these resources, we represent reliability as a nonrenewable resource that flows across the

network and is partially depleted based on the quality levels of the nodes through which it

flows. Individual and collective resource limits and bounds apply to nodes. Lower quality

leads to more error, which propagates across the network and has the potential to affect the

quality of subsequent nodes as well as overall reliability. Our goal is to provide an efficient

framework for modeling a system in such a way that resource bounds are respected and a

designer-specified goal is optimized. This goal is supplemented with optimization objectives

such as optimizing reliability and minimizing energy or other resource usage in the system.

13

To answer the above-mentioned multi-objective optimization problem, we reduce it to the

satisfiability problem for the satisfiability modulo theory (SMT). SMT-solving technology

has advanced dramatically over the last two decades [77], and we use its improvements to

solve our problem. To that end, we represent (1) the elements of the producers-consumers

graph, as well as the concepts of data rates, quality, reliability, and resource consumption, as

SMT entities (e.g., variables, functions, constants, and so on), (2) the resource constraints

and bounds as a set of SMT constraints, (3) the pillars of our original optimization problem

as additional SMT constraints that will be checked and searched using a binary search

algorithm to find the optimal solution, and (4) a machine learning based model that aims

to even further optimize the problem in terms execution time at the cost of minimal loss in

accuracy.

The SMT aspects of our technique is implemented using the SMT-solver Z3 [32] and

the machine learning aspects of our technique is implemented using the machine learning

toolkit Scikit-learn [101] and Keras [65] artificial neural network interface. Our model aims

to optimize reliability and resource consumption trade-offs. We explore these trade-offs

through detailed synthetic experiments. We also apply our techniques on a real-world case

study, where we optimize a network of embedded streaming devices, so that the network

(1) delivers the best possible performance using the available resources, or it (2) uses the

minimal amount of a certain resource while meeting a given performance goal.

1.5 Organization

This chapter (Chapter 1) provided an overview of the motivation, challenges and con-

tributions of this dissertation. The remainder is organized as follows. Chapter 2 discusses

the background for our work. Chapter 3 provides details on our runtime verification of

distributed systems using automata-based and progression-based techniques. Chapter 4 ex-

tends this work to CPS and Boolean predicate detection. Chapter 5 further extends this

work from Boolean predicates to STL, whereas, Chapter 6 extends this from a centralized

monitoring setting to a decentralized monitoring setting. Chapter 7 introduces the notion of

14

reliability and provides a resource optimization technique. Chapter 8 elaborates on related

work, and finally Chapter 9 summarizes the findings, discusses ongoing work and suggests

avenues for further research.

15

CHAPTER 2

PRELIMINARIES

In this Chapter, we present the background concepts of our work. We start with the formal

specification languages we use in our approaches, and then introduce other crucial back-

ground components of our work.

2.1 Linear Temporal Logics (LTL)

Let AP be a set of atomic propositions and Σ = 2AP be the set of all possible states. A

trace is a sequence s0s1 . . ., where si

Σ for every i

∈

≥

0. We denote by Σ∗ (resp., Σω) the

set of all finite (resp., infinite) traces. For a finite trace α = s0s1 . . . sk,

denotes its length,

α
|

|

k + 1. Also, for α = s0s1 . . . sk, by αi, we mean trace sisi+1 . . . sk of α.

2.1.1

Infinite-trace Semantics of LTL

The syntax and semantics of the linear temporal logic (LTL) [104] are defined for infinite

traces. The syntax is defined by the following grammar:

φ ::= p

φ

| ¬

φ

|

∨

φ

|

φ

φ

|

U

φ

where p

∈

AP, and where

and

U

are the ‘next’ and ‘until’ temporal operators respectively.

Other propositional and temporal operators are considered as abbreviations, that is, true =

p

p, false =

∨ ¬

φ), and φ =

¬

true, φ

ψ =

(
¬
¬
φ (always φ). We denote the set of all LTL formulas by ΦLTL.

φ = true

ψ =

∨ ¬

→

¬

∨

∧

φ

φ

U

ψ, φ

ψ),

φ (eventually

¬ ¬

The infinite-trace semantics of LTL is defined as follows. Let σ = s0s1s2

Σω, i

0,

≥

· · · ∈

and let

= denote the satisfaction relation:
|

(σ, i)

(σ, i)

(σ, i)

(σ, i)

(σ, i)

φ

= p
|
=
|
¬
= φ
|
∨
= φ
|
= φ
|

U

iff

iff

iff

iff

iff

ψ

ψ

si

p

∈
(σ, i)

(σ, i)

= φ
̸|
= φ or (σ, i)
|

= ψ
|

(σ, i + 1)

= φ
|
i : (σ, k)

k

∃

≥

= ψ and
|

j
∀

∈

[i, k) : (σ, j)

= φ
|

16

a

}

{

q0

{}

q⊥

a, b
}
{

,

{

b

}

q⊤

true

true

Figure 2.1 LTL3 monitor for φ = a

b.

U

2.1.2 Finite-trace Semantics of LTL

In the context of RV, the 3-valued LTL (LTL3 for short) [12] evaluates LTL formulas

for finite traces, but with an eye on possible future extensions, whereas the finite LTL, or

FLTL [80] solely considers the present trace with no regard for the future. In LTL3, the set of

truth values is B3 =

,
{⊤

, where
, ?
}

⊥

⊤

(resp.,

⊥

) denotes that the formula is permanently

satisfied (resp., violated), regardless of how far the current finite trace extends, and ‘ ?’

denotes an unknown verdict, i.e., there exists an extension that can violate the formula, and

another extension that can satisfy the formula. Let α

Σ∗ be a non-empty finite trace.

∈

The truth value of an LTL3 formula φ with respect to α, denoted by [α

=3 φ], is defined as
|

follows:

[α

=3 φ] =
|






⊤

⊥

?

if

Σω : ασ

σ

∀

∈

if

σ

Σω : ασ

∀

∈
otherwise.

= φ
|

= φ
̸|

Definition 1. The LTL3 monitor for a formula φ is the unique deterministic finite state

machine

M

φ = (Σ, Q, q0, δ, λ), where Q is the set of states, q0 is the initial state, δ : Q
B3 is a function such that λ(cid:0)δ(q0, α)(cid:1) = [α

Q is the transition function, and λ : Q

→

Σ

×
→
=3 φ],
|

for every finite trace α

Σ∗. ■

∈

As an example, Figure 2.1, shows the monitor automaton for formula φ = a

U
has the same syntax as LTL, and its semantics is based on the truth values B2 =

b. FLTL

,
{⊤

,
⊥}

17

where

⊤

(resp.,

) denotes that the formula is satisfied (resp., violated) given the current

⊥

finite trace. For atomic propositions and Boolean operators, the semantics of FLTL is iden-

tical to those of LTL. Let φ, φ1, and φ2 be LTL formulas, α = s0s1 . . . sn be a non-empty

finite trace, and

=F denote the satisfaction relation in FLTL. The semantics of FLTL for
|

the temporal operators are as follows:

[α

=F
|

φ] =

[α

=F φ1
|

U

φ2] =






⊤

⊥





if

[α1

=F φ]
|

if α1

= ε

⊥

otherwise.

[0, n] : ([αk

k

∃

∈

=F φ2] =
|

)
∧

⊤

[0, k) : ([αl

∈

l
∀
otherwise.

=F φ1] =
|

)
⊤

Consider the formula φ = p, and a finite trace α = s0s1

sn to further illustrate the

difference between LTL and FLTL and LTL3. If p

̸∈

that is, the formula is permanently violated and so is the case in FLTL where, [α

· · ·
si for some i

Now, consider formula φ = p. If p

si for all i

̸∈

∈

[0, n], then [α

there exist infinite extensions to α that can satisfy or violate φ in the infinite semantics of

LTL. But, this is not the case in FLTL where [α

=F φ] =
|

⊥

as it did not observe any p in

the observed finite trace.

2.2 Distributed Computation

We assume a loosely coupled asynchronous message passing system, consisting of n re-

liable processes (that do not fail), denoted by A =

A1, A2, . . . , An

{

}

, without any shared

memory or global clock. Channels are assumed to be First In, First Out (FIFO), and loss-

less. In our model, each local state change is considered an event, and every message activity

(send or receive) is also represented by a new event. Message transmission does not change

the local state of processes and the content of a message is immaterial to our purposes. We

will need to refer to some global clock that acts as a ‘real’ timekeeper. It is to be understood,

18

∈

[0, n], then [α

,
=3 φ] =
⊥
|
.
=F φ] =
⊥
|
=3 φ] =?. This is because
|

̸
however, that this global clock is a theoretical object used in definitions, and is not available

to the processes.

We make a practical assumption, known as partial synchrony [40]. The local clock (or

time) of a process Ai, where i

∈

[1, n], can be represented as an increasing function ci :

R≥0

→

R≥0, where ci(χ) is the value of the local clock at global time χ. Therefore, for any

two processes Ai and Aj, we have:

R≥0.

χ

∀

∈

ci(χ)
|

−

cj(χ)

|

< ε

with ε > 0 being the maximum clock skew. The value ε is assumed to be fixed and known

by the monitor in the rest of this dissertation. In the sequel, we make it explicit when we

refer to ‘local’ or ‘global’ time. This assumption is met by using a clock synchronization

algorithm, like NTP [88], to ensure bounded clock skew among all processes.

An event in process Ai is of the form ei

τ,σ

, where σ is logical time (i.e., a natural number)

and τ is the local time at global time χ, that is, τ = ci(χ). We assume that for every two

events ei

τ,σ

and ei

τ ′,σ′, we have (τ < τ ′)

(σ < σ′).

⇔

Definition 2. A distributed computation on N processes is a tuple (
E

, ⇝), where

E

is a set

of events partially ordered by Lamport’s happened-before (⇝) relation [73], subject to the

partial synchrony assumption:

• In every process Ai, 1

i

≤

≤

N , all events are totally ordered, that is,

τ, τ ′
∀

R+.

σ, σ′
∀

∈

∈

Z≥0.(σ < σ′)

→

(ei

τ,σ

⇝ ei

τ ′,σ′).

• If e is a message send event in a process, and f is the corresponding receive event by

another process, then we have e ⇝ f .

• For any two processes Ai and Aj, and any two events ei

τ,σ, ej

τ ′,σ′

, if τ + ε < τ ′, then

∈ E

ei
τ,σ

⇝ ej

τ ′,σ′, where ε is the maximum clock skew.

• If e ⇝ f and f ⇝ g, then e ⇝ g. ■

19

Definition 3. Given a distributed computation (

, ⇝), a subset of events C

E

is said

⊆ E

to form a consistent cut iff when C contains an event e, then it contains all events that

happened-before e. Formally,

.(e

C)

(f ⇝ e)

f

C. ■

∧
The frontier of a consistent cut C, denoted front(C) is the set of events that happen last

∈ E

→

∈

∈

e
∀

in the cut. front(C) is a set of ei

last

for each i

[1, N ] and ei

C. We denote ei

last

as the

last event in Pi such that

ei
τ,σ ∈ E

∀

.(ei

τ,σ ̸

2.3 Hybrid Logical Clocks

∈
last)

= ei

(ei

τ,σ

→

last ∈
last).

⇝ ei

A hybrid logical clock (HLC) [71] is a tuple (τ, σ, ω) for detecting one-way causality, where

τ is the local time, σ ensures the order of send and receive events between two processes,

and ω indicates causality between events. Thus, in the sequel, we denote an event by ei

τ,σ,ω

.

More specifically, for a set

of events:

E

• τ is the local clock value of events, where for any process Ai and two events ei

τ,σ,ω, ei

τ ′,σ′,ω′

, we have τ < τ ′

iff ei

τ,σ,ω

⇝ ei

τ ′,σ′,ω′.

∈ E

• σ stipulates the logical time, where:

– For any process Ai and any event ei

τ,σ,ω ∈ E

, τ never exceeds σ, and their difference

is bounded by ε (i.e, σ

τ

ε).

≤
– For any two processes Ai and Aj, and any two events ei

−

τ,σ,ω, ej

τ ′,σ′,ω′

event ei

τ,σ,ω

receiving a message sent by event ej

}
The maximum of the three values are chosen to ensure that σ remains updated

τ ′,σ′,ω′, σ is updated to max
{

, where

∈ E

σ, σ′, τ

.

with the largest τ observed so far. Observe that σ has similar behavior as τ ,

except the communication between processes has no impact on the value of τ for

an event.

• ω :

Z≥0 is a function that maps each event in

E →
– For any process Ai and a send or local event ei

E

to the causality updates, where:

, if τ < σ, then ω is

τ,σ,ω ∈ E

incremented. Otherwise, ω is reset to 0.

20

P1

P2

P3

(τ , σ, ω)

10 10 0

✗

✓

✗

20 20 0

21 21 0

31 31 0

0 10 1

1 10 2

2 10 5

20 20 0

0 0 0

1 10 3

2 10 4

C0

Figure 2.2 HLC example.

20 20 0

C1

C2

– For any two processes Ai and Aj and any two events ei

event ei

τ,σ,ω

receiving a message sent by event ej

τ ′,σ′,ω′, ω(ei

τ ′,σ′,ω′

, where

τ,σ,ω, ej
τ,σ,ω) is updated based

∈ E

on max
σ, σ′, τ
{

}

.

– For any two processes Ai and Aj, and any two events ei

τ,σ,ω, ej

τ ′,σ′,ω′

, (τ =

∈ E

(ω < ω′)

τ ′)

∧

→

ei
τ,σ,ω

⇝ ej

τ ′,σ′,ω′.

We presume that HLC is fault-proof in our implementation. Figure 2.2 depicts an HLC

with partially synchronous concurrent timelines of three processes with ε = 10. Note that

the local times of all events in front(C1) are bounded by ε. As a result, C1 is a consistent

cut, but C0 and C2 are not.

2.3.1 Physical Vector Clocks

We first define Physical Vector Clocks (PVCs), which generalize vector clocks [81] from

countable to uncountable sets of events. They are used by the abstractor process (next

section) to track the happened-before relation. A PVC captures one agent’s knowledge, at

appropriate local times, of events at other agents.

Definition 4. Given a distributed signal (E, ⇝) on N agents, a Physical Vector Clock, or

PVC, is a set of N -dimensional timestamp vectors vt

n ∈

RN
+

, where vector vt
n

is defined by

the following:

1. Initialization: v0

n[i] = 0,

i
∀

1, . . . , N

∈ {

}

21

2. Timestamps store the local time of their agent: vt

n[n] = t for all t > 0.

3. Timestamps keep a consistent view of time: Let V t
n

be the set of all timestamps vs
m

s.t. es
m

happened-before et
n

in E. Then:

vt

n[i] = max
m∈V t
n

vs

(vs

m[i]),

i
∀

∈

[N ]

n

, t > 0
}

\ {

PVCs are partially ordered: vt

n < vt′

m

iff vt
n ̸

= vt′
m

and vt

n[i]

vt′
m[i]

[N ]. ■

≤
. The detection algorithm can now know the happened-before

∈

i
∀

We say vt
n

is assigned to et
n

relation by comparing PVCs.

Lemma 1. Let n

= m and t, t′

= 0. Then (et
n

⇝ et′

m) iff (vt′

m[n]

t).

≥

Proof. We split the bidirectional implication into its two directions:

1. (et
n

⇝ et′

Since vt

(vt′

m[n]

m) =
≥
n[n] = t by Definition 4 2 and et

⇒

t)

n

⇝ et′
m

, then by Definition 4 3, vt′

m[n]

t.

≥

2. (et
n

⇝ et′

m)

= (vt′

m[n]

⇐

t)

≥

a) Case (vt′

m[n] = t) =

⇒

(et
n

⇝ et′

m):

Besides initialization, the only case in Definition 4 where a value is assigned which

did not come from another timestamp is Definition 4 2. Consider an event et
n

.

The timestamp of this event at index n is t, by Definition 4 2. At the point in

time when this event is created (local time t on agent An), no other timestamp

has the value t at index n. All other vt′
m

which have the value t at index n must

be assigned by Definition 4 3. This means that they have the relation et
n

⇝ et′
m

,

due to the transitive property of the happened-before relation.

b) Case (vt′

m[n] > t) =

⇒
Consider a t′′ where vt′

(et
n

⇝ et′

m):

m[n] = t′′ and t′′ > t. Then by the previous case, et′′

n

⇝ et′
m

.

Since by the happened-before relation all events on an agent are totally ordered

(Definition 7 2), et
n

⇝ et′′
n

. By the transitive property of the happened-before

relation (Definition 7 2), et
n

⇝ et′
m

.

22

̸
̸
■

Theorem 1. Given a distributed signal (E, ⇝), let V be the corresponding set of PVC

timestamps. Then (V, <) and (E, ⇝) are order isomorphic, i.e., there is a bijective mapping

between V and E s.t. et
n

⇝ et′
m

iff vt

n < vt′

m

.

Proof. Since each PVC timestamp corresponds to exactly one event and all events have a

timestamp, there is clearly a bijective mapping. To show it preserves order, we need to

confirm that (et
n

⇝ et′

m)

⇐⇒

(vt

n < vt′

m).

1. et
n

m =

⇝ et′

n < vt′
vt
By Definition 4 3, each element of vt
n

⇒

m

must be less than or equal to the corresponding

element of vt′
m

. So then we need to show that vt
n ̸
n[m] = t′ then et′

m

vt′
m[m] = t′. By Theorem 1 if vt
the happened-before order relation, then vt

= vt′
m

. Definition 4 2 indicates that

⇝ et
n

; but there cannot be cycles in

n[m] < t′. This implies that vt

n < vt′

m

.

2. (et
n

⇝ et′

m)

= (vt

n < vt′
m)

⇐

m

means that vt

n < vt′
vt
By Definition 4 2, vt

vt′
m[i],
≤
n[n] = t, so vt′

n[i]

i
∀
m[n]

[N ]. Consider index n, where vt

vt′
m[n].
t. Then Theorem 1 states that this implies

n[n]

≤

∈

≥

et
n

⇝ et′
m

.

■

Definition 4 is not quite a constructive definition. We need a way to actually compute

PVCs. This is enabled by the next theorem.

Theorem 2. The assignment

vt

n =




[0, . . . , 0, t, 0, . . . , 0],

t < ϵ



[t

−

ϵ, . . . , t

ϵ, t, t

−

−

ϵ, . . . , t

ϵ],

t

ϵ

≥

−

where the t is in the nth position in both cases, satisfies the conditions of PVC in Definition 4.

23

Proof. Consider Definition 7 2. This indicates that all events et−ϵ

i

happened-before et
n

,

n

. Therefore, if these events directly happened-before et
n

(there is no et′
m

i
∀
∈
where

[N ]

et−ϵ
i

\ {
}
⇝ et′
m

and et′
m

⇝ et
n

), then this vector is a correct assignment.

By looking at each point in Definition 7, we can see that the only case where one event

happened-before another on a different process is when there is at least ϵ difference, Def-

inition 2. While an event may have happened-before et
n

by indirectly following Defini-

tion 2 by way of 2 and 2, we do not need to consider this event because there is not a

direct happened-before relation with et
n

(no event in between). Therefore, the assignment

ϵ, . . . , t

[t

−

ϵ, t, t

−

−

ϵ, . . . , t

−

ϵ] is suitable for timestamp vt
n

. ■

2.4 Signal Model

In this section, we introduce our signal model, i.e., our model of the output signal of an

agent. To this end, first, we set some notations. The set of reals is R, the set of non-negative

reals is R+, and the set of positive reals is R∗
+

. The set of integers

1, . . . , N

{

}

is abbreviated

as [N ]. Global time values, kept track of by a hypothetical global clock are denoted by χ,

χ′, etc., while the letters t, t′, t1, t2, s, s′, s1, s2, etc. denote corresponding local clock values

particular to individual signals/agents, which are always clear from the context.

Definition 5. An output signal (of some agent A) is a function x : [a, b]

Rd, which is

→

right-continuous, left-limited, and is not Zeno. Here, [a, b] is an interval in R+, and will be

referred to as the timeline of the signal. ■

Definition 6. A root is an event et
n

where xn(t) = 0 or a discontinuity at which the signal

changes sign: sgn(xn(t))

= sgn(lims→t− xn(s)). A left root et
n

is a root preceded by negative

values: there exists a positive real δ s.t. xn(t

−

α) < 0 for all 0 < α

δ. A right root et
n

is a

root followed by negative values: xn(t + α) < 0 for all 0 < α

≤

≤
δ. ■

We assume that x is one-dimensional, i.e., d = 1. Therefore, Right-continuity implies

that for each t in its support, lims→t+ x(s) = x(t). The function is Left-limitedness if it has
. Not being Zeno means that x
a finite left-limit at every t in its support: lims→t− x(s) <

∞

24

̸
has a finite number of discontinuities in any bounded interval in its support. This prevents

the signal from jumping indefinitely many times in a finite length of time. A discontinuity

) can be caused by a discrete event within agent A (such as a variable updated
in a signal x(
·

by software), or to a message transmitted to or received from another agent A′.

We assume a loosely linked system with N reliable agents that never fail, denoted by

A1, . . . , AN

, without any shared memory or global clock. The output signal of agent An
{
}
is denoted by xn, for 1

N . We refer to some global clock which acts as a ‘real’ time-

n

≤

≤

keeper. However, this global clock is a hypothetical object used in definitions and theorems,

and is not available to the agents. We make two assumptions:

• (A1) Partial synchrony. The local clock (or time) of an agent An can be represented

as an increasing function cn : R+

→

R+, where cn(χ) is the value of the local clock at

global time χ. Then, for any two agents An and Am, where m, n

[N ], we have:

∈

χ

∀

∈

R+.

cn(χ)
|

−

cm(χ)

< ε

|

where the maximum clock skew presumed fixed and known by the monitor is ε > 0.

When we refer to ‘local’ or ‘global’ time in the sequel, we make it clear.

• (A2) Deadlock-freedom. The agents being analyzed do not enter a deadlock state.

Assumption (A1) is met by using a clock synchronization algorithm, like NTP [88], to

ensure bounded clock skew across all agents.

An event in the discrete-time setting is a change in value of an agent’s variables. We now

update this definition for the continuous-time setting of this work. Specifically, in an agent

An, an event is either a (i) a pair (t, xn(t)), where t is the local time (i.e., returned by function

cn); (ii) a message transmission, or (iii) a message reception. The communications that the

agents transmit to each other are free of assumptions. Messages that are sent to the monitor

are timestamped by their respective local clocks. Since the agents evolve in continuous time

and their output signals are defined for all local times t, a message transmission or reception

always coincides with a signal value; i.e., if An receives a message at local time t, its signal

25

has value xn(t) at that time. Thus, without loss of generality, every event will be represented

as a (local time, value) pair (t, xn(t)), often abbreviated as en
t

(n and t will be omitted when

irrelevant).

A distributed signal is modeled as a set of signals, where events in each signal are partially

ordered by a variation of the happened-before (⇝) relation [73], extended by our assumption

(A1) on bounded clock skew among all agents. The following defines a continuous-time/value

distributed signal under partial synchrony.

Definition 7. A distributed signal on N agents is a pair (E, ⇝), where E = (x1, . . . , xN )

is a vector of signals, the set In is a bounded nonempty interval, and the relation ⇝ is a

relation between events in signals such that:

1. In every signal xn, all events are totally ordered, that is, for all n

[N ], for any

∈

t, t′

∈

In, if t < t′, then (t, xn(t)) ⇝ (t′, xn(t′)). That is,

n

∀

∈

[N ].

t, t′
∀

∈

(cid:16)

t < t′(cid:17)

In.

⇒

(cid:16)

(cid:17)
(t, xn(t)) ⇝ (t′, xn(t′))

,

where the set In is a bounded nonempty interval.

2. If the time between any two events is more than the maximum clock skew ε, then the

events are totally ordered, that is, for all m, n

[N ], for any t, t′

In, if t + ε < t′,

∈

∈

then (t, xn(t)) ⇝ (t′, xn(t′)). That is,

m, n

∀

∈

[N ].

t, t′
∀

∈

(cid:16)

In.

t + ε < t′(cid:17)

⇒

(cid:16)

(cid:17)
(t, xm(t)) ⇝ (t′, xn(t′))

.

3. If e is a message send event in an agent and f is the corresponding receive event by

another agent, then we have e ⇝ f .

4. For any three events e, f , and g, if e ⇝ f and f ⇝ g, then e ⇝ g. ■

Setting ε =

∞

yields the classic instance of total asynchrony. The constraints on In

(bounded and non-empty) are required in the continuous-time context and will be discussed

more in the next section. Because the agents are synchronized within ε, it is not possible to

26

Figure 2.3 Two partially synchronous continuous concurrent timelines with ε = 0.5, and
corresponding signals x and y. (Solid dot indicates signal value at discontinuity). C is a
consistent cut but C ′ is not.

analyze all signals in global time simultaneously. The following definition of consistent cut

captures plausible global states, that is, states that might be legitimate global states. Fig-

ure 2.3 shows two partially synchronous concurrent timelines generated by two agents. Every

moment in each timeline corresponds to an event (t, xn(t)), n

[2]. Thus, the following hold:

∈

(1, x1(1)) ⇝ (2.3, x1(2.3)), (2.3, x1(2.3)) ⇝ (2.94, x2(2.94)), (1, x2(1)) ⇝ (2.94, x2(2.94)),

and (2.94, x2(2.94))

⇝ (3, x1(3)).

Definition 8. Let (E, ⇝) be a distributed signal over N agents and S be the set of all

events defined as follows:

S =

(cid:110)

(t, xn(t))

xn

E

t

∧

∈

∈

In

∧

In

⊆

|

(cid:111)
.

R+

A consistent cut C is a subset of S if and only if when C contains an event e, then it contains

all events that happened before e. Formally,

e, f

∀

∈

S . (e

C)

∈

∧

(f ⇝ e)

(f

∈

⇒

C). ■

From this definition and Definition 7 it follows that if (t′, xn(t′)) is in C, then C also

contains every event (t, xm(t)) s.t. t + ε < t′. Note that due to time asynchrony, there exists

an infinite number of consistent cuts represented by

(χ) at any global time χ

R+. This

∈

C

27

1.512.332.943.1C’messageCA1A2ytsx̸
is due to the fact that there are an infinite number of time instances between any two local

time instances t1 and t2 on some signal x. As a result, an infinite number of consistent cuts

can be created.

A consistent cut C can be represented by its frontier

front(C) =

(t1, x1(t1)), . . . , (tN , xn(tN ))

(cid:110)

(cid:111)
,

in which each (tn, xn(tn)), where 1

n

≤

≤

Formally:

N , is the last event of agent An appearing in C.

n

∀

∈

[N ] . (tn, xn(tn))

∈

C and tn = max

(cid:110)
t

In

∈

(t, xn(t))

| ∃

(cid:111)
.

C

∈

Example Assuming ε = 0.1 in Figure 2.3, it comes that all events below (thus, before)

the solid arc form a consistent cut C with frontier front(C) =

.
(3, x1(3)), (2.94, x2(2.94))
}
On the other hand, all events below the dashed arc do not form a consistent cut since

{

(2.3, x1(2.3)) ⇝ (3.1, x2(3.1)) and (3.1, x2(3.1)) is in the set C ′, but (2.3, x1(2.3)) is not in

C ′.

2.5 Signal Temporal Logic (STL)

Let AP be a set of atomic propositions. The syntax for signal temporal logic (STL) [79]

is defined for infinite traces using the following grammar:

φ := p

φ

| ¬

φ

φ

φ

|

∧

|

U

[a,b] φ

where p

AP and

is the ‘until’ temporal operator. We view other propositional and

U
temporal operators as abbreviations, that is,

∈

= p

⊤

∨ ¬

p (true),

=

⊥

¬⊤

(false),

[a,b] φ =

[a,b]φ (eventually or F),

⊤ U
formulas by ΦSTL.

[a,b] φ =

φ (always or G). We denote the set of all STL

¬ [a,b] ¬

Let a trace σ = (x1, . . . , xN ) be a vector of N continuous-time and continuous-valued sig-

nals. In the context of STL, we express p as f (x1[t], . . . , xn[t]) > 0, where (x1[t], . . . , xn[t])

∈
R is a function that evaluates a vector

Rn is a vector of signal values at time t, and f : Rn

→

of signal values.

28

q
p

⊤
⊥
⊤
⊥
t

0

4.5
Figure 2.4 A trace σ generated by a system.

6

The infinite-trace semantics of STL is defined as follows. Let

= be the satisfaction
|

relation, and the satisfaction of formula φ by a trace σ at time t be:

(σ, t)

(σ, t)

(σ, t)

(σ, t)

= p
|
= φ
|
=
|
¬
= φ
|

ψ

iff

iff

iff

∧
φ

[a,b]ψ iff

U

f (x1[t], . . . , xn[t]) > 0

(σ, t)

= φ and (σ, t)
|
((σ, t)

= φ)
|

= ψ
|

¬
t′
∃

[t + a, t + b] : (σ, t′)

∈

= ψ and
|

t′′

∀

∈

[t, t′] : (σ, t)

= φ
|

For the sake of simplification, from this point and onward, we write σ

= φ if and only
|

if (σ, 0)

= φ holds. As an example of STL, given the trace σ shown in Figure 2.4, the STL
|
formula φ = p

[4,6.5]q holds at time 0, that is, σ

= φ. However, φ does not hold after time
|

U

2, as in that case, q must hold after time 2 + 4 and before 2 + 6.5, which does not happen.

The STL semantics are over infinite signals, however a distributed signal E is defined

to have a fixed duration (In is bounded), which is suited for online monitoring, but the

STL semantics are over infinite signals. Given a (completely synchronous) finite duration

signal x, we say it satisfies/violates φ iff every extension (x.y), where y is an infinite signal,

satisfies/violates φ. Otherwise, Unknown is returned by the monitor. The dot ‘.’ here

represents time concatenation.

2.6 Producer-Consumer Network

A producer-consumer network is a directed acyclic graph (DAG) G = (V, E), in which

each vertex v

∈

V is a node), that may be either a producer, a consumer, or both, based

on its incoming/outgoing edges. A producer node only has outgoing edges, a consumer node

29

vs

v1

v2

v3

v4

v5

v6

v7

v8

v9

Figure 2.5 A producer-consumer network of 10 nodes.

only has incoming edges, and a producer/consumer node has both incoming and outgoing

edges. Let Pred(v) denote the finite set of predecessor nodes from which v receives data,

and Succ(v) denote the finite set of successor nodes which receive data from v. The set E of

edges represented as ordered pairs of vertices such that:

E =

(cid:110)

(u, v)

Succ(u)

(cid:111)
.

v

|

∈

An edge from u to v represents a stream of items flowing from u to v, in which case u

is a producer (potentially also a consumer) and v is a consumer. A node v

V, where

∈

Pred(v) =

∅

is called a source and a node u

V, where Succ(u) =

is called a sink.

∅

∈

Figure 2.5 depicts a producer-consumer network. The network represents a hierarchical

monitoring system, in which v[1,4] are producers of events that are consumed and manipulated

by nodes v[5,8]. Nodes v[5,8] then transmit the manipulated events into v9.

A producer (respectively, consumer) node v

∈

V may receive (respectively, emit) data at

a set of possible input rates denoted by possible output rates IRate(v) (respectively, ORatev).

Let Out(u, v) denote the outgoing data rate from node u into node v. For example, in

Figure 2.5, the incoming data for v1 is received from vs, and the outgoing data is sent to v5

30

and v6. For every node v

∈

V, we define In(v) such that,

In(v) =

(cid:88)

Out(u, v).

u∈Pred(v)

31

CHAPTER 3

RUNTIME VERIFICATION OF PARTIALLY SYNCHRONOUS
DISTRIBUTED DISCRETE-EVENT SYSTEMS

In this chapter, we present two sound and complete solutions to the distributed runtime

verification (RV) problem in relation to LTL formulas. In order to address the explosion

of different interleaving, we adopt a practical assumption, namely, a finite skew between

local clocks of each pair of processes, which is ensured by a fault-proof clock synchronization

system, such as NTP [88]. Both approaches utlize a fault-proof central monitor.

To this end, we consider discrete-event systems [20], where the discrete states in the said

systems are transitioned via events. The events can be message send events, message receive

events or local processing events. As stated in Chapter 1, the agents in these systems do

not share a global clock and memory, while attempting to perform a joint task. However,

a clock synchronization algorithm (see Subsection 2.3) guarantees a maximum clock skew

among the agents; thus, allowing partial synchrony. In other words, we make the following

assumptions:

• The systems under observation are discrete-event systems. That is, for every agent,

within any time period, there is a finite number of event executions. These events

could be internal to agents (e.g. variable updates), a message send event, or a message

receive event.

• A bounded skew ε between local clocks of every pair of processes, guaranteed by a fault-

proof clock synchronization algorithm (e.g., NTP). This means time instants from

different local clocks within ε are considered concurrent, i.e., it is not possible to

determine their order of occurrence. This setting constitutes partial synchrony, which

does not assume a global clock but limits the impact of asynchrony within clock drifts.

In the following sections, we elaborate on our runtime verification approach for partially

synchronous distributed systems using an automata-based technique and a progression-based

formula rewriting technique.

32

3.1 Problem Statement

Given a distributed computation (
E

, ⇝), as defined in Definition 2, and an LTL formula

φ, we say (
E

, ⇝) satisfies φ iff there exists a trace, α, defined by a sequence of frontiers

, ⇝), that satisfies φ. Formally, the evaluation of the LTL formula φ with respect to

in (
E
, ⇝) in the finite semantics is the following:

(
E

Problem Statement

Monitoring of Distributed Systems. Given a distributed computation (
E

, ⇝), a

valid sequence of consistent cuts is of the form C0C1C2

, where for all i

0, we have

≥

· · ·

denote the set of all valid sequences of

(1) Ci

Ci+1, and (2)

⊂

|
consistent cuts. We define the set of all traces of (

C

+ 1 =

Ci
|

Ci+1
|

. Let
|

, ⇝) as follows:

E

(cid:8)front(C0)front(C1)

C0C1C2

· · · |

· · · ∈ C

(cid:9).

The evaluation of the LTL formula φ with respect to (
E

, ⇝) in the finite semantics is

the following:

, ⇝)

[(

E

=3 φ] =
|

(cid:110)

α

=3 φ
|

|

α

∈

(cid:8)front(C0)front(C1)

C0C1C2

· · · |

· · · ∈ C

(cid:9)(cid:111)

and,

, ⇝)

[(

E

=F φ] =
|

(cid:110)

α

=F φ
|

|

α

∈

(cid:8)front(C0)front(C1)

C0C1C2

· · · |

· · · ∈ C

(cid:9)(cid:111)

This means that evaluating a distributed computation against a formula yields a set

of verdicts, because a computation may contain multiple traces. It should be noted that

throughout this chapter, (
E

, ⇝) is used to denote distributed computation.

3.2 Formula Progression for LTL

Because of the existence of a total ordering of events in a synchronous system, verification

on a computation may be accomplished in a state by state method [10]. However, in a

33

partially synchronous system, such event ordering is not possible. A distributed computation

, ⇝) may have different event orderings governed by different event interleavings. As a

(
E
result, multiple verdicts might be obtained from the same distributed computation (
E

, ⇝).

To explore these verdicts, we present a formula progression-based monitoring approach that,

if possible, partially evaluates a formula on the current computation and, depending on the

verdict, provides a rewritten formula to be evaluated on the extensions of the computation.

As an example, let us consider the formula to be monitored as, φ = (a

b). Now, if in

→

some trace in a computation, the monitor observes a, then for the extensions of computations,

it is enough to monitor the rewritten formula, φ′ = b, as the final verdict is no longer

dependent on the occurrence of a. We call this method of rewriting formula progression.

Definition 9. A progression function Pr : Σ∗

ΦLTL

ΦLTL is one that for all finite traces

×
Σω, and formulas φ

→
ΦLTL, we have: ασ

∈

= φ iff and only if
|

α

σ

Σ∗, infinite traces σ

∈
= Pr(α, φ). ■
|

∈

Our method and the traditional rewriting method [59] vary primarily in that our function

Pr accepts finite traces as input, whereas the algorithm in [59] rewrites the input LTL formula

in a state-by-state manner. As a result, it is not feasible to rewrite using the fixed point

representation of temporal operators. The fact that a given distributed computation is

divided into a number of segments so an SMT query is used to verify each segment serves as

the motivation for our method. A state-by-state approach would generate excessive amounts

of SMT queries, rendering the approach inefficient and unscalable.

Remark 1. It is straightforward to see that for any α

Σ∗ and φ

∈

∈

Φ, if a progression

function returns a non-trivial formula, which we denote by Pr(α, φ) = φ′ for some φ′

ΦLTL,

∈

then the verdict of monitoring is unknown.

Atomic propositions.

Let φ = p for some p

AP. The verdict is provided depending

∈

upon whether or not p

∈

α(0). This is the only case where the output of Pr cannot be a

rewritten formula; the possible verdicts are either true or false:

34




true

if

α(0)

p

∈

Pr(α, φ) =



false if

p

α(0)

̸∈
Pr(α, ϕ).

¬

Let φ =

ϕ. We have Pr(α, φ) =

Negation.

Disjunction.

¬
Let φ = φ1

φ2. If either sub-formula φ1 or φ2 is evaluated to false, then

∨

the progression of φ becomes the other sub-formula φ2 or φ1 respectively, since that will be

the only responsible sub-formula for the verdict of all future computations:

Pr(α, φ) =

true

false






φ′
2

φ′
1

φ′

1 ∨

φ′
2

if

if

if

if

if

Pr(α, φ1) = true

∨

Pr(α, φ2) = true

Pr(α, φ1) = false

Pr(α, φ1) = false

∧

∧

Pr(α, φ2) = false

Pr(α, φ2) = φ′
2

Pr(α, φ2) = false

Pr(α, φ1) = φ′

1 ∧

Pr(α, φ1) = φ′
1

∧
Pr(α, φ2) = φ′
2

Next operator. Let φ = ϕ. The verdicts true, false and ϕ′ can only be reached if

α1

= ε. Otherwise, or if we are at the last event in the trace, then the progression of φ

becomes ϕ; implying ϕ must hold at the beginning of the future extension:

Pr(α, φ) =






true

if

false if

ϕ′

ϕ

if

if

Pr(α1, ϕ) = true

∧
Pr(α1, ϕ) = false

α1

= ε

α1

= ε

∧

Pr(α1, ϕ) = ϕ′

α1

= ε

∧

α1

|

= ε
|

Always and eventually operators. Progression in the temporal operator ‘always’,

(resp. ‘eventually’,

) may yield false (resp. true) or remain unchanged:

Pr(α, φ) =

false if

[α

=F φ] =
|

⊥





ϕ

if

otherwise

35

̸
̸
̸
̸
Pr(α, φ) =




true if



ϕ

if

[α

=F φ] =
|

⊤

otherwise

Note that the semantics of FLTL is not frequently used, due to LTL3 being generally more

expressive, as shown in [11]. However, LTL3 cannot be used to construct the progression

rules. To be more precise, the ‘ ?’ (unknown) verdict in LTL3 semantics would raise additional

and unnecessary complications in the progression rules, as this verdict does not provide any

additional information as far as our progression-based approach is concerned.

In fact, if

progression results in a formula, it represents the ‘ ?’ verdict in LTL3. Therefore, we use

FLTL for specifying the progression rules without any loss of generality as shown later in

the proof of Lemma 2.

Until operator. Let φ = φ1

φ2. Recall that φ1

φ2 = φ2

(φ1

∨

∧

(φ1

U

U

φ2)). We

U

divide the

U

formula into two parts, one with globally ( φ1) and the other eventuality

( φ2). These sub-formulas are evaluated independently, and the verdicts of each are used

to establish the progression for the

U

operator. However, for the case when both φ1 and

φ2 occur in the same computation, we cannot reach a verdict without taking the order of

occurrence of these sub-formulas into account. That is, on a given finite trace α, if φ2 holds

in α(i) (denoted

iφ2) and φ1 holds throughout in all states from α(0) to α(i
i−1φ1), then the progression of φ becomes true. If this is not the case, and φ1 does not

1) (denoted

−

hold in α, the progression of φ becomes false, since this signifies a break from the streak of

φ1 required for φ to hold. The progression of φ remains unchanged if φ1 holds throughout

α, but φ2 does not hold anywhere:

36

α

α′

α′′

∅

∅

∅

∅

r

∅

∅

q

p

Figure 3.1 Progression example.

Pr(α, φ) =






true

false

Pr(α, φ1)

Pr(α, φ1)

U

Pr(α, φ2)

if

i
∃

∧
if [α

∧
if [α

∧
if [α

∈

[α

[0,

α
|

| −

1].[α

=F
|

iPr(α, φ2)] =

⊤

=F
|

i−1Pr(α, φ1)] =

⊤

Pr(α, φ1)] =

=F
|
not the first case

⊥

Pr(α, φ2)] =

=F
|
not the second case

⊤

=F
|

Pr(α, φ1)] =

⊤

[α

=F
|

∧

Pr(α, φ2)] =

⊥

Example. Consider the formula φ =

r

p

(
¬

U

→

q), which can be broken into sub-

formulas φs =

{

r, q,

q,

p

}

, according to our progression rules. Consider the trace in

Figure 3.1 divided into three segments. In the first segment α, neither p, q nor r are present,

and as far as the laws of the progression function defined above, φ remains unchanged for the

next segment; i.e., Pr(α, φ) = φ. In the second segment α′, proposition r is observed, this

satisfies sub-formula

r the progressed formula becomes

p
¬

U

q; i.e., Pr(α′, φ) =

q. In

p
¬

U

the next segment α′′, proposition q occurs before p. This falls under the first case of the until

progression operator. Since q happens after a streak of

p, we arrive at the verdict true;

¬

i.e., Pr(α′′,

p
¬

U

q) = true. Put it another way, Pr(αα′α′′, φ) = true.

Lemma 2. Given an LTL formula φ, and a finite and infinite trace α

tively, trace ασ satisfies φ if and only if σ satisfies Pr(α, φ). Formally,

Σ∗, σ

∈

∈

Σω respec-

Proof. We distinguish the following cases:

[ασ

=F φ]
|

⇐⇒

[σ

=F Pr(α, φ)]
|

37

Case 1: First, we consider the base case of this proof, where the formula is an atomic

proposition, that is, φ = p.

) Let us first consider that p is observed on the first state of ασ. This implies,

[ασ

(
⇒
=F φ] yields true, and Pr(α, φ) yields
|
true.

. Therefore, [σ

⊤

=F Pr(α, φ)] must also yield
|

Now, let us consider that p is not observed on the first state of ασ. This implies, [ασ

φ] yields false, and Pr(α, φ) yields

. Therefore, [σ

) Let us first consider that [σ

=F
|
=F Pr(α, φ)] must also yield false.
|

⊥
=F Pr(α, φ)] yields true. This implies, Pr(α, φ) yields
|

=F φ] yields true. Therefore, p must have been observed on the first state of
|

(
⇐
, and [ασ

⊤
ασ.

Now, let us consider that [σ

,
⊥
=F φ] yields false. Therefore, p must not have been observed on the first state of
|

=F Pr(α, φ)] yields false. This implies, Pr(α, φ) yields
|

and [ασ

ασ.

Case 2: Assume that the proof has been established for the case when the formula is

φ = ϕ. Now, we consider the case where the formula is φ =

ϕ.

¬

We can say [ασ

=F
|

¬

ϕ] is equivalent to

[ασ

=F ϕ] according to the finite-trace se-
|

mantics of LTL. We can also say [σ

=F
|
Pr(α, ϕ) is defined as a progression rule. Furthermore, [σ

ϕ)] is equivalent to [σ

¬

¬
=F Pr(α,
|

Pr(α, ϕ)] since

Pr(α, ϕ)] is

¬

¬
=F
|

Pr(α,

ϕ) =

¬
equivalent to

¬

[σ

¬

=F Pr(α, ϕ)] according to the finite-trace semantics of LTL.
|

Based on our assumption, the proof has already been established for [ασ

⇐⇒
=F Pr(α, ϕ)], and by extension,
|

=F ϕ]
|

[σ

=F Pr(α, ϕ)]. Therefore,
|
[ασ

ϕ]

[σ

=F
|

¬

⇐⇒

=F Pr(α,
|

=F ϕ]
|

[σ

⇐⇒ ¬

[ασ

¬

ϕ)]

¬

Case 3: Assume that the proof has been established for the case when the formula is

φ = ϕ. Now, we consider the case where the formula is φ = ϕ.

Let us first consider the case where the length of the trace α is 1, that is,

α1

= 0. In this particular case, [ασ
|

|
Pr(α, ϕ) = ϕ; which implies, [σ

=F
|

ϕ] is equivalent to [σ

|
=F ϕ]. Furthermore,
|
=F Pr(α, ϕ)] is equivalent to [σ
|

=F ϕ]. Therefore,
|

α

= 1 and
|

38

[ασ

ϕ]

=F
|
Now, let us consider the case where the length of the trace α is longer than 1, that

=F Pr(α, ϕ)].
|

⇐⇒

[σ

is,

[σ

α

1 and

α1

1.

In this case, [ασ

|≥

|
|
=F Pr(α, ϕ)] is equivalent to [σ
|

|≥

=F Pr(α1, ϕ)].
|

=F
|

ϕ] is equivalent to [α1σ

=F ϕ], and
|

Based on our assumption, the proof has already been established for [α1σ

=F ϕ]
|

⇐⇒

[σ

=F Pr(α1, ϕ)]. Therefore, [ασ
|

=F
|

ϕ]

⇐⇒
Case 4: Assume that the proof has been established for the cases when the formulas are

[σ

=F Pr(α, ϕ)].
|

φ = φ1 and φ = φ2. Now, we consider the case where the formula is φ = φ1

∨
Based on our assumption, the proof has already been established for [ασ

φ2.

⇐⇒
=F Pr(α, ϕ2)]. Therefore, we can derive the
|

=F ϕ1]
|

[σ

=F Pr(α, ϕ1)] and [ασ
|

=F ϕ2]
|

⇐⇒

[σ

following:

[ασ

=F (φ1
|

∨

φ2)]

⇐⇒

⇐⇒

⇐⇒

[ασ

[σ

[σ

[ασ

∨

=F φ1]
|
=F Pr(α, φ1)]
|
=F Pr(φ1
|

∨

∨
φ2)].

=F φ2]
|

[σ

=F Pr(α, φ2)]
|

Case 5: Now, we consider the case where the formula is φ = φ1

φ2. We prove this by

U

induction:

Base Case:

α

= 0.
|

|

[ασ

=F φ]
|

⇐⇒

⇐⇒

[σ

[σ

=F Pr(α, φ)]
|
=F φ]
|

39

Hypothesis Step:

α

= k.
|

|

[ασ

=F φ1
|

[ασ

⇐⇒

φ2]
(cid:16)

U
=F
|

(cid:0)φ1

φ2

∨

∧

(φ1

(cid:16)

φ1

U

∧

φ2)(cid:1)(cid:17)

]

(φ1

U

(cid:17)

φ2)

]

⇐⇒

[ασ

=F φ2]
|

∨

⇐⇒

[ασ

=F φ2]
|

∨

[ασ

=F
|

[ασ

=F φ1]
|

∧

[α1σ

=F φ1
|

(cid:17)

]

φ2

U

⇐⇒

[ασ

=F φ2]
|

∨

[ασ

=F φ1]
|

∧

[α1σ

(cid:16)

φ2

=F
|
=F φ2](cid:1)
|

∨

∨

(cid:0)φ1

. . .

∨
(cid:17)

∧
(cid:16)

φ2)(cid:1)(cid:17)

(cid:19)
]

(φ1

U

[ασ

=F φ1]
|

∧

⇐⇒
[α1σ

(cid:16)

[ασ

[ασ

=F φ2]
|

∨

=F φ1]
|

∧

. . .

∧

(cid:0)[ασ

=F φ1]
|
[αk−2σ

[α1σ

∧

=F φ1]
|

[(αk−1σ

∧

=F φ2]
|

=F φ1]
|

∧

. . .

[αk−1σ

=F φ1]
|

[αkσ

=F φ1
|

U

φ2]

∨
(cid:17)

[ασ

=F φ2]
|
[αk−1σ

⇐⇒

. . .

∧

=F φ1]
|

[σ

=F φ1
|

U

∧

φ2]

∧
[α1σ

(cid:17)

[ασ

=F φ1]
|

∧

∨

(cid:17)

=F φ2]
|

. . .

∨

∨

(cid:16)

[ασ

=F φ1]
|

∧

(cid:16)

(cid:18)

∧
(cid:16)

Inductive Step:

α

= k + 1 Trivially expanded from the above expansion.
|

|

[ασ

=F φ1
|

U

φ2]

⇐⇒

(cid:16)

[ασ

[ασ

=F φ2]
|

∨

=F φ1]
|

∧

. . .

∧

[ασ

=F φ1]
|
[αk−1σ

∧

=F φ1]
|

∧

[α1σ

=F φ2]
|
[αkσ

=F φ1]
|

[σ

=F φ1
|

U

∧

(cid:17)

φ2]

(cid:17)

. . .

∨

∨

(cid:16)

Now, in order for [ασ

φ2] to yield true, there must be a k

. . .

φ1

∧

∧

αk−1σ

=F φ1
|
=F φ1
|

∧

U
=F φ2], that is,
αkσ
|

1 such that [ασ

=F
|

≥

[ασ

=F φ1
|

U

φ2]

⇐⇒

[

k

∃
αkσ

=F φ1
|

∧

. . .

∧

αk−1σ

=F φ1
|

∧

1 . α0σ

≥
=F φ2]
|

⇐⇒

[

k

∃

≥

1 . ασ

=F
|

kφ2

ασ

=F
|

∧

k−1φ1]

Note that the above recursive definition of Until allows us to evaluate any until for-

mula, and by extension, any always ( φ = φ

) and eventually ( φ =

U ⊥

φ) formula.

⊤ U

Therefore, we can evaluate any sub-formula using this fixed point representation of until.

■

40

a1

q2

a3

q3

q0

q1

a2

a4

a5

q4

qr

(a)

a1a2a3

a3

q3

a3a1a2

a1

q2

a2a3a1

q0

q1

a2

a4

a5

q4

qr

(b)

a1a2a3

q3

a3a1a2

a1

q2

a2a3a1

q0

q1

a2

a4

a5

q4

qr

(c)

Figure 3.2 Removing non-loop cycles in an LTL3 Monitor.

3.3 SMT-based Solution

In this section, we go into further detail about our approach to distributed monitoring

utilizing the two previously discussed monitoring techniques: (1) automata-based approach,

and (2) progression-based approach.

3.3.1 Overall Idea

Automata-based approach. Recall from Figure 1.5 that monitoring a distributed com-

putation may result in multiple verdicts depending upon different ordering of events. In other

words, given a distributed computation (

E

, ⇝) and an LTL formula φ, different ordering of

events may reach different states in the monitor automaton

φ = (Σ, Q, q0, δ, λ) (as defined

M

in Definition 1). In order to ensure that all possible verdicts are explored, we generate an

SMT instance for (1) the distributed computation (
E

, ⇝), and (2) each possible path in the

LTL3 monitor. Thus, the corresponding decision problem is the following: given (
E

, ⇝) and

a monitor path q0q1

qm in an LTL3 monitor, can (
E

· · ·

, ⇝) reach qm? If the SMT instance is

satisfiable, then λ(qm) is a possible verdict. For example, for the monitor in Figure 2.1, we

consider two paths q∗

0q⊥ and q∗

0q⊤ (and, hence, two SMT instances). Thus, if both instances

turn out to be unsatisfiable, then the resulting monitor state is q0, where λ(q0) =?.

We note that LTL3 monitors may contain non-self-loop cycles. In order to simplify the

SMT instance creation process (for each possible path in the LTL3 monitor), we collapse each

41

M

φ = (Σ, Q, q0, δ, λ)

φ = (Σ, Q, q0, δ′, λ)
′

Data:
Result:
Let CP be the set of all possible paths containing cycles
δ′
δ
foreach q

M

←
Q do
∈
foreach q sm
δ′(q, sm

end

CP do

sn
q
−→
q
←

∈

−→ · · ·
sn)
· · ·

qi

sk
−→

qj

|

q sm

−→ · · ·

qi

sk
−→

qj

· · ·

sn
−→

q

CP
}

∈

do

∈ {

end
foreach qm

s
qn
−→
if m > n then
δ′(qm, s)

← ∅

end

end
return

φ

M

Algorithm 3.1 Non-Self Loop Cycle Removal Algorithm

non-self-loop cycle into one state with a self-loop labeled by the sequence of events in the

cycle using Algorithm 3.1. As an example, in Figure 3.2, Algorithm 3.1 first takes an LTL3

monitor (Figure 3.2a) and adds the necessary self-loops (Figure 3.2b). Then it eliminates all

non-self-loop cycles by removing transitions from states with higher identifiers to states with

lower identifiers in cycles (Figure 3.2c). The non-deterministic nature of the final automata

ensure that all the transitions and the accepting language of the automata are preserved.

Lemma 3. Let

M

φ = (Σ, Q, q0, δ, λ) be the monitor automaton for LTL formula, φ, and

′

φ = (Σ, Q, q0, δ′, λ) be the monitor automaton with no non-self loop cycles, obtained from
an and a initial state,
φ. Given a finite trace, α = a1a2

M
applying Algorithm 3.1 on

M

· · ·

Q, we prove that λ(δ(q, α)) = λ(δ′(q, α)).

q

∈

Proof. We distinguish the following cases:

Case 1: First we show, λ(δ(q, α))

λ(δ′(q, α)) Let α = a1a2

an, where

· · ·

→
i
∀

λ(δ′(q, α)), that is,

Q . λ(δ(q, α)) =

⇒
Σ. Algorithm 3.1 removes non-self loop

∈

∀

α,

q
∀

[1, n].ai

∈

∈

cycles by removing a transition such that the corresponding transition of δ(q, ai), δ′(q, ai),

where i

∈

[1, m] does not exist. This is such that

k

∃

∈

[1, i] . q′ ai−k

−−→ · · ·

q ai
−→

q′. This

42

transition is same as δ′(q′, ai−k

· · ·

ai) = q′ which was one of the added self-loops. The rest

of the transitions are maintained such that δ(q, ai) = δ′(q, ai), where q

Q and i

[1, m].

∈

∈

Case 2: Now, we show, λ(δ′(q, α))

λ(δ(q, α)) Let α = a1a1

· · ·

by

i
∃
q ai
−→

[1, n],

k

[1, n

∈
q′ ai+1

−−→ · · ·

∃

∈
ai+k
−−→

−
q in

→
i
∀

an, where

∈
i] . δ′(q, aiai+1

λ(δ(q, α)), that is,

α,

∀

q
∀

∈

[1, n].ai

∈

Σ. A self-loop in

′
φ

M

Q . λ(δ′(q, α)) =

⇒
can be represented

ai+k) = q. In another words, there exists a path

· · ·

φ. The rest of the non-self loop transitions are the same, such

M

that δ′(q, ai) = δ(q, ai), where q

Q and i

[1, m].

∈

∈

■

Progression-based approach. Due to the existence of a total ordering of events in a

synchronous system, verification on a computation may be carried out using a state-by-state

methodology [10]. A partially synchronous system, however, makes such an ordering of

events impossible. Varying interleavings of events can lead to different orderings of events

in a distributed computation (
E

, ⇝). Therefore, it is possible to obtain multiple verdicts on

the same distributed computation (
E

, ⇝). To explore these verdicts, we provide a formula

progression monitoring approach that, if feasible, partially evaluates a formula on the current

computation and, in response to the verdict, offers a rewritten formula that is to be evaluated

on the extensions of the computation. As an example, let us consider the formula to be

monitored as, φ = (a

→

b). Now, if in some trace in a computation, the monitor

observes a, then for the extensions of computations, it is enough to monitor the rewritten

formula, φ′ = b, as the final verdict is no longer dependent on the occurrence of a. We

call this method of rewriting formula Progression, which we discuss in length later on. In

the next two subsections, we present the SMT entities and constraints with respect to one

monitor path and a distributed computation.

3.3.2 SMT Entities

SMT entities represent the sub-formulas of an LTL formula and a distributed computa-

tion. After the verdicts from all the sub-formulas are generated, we construct our rewritten

43

formula by attaching the said verdicts to their corresponding parent formulas in the parse

tree and then performing an in-order traversal starting from the root of the parse tree.

At the end of the traversal, the resulting formula is, in fact, the progression for the next

computation. We now introduce the entities that represent a path in an LTL3 monitor

φ = (Σ, Q, q0, δ, λ) for LTL formula φ and distributed computation (
E

M
noted that the SMT entities in this subsection are used in both the automata-based and the

, ⇝). It should be

progression-based approaches.

Monitor automaton. Let q0

s0
−→

q1

s1
−→ · · ·

(qj

sj
−→

qj)∗

sm−1
−−−→

· · ·

qm be a path of monitor

φ, which may or may not include a self-loop. We include a non-negative integer variable

M
ki for each transition qi
sj
−→

self-loop qj

si
−→

qi+1, where i

[0, m

−

∈

1] and si

∈

Σ. This is also true for the

qj, for which we include a non-negative interger kj.

Distributed computation.

In our SMT encoding, the set of events,

are represented by a

E

bit vector, where each bit corresponds to an individual event in the distributed computation,

(
E
an

, ⇝). We conduct a pre-processing of the distributed computation, during which we create

matrix, hbSet to incorporate the additional happen-before relations obtained by

E × E

the clock-synchronization algorithm. Afterwards, we populate the hbSet with 0’s and 1’s,

such that hbSet[i][j] = 1 if

[i] ⇝

[j], and hbSet[i][j] = 0 otherwise. We introduce a

function µ :

AP

E ×

→ {

E

E
true, false
}

in order to establish a relation between each event

and the atomic propositions in it. In the event that other variables or constants are used in

defining the predicates (e.g. x1 + x2

an uninterpreted function ρ : Z≥0

≥

→

2), µ is constructed accordingly. Finally, we introduce

2E that identifies a sequence of consistent cuts from

to

for reaching a verdict, while satisfying a number of given constraints explained in

{}
{E}
Subsection 3.3.3.

3.3.3 SMT Constraints

We next go on to the SMT constraints after defining the requisite SMT entities. The

SMT constraints for consistent cuts that are enforced on both the automata-based and the

44

progression-based approaches are first defined. Afterwards we define the SMT constraints

that are more dependant on the methodology.

Consistent cut constraints over ρ.

In order to ensure that the uninterpreted function

ρ identifies a sequence of consistent cuts, we enforce certain consistent cut constraints. The

first constraint enforces that each element in the range of ρ is in fact a consistent cut:

i
∀

∈

[0, m].

e, e′
∀

∈ E

(cid:16)

.

(e′ ⇝ e)

(e

∧

∈

(cid:17)

ρ(i))

(cid:16)

e′

∈

→

(cid:17)

ρ(i)

Next, we enforce that the sequence of consistent cuts identified by ρ start from an empty

set of events, and each successor cut of the sequence contains one more new event than its

predecessor.

|
Finally, we ensure that each successive consistent cut is immediately reachable in (

∈

|

i
∀

[0, m].

ρ(i + 1)

=

ρ(i)
|
|

+ 1

, ⇝) by

E

enforcing a subset relation:

i
∀

∈

[0, m]. ρ(i)

ρ(i + 1)

⊆

We determine if a series of consistent cuts conforms to the specification after it has been

created. This is done using (1) progression-based approach, where the LTL formula is rep-

resented by a SMT constrain and (2) LTL3 automata-based approach, where a path on the

automata is represented as an SMT constraint. This is repeated for all sub-formulas of the

original LTL formula and all paths in the LTL3 automata respectively as discussed below.

Let C represent for the conjunction of the aforementioned constraints. Recall that there

is only one valid path that is relevant to this conjunction C. Since there may be multiple

paths in the monitor, we replicate the above constraints for each such path. Suppose there

are n such paths and let C1, C2, . . . , Cn be the corresponding SMT constraints for these n

paths. We include the following constraint:

This means that if the SMT instance above satisfiable, then a valid path exists.

C1

C2

∨

∨

C3

∨ · · · ∨

Cn

45

Constraints for LTL progression over ρ. Given a distributed computation (

, ⇝

E

), the aforementioned constraints may provide a valid series of consistent cuts that may

result in multiple verdicts depending on how the concurrent events are ordered. Therefore,

while evaluating an LTL formula on (

E

, ⇝), all potential outcomes are investigated in order

to prevent false positives. To achieve this, we examine the sequence of consistent cuts

C0C1C2

· · ·

Cm interpreted by the uninterpreted function ρ(m), looking for both satisfaction

and violation. Note that applying our progression rules to monitor any LTL formula will

cause it to eventually monitor sub-formulas that only include atomic propositions, globally,

and eventually temporal operators:

φ = p

φ = ϕ

φ = ϕ

front(ρi)

AP (satisfaction, i.e.,

= p, for p
|
[0, m]. front(ρi)

∈

∈
[0, m]. front(ρi)

i
∃

i
∃

∈

= ϕ (violation, i.e.,
̸|

= ϕ (satisfaction, i.e.,
|

)
⊤
)
⊥
)
⊤

Situations to the contrary will lead to a rewritten formula that will go on to the following

segment. In general, the verdict for any LTL formula will be derived using our progression

rules in Section 3.2.

3.4 Optimization

We employ several optimization techniques in our implementation to speed up and im-

prove the monitoring process. In this section, we discuss two crucial optimization techniques,

as well as their impact on run time.

3.4.1 Segmentation of Distributed Computation

RV is known to be an NP-complete problem in the number of processes in a distributed

setting [53]. The complexity exhibits even more exponential blowup during verifying for-

mulas with nested temporal operators.

In order to cope with this complexity, we divide

our computation into smaller segments, (seg1, ⇝)(seg2, ⇝)
albeit more SMT problems. Given a distributed computation (

(segl/g, ⇝) to create smaller,
, ⇝) of length l, we divide

· · ·

E

it into l
g

smaller segments length g. The set of events in segment j, where j

[1, l

g ], is the

∈

46

following:

segj =

(cid:110)

en
τ,σ,ω |

σ

0, (j
[max
{

∈

1)

g

, j

ε

}

−

×

−

×

g]

∧

n

∈

(cid:111)

[1, N ]

Note that each segment (barring seg0
the previous segments ending point. This creates an overlap of ε time units between each

) has to be constructed starting at ε time units before

pair of adjacent segments. Doing so ensures that no pair of possible concurrent become

non-concurrent due to the splits caused by segmentation. Therefore, dividing the actual

computation into segments does not have any effect on the final verdict of the said computa-

tion. We also use parallelization to make our algorithm perform faster, while utilizing most

of the computation power modern processors are capable of handling.

Lemma 4. A distributed computation, (

, ⇝), of length l satisfies an LTL formula, φ, if and

only if the distributed computation, (
E

E
, ⇝), is divided into l
g

segments of length g satisfies

φ using the automata-based approach. That is, Given a distributed computation (

, ⇝) of

E

length l divided into l
g

segments of length g, the evaluation of the LTL formula φ on, by the

automata-based approach is equal, i.e.,

[(seg1.seg2.

.seg l
g

, ⇝)

· · ·

⇐⇒

= [(seg1.seg2.

.seg l
g

, ⇝)

=3 φ]
|
=3 φ], that is,
|

α

{

=3 φ
|

|

[(

, ⇝)

=3 φ]
|
, ⇝)

=3 φ]
|

E
Proof. Let us assume [(
E
=3 φ
|

α

=

, ⇝)

Tr(
E
∈
) Let Ck be a consistent cut such that Ck is in Tr(
E

Tr(seg1.seg2.

α
{

· · ·

} ̸

α

|

, ⇝)
}

∈
(
⇒
) for some k

· · ·
.seg l
g

[0,

]. This implies that the frontier of Ck, front(Ck)

seg1

and front(Ck)

̸⊆
. However, this is not possible, as according to the seg-

̸⊆

, ⇝), but not in Tr(seg1.seg2.

.seg l
g

, ⇝

· · ·

∈
|E|
and front(Ck)

and

· · ·

seg2
mentation construction, there must be a segj
Therefore, such Ck cannot exist, and

seg l
g

̸⊆

α

=3 φ
|
. By extension, [(
E
}

, ⇝)

{

α

∈
|
=3 φ]
|

⇒

Tr(seg1.seg2.

.seg l
g

, ⇝)

· · ·

(
⇐
, ⇝) for some k

) Let Ck be a consistent cut such that Ck is in Tr(seg1.seg2.

[0,

]. This implies, front(Ck)

E
[1, l

⊆
g ]. However, this is not possible due to the fact that

|E|

∈

Tr(

j

∈

where 1

j

≤

such that front(Ck)

segj

.

⊆

l
g
≤
Tr(
E

, ⇝)

} ⊆ {

[(seg1.seg2.

· · ·
.seg l
g

· · ·

α

=3 φ
|
.seg l
g

|
, ⇝)

α

∈
=3 φ]
|

, ⇝), but not in

segj

and front(Ck)

for some

̸⊆ E

j
∀

∈

[1, l

g ] . segj ⊆ E

. Therefore,

47

̸
such Ck cannot exist, and

|
. By extension, [(seg1.seg2.

{

α

=3 φ
|

α

Tr(seg1.seg2.
, ⇝)

∈
.seg l
g

· · ·
=3 φ]
|

.seg l
g

, ⇝)

, ⇝)

[(

E

⇒

α

α

=3 φ
|

} ⊆ {

∈
=3 φ]. Therefore,
|

|

· · ·

Tr(

[(

E

E
, ⇝)

, ⇝)
}
=3 φ]
|

[(seg1.seg2.

.seg l
g

, ⇝)

=3 φ]. ■
|

· · ·

⇐⇒

Lemma 5. A distributed computation (

E

, ⇝) of length l satisfies an LTL formula φ if and

only if the distributed computation, (
E

, ⇝), is divided into l
g

segments of length g satisfies

φ using the progression-based approach. That is,

, ⇝)

[(

E

=F φ]
|

⇐⇒

[(seg1.seg2.

.seg l
g

, ⇝)

=F φ]
|

· · ·

3.4.2 Parallelized Monitoring

Clusters of computers with several processing cores and processors are used by many

cloud services. They can now create high-performance parallel/distributed applications and

handle huge data rates as a result. Utilizing the extensive infrastructure should also be

possible for monitoring such applications. In light of this, we will now talk about parallelizing

our SMT-based monitoring technique.

Let G be a sequence of g segments G = seg1seg2 · · ·

segg

. For each computer core that

is available, a task queue will be established.The segments will then be distributed evenly

among all of the queues so that each core may independently monitor its queue. However,

merely dividing up all the segments across cores will not guarantee a reliable outcome. For

example, consider formula φ = a

U

b and two segments, seg1

and seg2

across two cores, Cr1

and Cr2, respectively. The monitor operating on Cr2 must be aware of the outcome of the

monitor operating on Cr1 in order to render the proper verdict. In a scenario, where Cr1

observes one or more

a in seg1
¬

, a violation must be reported even if Cr2 does not observe b

and no

a. Generally speaking, the temporal order of events makes independent evaluation

¬

of segments impossible for LTL formulas. Of course, some formulas such as safety (e.g.,

p)

and co-safety (e.g.,

q) properties are exceptions.

For our automata-based approach, we address this problem in two steps. Let

φ =

M

(Σ, Q, q0, δ, λ) be an LTL3 monitor. Our first step is to create a 3-dimensional reachability

48

matrix RM by solving the following SMT decision problem: given a current monitor state

qj

∈
and j, k

Q and segment segi

, can this segment reach monitor state qk

Q, for all i

[1, g],

∈

∈

[0,

Q
|

| −

∈

1]. If the answer to the problem is affirmative, then we mark RM [i][j][k]

with true, otherwise with false. This is illustrated in Figure 3.3 for the monitor shown

in Figure 2.1, where the grey cells are filled arbitrarily with the answer to the SMT prob-

lem. This step can be made embarrassingly parallel, where each element of RM can be

computed independently by a different computing core. One can optimize the construc-

tion of RM by omitting redundant SMT executions. For example, if RM [i][j][

⊤

] = true,

then RM [i′][

][

] = true for all i′

⊤

⊤
] = true for all i′

RM [i′][

][

⊥

⊥

∈

[i,

Q
|
1].

[i,

Q
|

| −

∈

1]. Likewise, if RM [i][j][

⊥

] = true, then

| −

The second step is to generate a verdict reachability tree from RM . The goal of the

tree is to check if a monitor state qm

∈

Q can be reached from the initial monitor state q0.

This is achieved by setting q0 as the root and generating all possible paths from q0 using

RM . That is, if RM [i][k][j] = true, then we create a tree node with label qj and add it

as a child of the node with the label qk. Once the tree is generated, if qm is one of the

leaves, only then we can say qm is reachable from q0. In general, all leaves of the tree are

possible monitoring verdicts. Note that creation of the tree is achieved using a sequential

algorithm. For example, Figure3.4 shows the verdict reachability tree generated from the

matrix in Figure 3.3.

For our progression-based approach, we adhere to a similar technique for parallelized

monitoring as our automata-based approach. The key difference being, in the progression-

based approach subformulas are used, whereas in the automata-based approach different

states are used. As an example, the previous formula φ = a

b will be broken into two

subformulas φ1 = a and φ2 =

U
b, before creating the reachibility matrix, and then

generating the verdict for both these subformulas.

Lemma 6. A distributed computation (

E

, ⇝) of length l satisfies an LTL formula φ if and

49

seg2
q⊤ q⊥ q0

seg3
q⊤ q⊥ q0

seg1
q0
q⊤ q⊥ q0
q0 T F
q0
q⊤ F
q0
q⊥ F

q⊤ q⊥ q0
F
F
q⊤ q⊥ q0
F
F
F

seg4
q⊤ q⊥
F T T F T T T T T T
q⊤ q⊥
F T F
q⊤ q⊥
F T

q⊤ q⊥ q0
F T F

q⊤ q⊥ q0
F T F

q⊤ q⊥ q0

q⊤ q⊥ q0

F T F

F T F

Figure 3.3 Reachability Matrix for a

b.

U

q0

q0

q

q

⊥

⊥

q

q

q

⊤

⊤

⊤

q0

q0

q

q

⊤

⊤

q0

q

q

⊤
Figure 3.4 Reachability Tree for a

⊥

b.

U

only if the parallelized monitoring technique satisfies φ. That is,

, ⇝)

[(

E

=3 φ]
|

⊤ ∈

⇐⇒

λ(q) =

⊤

and,

Where q

∈

⇐⇒
Q is some leaf node in the verdict reachability tree generated from RM during

⊥ ∈

⊥

E

[(

, ⇝)

=3 φ]
|

λ(q) =

the parallelized monitoring process and λ is the labelling function in

φ.

M

Base Case: Let us first consider the case where there is only one segment. That is, l = g.

) If

(
⇒

, ⇝)

[(

E

=3 φ] (resp.,
|

⊤ ∈

, ⇝)

[(

E

=3 φ]), then according to the construction
|

⊥ ∈

of the corresponding verdict reachability tree made from the RM , the root node q0 must

have a child q⊤ (resp., q⊥), such that, λ(q⊤) =

(resp., λ(q⊥) =

⊥

⊤

). This child is also a leaf

node, as the height of a verdict reachability tree is 2 when there is only one segment.

50

) We can trivially show that if λ(q⊤) =

(
⇐

(resp., λ(q⊥) =

⊤

), that is, if q⊤ (resp., q⊥)

is reachable from q0, then

E
Hypothesis: Let us assume the proof as been established for l = g

⊤ ∈

⊥ ∈

E

[(

, ⇝)

=3 φ] (resp.,
|

[(

, ⇝)

⊥
=3 φ]).
|

k. Now we consider

×

l = q

×

(k + 1) as the segment length.

) If

(
⇒

, ⇝)

[(

E

=3 φ] (resp.,
|

⊤ ∈

, ⇝)

[(

E

=3 φ]), then according to our assumption,
|

⊥ ∈

there must be at least one node at height k + 1 (height of the leaf nodes where there are

k segments), such that λ(q⊤) =

(resp., λ(q⊥) =

⊤

). Now for k + 1 number of segments,

⊥

according to the construction of the corresponding verdict reachability tree made from the

RM , the node q⊤ (resp., q⊥) can only have the child q⊤ (resp., q⊥). Therefore, there must

be at least one node at height k + 2 (height of the leaf nodes when there are k + 1 segments),

such that λ(q⊤) =

(resp., λ(q⊥) =

).

⊥

⊤

) We can trivially show that if λ(q⊤) =

(
⇐

(resp., λ(q⊥) =

⊤

is reachable from q0, then

, ⇝)

[(

E

=3 φ] (resp.,
|

[(

E

⊥ ∈

⊤ ∈

3.5 Case Studies and Evaluation

, ⇝)

⊥
=3 φ]).
|

), that is, if q⊤ (resp., q⊥)

In this section, we focus on our SMT-based solution without digressing into other aspects

like instrumentation, data gathering, data transfer, monitoring, etc., as given the distributed

setting, runtime will be the dominant factor over any other kind of overhead. We evaluate our

proposed technique using synthetic experiments, Cassandra (a distributed database) [19, 72],

and the RACE dataset from NASA [85].

3.5.1

Implementation and Experimental Setup

Three steps may be identified for each experiment: (1) data generation, (2) data collec-

tion, and (3) data verification. For the purpose of generating data, we create a synthetic

program that at random generates a distributed computation (i.e., the behaviors of a set of

programs in terms of their inter-process communication and local calculations). Generating

synthetic experimental data offer benefits that enable us to draw comparison between differ-

ent parameters and their effect on the approach. For example, generating data for different

values of ε is beneficial to study its effect on the runtime and the number of false warning

51

verdicts of our approach.

When developing the synthetic distributed system as part of our experiment, we ensure

a partially-synchronous setting by including an HLC implementation. We use a uniform

distribution (0, 2) to define the type of event (local computation, send and receive message)

and a flip-coin distribution for computing the atomic propositions that are true at each

local computation event. Although the events in our synthetic experiments in Section 3.5.2

are uniformly distributed over the length of the trace, the event distribution as part of the

Cassandra experiments in Section 3.5.3 are affected by the network latency and other external

factors. In addition, we assume that that there is an external data collection program which

keeps track of the data/states of the system under verification. It generates the trace logs

which is used by the monitoring program to verify against the given LTL specifications

mentioned in Figure 3.5b.

For data verification, we consider the following parameters: (1) number of processes (N ),

(2) computation duration (l secs), (3) segment length (g), (4) event rate (r events/pro-

cess/sec), (5) maximum clock skew (ϵ), and (6) number of nested temporal operators (

)
ϕ
|
|
for the LTL formula under monitoring. The primary metric is to calculate the SMT solving

runtime for each parameter configuration. In all of the charts shown in this section, the time

axis is displayed in log scale. By keeping the values of all the other parameters at sensible

fixed values, we can study the impact of changing one parameter. In all the graphs, we com-

pare the runtime of our automata-based approach against the progression based approach.

We use a MacBook Pro with Intel i7-7567U(3.5Ghz) processor, 16GB RAM, 512 SSD and

g++ Apple clang version 12.0.5 (clang-1205.0.22.9) interface to the Z3 SMT-solver [97] to

generate the traces. To evaluate our parallel algorithm, we use a server with 2x Intel Xeon

Platinum 8180 (2.5GHz) processor, 768GB RAM, 112 vcores and g++(GCC) 9.3.1 interface

to the Z3 SMT-solver [97].

52

3.5.2 Analysis of Results – Synthetic Experiments

In this series of experiments, we examine every parameter that is available and record how

it impacts SMT solution. To investigate how each parameter affects runtime, we test each

one separately. Since the created synthetic data is independent of any outside influences,

we include a delay to both reduce the amount of events occurring at each time unit and to

ensure that events are distributed equally across the execution of each process. We assign a

value to each local computation event in each process using a uniform distribution (0,

).
Σ
|
|

The findings of the following experiments only make use of one CPU core.

Overall, we notice an improvement of around 35% when the progression based technique

is compared to the other automata based approach. This improvement in performance owes

to two main reasons: (1) compared to the automata-based approach, the LTL constrains

in our progression-based approach is less demanding in terms of computational complexity.

Each sub-formula consists of mostly one atomic proposition as opposed to multiple atomic

propositions in each path of the automaton, which in turn speeds up the overall verification

process, and (2) the total number of SMT-instances needed is fewer due to the less number

of sub-formulas compared to automaton paths given the same specification. We now analyze

the results in detail.

Impact of predicate structure.

In this experiment (Figure 3.5a), we consider different

predicate distribution over AP for the formula, φ1, i.e., how many processes are involved with

a particular predicate. We consider different predicate structures: O(1), O(n), O(n2) and

O(n3) which signifies the order of the number of SMT-encodings that need to be generated

for the given distribution of predicates. As can be seen, the progression based technique

outperforms the automata-based technique overall by 35% on average.

Having said that, during our experiments when comparing the runtime of our moni-

toring approach for increasing number of sub-formulas, we observe a slight decrease in the

overall efficiency in runtime when using the progression-based approach compared to the

automata-based approach. Since the progression-based approach is based on evaluating each

53

)
s
(

e
m

i
t

n
u
R

500

100
50

10
5

10

5

)
s
(

e
m

i
t

n
u
R

1

500

100

50

10

5

1

)
s
(

e
m

i
t

n
u
R

O(1) Automata
O(1) Progression
O(n) Automata
O(n) Progression
O(n2) Automata
O(n2) Progression
O(n3) Automata
O(n3) Progression

φ1 Automata
φ1 Progression
φ2 Automata
φ2 Progression
φ3 Automata
φ3 Progression
φ4 Automata
φ4 Progression
φ5 Automata
φ5 Progression
φ6 Automata
φ6 Progression

500

100
50

10
5

1

)
s
(

e
m

i
t

n
u
R

2

3

5

4
Number of processes

7

6

|P|
(a) Predicate Structure.

8

9

10

1

2
Number of processes

3

4

|P|

(b) LTL Formula.

5

g = 0.5 Automata
g = 0.4 Automata
g = 0.3 Automata
g = 0.2 Automata
g = 0.5 Progression
g = 0.4 Progression
g = 0.3 Progression
g = 0.2 Progression

= 1 Automata
= 2 Automata
= 3 Automata
= 4 Automata
= 5 Automata
= 1 Progression
= 2 Progression
= 3 Progression
= 4 Progression
= 5 Progression

|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|

500

100
50

10
5

1

)
s
(

e
m

i
t

n
u
R

50 100 150 200 250 300 350 400 450 500
Clock skew ϵ (ms)

5

6

8

7
Event rate (/process/sec)

10 11 12 13 14 15

9

(c) Epsilon.

(d) Event Rate.

= 1 Automata
= 2 Automata
= 3 Automata
= 4 Automata
= 5 Automata
= 1 Progression
= 2 Progression
= 3 Progression
= 4 Progression
= 5 Progression

|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|

= 1 Automata
= 2 Automata
= 3 Automata
= 4 Automata
= 5 Automata
= 1 Progression
= 2 Progression
= 3 Progression
= 4 Progression
= 5 Progression

|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|

100
50

10
5

1

)
s
(

e
m

i
t

n
u
R

50 100 150 200 250 300 350 400 450 500
Segment length g(ms)

0.5

1

2

1.5
Computation duration l sec

2.5

3.5

3

4

4.5

5

(e) Segment Length.

(f) Computation Duration.

Figure 3.5 Synthetic experiments - impact of different parameters.

54

sub-formula, there exists an LTL formula where the number of sub-formulas is more than

the number of paths in the corresponding automata, and thus, the the progression-based

approach might not be as efficient as the automata-based approach in such a scenario.

For example, consider a formula, φ = a

b

∨

∨

c, where the automata has two

states, which makes the number of paths to be 2. However, the progression involves 3

sub-formulas, which makes the progression based approach less efficient than its automata

counterpart. We would like to point out that the formula can be rewritten as

(a

b

∨

∨

c),

which makes both the approaches yield similar results. Thus we hypothesize that for all LTL

formulas, the progression-based approach will be more (if not equally) efficient to that of the

automata-based approach.

Impact of LTL formula. Given an LTL formula, the depth of nested temporal operators

plays an important role as suggested by Figure 3.5b. We experiment with the following LTL

formula and the progression based technique achieved an average improvement of 32.8%

compared to the automta-based one.

φ1 = p

φ2 = (q

→

φ3 = ((q

φ4 = ((q

∧

∧

p)

r)

r)

φ5 = r

(s

r))

U

(s

(r

∨

U
t)

p

p

→

(
¬
(
→
¬
r
(
¬
(s

∧ ¬
r
(
¬
t)

p

∧
(t

p
(
¬
U
p)))

d = 2

d = 3

d = 4

d = 5

t)))))

d = 6

= 1

= 2

= 3

= 8

= 8

ϕ
|
|
ϕ
|
|
ϕ
|
|
ϕ
|
|
ϕ
|
|
ϕ
|
|

U

→

φ6 = ((q

∧
r
(
¬
Impact of partial synchrony. Figure 3.5c depicts the anticipated outcome, wherein an

→
r
(
¬

d = 7

∧
r)

p)))

= 9

→

→

r)

(t

∧

∧

∧

U

U

U

U

exponential rise in the number of concurrent events across processes leads to longer runtime

as clock skew ϵ grows. When comparing with the automata-based approach , the progression-

based technique yields us an improvement of 33.36%.

Impact of event rate. Figure. 3.5d shows that our approach breaks even with the com-

putation duration for N = 3 for an event rate of 5events/process/sec. However, increasing

55

the event rate increases the search space for the SMT solver. Overall we improve by 34.4%

by using the progression-based technique compared to the automata-based technique.

Impact of segment count. The number of events to be handled grows as segment length

rises, exponentially lengthening the time our method takes to operate. Since there are

not enough occurrences to have an effect, N = 1, 2 doesn’t show significant improvement

in Figure 3.5e. For a greater number of operations, we see improved performance with

shorter segments. Due to the time required to construct a greater number of SMT encodings

outweighing the performance benefit from smaller segments, it should be noted that the

runtime rises for extremely short segment lengths. Here too, we notice an improvement of

32.6% for the progression-based technique over the automata-based technique.

Impact of computation duration.

In Figure 3.5f, we lengthen computation and monitor

the impact on runtime. The number of segments required to verify the lengthier computation

grows as the duration of the computation rises, leading to a linear increase in runtime. The

progression-based approach improves the runtime by 33.1% when compared to the automata-

based approach.

Impact of parallelization. The technique performs significantly better when the veri-

fication is distributed over many cores. Figure 3.6a illustrates the dramatic improvement

in performance that occurs when the number of cores is increased from 1 to 10. However,

raising it further makes little progress since the time required to generate the SMT encodings

begins to take precedence over the time required to solve it. An improvement of 33.8% is

achieved for progression-based approach when compared to automata-based approach.

Impact of ϵ on false warnings. As discussed in Section 2.3, the monitor does not have

access to the global clock, it can report events as concurrent, when in reality, one happened

before the other in the system under observation. However, during this experiment, we

keep track of the global clock values separately, which gives us full knowledge over the total

ordering of all events. Thus, allowing us to study and report the real verdicts alongside the

56

1,000
500

100
50

10
5

1

)
s
(

e
m

i
t

n
u
R

)
s
(

e
m

i
t

n
u
R

100

50

20

10

5

2

1

= 1 Automata
= 2 Automata
= 3 Automata
= 4 Automata
= 5 Automata
= 1 Progression
= 2 Progression
= 3 Progression
= 4 Progression
= 5 Progression

|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|

50,000

)
s
(

e
m

i
t

n
u
R

10,000

5,000

sbs - 1
sbs - 2
sbs - 3

0

10 20 30 40 50 60 70 80 90 100

Number of cores

(a) Synthetic Data.

1

2

3
Number of cores

4

5

(b) SBS Data.

O(n) Conjunction Satisfaction
O(n) Conjunction Violation
O(n) Disjunction Satisfaction
O(n) Disjunction Violation

1 event/sec/process
2 event/sec/process
3 event/sec/process

100

80

60

40

)
s
(

e
m

i
t

n
u
R

11

12

13

14

10−2 0.1

5

·

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Clock skew ϵ (ms)

1

2

3

4

7

6

5
10
Number of cores

8

9

(c) Google Data.

(d) False Warnings.

Figure 3.6 Impact of parallelization on different data.

reported verdicts. We observe that the monitor sometimes report false warnings, that is, it

reports both verdicts (satisfaction and violation), when in reality, only one has occurred.

Note that the monitor never fails to report real verdicts. However, it may report false

warnings alongside real verdicts on some occasions. Although this does not change the

correctness of the approach, it may still include false warnings as part of the set of evaluated

results.

In Figure 3.6d, we observe that with the increase of the maximum clock skew ϵ, the

number of false warnings increases. The increase in false warnings is attributed to the fact

that as the value of ϵ increases, so does the number of events considered as concurrent by

57

the monitor.

Additionally, we observe that the number of false warning is greatly influenced by the

predicate structure of the LTL formula, as evident from Figure 3.6d. For O(n) conjunctive

satisfaction formula monitoring and O(n) disjunctive violation formula monitoring, false

warnings might occur if any one of the n sub-formulas are violated or satisfied, respectively.

Therefore, we see a higher number of false warnings. Similarly, for O(n) disjunctive satisfac-

tion formula monitoring and O(n) conjunctive violation formula monitoring, false warnings

might occur if all of the n sub-formulas are violated or satisfied, respectively. Therefore, we

see a lower number of false warnings.

3.5.3 Case Study 1: Cassandra

In this case study, we observe read/write irregularities of a No-SQL distributed database

management system called Cassandra [19, 72]. One node from each cluster serves as the

seed node in our simulation of a distributed database with two data centers: one cluster

with four nodes and the other with three. Each node in both clusters replicates all of the

data. Each node runs on Red Hat OpenStack Platform using 4 VCPUs, 4GB RAM, Ubuntu

18.04, Cassandra 3.11.6, and Java 1.8.0_252. Additionally, we have simulated a system with

numerous processes, each of which is in charge of the fundamental database operations (read,

write and update). These processes are also capable of inter-process communication, which

enables them to alert other processes in the event that they create a new database record.

We compared our system’s latency against that of Google Cloud, Microsoft Azure, and

Amazon Web Services in order to make our simulated database as realistic as possible. The

quickest response was timed at 41ms compared to our system’s 100ms. The sluggish band-

width and different infrastructure are to blame for the significant latency when compared to

the industry norm. In all of our experiments, we consider a delay of 100ms into account.

Each of the processes is capable of reading, writing, or updating the database entries given

the way the processes are designed. We choose the kind of operation that will be carried

out by the process using a (0, 2) uniform distribution. The other processes are informed

58

)
s
(

e
m

i
t

n
u
R

100

50

10

5

4

6

φrw,
φrw,
φwrc,
φwrc,
φdrc,
φdrc,
φrw,
φrw,
φwrc,
φwrc,
φdrc,
φdrc,

|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|

= 2 Automata
= 3 Automata
= 2 Automata
= 3 Automata
= 2 Automata
= 3 Automata
= 2 Progression
= 3 Progression
= 2 Progression
= 3 Progression
= 2 Progression
= 3 Progression

φrw,
φrw,
φwrc,
φwrc,
φdrc,
φdrc,
φrw,
φrw,
φwrc,
φwrc,
φdrc,
φdrc,

|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|
|P|

= 2 Automata
= 3 Automata
= 2 Automata
= 3 Automata
= 2 Automata
= 3 Automata
= 2 Progression
= 3 Progression
= 2 Progression
= 3 Progression
= 2 Progression
= 3 Progression

100

50

10

5

)
s
(

e
m

i
t

n
u
R

8

10

12
Segmentation frequency (Hz)

16

14

18

20

1

2

4

8
3
Computation duration (s)

6

5

7

9

10

(a) Segment Length.

(b) Computation Duration.

Figure 3.7 Cassandra experiments.

of any additions made by the write operation using the inter-process communications. No

messages are believed to have been lost during transmission, and as soon as they are received,

the receiving process reads each message.

Consistency level helps a database maintain the bare minimum number of replications

required for an activity to be deemed successfully completed.

In order to eliminate any

potential of a read or write anomaly in the database, Cassandra recommends that the total

of read and write consistency should be greater than the replication factor. Using runtime

monitoring approaches, we want to keep an eye on and detect read/write irregularities in

the database. The corresponding LTL specification becomes:

φrw =

n
(cid:94)

(cid:16)

i=0

write(i)

→

(cid:17)

read(i)

where n is the number of read/write operations.

One of the drawbacks of utilizing a distributed database like Cassandra is the absence

of database normalization features. As a result, we intend to monitor both write and delete

reference checks. We present two tables:

Student(id, name)

Enrollment(id, course)

On these tables, we enforce the write and remove reference check. A write in the Enroll-

59

ment table must always be followed by a write in the Student table with the same id.

Similarly, deleting from the Student table should always be followed by deleting from the

Enrollment table with the same id. This ensures no insertion and deletion anomalies,

resulting in the following LTL specification:

φwrc =

write(Student.id)

¬

U

write(Enrollment.id)

(cid:17)

φdrc =

delete(Enrollment.id)

delete(Student.id)

(cid:17)

U

(cid:16)

¬

(cid:16)

¬

¬

Extreme load scenario. Fig 3.7b and Fig 3.7a depict runtime versus computation dura-

tion and runtime vs segmentation frequency under our network’s maximum read/write load.

These results are slightly noisier when compared to the results of the synthetic experiments.

This is because the events in the synthetic experiments were uniformly distributed across the

whole computation length, but they are not uniform here. Database operations requiring net-

work communications (read, write, and update) require an average of 100ms, whereas sending

and receiving messages involve inter-process communication and take roughly 10ms-15ms,

resulting in a non-uniform event distribution. When comparing with the automata-based

approach, we do not see much improvement when monitoring φwrc or φdrc using progression

based approach. However, when monitoring φrw, we observe an average improvement of

55.53%.

Moderate load scenario.

In Fig 3.7b, we were able to break even with as little as 2

processes. To find a real-world example with modest database activities, consider the Google

Sheets API, which enables a maximum of 500 requests per 100s per project and a maximum

of 100 requests per second per user, i.e., on average 5 events/sec per project and 1 event/sec

per user. To see how our technique operates in such a scenario, we increase the number

of processes and cores available to monitor such a system in order to investigate the time

required to verify the trace created by such a system. In Figure 3.6c, we see that we break

even at an event rate of 3 events/sec/user when using the progression-based strategy. Our

algorithm operates effectively when the number of processes is 7, 8, or 9, which is far higher

60

than Google allows. This allows us to be certain that our technique can be implemented

online in real-world scenarios.

3.5.4 Case Study 2: RACE

In this case study, we monitor a mutual separation property between multiple aircrafts.

The dataset1 for this case study was generated using the Runtime for Airspace Concept

Evaluation (RACE) [85] framework developed by NASA. RACE is a framework for creating

an event-based, reactive airspace simulation. This dataset consists of three data sets obtained

on three distinct days. Each pair was captured at around 37◦ N Latitude and 121◦ W

Longitude. The dataset contains all eight types of messages sent by the SBS unit when a

Telnet application is used to listen to port 30003, but we only use the messages with ID

M SG

−

3 which is the Airborne Position Message and includes a flight’s latitude, longitude,

and altitude and is used to verify the mutual separation of all pairs of aircraft.

We found that the time gap between the time the message was created and the time it

was recorded was generally less than a second, thus we regarded an ϵ = 1s over the time

the message was generated. Furthermore, calculating the distance between two locations is

computationally intensive since we must account for characteristics such as earth curvature.

We consider a constant latitude of 111.2km and longitude of 87.62km to speed up distance

computations at the expense of a minuscule error margin. We use these as constants and

multiply them by the difference in latitude and longitude, and factor in the altitude to get

the distance between two aircrafts. We verify mutual separation by assuming a minimum

separation of 500m between each pair of aircrafts. According to the dataset, each aircraft

generates a message at least once per 1 second. There are three distinct datasets: sbs-1 has

293 aircrafts, 168, 283 messages spread over 3 hours, 28 minutes, and 58 seconds; sbs-2 has

110 aircrafts, 64, 218 messages spread over 1 hour, 1 minute, and 46 seconds; and sbs-3 has

97 aircraft, 64, 162 messages spread over 49 minutes and 42 seconds.

In Figure 3.6b, we compare our obtained runtime to the three RACE datasets (labelled

1https://github.com/NASARace/race-data

61

sbs-1, sbs-2 and sbs-3). We monitor the data in real time, with segments of 10s and ϵ of 1s.

We put our approach to the test by increasing the number of cores on the CPU and utilizing

all available cores, as described in 3.4.2 by using more number of cores on the processor

and utilize all available cores. Our results break even for 4 cores. This makes our approach

desirable for aircraft monitoring and similar systems such as IoT.

3.6 Conclusion

We elected to start our work with discrete-event systems (as opposed to continuous-time

systems) due to the fact that monitoring discrete-event systems are intuitively less expensive

in terms of runtime and computational complexity, compared to similar continuous-time sys-

tems. Both of our proposed techniques take an LTL formula and a distributed computation

as input and, assuming a bounded clock skew among all processes, chops the computation

into multiple segments before applying either the automata-based monitoring algorithm, or

the progression-based monitoring algorithm implemented as an SMT decision problem to ver-

ify the formula’s correctness. In Section 3.5, we carried out extensive simulated experiments,

as well as case studies on monitoring consistency conditions in Cassandra and a NASA air

traffic control dataset. Our experiments demonstrate up to 35% improvement in performance

in our progression-based algorithm over our automata-based algorithm. Furthermore, based

on these experiments, we summarize that online monitoring is indeed possible with our tech-

niques when distributed computations are properly segmented and parallelized. A natural

course of action now would be to carry and apply the relevant aspects of this approach into

monitoring continuous-valued systems; in other words, distributed CPS. We take the first

steps into monitoring distributed CPS in the next chapter.

62

CHAPTER 4

PREDICATE MONITORING IN
DISTRIBUTED CYBER-PHYSICAL SYSTEMS

In this chapter, we take first steps towards rigorously monitoring distributed CPS. To this

end, we propose a monitoring technique to detect Boolean predicates over the analog (i.e.

continuous-time and continuous-valued) signals generated by the agents in a distributed

CPS. Similar to our approach described in Chapter 3, a clock synchronization algorithm

guarantees a maximum clock skew across all signals generated by the agents.

In the following sections, we first define the analog signal transmission sampling method

based on our signal model defined in Chapter 2. We then elaborate on our predicate detection

approach for partially synchronous distributed CPS using a signal retiming technique.

4.1 Signal Transmission to the Monitor

Communication between nodes requires sampling the analog signal, sending the samples,

and reconstructing the signal at the receiving node. Our goal is to monitor the reconstructed

analog signals. This is not the same as monitoring a discrete-time signal composed of sam-

ples; the applications we are addressing are concerned with the value of the signal between

samples and the possible violations revealed by it. Signal transmission methods, such as

sampling and reconstruction, are common in communication theory. Errors caused by sam-

pling and reconstruction (for example, owing to bandwidth constraints) can be addressed

for by tightening the STL formula using the methods of [45]. The reconstruction algorithm

is chosen based on the application and domain knowledge. For the sake of simplicity and

generality, we assume that every output signal xn is rebuilt as piece-wise linear between the

samples, except in one experiment where we reconstruct a signal as both piece-wise linear and

piece-wise quadratic to study the trade-offs. Other signal constructions, such as cubic splines,

can also be employed with easy modifications to our algorithms at the cost of increased run

time, provided that the signal structure chosen is orthogonal to our methodologies and the

aims of this work. Since we assume the agents do not deadlock, this transmission happens in

63

segments of length T : at the kth transmission, agent An transmits xn

tion of its output signal to the interval [(k

−
remainder of the work refers solely to the signal fragments received by the monitor during a

[(k−1)T,kT ], the restric-
|
1)T, kT ] as measured by its local clock. The

specific transmission.

We now return to the constraint imposed on In in Definition 7, namely that it is a

non-empty bounded interval. Non-emptiness models the absence of deadlock in comput-

ing. That is an interval In expresses that no events are missed, or equivalently, that signal

reconstruction is perfect at the monitor. The restriction that it be bounded models the

above monitoring setup: the monitor is only ever dealing with bounded signal fragments

xn

[(k−1)T,kT ], therefore,
|

−
for every agent at the kth transmission, measured in local time. By the bounded skew

In = [(k

1)T, kT ],

(4.1)

assumption, we have:

Lemma 7. For any two agents An, Am,

min In

|

−

min Im

| ≤

ε and

max In

|

−

max Im

ε.

| ≤

4.2 Problem Statement

Predicates are frequently used to encapsulate several system requirements (e.g., invari-

ants). A predicate φ is a global Boolean-valued function over the signal values of agents. For

instance, φ(x1, x2) = (x1 > 0)

∧

(ln(x2) < 3) is a predicate on two signals that evaluates to

true when x1 > 0 and ln(x2) < 3, otherwise false.

Because the agents are partially synchronized to within an ε, it is not possible to actually

evaluate all signals at the same moment in global time. However, as noted above, the frontier

of a consistent cut gives us a possible global state. Therefore, the monitoring problem can be

worded as follows. Given a distributed signal (E, ⇝) over N agents, as defined in Definition 7,

and a Boolean predicate φ, (E, ⇝) satisfies φ iff there exists a frontier of a consistent cut

in (E, ⇝), where φ is satisfied. It should be noted that throughout this chapter, (E, ⇝) is

used to denote distributed signals. We now define distributed satisfaction below.

64

Definition 10. [Distributed satisfaction] Given a distributed signal (E, ⇝) over N agents,

and a predicate φ over the N agents, we say that (E, ⇝) satisfies φ iff for all consistent cuts

E with

C

⊆

front(C) =

(cid:16)

(t1, x1(t1)), . . . , (tN , xN (tN ))

(cid:17)

we have φ(cid:0)x1(t1), x2(t2), . . . , xN (tN )(cid:1) = true. We write this as (E, ⇝)

= φ. ■
|

Thus, we formally define the problem as follows.

Problem Statement

Continuous-Time Monitoring of Distributed CPS. Given a distributed signal

(E, ⇝) and a predicate φ over N agents, determine whether (E, ⇝)

= φ.
|

When a distributed signal (E, ⇝) does not satisfy a predicate φ, we say that (E, ⇝) violates

φ and write (E, ⇝)

= φ.
̸|

In this dissertation, we want to detect whether there exists a

consistent cut C

⊆

E, such that (E, ⇝)

= φ.
̸|

The main challenge in monitoring distributed signals is that the monitor has to reason

about signals that are subject to time asynchrony. For instance, consider two signals x1 and

x2 and the case where x1(2) = 5, x2(3) = 1, φ(x1, x2) = (x1 > 4)

(x2 < 0), and ε = 2 so

∧

that time points 2 and 3 form a consistent cut. In this case, since the above signal values

are at local times within the possible clock skew, one has to (conservatively) consider that

the predicate is violated. In the next section, we present our solution to the problem.

4.3 SMT-based Monitoring Algorithm

In a nutshell, our solution has the following features:

• Central monitor. We assume that there is a central monitor that solves, at regular

intervals, the monitoring problem described in Section 4.2.

• Signal retiming. As signals are measured using their local clocks, the monitor

should somehow align them to detect possible violations of the predicate. To this

end, we propose a retiming technique that establishes the happened-before relation in

65

the continuous-time setting, and stretches or compacts signals to align them with each

other within the ε clock skew bound.

• SMT encoding. We transform the monitoring decision problem into an SMT-solving

problem, whose components (like input signals and the happened-before relation) are

modeled as SMT entities and constraints.

4.3.1 Retiming Functions

Our signal model is continuous-time, that is, the signals are maps from R+ to R+. There-

fore, to model the approximate re-synchronizing action of the monitor, we use a retiming

function.

Definition 11. [Retiming functions] A retiming function, or simply retiming, is an increasing

function ρ : R+

→

R+. An ε-retiming is a retiming such that:

R+ :

t
∀

∈

t
|

ρ(t)
|

−

< ε. Given a

distributed signal (E, ⇝) over N agents and any two distinct agents Ai, Aj, where i, j

[N ],

∈
(t < ρ(t′)) for

a retiming ρ from Aj to Ai respects ⇝ if we have ((t, xi(t)) ⇝ (t′, xj(t′)))

⇒

any two events (t, xi(t)), (t′, xj(t′))

∈

E. An ε-retiming that respects ⇝ is a valid retiming. ■

Figure 4.1 shows examples of retimings and how they relate to predicate monitoring.

To detect predicate violation, we must first retime y to the t axis via a retiming map ρ.

(c) shows three different retimings, including the identity. (d)-(e) show the retimed y. For

the predicate x > y, (e)-(f) show no violations, but (d) does. The conservative monitoring

answer is that the predicate is violated. An ε-retiming ρ maps R+ to itself, but it is easy to

see that the restriction of ρ to a bounded interval I is an increasing function from I to ρ(I)

that respects the constraint

t
|
−
attention to the action of ε-retimings on bounded intervals.

< ε for all t

ρ(t)

∈

|

I. Thus, in what follows we restrict our

We now state and prove the main technical result of this chapter, which relates the

existence of consistent cuts in distributed signals to the existence of retimings between the

agents’ local clocks.

Theorem 3. Given a predicate φ and distributed signals (E, ⇝) over N agents, there exists

a consistent cut C

⊆

E that violates φ if and only if there exists a finite A1-local clock value

66

Figure 4.1 Predicate violation between two signals x and y measured using partially syn-
chronized clocks t and s.

t and N

−

1 ε-retimings ρn : In

→

I1 that respect ⇝, 2

≤

(cid:16)

φ

x1(t), x2

ρ−1
2 (t), . . . , xN

ρ−1
N (t)

◦

◦

n

(cid:17)

N , such that:

≤

= false

(4.2)

and such that ρ−1

m ◦

ρn : In

→

composition operator.

In is an ε-retiming for all n

= m. Here, ‘
◦

’ denotes the function

Proof. We distinguish the following cases:

Case 1: Suppose that such retimings exist. Define the local time values t1 := t, tn = ρ−1

2

n

N , and the set C =

≤

≤

≤
retimings respect ⇝, it holds that if e

en
t t
{

n (t),
. By the construction of C and the fact that the
}
C and f ⇝ e then f

C. For every n, m

= m,

2, n

tn

∈

≥

∈

it holds that tm = ρ−1

m (ρn(tn)) so

tn
|

−

tm

| ≤

ε. Thus C is a consistent cut with frontier

(en

tn)N

n=1

that witnesses the violation of φ.

67

ytsxy idxtxty ⇢1xty ⇢2(a)(b)(d)(e)(f)(c)ssstttid⇢1⇢2̸
̸
Case 2: Suppose now that there exists a consistent cut C with frontier:

front(C) =

(cid:16)

(t1, x1(t1)), . . . , (tN , xN (tN ))

(cid:17)

that witnesses violation of φ. We need the following facts.

and em
Fact 1. For every two events en
tm
tn
front(C), we have em

ε. Indeed, since en

tm

tn ∈

C for all s s.t. s + ε

s ∈

in the frontier of a consistent cut, we have

s for all such s and so tm

tn

−

≥

ε. By symmetry of the argument, tn

≥

≤
tm

tn. Thus,

ε holds

−

−

| ≤

tn
|
tm

≥
as well.

Fact 2. Given intervals [a, b] and [c, d] s.t.

c

a
|

−

| ≤

ε and

d

b
|

−

| ≤

ε, the map L : [a, b]

[c, d]

→

defined by L(t) = c + d−c

b−a(t

−

a) is a linear ε-retiming. This is immediate.

Suppose first that there are no message exchanges. For 2

n

≤

≤

N , we define the retiming

ρn : In

→

I1 in two pieces. First, set ρn(tn) = t1. By preceding lemma,

tn
|

−

t1

| ≤

ε. Write

I1 = [a, b] and In = [c, d] for notational simplicity in this proof. Call a pair of intervals that

satisfies the hypothesis of Fact 2 an admissible pair. Then, the following pairs are clearly

admissible by Lemma 7: [a, t1] and [c, tn], and [t1, b] and [tn, d]. Thus, there exist two linear

retimings Ln : [a, t1]

ρn(t) = Ln(t) on c

is an ε-retiming.

≤

→
t

≤

[c, tn] and L′

n : [t1, b]

[tn, d], and we can define a piece-wise ρn:

tn and ρn(t) = L′

→
n(t) on tn

t

≤

≤

d. It is easy to establish that ρn

It remains to show that ρ−1

n ◦

ρm : Im = [f, g]

→

[c, d] is also an ε-retiming. This too can

be established in parts, first over [f, tm] then over [tm, g], using the same arguments as above

and exploiting the linearity of these retimings. For instance, if we write αn for the slope of

Ln, then over [f, tm]

n (ρm(s)) = L−1
ρ−1

n (Lm(s)) = L−1
1
αn

[a + αm(s

−

c)] + f

−

=

n (a + αm(s

c))

−
a/αn = f +

g
d

f
c

−
−

(s

c)

−

which is a linear ε-retiming by Fact 2. ■

68

If there are message exchanges, the above argument still applies but over a more fine-

grained division of the timelines In obtained by partitioning each timeline at message trans-

mission times.

Proof. For the admissible pair I1 = [a, b] and In = [c, d], suppose the first message is sent from

An to A1 at local time s < tn and is received at local time r < t1. Define t(s) := min(s + ε, r).

Then the pair [a, t(s)], [c, s] is admissible. Upon repeating this process for all messages, a

collection of admissible pairs is obtained that can be retimed to each other, as above, without

violating the ⇝ relation. These are concatenated to yield the desired retiming ρn.

■

Thus, finding a consistent cut that violates the predicate can be achieved by finding

such retimings. The proof of Prop. 3 further shows that the retimings can always be chosen

as piece-wise linear (rather than any increasing function), which yields significant runtime

savings in the SMT encoding in the next section.

Remark 2. An interesting consequence of Fact 2 in the proof is that it is enough to use

piece-wise linear retimings. This results in the following concrete problem.

Concrete Problem Statement

Given ε > 0, a distributed signal (E, ⇝) over N agents, and a predicate φ over the

N agents, find N

−

1 piece-wise linear ε-retiming functions ρ2, . . . , ρN that satisfy the

hypotheses of Theorem 3 and s.t.

(cid:16)

φ

x1(t1), x2(ρ−1

2 (t1)), . . . , xN (ρ−1

(cid:17)
N (t1))

= false

(4.3)

4.3.2 SMT Formulation

We solve the monitoring problem by transforming it into an instance of satisfiability

modulo theory (SMT) [6]. Specifically, we ask whether there exists N

1 retimings, such

−

that (4.3) holds; equivalently, whether there exists a consistent cut that witnesses satisfaction

of

φ.

¬

69

Without loss of generality, we start with our encoding of two agents, A1 and A2 (shown in

Figure 2.3). A1 outputs signal x supported over the bounded timeline I1, which is discretized

to D1

⊂

I1 and sent to the monitor. Similarly, A2 outputs signal y supported over the

bounded timeline I2, which is discretized to D2

I2 and sent to the monitor. D1 and D2

⊂

are finite. Let δk > 0 be the sampling period of agent Ak, so two consecutive elements of Dk

differ by δk, k

.

1, 2
}

∈ {

Consider further that A2 transmits a message at local time t1 and it is received by A1 at

local time t2, and that A1 sends a message at local time t3 which is received by A2 at local

time t4. The distributed signal violates the predicate iff the following SMT problem returns

SAT.

SMT entities.

In our encoding, the entities are the retimings ρn included as uninterpreted

functions (the solver will interpret), signals x and y, intervals I1 and I2, real numbers t, s,

s′, t1, t2, t3, and t4. All these entities have been defined in the previous sections. The

following quantities are all constants in the encoding, since they are known to the monitor:

the sampling time sets Dk and sampling periods δk, the sampled values

x(ti)

{

ti

|

∈

D1

}

and

si

D2

, and the message transmission and reception local times.

y(si)
{
SMT constraints. The encoding is a conjunction of the following constraints:

∈

}

|

• (Predicate violation) The first constraint ‘finds’ local times t and s at which predicate

φ is violated (upto ε-synchrony):

(cid:17)

t− + δ1

s− + δ2

I2.

s
∃
∈
D1. t−

t

≤
s

≤

≤

≤
D2 . s−
(cid:17)

I1.

∈

∈
t−
∃
s−
∃
ρ(s) = t

∈

∃

t

(cid:16)

(cid:16)

(cid:16)

(cid:16)

∧

φ(x(t−), y(s−))

(cid:17)

¬

∧
(cid:17)

∧

(4.4a)

(4.4b)

(4.4c)

(4.4d)

(4.4e)

Eq. (4.4b) finds the time sample t− such that x(t) = x(t−): this is the result of

our assumption that signals are piece-wise constant. Eq.(4.4c) does the same for y.

70

Eq. (4.4d) specifies that s is retimed to t: this is what guarantees that (x(t), y(s)) is a

possible global state as per Theorem 3. Eq. (4.4e) checks violation of the predicate at

(x(t), y(s)) = (x(t−), y(s−)).

• (Valid retiming) Eq. (4.5) ensures that ρ is a valid ε-retiming from I2 to I1:

s
∀

∈

I2.

t
∃

∈

I1. (ρ(s) = t)

t
(
|

∧

s

|

−

< ε)

and Eq. (4.6) ensures that the retiming function is increasing:

s
∀

∈

I2.

s′
∀

∈

I2.

(cid:16)

s < s′

⇒

ρ(s) < ρ(s′)

(cid:17)

(4.5)

(4.6)

• (Happened-before) Eq. (4.7) enforces the happened-before relation for message trans-

missions:

(cid:16)

ρ(t1) < t2

(cid:17)

(cid:16)

∧

(cid:17)

t3 < ρ(t4)

(4.7)

• (Inverse retiming) When there are more than 2 agents, we must also encode the con-

straint that for all n

= m, ρ−1

m ◦

ρn is an ε-retiming. Thus, for all n

= m, letting fm be

the uninterpreted function that represents the inverse of the uninterpreted ρm, we add

t
∀

∈

In

·

fm(ρn(t)) = t

(4.8)

in addition to the analogs of Eqs. (4.6) and (4.5) for fm

ρn.

◦

Other signal models.

If output signals were piece-wise linear, say, Eq. (4.4e) would be

modified accordingly:

(cid:18)

x(t−) +

φ

x(t− + δ1)
δ1

−

x(t−)

t−),

(t

−

y(s−) +

y(s− + δ2)
δ2

−

y(s−)

(s

−

(4.9)

= false

(cid:19)

s−)

Similarly, if output signals were piece-wise quadratic, Eq. (4.4e) would be modified as follows:

φ (x(t), y(s)) = false

(4.10)

71

̸
̸
x(t−) = 1
x

x(t− + δ1) = 5

3

y

y(s−) = 2 y(s− + δ2) = 4
3

x(t− + δ1)

y(s− + δ1)

x(t)

y(s)

x(t−)

x

y
y(s−)

(a) Piece-wise linear signals.

(b) Piece-wise quadratic signals.

Figure 4.2 Piece-wise interpolations.

y

x

Figure 4.3 Piece-wise linear signals vs. piece-wise quadratic signals.

In some systems, piece-wise quadratic signals may be used to represent signals more

accurately. For example, Figure 4.3 shows two piece-wise quadratic construction having the

same value at some point in time, whereas their piece-wise linear counterpart signals do not.

Our choice of signal models is limited by the SMT solver:

it must be able to handle the

corresponding interpolation equations, like the piece-wise linear interpolation in Eq. (4.9).

As an example, in Figure 4.2a, let x and y be two signals, where the violating predicate ϕ

to be monitored is x(t) = y(s). Let ρ be a retiming of y on x, such that ρ(s−) = t− and

ρ(s− + δ2) = t− + δ1. It can be observed that although the discretized signal samples do not

violate ϕ, due to the signals being piece-wise linear, it is easy to identify a violation at time

t and s on signals x and y respectively, where x(t) = 3, y(s) = 3 and ρ(s) = t.

Another example is demonstrated in Figure 4.2b, where x and y are two signals expressed

by their corresponding quadratic formulas. The violating predicate ϕ to be monitored is

d(x(t), y(s))

≤

2, where d is a function that yields the distance between any two points. Let

ρ be a retiming of y on x, such that ρ(s−) = t− and ρ(s− + δ2) = t− + δ1. Furthermore, let

72

Figure 4.4 Leveraging dynamics.

evaluation of d(x(t−), y(s−)) be 3 and evaluation of d(x(t− + δ1), y(s− + δ2)) be 3. It can be

observed that although the discretized signal samples do not violate ϕ, due to the signals

being piece-wise quadratic, it is easy to identify a violation at time t and s on signals x and

y respectively, where d(x(t), y(s))

2 and ρ(s) = t.

≤

It is worth mentioning that restricting the SMT search to piece-wise linear retimings

results in a significant decrease in run time, compared to the approach where the SMT is

tasked with determining an interpolation. For example, for two UAVs with ε = 1ms over

5s-long signals, at segment count 5, the search for a general retiming requires 3.42s, whereas

searching for a piece-wise linear retiming requires only 1.01s. Since, by Remark 2, there

is no loss of generality in this restriction, from this point, all the reported experiments are

obtained using the piece-wise linear retiming approach.

Remark 3. (i) ρ−1
m ◦

ρn respects ⇝ automatically so it is not necessary to encode that explic-

itly. (ii) Because we can restrict the SMT search to piece-wise linear retimings (see remark

following proof of Theorem 3), constraint (4.8) can be simplified, namely, the expression for

the inverse can be hard-coded. We do not show this to maintain clarity of exposition.

73

ytsxrate boundy0.5x 3rate bound⌧1⌧2Data: Distributed signal (E, ⇝), ε, predicate φ, bounds
Result: (E, ⇝)
= φ
|
Set tn = min In, n

[N ] while not done do

˙xn
|

| ≤

bn, n

∈

[N ]

∈

Get next violating assignment σ to the atoms of φ if there are no more violating assign-
ments then
done

else

for every atom a in φ do
if σ(a) = true then

τ
τn = min
{

else

x(tn + τ )

, n

va

}

∈

≥

[N ]

|

x(tn + τ ) < va

τ
τn = min
{
end
Set τ = maxn τn and m = argmax
of the restrictions xn

|

∈

[N ]

, n
}
nτn SMT-monitor the distributed signal Eσ made

[tn+τ −ε,max In], n
|

= m and xm

[tm+τ,max Im] If SAT, done.
|

end

end

Algorithm 4.1 Dynamics-aware monitoring.

4.4 Exploiting the Knowledge of System Dynamics

Physical processes in a CPS follow the laws of physics. A runtime monitor can leverage

this knowledge of the CPS dynamics to make monitoring more efficient. We explain our idea

by the following example (see Figure 4.4). From knowing the rate bound

1 (shown by

˙x
|

| ≤

a dashed line), the monitor concludes that the earliest x can satisfy the atom x

3 is τ1.

≤

Similarly for y. Given that τ1 > τ2, the monitor discards, roughly speaking, the fragment

[0, τ2] from each signal and monitors the remaining pieces. Note that x(0) = 1 and y(0) = 2.

Consider the predicate: φ =

(a

¬

∨

b), where a := x

3 and b := y

≥

≤

0.5. Let a and b be

atoms of predicate φ. There are 3 Boolean assignments to atoms a and b that falsify the

predicate. Let us fix one such assignment, a = b = true. If the monitor knows a uniform

bound on the rate of change ˙x of x, say

˙x(t)
t.
|

∀

| ≤

1, then it can infer that a = true cannot

hold before τ1 = 2 (local time). Similarly, if the monitor knows that

3, then b = true

˙y
|

| ≤

cannot hold before τ2 = 0.5 (local time). Taking into account the ε-synchrony, the monitor

can limit itself to monitoring x

[2,T ] (the restriction of x to [2, T ]) and y
|

[2−ε,T +ε].
|

Now, if this yields UNSAT in the SMT instance, we select the next Boolean assignment

(in terms of atoms a and b) that falsifies predicate φ (e.g., a = false and b = true), derive

74

̸
the useful portion of signals x and y, and repeat the same procedure until the answer to the

SMT instance is affirmative or all falsifying Boolean assignments are exhausted. Of course,

this requires exploring all such assignments to atoms of the predicate, but since we expect

the number of atoms in realistic predicates to be relatively small, the exhaustive nature of

falsifying Boolean assignments will not be a bottleneck. We generalize this idea to N agents

and arbitrary predicates in Algorithm 4.1. We assume without loss of generality that every

atom a that appears in φ is of the form xn

va for some n and va

R. A Boolean assignment

is a map σ from atoms to

{

≥
, and a violating assignment is one that makes the
false, true
}

∈

predicate false. Thus, given a violating assignment σ, for every atom a, a = σ(a) iff xn

va

≥

(if σ(a) = true) or xn < va (if σ(a) = false). Obvious modification to Algorithm 4.1 allows

the monitor to take advantage of knowing different rate bounds at different points along the

signals.

4.5 Case Studies and Evaluation

In this section, we evaluate our technique using two case studies on networks of au-

tonomous ground and aerial vehicles.

4.5.1 Case Study 1: Network of Ground Autonomous Vehicles

We collected data from two 1/10th-scale autonomous cars competing in a race around a

closed track. Each car carries a LiDAR for perceiving the world, and uses Wi-Fi antennas

to communicate with the central monitor. Each car is running a model predictive controller

to track its racing line and RRT to adjust its path. The trajectory data is sampled at 25Hz.

In this application, the useful signal length to monitor is 1

2s, as this is the control horizon

−

(i.e., the controller repeatedly plans the next 1

2s). Thus, in Eq. (4.1), T = 1

2s. A

−

−

reasonable range for ε is interval [1, 5]ms, guaranteed by ROS clock synchronization based

on NTP. Unless otherwise indicated, we monitor the predicate d(x1, x2) > δ

d(x1, x2)

∧

∆.

≤

75

4.5.2 Case Study 2: Network of UAVs

We use Fly-by-Logic [100], a path planner software for UAVs, to simulate the operation of

two UAVs performing various reach-avoid missions. In a reach-avoid mission, each UAV must

reach a goal within a deadline, and must avoid static obstacles as well as other UAVs. The

path planner uses a temporal logic robustness optimizer to find the most robust trajectory.

The trajectories are sampled at 20Hz. In this application, the useful signal length to monitor

is around 2s, as this is the UAV’s ‘reaction time’ (depending on current speed). Thus, in

Eq. (4.1), T ≊ 2s. A reasonable range for ε is again 1

otherwise indicated, we monitor the predicate d(x1, x2)

−

≥

4.5.3 Case Study 3: Water Distribution System

5ms, guaranteed by ROS. Unless

δ.

We use a model of a hybrid dynamic high pressure water distribution system consisting

of two water tanks. Each water tank has an inlet pipe connected to some external water

source, and an outlet pipe with a valve that can be used to regulate high pressure water

outflow from each tank. A controller on each water tank operates its valve, and samples

the outflow pressure at 20Hz using its local clock. We model such a system in Simulink,

which is a simplified emulation of the Refueling Water Storage Tanks (RWST) module of

an Emergency Core Cooling System (ECCS) of a Pressurized Water Reactor Plant [118] as

shown in Figure 1.1. ECCS is tasked with providing core cooling to minimize fuel damage

following a ‘loss of coolant’ accident by administering high pressure water injection from

RWST. The water tanks, and by extension their controllers, operate even when the supply

of power is lost to the plant. As a failsafe, ECCS also incorporates Cold Leg Accumulators

that do not require power to operate. These tanks contain large amounts of borated water

with a pressurized nitrogen gas bubble in the top. If the pressure of the outflow pressure

drops below a certain threshold, the nitrogen will force the borated water out of the tank

and into the reactor coolant system. A reasonable range for ε is 5ms

500ms [13] depending

−

on how often the local clocks of the water tanks are synced with global time. In this case

study, we monitor the property that the cumulative pressure of the RWSTs always remains

76

e
l
a
c
s

2
g
o
l

n
i

)
s
(

e
m

i
t

n
u
R

2

0

2
−

4
−

S.D. = 0.5s
S.D. = 0.6s
S.D. = 0.7s
S.D. = 0.8s
S.D. = 0.9s
S.D. = 1.0s
S.D. = 1.5s
S.D. = 2.0s

0

5
15
10
Number of segments

20

e
l
a
c
s

2
g
o
l

n
i

)
s
(

e
m

i
t

n
u
R

2

0

2
−

4
−

S.D. = 0.5s
S.D. = 0.6s
S.D. = 0.7s
S.D. = 0.8s
S.D. = 0.9s
S.D. = 1.0s
S.D. = 1.5s
S.D. = 2.0s

0

5
15
10
Number of segments

20

(a) Network of cars.

(b) Network of UAVs.

Figure 4.5 Impact of signal segmentation on run time with varying signal duration (S.D.)
and fixed ε = 0.001s.

ε = 0.001s

0.6

0.4

0.2

)
s
(

e
m

i
t

n
u
R

0

1

2

3

4

5

6

Signal duration (s)

Figure 4.6 Best run time (network of cars) for different signal duration.

above a certain threshold.

Note that the SMT solver’s effort is mostly spent on finding retiming, instead of predicate

complexity. Thus, we pick simpler predicates for our experiments.

4.5.4 Experimental Setup

In our experiments, we choose the following parameters: (1) signal duration, (2) maxi-

mum clock skew ε, and (3) distribution of communication among agents. We measure the

77

monitor run time. All experiments are replicated to exhibit %95 confidence interval to pro-

vide statistical significance. The experimental platform is a CentOS server with 112 Intel(R)

Xeon(R) Platinum 8180 CPUs @ 3.80GHz CPU and 754G of RAM. Our implementation

invokes the SMT-solver Z3 [97] to solve the problem described in Section 4.3.

4.5.5 Analysis of Results

Impact of signal segmentation Given a signal-to-be-monitored, we have a choice of

either passing the entire signal to the monitor, or chopping it into segments and monitoring

each segment separately (while accounting for ε-synchrony). Monitoring a signal in one shot

is computationally more expensive than monitoring a number of shorter segments. Figure 4.5

shows the results of this claim. Note that all curves are plotted in log2
clarity. As can be seen, for any signal duration, chopping the signal and invoking the monitor

scale to provide more

for the shorter segments reduces the run time significantly. For example, in the case of the

UAV network (Figure 4.5b), for a signal duration of 2s, it takes 4.5s to monitor the signal

in one shot, but only 0.55s if the monitor is invoked 20 times over the signal duration. We

observe the same behavior in Figure 4.5a. This is due to the SMT-solver having to deal with

much smaller search spaces in each invocation.

Figure 4.6 shows the best achievable run time for different signal durations by searching

over the segment count of range [1, 25]. For example, segment count of 4 is obtained for 1s

signal to get minimum run time of 0.17s, while segment count of 18 is obtained for 5s signal

to get minimum run time of 0.72s. The best run time shown is achieved by distributing

the monitoring tasks across all the available cores (4) on the monitoring device. Notice that

our predicate detection algorithm can be parallelized trivially, assigning one or a pool of

segments to a different core.

An important consequence of segmentation is that it enables us to monitor signals in

real time, as for 3 or more segments, the run time of the monitor is less than the signal

duration. For this reason, in all remaining experiments, the signal-to-monitor is chopped into

20 segments and each segment is monitored separately. Cumulative run times (of monitoring

78

e
l
a
c
s

2
g
o
l

n
i

)
s
(

e
m

i
t

n
u
R

2

1.5

1

0.5

0

Seg = 1
Seg = 2
Seg = 3
Seg = 4
Seg = 5
Seg = 7
Seg = 9
Seg = 20

e
l
a
c
s

2
g
o
l

n
i

)
s
(

e
m

i
t

n
u
R

2

0

2
−

Seg = 1
Seg = 2
Seg = 3
Seg = 4
Seg = 5
Seg = 7
Seg = 9
Seg = 20

1

2
4
3
Clock skew ε (ms)

5

1

4
3
2
Clock skew ε (ms)

5

(a) Network of cars.

(b) Network of UAVs.

Figure 4.7 Impact of clock skew on run time. Signal duration = 2s.

all 20 segments) are reported.

Impact of clock skew We now study the impact of different choices of ε on monitoring

run time. We choose realistic values for ε with millisecond resolution. Figure 4.7 shows the

monitoring run time for a 2s signal chopped into 1

20 segments. Both Figs. 4.7a and 4.7b

−

show that high resolution clock synchronization results in very stable execution time for

the monitor. This is a positive result, showing that for practical clock synchronization

algorithms, the actual value of ε does not have an impact on the monitoring overhead.

However, naturally ε has an impact on the number of violations detected, specifically false

positives. To demonstrate this, we model the path of a pair of UAVs and a pair of cars, where

the agents periodically reside within the given mutual separation threshold, and violate the

mutual separation property.

Tables 4.1 and 4.2 show the results for two cars and two UAVs, respectively, in operation

for half an hour. The experiments report (1) the number of True Violations as a baseline

that was reverse calculated from the introduced clock drift ε; (2) the number of Detected

Violations using our method, and (3) the number of False Positives, which is essentially the

difference between the true violations and the detected violations. Note that there were no

79

Clock
Skew (s)
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5

True
Violations
6
3
2
4
1
4
4
5
3
3

Detected
Violations
13
19
29
41
46
52
60
70
80
89

False
Positives
7
16
27
37
45
48
56
65
77
86

Table 4.1 Impact of clock skew in network of cars on verdicts using varying ε.

Clock
Skew (s)
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5

True
Violations
6
6
8
4
2
1
7
2
5
6

Detected
Violations
11
20
30
39
46
48
62
66
76
84

False
Positives
5
14
22
35
44
47
55
64
71
78

Table 4.2 Impact of clock skew in network of UAVs on verdicts using varying ε.

False Negatives. Furthermore, as the maximum clock skew is increased from 0.05s to 0.5s,

naturally the number of False Positives increase as well.

Impact of number of agents Now we observe the impact of the number of UAVs on the

monitor. Figure 4.8a shows the effect on run time for increasing the number of agents from 2

to 10 with ε = 1ms over 5s-long signals. As each segment of a signal can be monitored inde-

pendently, we improve our run time by distributing the monitoring tasks across all available

cores on the monitoring device. Observe that initially the run time drastically improves as

more segments are used. However, eventually the improvement becomes negligible, due to

run time being dominated by non-SMT tasks, such as creating job queues, allocating jobs

80

)
s
(

e
m

i
t

n
u
R

60

40

20

0

0

5

Agents = 2 (bottom) to 10 (top)

S.D. = 5s

)
s
(

e
m

i
t

n
u
R

30

20

10

0

30

2

8
6
4
Number of agents

10

10

15
Number of segments

20

25

(a) Signal Duration = 5s and ε = 0.001s.

(b) Signal Duration = 5s and ε = 0.001s.

Figure 4.8 Impact of agents on run time.

to cores, and so on. We refer to this run time as the best run time. Figure 4.8b shows the

best run times for different number of agents with ε = 1ms over 5s-long signals.

Impact of communication We examine whether the number of messages exchanged

between agents has a significant impact on monitor run time. Two opposing mechanisms

exist: on the one hand, messages impose an order between the send and receive moments and

so reduce concurrency. In the discrete-time setting this normally reduces the asynchronous

monitoring complexity. On the other hand, messages result in extra constraints in the SMT

encoding via Eq. 4.7, which could increase SMT run time.

Figure 4.9 shows the results. In (a) we use ε = 1ms and a 1s-long signal. Run time

varies with no clear trend, suggesting that neither of the above two opposing mechanisms

dominates. In (b), we use ε = 2s for a 2s-long signal: i.e., all events are concurrent. One

can see the order introduced by messages are slightly increasing the runtime, instead of

decreasing it. No conclusion can be drawn, and future work should study this more closely.

Impact of piece-wise quadratic signals We now compare the effect on run time for

piece-wise quadratic signals against piece-wise linear signals. To this end, we consider 1s-

long signals for each signal model generated by the network of cars with ε = 0.001ms. For

quadratic signals, each formula is constructed by the corresponding agent with the help of

81

)
s
(

e
m

i
t

n
u
R

1.6

1.5

1.4

1.3

0

ε = 0.001s

ε = 2s

400

300

200

)
s
(

e
m

i
t

n
u
R

100

0

40

20
80
Number of messages

60

100

40

20
80
Number of messages

60

(a) Signal Duration = 1s and ε = 1s.

(b) Signal Duration = 2s and ε = 2s.

Figure 4.9 Impact of communication (between two agents) on run time.

Piece-wise Linear Signals
Piece-wise Quadratic Signals

e
l
a
c
s

2
g
o
l

n
i

)
s
(

e
m

i
t

n
u
R

0

1
−

2
−

3
−

0

5
15
10
Number of segments

20

Figure 4.10 Run time (network of cars) vs. segment count.

an SMT solver, using signal value of at current local time, and signal values of last two

samples. This formula is then sent to the monitor. The formulas are constructed at their

corresponding agents instead of the monitor due to the fact that solving quadratic equations

for all agents on each sample point can become an expensive task for the monitor, especially

for higher number of agents.

In Figure 4.10 we observe the runtime for varying segment

counts for both signal models. The runtime for piece-wise quadratic signals is generally

higher than its piece-wise linear counterpart due to quadratic signals having three discrete

82

2.2

2

1.8

1.6

1.4

1.2

)
s
/
m

(

y
t
i
c
o
l
e
V

Velocity-car1
Velocity-car2

3

2

1

0

)
s
(

e
m

i
t

n
u
R

0

0.5

1

1.5
Time (s)

2

2.5

3

0.5

SMT-normal
SMT-dynamics

1.5

1
Signal duration (s)

2

2.5

3

(a) Velocity profile of two cars.

(b) Run time vs. signal duration.

Figure 4.11 Impact of Algorithm 4.1 on monitoring run time. ε = 0.001s.

sample points (longer) than linear signals with two sample points (shorter). In exchange for

this cost in runtime, we achieve better accuracy, as a piece-wise quadratic signal model is

a more accurate signal representation when compared to its piece-wise linear signal model

counterpart (recall Figure 4.3).

Impact of knowledge of dynamics bounds Here the predicate of interest is φ = (v1 >

1.6)

∨

(v2 > 1.3), where vi is the velocity of the ith car. The acceleration limit from system

dynamics is 1m/s2. The monitor samples the received signals (Figure 4.11b) at 0.25s intervals

and applies the acceleration bounds as explained in Section 4.4 to discard irrelevant pieces

of the signal. As shown in Figure 4.11, applying Algorithm 4.1 clearly reduces the monitor

run time. In general, of course, the exact run time reduction varies. For instance, while the

speedup is

10 for 3s-long signals 3s, it is

×

15 for 2s-long signals.

×

Impact of segment duration and number of water tanks Let P1 and P2 denote

the outflow pressure indicated by the respective valve controllers attached to each water

tank. For simplicity, we assume all the pipes are of the same diameter. Therefore, the

pressure exerted on the Cold Leg Accumulators is P1 + P2. In the experiment, we monitor

the property φP, which is, during an emergency, the outflow remains above the threshold

83

)
s
(

e
m

i
t

n
u
R

2

0
1

2

3

4

5 2

Segment duration (s)

4

3

N u m b er of ta n ks

Figure 4.12 Effect of segment duration and the number of water tanks on runtime when
ε = 0.05s.

pressure 600psig [117], that is, φP = P1 + P2 > 600psig.

Figure 4.12 shows the effect on runtime for increasing the number of water tanks from

2 to 4 with ε = 0.05s over segment duration ranging from 1s to 5s. As expected, both

segment duration and the number of water tanks contribute to driving up the runtime. We

note that even when the monitor receives the distributed signals sent by the water tanks at

a reasonable 1s intervals, the monitor is still able to verify the property under around half a

second for four water tanks.

Impact of clock skew We now study the impact of different choices of ε on monitoring

verdicts. To this end, we model two Refueling Water Storage Tanks with intentional ‘faults’,

where the outflow pressures of either water tank can drop below the threshold pressure of

the Cold Leg Accumulators. Therefore, if at some moment in time, both the tanks’ pressures

fall simultaneously, the Cold Leg Accumulators gets triggered. We also introduce a clock

drift in the valve controller of one of the water tanks. We choose realistic values for clock

drift with millisecond resolution.

Table 4.3 shows the results for two water tanks that were active for an hour. During

this operation time, Tank 1 reported low pressures for a total of 35.5 seconds, and Tank

2 reported low pressures for a total of 36.1 seconds. The experiment reports number of

84

Tank 1
Total Low
Pressure
Duration (s)
35.5
35.5
35.5
35.5
35.5
35.5
35.5
35.5
35.5
35.5

Tank 2
Total Low
Pressure
Duration (s)
36.1
36.1
36.1
36.1
36.1
36.1
36.1
36.1
36.1
36.1

Clock
Skew (s)

True
Violations

Detected
Violations

False Positives

0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5

9
4
12
11
4
7
5
7
10
7

25
42
65
80
86
99
112
127
145
160

16
38
53
69
82
92
107
120
135
153

Table 4.3 Impact of clock skew in water tanks on verdicts using varying ε.

True Violations as a baseline that was reverse calculated from the introduced clock drift ε,

number of Detected Violations using our method, and the number of False Positives, which

is essentially the difference between the True Violations and the Detected Violations. Note

that there were no False Negatives. Furthermore, as the maximum clock skew is increased

from 0.05s to 0.5s, naturally the number of False Positives increase as well.

4.6 Conclusion

In this chapter, we demonstrated a new approach to online predicate detection for dis-

tributed signals that do not share a global clock. To make the problem tractable, in Sec-

tion 4.5, we use causality analysis between real-valued signals, a reasonable assumption of

maximum clock skew among local clocks, and some knowledge of system dynamics. We also

studied the influence of signal dynamics information on monitoring efficiency. By experiment-

ing on a real network of autonomous cars, a simulated network of UAVs, and a simulated

water distribution system, we discovered that under certain circumstances, our method may

be used to successfully monitor a distributed CPS in an online setting. However, this ap-

proach only considers Boolean predicates over distributed CPS, and by extension, does not

capture more complex specifications, such as, nested and/or temporal properties. In the next

chapter, we explore the avenue of monitoring temporal specifications in distributed CPS.

85

CHAPTER 5

MONITORING SIGNAL TEMPORAL LOGIC IN
DISTRIBUTED CYBER-PHYSICAL SYSTEMS

In this chapter, we explore a runtime verification approach for partially synchronous dis-

tributed CPS, where we make use of the signal retiming mechanism from the predicate

detection technique demonstrated in Chapter 4, and the idea of the progression-based for-

mula rewitting technique demonstrated in Chpater 3. In Chapter 4, we proposed an online

predicate monitoring approach for distributed CPS. As mentioned before, the approach is

only able to detect Boolean predicates, and therefore, suffers being unable to handle formal

specification languages.

5.1 Problem Statement

As distributed agents are partially synchronized within ε clock skew, a monitoring algo-

rithm must explore all (infinite) possible reachable consistent cuts. We call the propagation

of consistent cuts with respect to time a consistent cut flow. Our objective is to determine

whether there exists some flow of moments that are within ε of each other for which at least

one reachable consistent cut results in violation of a given STL formula. This intuition is

formalized below, starting with the notion of a consistent cut flow.

Definition 12. [Consistent cut flow] Let (E, ⇝) be a distributed signal over N agents with

time interval [a, b], and S be the set of all events over E. A consistent cut flow is a function

ccf : [a, b]

→

2S that maps each time χ

∈

[a, b] to the frontier of a consistent cut at time χ;

i.e., ccf(χ)

front(C)

∈ {

C

(χ)
}

∈ C

|

. For each time χ′

∈

[a, b], and for each n

[N ], if χ < χ′,

∈

2

1

3

x1

x2

x3

3

3

2

4

3

4

ccf(0)

ccf(1.5)

ccf(3)

Figure 5.1 A valid ccf.

86

then for all events (cn(χ), xn(cn(χ)))

∈

ccf(χ), and for all events (cn(χ′), xn(cn(χ′)))

ccf(χ′),

∈

(cn(χ), xn(cn(χ))) ⇝ (cn(χ′), xn(cn(χ′))) hold. ■

Notice that a consistent cut flow induces a vector of N signals that are fully synchronized,

and thus, can be verified against an STL formula φ at time t as (ccf, t)

= φ using the
|

semantics described in Section 2.5. That is, for a consistent cut flow ccf on (E, ⇝), individual

signals (x′

1, . . . , x′

N ) can be constructed, such that, for all 1

[a, b], if (ci(χ), xi(ci(χ)))

∈
i(χ) = x1(cn(χ)). For example, let (E, ⇝) be a
distributed signal consisting of signals x1, x2, and x3 as shown Figure 5.1. For an STL

ccf(χ), then x′

≤

≤

∈

N and for all χ

i

formula

[0,3](x1 + x2 + x3

≤

10), ccf is a valid consistent flow on (E, ⇝). Note that a

distributed signal (E, ⇝) encodes uncountably infinite consistent cut flows. Let us denote

the set of all consistent cut flows by CCF. Our decision problem consists of determining

whether there is a violation of a given STL formula by some consistent cut flow.

Definition 13. [Distributed satisfaction] Let φ be an STL formula, (E, ⇝) be a distributed

signal over N agents and CCF be the set of all induced consistent cut flows. We say that

((E, ⇝), 0), or simply (E, ⇝) satisfies φ, iff for each σ

CCF, we have σ

= φ. ■
|

∈

Problem Statement

Given maximum clock skew ε > 0, a distributed signal (E, ⇝) over N agents, and

an STL formula φ, decide whether there exists a consistent cut flow σ

CCF where

∈

σ

= φ.
̸|

5.2 Monitoring Algorithm

In this chapter, we assume the monitor receives output signals from xn as piece-wise

linear signals (this is by choice and other forms of discretization will not change the core

monitoring algorithm). This transmission happens in segments of length T : at the kth

transmission, agent An transmits xn

[(k−1)T,kT ], the restriction of its output signal to the
|

interval [(k

−

1)T, kT ] as measured by its local clock. In the rest of this chapter, we refer

exclusively to the signal fragments received by the monitor in a given transmission.

87

We now revisit the restriction placed on In in Definition 7, namely, that the monitor only

deals with non-empty bounded signal fragments xn

[(k−1)T,kT ], therefore, In = [(k
|

−

1)T, kT ]

for every agent at the kth transmission, measured in local time. By the bounded skew

assumption, we have:

Lemma 8. [Bounded skew lemma] For any two agents An, Am with the intervals In =

[min In, max In] and Im = [min Im, max Im],

min In

min Im

ε and

max In

max Im

ε.

|

| ≤
> ε. However, both min In and min Im are lower bounds of

| ≤

−

−

|

Proof. Assume

min In

min Im

|

−

|

In and Im respectively, at the kth transmission. Therefore, by definition of partial synchrony,

the difference of their values must not exceed the maximum clock skew ε. Therefore, our

assumption is not possible. Thus,

min In

|

−

min Im

| ≤

ε. Similarly, we can show that

max In

|

−

max Im

| ≤

ε. ■

Since online monitoring happens in segments, at the end of each segment the monitor

either returns

⊤

(formula already satisfied),

⊥

(already violated), or unknown, and the

next segment is processed. For simplicity, our solution employs a central monitor. Our

monitoring algorithm involves three key ideas: (1) formula progression, (2) signal retiming,

and (3) SMT-based implementation, explained in the following sections.

5.2.1 Formula Progression

Let φ be an STL formula and (E, ⇝) be a distributed signal. Without loss of generality,

let this signal be split into two segments: prefix (E1, ⇝) and suffix (E2, ⇝). That is, (E, ⇝

) = (E1E2, ⇝). Thus, the monitor first evaluates φ on (E1, ⇝). If the verdict yields true or

false, then this verdict is returned and monitoring for (E, ⇝) is already complete. Otherwise,

the monitor computes a new progressed formula φ′ which will be evaluated for segment

(E2, ⇝).

Definition 14. [Formula progression] Let (E1, ⇝) be a finite distributed signal starting at

time 0 whose duration is denoted by

, and (E2, ⇝) be a finite or infinite extension
|
of (E1, ⇝). We say STL formula φ′ is a progression of STL formula φ for (E1, ⇝) if and only

(E1, ⇝)
|

88

if:

((E1E2, ⇝), 0)

= φ′.
|

⇔

((E2, ⇝), 0)

= φ
|
= φ (resp., ((E1, ⇝), 0)
|
). ■

= φ), then the progression
̸|

It stands to reason that if ((E1, ⇝), 0)

of φ is trivially φ′ =

(resp., φ′ =

⊤

5.2.2 Signal Retiming

⊥

Recall that signals are measured using their local clocks. Since the signals in our setting

are partially synchronized within an ε, it is not possible to evaluate all signals at the same

moment in global time. Rather, the best a monitor can do is explore all valid alignments

of the concurrent local moments (i.e., those moments that are within ε of each other) and

determine whether at least one such alignment violates the formula. This intuition is for-

malized below, starting with the notion of a retiming function borrowed from Chapter 4

that establishes the happened-before relation in the continuous-time setting, and stretches

or compresses signals to align them with each other within the ε clock skew bound.

A valid retiming formalizes the notion of alignment of timelines: given two ε-synchronous

timelines t and s (on two agents), we treat moments t and s = ρ(t) as being simultaneous.

Thus, the signal x(t) = [x1(t), x2(ρ(t))] is now a fully synchronous signal. An ε-retiming ρ

maps R+ to itself, but the restriction of ρ to a bounded interval I is an increasing function

from I to ρ(I) that respects the constraint

t
|
our attention to ε-retimings on bounded intervals. Between 2 agents, we need one retiming

I. Thus, we restrict

< ε for all t

ρ(t)
|

−

∈

ρ : I2

→

I1, and between N agents, we need N

1 retimings In

−

→

I1. In general there

is an infinity of valid retimings, any of which might reveal a potential violation. The next

theorem establish the fundamental condition on ε-retimings among agents and violation of

an STL formula.

Theorem 4. Given a distributed signal (E, ⇝) over N agents, and an STL formula φ with

time interval [a, b], there exists a violation at time t

R+, if and only if there exists N

1

−

∈

89

ε-retimings ρn : In

→

I1 that respect ⇝, where 2

n

≤

N , such that:

(cid:16)(cid:0)x1, x2

ρ−1
2 , . . . , xN

◦

◦

≤
(cid:17)
(cid:1), t

ρ−1
N

= φ
̸|

(5.1)

Here, ρ−1

m ◦

ρn : In

→

Im is an ε-retiming for all n

= m, and ‘

’ denotes the function

◦

composition operator, where given two functions f and g, h = g

f such that h(x) = g(f (x)).

◦

Proof. We distinguish the following cases:

Case 1: Suppose that such retimings exist. We define local time values for each time

χ

N = ρ−1

∈
. . ., tχ

N (c1(χ)), where 2

[t + a, t + b] for agents A1, A2, . . ., AN respectively as tχ

2 (c1(χ)),
2 (c1(χ)), . . .,
N (c1(χ)) are the local times of agents A1, A2, . . ., AN respectively at global time χ.
. By the construction of Cχ, and

tχ
N = ρ−1
Furthermore, define Cχ =

N . In other words, tχ

1 = c1(χ), tχ

1 = c1(χ), tχ

(tn, xn(tn))

2 = ρ−1

2 = ρ−1

tn

N

≤

≤

n

n

∈
{
the fact that the retimings respect ⇝, it holds that if e

≤

|

}

tχ
n ∧

every n, m

2 and n

≥

= m, it holds that tχ

m = ρ−1

a consistent cut, and the flow of frontiers front(Cχ), where χ

∈

CCF that witnesses the violation of φ.

σ

∈

∈
n)) so
m (ρn(tχ

Cχ and f ⇝ e, then f

Cχ. For

∈

tχ
m| ≤

ε. Thus, Cχ is

tχ
n −
|
R+, is a consistent cut flow

Case 2: Suppose σ

∈

CCF is a consistent cut flow that violate φ. By definition, there must

be consistent cuts in σ that violate φ. Let Cχ denote such consistent cuts, and let front(Cχ)

denote their frontiers. For every two events (tn, xn(tn)) and (tm, xm(tm)) in front(Cχ), we

have

tn
|

s + ε

tn

≥

≤
tm

tm

ε. Since (tn, xn(tn))

front(Cχ), we have (s, xm(s))

Cχ for all s s.t.

| ≤
−
tn. Thus, tm

∈

s for all such s and so tm

tn

−

≥

≥

∈
ε. By symmetry of the argument,

ε holds as well, implying a retiming indeed does exist. ■

−

5.2.3 SMT Encoding

We solve the monitoring problem by transforming it into an instance of the satisfiabil-

ity modulo theory (SMT). Specifically, we ask whether there exists N

1 retimings, such

−

that (5.1) holds; equivalently, whether there exists a consistent cut flow that witnesses sat-

isfaction of

φ. That is, the distributed signal violates φ iff the following SMT problem is

¬

satisfiable. This transformation to SMT solving is the focus of the next section.

90

̸
̸
5.3 SMT-based Monitoring Algorithm

The SMT formulation part of our solution is constructed by encoding both formula

progression and signal retiming into a single SMT-solving problem, and then solving it

using an SMT-solver. First, we define the SMT entities and constraints, then demonstrate

our monitoring approach with two complete examples.

In both examples, we consider a

distributed signal (E, ⇝) comprised of two individual 10 time unit long signals x1 and x2,

generated by agents A1 and A2 respectively, with a clock skew bound ε = 1. Our running

examples involve monitoring formulas

φ1 = [0,10] p and
¬

φ2 = [0,10](p

¬

[0,5]

q).

¬

∧

5.3.1 SMT Entities

In our encoding, N signals and time intervals are defined in the same fashion as the

mathematical representation in previous sections. We also include ρn retiming functions,

where 2

n

≤

≤

N , a consistent cut flow function ccf as an uninterpreted function, and real

numbers t, s, and χ. Identifying interpretations of these functions will be the output SMT

solving and, hence, the verdict of monitoring. The sampled signal values are constants in

the encoding that are known to the monitor:

xn(tn)

{

tn

|

In

.

}

∈

5.3.2 SMT Constraints

Recall from Section 5.1 that (ccf, t)

= p denotes a consistent cut flow at time t on signals
|

(x′

1, . . . , x′

N ) satisfies the atom p. To express this as an SMT problem, we encode (ccf, t)

= p
|

as f (x′

1[t], . . . , x′

n[t]) > 0, where (x1[t], . . . , xn[t])

Rn is a vector of signal values at time t,

∈

and f : Rn

→

R is a function that evaluates a vector of signal values. The SMT constraints

are primarily comprised of (1) a set of constraints that ensures valid consistent cut flow,

(2) a set of constraints that find violation, and (3) a set of constraints that enforce valid

retimings under a given clock skew.

Consistent cut flow constraints.

In order to ensure that ccf identifies a valid consistent

cut flow on (E, ⇝) over time interval [a, b], first we define the happened-before (⇝) notation

in SMT according to Definition 7, and ensure that the events in the consistent cuts mapped

91

by ccf respect the happened-before relation:

SM T flow1 =

[a, b] .

χ
∀
∈
(cid:16)(cid:0)(t′
n, xn(t′
(cid:16)

(tn, xn(tn)), (t′
∀
n)) ⇝ (tn, xn(tn))(cid:1)

n, xn(t′

n))

E .

∈
(cid:0)(tn, xn(tn))

∧

ccf(χ)(cid:1)(cid:17)

∈

(t′

n, xn(t′

n))

⇒

(cid:17)

ccf(χ)

.

∈

And that the consistent cuts mapped by ccf always increase and never intersect:

SM T flow2 =

χ, χ′
∀

∈

[a, b] .

n

∀

∈

[N ] .

(cid:16)

χ < χ′

⇒

cn(χ) < cn(χ′)

(cid:17)
.

Thus, the SMT constraint for consistent cut flow is the following:

SM T flow = SM T flow1 ∧

SM T flow2.

Retiming constraints over ccf. We ensure

SM T retime1 =

χ

[a, b] .

c2(χ)

∈

∀
(cid:16)
ρ(c2(χ)) = c1(χ))

∀

I2 .

c1(χ)

I1 .

∈

∧

∃
c1(χ)
(
|

−

∈
c2(χ)
|

(cid:17)

< ε

And that ρ is always increasing:

SM T retime2 =

[a, b] .

χ, χ′
∀
∈
(cid:16)
c2(χ) < c2(χ′)

c2(χ), c2(χ′)
∀

∈

I2 .

(cid:17)
ρ(c2(χ)) < ρ(c2(χ′))

⇒

When there are more than 2 agents , we must also encode the constraint that for all n

= m,

ρn is an ε-retiming. Therefore, for all n

ρ−1
m ◦
that represents the inverse of the uninterpreted cm:

= m, denoting fm as the uninterpreted function

SM T retime3 =

t
∀

∈

In . fm(ρn(t)) = t

Thus, the SMT constraint for signal retiming if the following:

SM T retime = SM T retime1 ∧

SM T retime2 ∧

SM T retime3.

92

̸
̸
[a,b]

U

[a, b]

i
∃

∈

[a,b]

R

[a, b]

i
∃

∈

φ

ψ

j
∀

∈

[0, i]

(σ, i)

= ψ
|

φ

ψ

(σ, i)

= φ
|

[0, i]

j
∀

∈

(σ, j)

= φ
|

(a) φ

U [a,b]ψ.

[a, b]

i
∀

∈

(σ, i)

= φ
|

[a,b]

φ

[a,b]

φ

[a, b]

i
∃

∈

(σ, i)

= φ
|

(b) φ

R [a,b]ψ.

¬

p

(σ, j)

= ψ
|

¬

(σ, j)

= p
|

(c) G[a,b]φ.

(d) F[a,b]φ.
Figure 5.2 Conversion of STL syntax trees to their corresponding SMT syntax tree.

(e)

p.

¬

[0,10]

∧

[0, 10]

i
∃

∈

∧

p

[0,5]

(σ, i)

= p
|

j
∀

∈

[0 + i, 5 + i]

[0, 10]

i
∃

∈

(σ, i)

= p
|

¬

q

¬

(σ, j)

= q
|

[0,10]

p

(a)

φ1 = F[0,10]p.

¬

Figure 5.3 SMT syntax tree of STL formulas

(b)

¬

φ2 = F[0,10](p
φ1 and

G[0,5]¬
∧
φ2.

q).

¬

¬

Violation constraints over (E, ⇝). Let γφ be the syntax tree representation of an STL

formula φ, where each internal node represents an operator, and each leaf node represents

an atomic proposition. We convert γφ to its SMT syntax tree representation τφ. An SMT

syntax tree τφ is a tree obtained from an STL syntax tree γφ by replacing each temporal

operator in the non-leaf node of γφ with its corresponding SMT encoding. In τφ as well,

each leaf represents an atomic proposition. The purpose of converting an STL formula φ to

its SMT syntax tree representation τφ is to be able to easily manipulate the syntax tree and

parse its corresponding SMT encoding. Figure 5.2 shows the process of converting all five

93

subtrees with STL operators to their corresponding SMT syntax tree representations. For

nested formulas, this process is done for every formula in the STL syntax tree, starting from

the root of the tree.

For example, Figs. 5.3a and 5.3b show creating SMT syntax trees of

φ1 and

φ2 using

¬
) be the SMT syntax trees created

¬

the technique shown in Figure 5.2. Let τ¬φ1
from

φ1 (resp.,

¬

¬

(resp., τ¬φ2

φ2). Let us first consider the case where the monitor has the whole

distributed signal (E, ⇝) (i.e., no segmentation). The case of a segmented signal will be

handled by formula progression explained in Section 5.3.3. Thus, we keep the SMT syn-

tax trees unchanged and we denote the corresponding SMT constraint by SM T τφ

. From

Figure 5.3a, for

φ1, the distributed signal (E, ⇝), and the SMT syntax tree τ¬φ1
¬

, we have:

SM T τ¬φ1 =

i
∃

∈

[0, 10].((ccf, i)

= p).
|

Recall from the beginning of this section ‘(ccf, i)

= p’ is replaced with the f (.) > 0 in the
|

SMT constraint. For

φ2, we have:

¬

SM T τ¬φ2 =

= q))).
|
Putting everything together. The final SMT constraint is the following:

[0 + i, 5 + i](

j
∧ ∀

= p
|

i
∃

∈

∈

¬

[0, 10].((ccf, i)

((ccf, j)

F inalSM T = SM T flow

SM T retime

SM T τ¬φ.

∧

∧

Obviously, since there is logical equivalence between an STL formula φ and its corre-

sponding SMT encoding SM T τφ

, for any given a distributed signal (E, ⇝) over N agents,

we have (E, ⇝)

= φ if and only if F inalSM T is satisfiable (assuming all time intervals of
̸|
temporal operators are within [0,

(E, ⇝)
|

]).
|

5.3.3 Formula Progression

We now consider the case where the monitor does not have the entire distributed signal

and receives it in segments, or, time intervals of some temporal operators are not within

[0,

]. Given a segment (E, ⇝) and formula φ, our goal is to obtain a progressed
(E, ⇝)
|
|

formula φ′ such that any (finite or infinite) extension (E′, ⇝) will be evaluated for φ′.

94

Data: SMT syntax tree τφ, partition time t
Result: SMT syntax tree τ ′
φ
Let rootτ be the root node of τφ and nτ be a node
Function PartitionTree(nτ ):

if nτ has a quantifier with range ‘[a, b]’ then

if a < t

b then
be an empty node

≤
Let n′
τ
’ then
if nτ has quantifier ‘
∀
Label n′
τ

as ‘

end
’ then
if nτ has quantifier ‘
∃
Label n′
τ

as

’
∧

←

end
n′
τ .lef tchild
range of n′
τ .l
n′
τ .rightchild
range of n′
τ .r
= rootτ then
if nτ
nτ .parent.child

←

n′
τ

←

end
else

nτ

end

n′
τ

←

end

∨
copy subtree rooted at nτ Set ‘[a, min(b, t))’ as the quantifier

copy subtree rooted at nτ Set ‘[max(a, t), b]’ as the quantifier

end
foreach nτ .child in nτ do

PartitionTree(nτ .child)

end

return PartitionTree(rootτ )

Algorithm 5.1 Function Λ.

We define function Λ, that takes as input an SMT syntax tree τφ and a segment duration

and returns as output (see Algorithm 5.1) an SMT syntax tree τ ′

(E, ⇝)
|
|
We construct an SMT syntax tree τ ′
φµ

from τ ′
φ

such that the following properties hold:

φ = Λ(τφ,

).
(E, ⇝)
|
|

• The root of τ ′
φµ

is the topmost (and leftmost if there are two) node of τ ′
φ

which has a

quantifier label.

• For every subsequent nodes, in τ ′
φµ

, if the node n has the label

or

∧

∨

with chil-

dren labelled with quantifiers, remove the node and only keep the left child by doing

n.parent = n.lef tchild.

95

̸
As examples, let us partition the SMT syntax trees in τ¬φ1

(Figure 5.3a) and τ¬φ2
, since the starting node nτ , which

(Fig-

ure 5.3b) at time t = 5 using Algorithm 5.1. For τ¬φ1
is the root node in this case, is labelled ‘

∨
∈
Now we create two copies of the tree at nτ , change the ranges to ‘[0, 5)’ (resp., ‘[5, 10]’),

[0, 10]’, we create a node n′
τ

and label it ‘

’.

i
∃

and attach them to left (resp., right) children of n′
τ

. n′
τ

is our new nτ . Now, we repeat

the process for each child of nτ . However, as none of the children nodes are labelled with

quantifiers, τ ′

¬φ1 = nτ is our desired partitioned tree from τ¬φ1

at time t = 5, shown in

Figure 5.2. Following the same process, we get τ ′

¬φ2

as our partitioned tree from τ¬φ2

at time

t = 5, shown in Figure 5.3.

Lemma 9. [SMT partition tree lemma] Let (E, ⇝) be a distributed signal and φ be an STL

formula. F inalSM T for (E, ⇝) and τφ is satisfiable if and only if F inalSM T for (E, ⇝)

and Λ(τφ,

) is satisfiable.
(E, ⇝)
|
|

Proof. We distinguish the following cases:

Case 1: First, we consider the base case of this proof, where the formula is an atomic

proposition, that is, φ = p.

) The SMT encoding for (E, ⇝) and τp is:

(
⇒

(ccf, 0)

= p
|

In other words, when the encoding above is satisfied, the events in the frontier of the con-

sistent cut at time 0 satisfies p. Now, as the SMT syntax tree for p does not have any

quantifiers, Algorithm 5.1 never enters succeeds a < t

b. Hence, the SMT syntax tree for

≤

p remains unchanged, and the SMT encoding using E and τ ′

φ = Λ(τφ,

E
|

) is:
|

) Trivial.

(
⇐

(ccf, 0)

= p
|

96

Case 2: Assume that the proof has been established for the cases when the formulas are

φ = φ1 and φ = φ2. Now, we consider the case where the formula is φ = φ1

φ2.

∧

) The SMT encoding for (E, ⇝) and τφ1∧φ2

(
⇒

is:

(ccf, 0)

= φ1
|

∧

φ2

In other words, when the encoding above is satisfied, the events in the frontier of the con-

sistent cut at time 0 satisfies φ1

∧

φ2. Now, as the SMT syntax tree for φ does not have

any quantifiers, Algorithm 5.1 never succeeds a < t

b. Hence, the SMT syntax tree for φ

≤

remains unchanged, and the SMT encoding using E and τ ′

φ1∧φ2 = Λ(τφ1∧φ2, t′) is:

(ccf, 0)

= (φ1
|

∧

φ2)

∧

true

) Trivial.

(
⇐

Case 3: Assume that the proof has been established for the cases when the formulas are

φ = φ1 and φ = φ2. Now, we consider the case where the formula is φ = φ1

φ2.

∨

) The SMT encoding for (E, ⇝) and τφ1∨φ2

(
⇒

is:

(ccf, 0)

= φ1
|

∨

φ2

In other words, when the encoding above is satisfied, the events in the frontier of the con-

sistent cut at time 0 satisfies φ1

∨

φ2. Now, as the SMT syntax tree for φ does not have

any quantifiers, Algorithm 5.1 never succeeds a < t

b. Hence, the SMT syntax tree for φ

≤

remains unchanged, and the SMT encoding for (E, ⇝) and τ ′

φ1∨φ2 = Λ(τφ1∨φ2, t′) is:

(ccf, 0)

= φ1
|

∨

φ2

) Trivial.

(
⇐

Case 4: Assume that the proof has been established for the cases when the formulas are

φ = φ1 and φ = φ2. We consider the case where the formula is φ = φ1

[a,b]φ2.

U

) The SMT encoding for (E, ⇝) and τφ1 U [a,b]φ2

is:

(
⇒

(cid:16)

[a, b]

(ccf, i)

= φ2
|

j

∈

∧ ∀

[0, i)(cid:0)ccf, j)

(cid:1)(cid:17)

= φ1
|

i
∃

∈

97

If the above encoding is SAT, then both
[0, i)(cid:0)(ccf, j)

(cid:1) are SAT. For a <

[a, b](cid:0)(ccf, i)

i
∃
b, this can be written as:

= φ2
|

∈

(cid:1) and

i
∃

[a, b]

j
∀

∈

∈

= φ1
|

E
|

| ≤

i1

∃

∈

[a,

(cid:16)
)
|

E
|

(ccf, i1) = φ2

(cid:0)
∀

j1

∧

∈

[0, i1]((ccf, j1) = φ1)(cid:1)(cid:17)

, b]((ccf, j2) = φ1)(cid:1)(cid:17)
|
Note that this is the SMT encoding for (E, ⇝) and τ ′

(ccf, i2) = φ2

, b]
|

E
[
|

E
|

j2

i2

∈

∧

∈

(cid:16)

∃

∀

[

φ1 U [a,b]φ2 = Λ(τφ1 U [a,b]φ2,

∨
(cid:0)

E
|

), when
|

a <

E
|

| ≤

b. For any other value of a <

E
|

| ≤

b, the SMT syntax tree remains unchanged.

When the SMT encoding of τφ1 U [a,b]φ2
satisfied throughout [0,

U
first part of the SMT encoding of τ ′

), and φ1
|

E
|

is SAT, either (1) φ1

U
[|E|,b]φ2 is satisfied. If φ1

[a,|E|]φ2 is satisfied, or (2) φ1 is

[a,|E|]φ2 is satisfied, then the

U

φ1 U [a,b]φ2

becomes SAT, and if φ1 is satisfied throughout

[0,

E
|

), and φ1
|

U

[|E|,b]φ2 is satisfied, then the second part of the SMT encoding of τ ′

becomes SAT. Therefore, in all possible cases, if the SMT encoding of τφ1 U [a,b]φ2
then the SMT encoding of τ ′

will also yield SAT.

φ1 U [a,b]φ2

φ1 U [a,b]φ2
yields SAT,

) Trivial.

(
⇐

Case 5: Assume that the proof has been established for the cases when the formulas are

φ = φ1 and φ = φ2. Finally, we consider the case where the formula is φ = φ1

[a,b]φ2.

R

) The SMT encoding for (E, ⇝) and τφ1 R [a,b]φ2

is:

(
⇒

(cid:16)

[a, b]

(ccf, i)

= φ1
|

j

∈

∧ ∀

[0, i)(cid:0)ccf, j)

(cid:1)(cid:17)

= φ2
|

i
∃

∈

If the above encoding is SAT, then both
[0, i)(cid:0)(ccf, j)

(cid:1) are SAT. For a <

[a, b](cid:0)(ccf, i)

i
∃
b, this can be written as:

= φ1
|

∈

(cid:1) and

i
∃

[a, b]

j
∀

∈

∈

= φ2
|

E
|

| ≤

i1

∃

∈

[a,

(cid:16)
)
|

E
|

(ccf, i1) = φ1

(cid:0)
∀

j1

∧

∈

[0, i1]((ccf, j1) = φ2)(cid:1)(cid:17)

i2

∃

[

E
|

, b]
|

∈

(cid:16)

(ccf, i2) = φ1

∧

E
[
|

∈

, b]((ccf, j2) = φ2)(cid:1)(cid:17)
|

∨
(cid:0)

j2

∀

98

Note that this is the SMT encoding for (E, ⇝) and τ ′

φ1 R [a,b]φ2 = Λ(τφ1 R [a,b]φ2,

E
|

), when
|

a <

E
|

| ≤

b. For any other value of a <

E
|

| ≤

b, the SMT syntax tree remains unchanged.

When the SMT encoding of τφ1 R [a,b]φ2
satisfied throughout [0,

R
first part of the SMT encoding of τ ′

), and φ1
|

E
|

is SAT, either (1) φ1

R
[|E|,b]φ2 is satisfied. If φ1

[a,|E|]φ2 is satisfied, or (2) φ2 is

[a,|E|]φ2 is satisfied, then the

R

φ1 R [a,b]φ2

becomes SAT, and if φ2 is satisfied throughout

[0,

E
|

), and φ1
|

R

[|E|,b]φ2 is satisfied, then the second part of the SMT encoding of τ ′

becomes SAT. Therefore, in all possible cases, if the SMT encoding of τφ1 R [a,b]φ2
then the SMT encoding of τ ′

will also yield SAT.

φ1 R [a,b]φ2

φ1 R [a,b]φ2
yields SAT,

) Trivial.

(
⇐
■

Given a distributed signal (E′, ⇝) and an STL formula φ, the following theorem shows

that the subtree τ ′
φµ

of Λ(τ¬φ,

) allows computing the progressed formula by dis-
(E, ⇝)
|
|

charging τ ′
φµ

.

Theorem 5. [Partial evaluation theorem] Let (E, ⇝) be a distributed signal and φ be an

STL formula. It is the case that (E, ⇝)

= φµ if and only if F inalSM T for (E, ⇝) and τ ′
φµ
|

is satisfiable.

Proof. Let us assume that τ ′

φ = Λ(τφ,

not satisfiable. This implies that τ ′
φµ

), E
|

= φµ, and F inalSM T for (E, ⇝) and τ ′
φµ
|

E
|
has at least one subtree, where the root node is the

is

nth nested quantifier with an interval [αn, βn] and βn >

. However, while constructing
|
, only the left child is kept for any node that has the label

with children labelled

E
|

or

τ ′
φµ

∧

∨

with quantifiers (see Section 3.3). Furthermore, In Algorithm 5.1, the maximum range of

the quantifier labelled on the left child is min(βn,

E
|

Therefore, such a subtree cannot exist, and by extension τ ′
φµ

and only if F inalSM T for (E, ⇝) and τ ′
φµ

is satisfiable. ■

). Therefore, βn >
|

E
|
cannot exist. Thus, E

|

is not possible.

= φµ if
|

Simply evaluating F inalSM T for (E, ⇝) and τ ′
φµ

is not enough, as we must ensure that

there is no loss of information when modifying τ ′
φ

using the said evaluation results. For

example, in Figure 5.4b, Since (σ, j2)

=
|

q cannot be evaluated on the first segment, finding

¬

99

only one value of i1 in this segment may lead to loss of information, as this may ignore other

valid values of i1 that are required to evaluated (σ, j2)

=
|
would naturally occur only in its τ ′
φµ

¬

q on the next segment.

subtree. To this

Note that any modification to τ ′
φ

end, we define a function υ, that takes as inputs an SMT syntax tree τ ′
φµ

and a distributed

signal (E, ⇝), and returns an SMT syntax tree τ ′
φυ

in τ ′
φ

, τ ′
φ

can sufficiently evaluate (E′, ⇝).

, such that, upon replacing τ ′
φµ

with τ ′
φυ
In other words, the STL representation of τ ′
φ

becomes the desired progression of φ on (E, ⇝). However, before defining υ, we specify the

following shorthand notations we will be using throughout its definition:

• ‘τφ = p’: The root of the tree τφ is labelled p
• τφ = τφ1Xτφ2

, where X =

AP.

∈

,
{∧

∨}

: The root of the tree τφ is labelled X, and it has

two children τφ1

and τφ2

.

• τφ = [a,b] τψ: The root of the tree τφ contains label
• τφ = [a,b] τψ: The root of the tree τφ contains label
• ((E, ⇝), t)

i
∀

∈

i
∃

∈

= τφ : At time instance t, F inalSM T for (E, ⇝) and τφ is satisfiable.
|

[a, b], and it has a child τψ.

[a, b], and it has a child τψ.

Now we define υ in a case-by-case manner for the relevant STL operators:

Atomic propositions. Let τφµ = p for some p

AP. We have:

∈

υ((E, ⇝), τφµ) =



⊤

⊥

if

((E, ⇝), 0)

= p
|

otherwise

Conjunction. Let τφµ = τφµ1 ∧

τφµ2

. We have:

υ((E, ⇝), τφµ) = υ((E, ⇝), τφµ1

)

Disjunction. Let τφµ = τφµ1 ∨

τφµ2

. We have:

υ((E, ⇝), τφµ) = υ((E, ⇝), τφµ1

)

υ((E, ⇝), τφµ2

)

υ((E, ⇝), τφµ2

)

∧

∨

100

τ ′
¬

φ2µ

∨

i1

∃

∈

[0, 5)

i2

∃

∈

[5, 10]

∧

∧

τ ′
¬

φ1µ

∨

(σ, i1)

= p
|

∧

(σ, i2)

= p
|

j3

∀

∈

[i2, 5 + i2]

i1

∃

∈

[0, 5)

i2

∃

∈

[5, 10]

j1

∀

∈

[0 + i1, 5)

j2

∀

∈

[5, 5 + i1]

(σ, j3)

=
|

q
¬

(σ, i1)

= p
|

(σ, i2)

= p
|

(σ, j1)

=
|

q
¬

(σ, j2)

=
|

q
¬

(a) Partitioned SMT syntax tree for τ ′

¬φ1.

(b) Partitioned SMT syntax tree for τ ′

Figure 5.4 Examples of partitioned SMT syntax tree of STL formulas

φ1 and

¬

¬

¬φ2.
φ2 at t = 5.

Always operator. Let τφµ = τφ′

µ

.

In this case, the transformation of τφµ

is fairly

straightforward:

υ((E, ⇝), τφµ) =






[a,b] τφ′

µ

⊥

if

if

k

∀

k

∃

∈

∈

[a, b].((E, ⇝), k)

[a, b].((E, ⇝), k)

= τφ′
|

µ

= τφ′
̸|

µ

Eventually operator. Let τφµ = τφ′

µ

.

In this case, instead of finding a single time

instance where F inalSM T for (E, ⇝) and τφ′

µ

is satisfiable, a valid range [k, b] must be

identified, where k

∈

is satisfiable:

[a, b] is the earliest time instance where F inalSM T for (E, ⇝) and τφ′

µ

υ((E, ⇝), τφµ) =

[k,b] τφ′

µ





⊥

if

if

argmin

k∈[a,b](((E, ⇝), k)

= τφ′
|

µ)

[a, b].((E, ⇝), k)

k

∀

∈

= τφ′
̸|

µ

Remark 4. Since Until (Figure 5.2a) and Release (Figure 5.2b) operators are expressed

using existential and global quantifiers in SMT syntax trees, the definition of υ does not

need cases for them.

Now that we have defined υ, we state the necessary steps required to compute the pro-

gression of some STL formula φ on a distributed signal (E, ⇝) as follows:

101

• First, we create the SMT syntax tree τφ that corresponds to the STL formula φ using

the methods detailed in Figure 5.2. As examples, let us consider the SMT syntax trees

for the STL formulas,

φ1 = [0,10] p (Figure 5.3a) and
¬

φ2 = [0,10](p

¬

[0,5]

q)

¬

∧

(Figure 5.3b).

• Next, we partition τφ at time

(E, ⇝)
|
|

using Algorithm 5.1, and obtain τ ′

φ = Λ(τφ,

(E, ⇝
|

), such that τφµ
)
|
ple, we consider the case where the monitor only has the first 5 time units, that is,

is the subtree in τφ that can be evaluated on (E, ⇝). In our exam-

= 5. Figure 5.4a (resp., Figure 5.4b) shows the partitioned SMT syntax tree

(E, ⇝)
|
|
for Figure 5.3b (resp., Figure 5.3b) at time instance

(E, ⇝)
|
|

= 5 with subtrees τ ′

¬φ1µ

(resp., τ ′

¬φ2µ

) that can be evaluated on (E, ⇝).

• Finally, we partially evaluate φ on (E, ⇝) by transforming τ ′
φµ

The STL representation of this new SMT syntax tree τ ′
φ

to τ ′

φυ = υ((E, ⇝), τ ′

φµ).
is our desired progression of

φ on the extension of (E, ⇝). In our first example,

Now, let us assume that p is never true in (E, ⇝).

φ1p

is of the form

¬
In that case, according to the

[0,5] ¬

.

φ′

1p

rules specified for υ, The label of the root of τ ′

¬φ1µ

stays unchanged, and the child

becomes false. Therefore, the progression becomes (

is,

[5,10] p upon simplification. In our second example,

Now, let us assume that the minimum i for which

[0,5] false)

(

∨
is of the form

φ2p

¬
[0, 5)((((E, ⇝), i)

[5,10] p), which
.

φ′

2p

[0,5] ¬
(

j
∀

= p)
|

i
∃

∈
q))) is satisfied at time 3.5. In that case, according

∈

∧

[i + 0, min(i + 5, 5)](((E, ⇝), i)

=
|

¬

to the rules specified for υ, The label of the root of τ ′

¬φ2µ

is changed to

[3.5, 5).

∃
( [0,5]

∈
q))).

i1

¬

Therefore, the progression becomes (

[3.5,5)( [0,5]

q))

¬

∨

(

[5,10](p

∧

5.4 Case Studies and Evaluation

In this section, we evaluate our algorithm for monitoring STL specifications on distributed

signals using two case studies.

5.4.1 Case Study 1: Network of UAVs

In a similar manner as Section 4.5, we use the Fly-by-Logic framework [100], a path

planner software for UAVs, to simulate flight path of two UAVs that take off after 1.5s,

102

hover, and then land after 4.5s. The trajectories are sampled at 20Hz as xn, yn, and zn

coordinates for each UAV An, with an ε ranging between 1

5ms.

−

5.4.2 Case Study 2: Water Distribution System

We use the same model of a hybrid dynamic high pressure water distribution system

consisting of two water tanks that we used in Section 4.5. Therefore, the specifications of

the water tank model is identical to that of mentioned above. We use an ε range of 5

500ms.

−

However, despite using the same model, we will be verifying the system against STL, and

observe different results from what we have witnessed in Chapter 4.

5.4.3 Experimental Setup

In our UAV related experiments, we monitor three STL properties: (1) mutual separa-

tion between UAVs never falls below a threshold; (2) all UAVs take off simultaneously from

standby state and hover at the same altitude, and (3) all UAVs eventually land simultane-

ously. The monitor receives a distributed signal every second, and we measure its execution

time for each formula progression to verify truthfulness of the given formulas. In our wa-

ter tank related experiments, we simulate a plant failure where the RWST in the ECCS is

triggered upon receiving an emergency actuation signal. The monitor receives a distributed

signal at varying time intervals from multiple water tanks. Our goal is to find possible viola-

tions caused by clock drift, where the water pressure falls below threshold required to keep

the failsafe CLA from triggering. In other words, we want to monitor the property during

an emergency, when the outflow pressure reaches above the threshold pressure and remains

above the threshold pressure forever. All experiments are replicated to exhibit 95% confi-

dence interval to provide statistical significance. The experimental platform is a CentOS

server with an Intel(R) Xeon(R) Platinum 8180 CPU @ 3.80GHz clock rate and 754G of

RAM. Our implementation invokes the SMT-solver Z3 [97] to solve the problem described

in Section 4.3.

103

2 agents
3 agents
4 agents

)
s
(

e
m

i
t

n
u
R

0.6

0.4

0.2

0

)
s
(

e
m

i
t

n
u
R

0.8

0.6

0.4

0.2

0

2 agents
3 agents
4 agents

2 agents
3 agents
4 agents

)
s
(

e
m

i
t

n
u
R

6

4

2

0

2

1
3
5
Segment number

4

2

1
3
5
Segment number

4

2

1
3
5
Segment number

4

(a) φms (Mutual separation).

(b) φeh (Eventually hover).

(c) φel (Eventually land).

Figure 5.5 Effect of number of segments and agents on run time for different flight properties.

5.4.4 Analysis of Results

Mutual separation. This property states that the distance between every pair of UAVs

in fleet always remain above a given threshold δ. The corresponding STL formula φms is:

(cid:94)

(cid:16)(cid:113)

(xi

[0,∞]

i,j∈[N ],i̸=j

xj)2 + (yi

yj)2 + (zi

−

−

−

zj)2 > δ

(cid:17)

.

Figure 5.5a shows the run time for each segment for evaluation of φms on the distributed

signal.

In each segment the progression formula remains unchanged. However, the first

segment shows minimal run time due to the fact that the UAVs are stationary throughout

the entirety of that segment and, therefore, require very few ‘unique’ distance calculations.

The run time for the second segment and the last segment are slightly higher than that of the

first segment because of the same reason; the UAVs are partially grounded throughout these

two segments. Note that despite φms seemingly being a simple STL formula, the average run

time per segment is relatively higher (compared to the run time of other formulas) due to

requiring quadratic equations to be solved.

Eventually hover. This property states that the UAVs in fleet are eventually (within 2s)

airborne and hover within a λ height margin. Formally, the corresponding STL formula φeh

104

is:

(cid:94)

i,j∈[N ],i̸=j

(cid:16)

[0,2]

(cid:17)

zi, zj > 0

(cid:16)

[0,∞]

zi
|

zj

|

−

(cid:17)

< λ)

.

⇒

Figure 5.5b shows the run time for each segment for evaluation of φeh on the distributed

signal. The first segment has the lowest run time as the UAVs are stationary. The second

segment has a higher run time because (zi, zj > 0) is observed and progression is needed for

the following segments, where the progressed formula simply becomes

[0,∞](zi = zj).

Eventually land This property states that the UAVs in fleet eventually land on the ground

simultaneously. Formally, the corresponding STL formula φel is:

(cid:94)

i,j∈[N ],i̸=j

(cid:16)

zi = 0

[2,∞]

(cid:17)

zj = 0

.

∧

Figure 5.5c shows the run time for each segment for evaluation of φel on the distributed

signal. The temporal interval of φel is intentionally [2,

] instead of [0,

∞

] since the UAVs

∞

are on the ground at the start of the distributed signal. The behavior in run time shown

in this figure is opposite of what we have witnessed in Figure 5.5b. In segments 3 and 4,

the UAVs are airborne, and therefore, the search-space for the SMT problem is exhaustively

traversed. However, in segment 5, φel is satisfied and the progression becomes true.

Impact of segment duration and number of water tanks. Let P1, P2, . . . , PN denote

the outflow pressures of N number of water tanks. For simplicity, we assume all the pipes

are of the same diameter. Thus, the pressure exerted on the CLA is P1 + P2 + . . . + PN .

We monitor the property that states outflow pressure remains above the threshold pressure

600psig [117] indefinitely. The corresponding STL formula φP is:

[0,∞]

(cid:16) N
(cid:88)

n=1

Pn

≥

(cid:17)

600

.

Figure 5.6 shows the effect on run time for increasing the number of tanks from 2 to 4 with

ε = 0.05s over segment duration ranging from 1s to 5s. As expected, both segment duration

and the number of tanks drive up the run time. We note that even when the monitor receives

105

Clock
Skew (s)
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5

Clock
Skew (s)
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5

True
Violations
9
4
12
11
4
7
5
7
10
7

True
Violations
6
6
8
4
2
1
7
2
5
6

Detected
Violations
25
42
65
80
86
99
112
127
145
160

False
Positives
16
38
53
69
82
92
107
120
135
153

(a) Water tanks.
Detected
Violations
11
20
30
39
46
48
62
66
76
84

False
Positives
5
14
22
35
44
47
55
64
71
78

False +ve
Percentage
64%
90.48%
81.54%
86.25%
95.35%
92.93%
95.54%
94.49%
93.1%
95.63%

False +ve
Percentage
45.45%
70%
73.33%
89.74%
95.65%
97.92%
88.71%
96.97%
93.42%
92.86%

(b) UAVs.

Table 5.1 Impact of ε.

the distributed signals sent by the water tanks at a reasonable 1s intervals, the monitor is

still able to verify the property online under around half a second for four tanks.

Impact of clock skew.

In order to study the impact of ε on monitoring verdicts, we

model two RWST modules with intentional ‘faults’, where the outflow pressures of either

tank can drop below the threshold pressure of the CLA. Thus, if both tanks’ pressures fall

simultaneously, the CLA gets triggered. We also introduce a clock drift in the valve controller

of one of the tanks. Table 5.1a shows the results for two tanks that were active for an hour.

During this time, Tank 1 and Tank 2 reported low pressures for a total of 35.5s and 36.1s

respectively. Although generally we are interested in finding a single violation, in order to

106

)
s
(

e
m

i
t

n
u
R

5

0
1

2

3

4

5 2

Segment duration (s)

4

3

N u m b er of ta n ks

Figure 5.6 Effect of segment duration and the number of water tanks on runtime for φP.

demonstrate the effect of clock skew, we find multiple violation instances in this experiment

by tallying up pairs of piece-wise linear interpolations between samples where violations are

detected. We report the number of true violations as a baseline that was reverse calculated

from the introduced clock drift ε, number of detected violations using our method, and the

number of false positives, which is essentially the difference between the true violations and

the detected violations1. Note that there are no false negatives. Furthermore, as the clock

drift is increased from 0.05s to 0.5s, the number of false positives increase as well. Similarly,

we model a path for a pair of UAVs, where the agents periodically reside within the given

mutual separation threshold, and violate the mutual separation property. Table 5.1b shows

the results for two UAVs in operation for half an hour. We again report the number of true

violations, detected violations, and false positives.

5.5 Conclusion

In this chapter, we developed a mechanism for monitoring requirements defined in signal

temporal logic (STL) for distributed CPS, where continuous-time and continuous-valued

signals from a group of agents do not share a global clock. Our method relies on an off-the-

shelf clock synchronization algorithm, such as NTP, to ensure a maximum constrained clock

1We emphasize that due to the uncertainty caused by asynchrony, the existence of false positives is

inevitable, as there is no global clock to ensure a total order of events.

107

skew across all agents in the system. We also presented a signal retiming approach, borrowed

from Chapter 4, that effectively aligns continuous signals in order to detect potential STL

specification breaches. To address the complexity, we compress our runtime monitoring

problem to an SMT solution problem and cut the distributed signals into a series of smaller

parts. To that purpose, similar to that of Chapter 3, we presented a formula progression

approach that takes a distributed signal and an STL formula as input and outputs another

STL formula that depicts the formula’s progress over the signals.

Furthermore, in Section 5.4, we presented experimental results from the monitoring of an

unmanned aerial vehicle (UAV) fleet, and a water distribution system. These experiments

indicate that for certain cases, it is indeed possible to monitor STL formulas online on a

distributed signal using our technique.

108

CHAPTER 6

DECENTRALIZED PREDICATE DETECTION OVER
PARTIALLY SYNCHRONOUS CONTINUOUS-TIME SIGNALS

In this chapter, we set our sights toward decentralized monitoring of distributed CPS. The

natural first step would therefore be decentralized monitoring of predicate violations over

continuous-time and continuous-valued signals under partial synchrony.

To this end, we propose a decentralized monitoring algorithm to detect all Boolean pred-

icates over the analog (i.e. continuous-time and continuous-valued) signals generated by

the agents in a distributed CPS. Similar to our approaches described in previous chapters,

a clock synchronization algorithm (see Subsection 2.3) guarantees a maximum clock skew

across all signals generated by the agents.

It is helpful to overview our algorithm and key notions via an example before delving

into the technical details. An example is shown in Figure 6.1. Three agents produce three

signals x1, x2, x3. The decentralized detector consists of three local detectors D1, D2, D3, one

on each agent. Each xn is observed by the corresponding Dn. The predicate ϕ = (x1

≥
0) is being detected. It is true over the intervals shown with solid black

(x2

0)

∧

0)

∧

≥

(x3

≥

bars; their endpoints are measured on the local clocks. The detector only knows that the

maximal clock skew is ϵ = 1, but not the actual value, which might be time-varying.

1

0

0

0

2

2.5

4

4

5.5

x1

x2

x3

Figure 6.1 An example of a continuous-time distributed signal with 3 agents. Three timelines
are shown, one per agent. The signals xn are also shown, and the local time intervals over
which they are non-negative are solid black. The skew ϵ is 1. The Happened-before relation
is illustrated with solid arrows, e.g. between e1
. Some satisfying cuts for
1
0) are shown as dashed arcs, and the extremal
the predicate ϕ = (x1
cuts as solid arcs. All extremal cuts contain root events, and leftmost cut A also contains
non-root events.

, and e4
3

⇝ e2
2

⇝ e5
2

(x3

(x2

0)

0)

≥

≥

≥

∧

∧

109

Because of clock skew, any two local times within ϵ of each other must be considered as

potentially concurrent, i.e. they might be measured at a truly synchronous moment. For

example, consistent cut at local times [4, 4.5, 3.6] might have been measured at the global

time 4, in case the true skews were 0,0.5 and -0.4 respectively. The detector’s task is to find

all consistent cuts that satisfy the predicate. In continuous time, there can be uncountably

many, as in Figure 6.1; the dashed lines show two satisfying consistent cuts, or satcuts for

short.

In this example, the detector outputs two satcuts, [1.5, 2, 2.5] and [4, 5, 4], shown as thin

solid lines. These two have the special property (shown in this chapter) that every satcut lies

between them, and every cut between them is a satcut. For this reason we call them extremal

satcuts, which is formally defined later. Thus these two satcuts are a finite representation

of the uncountable set of satcuts, and encode all the ways in which the predicate might be

satisfied.

We note three further things: the extremal satcuts are not just the endpoints of the

intervals, and simply inflating each interval by ϵ and intersecting them does not yield the

satcuts. Each local detector must somehow learn of the relevant events (and only those) on

other agents, to determine whether they constitute extremal satcuts.

In the following sections, we state some necessary technical definitions, establish fun-

damental properties of the uncountable set of events satisfying the predicate, methodology

for computing finite representation of the uncountable of events, complexity analysis, and

finally, implementation and experiments.

6.1 Problem Statement

Before we state the problem statement, we define the class of predicates we monitor using

our decentralized monitoring algorithm. This chapter focuses on specifications expressible

as conjunctive predicates φ, which are conjunctions of N linear inequalities.

φ := (x1

0)

∧

≥

(x2

≥

0)

∧

. . .

∧

(xN

≥

0).

(6.1)

110

A1

A2

s

s′

t

t′

Figure 6.2 Two satcuts for a pair of agents A1 and A2, shown by the crossed solid lines (s, t′)
and (s′, t). Their intersection is (s, t), shown by a dashed arc, and their union is (s′, t′),
shown by a dotted arc. For a conjunctive predicate φ, the intersection and union are also
satcuts, forming a lattice of satcuts.

These predicates model the simultaneous co-occurrence, in global time, of events of interest,

like ‘all drones are dangerously close to each other’. Equation 6.1 also captures the cases

where some conjuncts are of the form xn

≤

0 and xn = 0.

If N numbers (an) satisfy

predicate φ (i.e., are all non-negative), we write this as (a1, . . . , aN )

= φ. Henceforth, we
|

say ‘predicate’ to mean a conjunctive predicate in this chapter.

Definition 15. [Distributed Satisfaction; SE] Given a predicate φ, a distributed signal

(E, ⇝) over N agents, and a consistent cut C of E with frontier

front(C) =

(cid:16)

(t1, x1(t1)), . . . , (tN , xN (tN ))

(cid:17)

we say that C satisfies φ iff (cid:0)x1(t1), x2(t2), . . . , xN (tN )(cid:1)
say that C is a satcut. The set of all satcuts in E is written SE. ■

= φ. We write this as C
|

= φ, and
|

6.1.1 Decentralized Predicate Detection

As stated before, our algorithm seeks to find all possible global states that satisfy a given

predicate, i.e. all satcuts in SE. In general, SE is uncountable.

Architecture. The system consists of N agents with partially synchronous clocks with

drift bounded by a known ϵ, generating a continuous-time distributed signal (E, ⇝). Agents

communicate in a FIFO manner.

111

Problem Statement

Given (E, ⇝) and a conjunctive predicate φ, find a decentralized detection algorithm

that computes a finite representation of SE. The detector is decentralized, meaning

that it consists of N local detectors, one on each agent, with access only to the local

signal xn (measured against the local clock), and to messages received from other

agents’ detectors.

By computing a representation of all of SE (and not some subset), we account for asyn-

chrony and the unknown orderings of events within ϵ of each other. One might be tempted to

propose something like the following algorithm: detect all roots on all agents, then see if any

N of them are within ϵ of each other. This quickly runs into difficulties: first, a satisfying cut

is not necessarily made up of roots; some or all of its events can be interior to the intervals

where xn’s are positive (see Figure 6.2). Second, the relation between roots and satcuts

must be established: it is not clear, for example, whether even satcuts made of only roots

are enough to characterize all satcuts (it turns out, they are not). Third, we must carefully

control how much information is shared between agents, to avoid the detector degenerating

into a centralized solution where everyone shares everything with everyone else.

6.2 The Structure of Satisfying Cuts

We establish fundamental properties of satcuts. In the rest of this chapter we exclude

the trivial case C = E. Proposition 1 mirrors a discrete-time result [26].

Proposition 1. The set of satcuts for a conjunctive predicate is a lattice where the join and

meet are the union and intersection operations, respectively.

Proof. Define the intersection I = C

∩

C ′ and let e be an element of I. Then by definition

of a cut, every event that happened-before e is in C and in C ′, and therefore is in their

intersection, so I is a cut. The frontier of I is made of events (tn, xn(tn)) such that tn =

max
{

t

|

(t, xn(t))

C

C ′

}

∩

∈

. In words, (tn, xn(tn)) is the last event on signal xn belonging to

both satcuts, which implies it is the last event on at least one of the cuts, say C. Therefore

112

(tn, xn(tn)) is on the frontier of C, and so xn(tn)

0 by definition of a conjunctive predicate.

≥

Since this is true for every n in [N ], we have that the frontier of I is a consistent state that

satisfies the predicate, and so I

= φ. The union C
|

∪

so the set of satcuts is a lattice. ■

C ′ is also a satcut by similar arguments,

We show that the set of satcuts is characterized by special elements, which we call the

leftmost and rightmost cuts.

Definition 16. [Extremal cuts] Let SE be the set of all satcuts in a given distributed signal

(E, ⇝). For an arbitrary C

∈

SE with frontier (etn

n )n and positive real α, define C

α to be

−

the set of cuts whose frontiers are given by

(et1−δ1
1

, et2−δ2
2

, . . . , etN −δN
N

) s.t. for all n : 0

δn

≤

≤

α and

n. δn > 0

∃

A leftmost satcut is a satcut C

∈

SE for which there exists a positive real α s.t. C

α and

−

SE do not intersect. A rightmost cut C (not necessarily sat) is one for which there exists a

positive real α s.t. C + α and SE do not intersect, and C

α

−

⊂

SE. We refer to leftmost

and rightmost (sat)cuts as extremal cuts. ■

Intuitively, C

α is the set of all cuts one obtains by slightly moving the frontier of

−
C to the left by amounts less than α.

If doing so always yield non-satisfying cuts, then

C is a leftmost satcut. Analogous intuition applies to rightmost cuts. If signals xn are all

continuous, then rightmost cuts are all satisfying as well.

In a signal, there are multiple extremal cuts. Figure 6.2 suggests, and Lemma 10 proves,

that all satcuts live between a leftmost satcut and rightmost cut.

Lemma 10. [Satcut intervals] Every satcut of a conjunctive predicate lies in-between a

leftmost satcut and rightmost cut, and there are no non-satisfying cuts between a leftmost

satcut and the first rightmost cut that is greater than it.

Proof. Let C be a satcut, so that xn(tn)

≥

0 for every (tn, xn(tn)) in its frontier. Let sn be

the biggest shift backwards in time preserving positivity:

s
sn := sup
{

|

s

≥

0 and

0

∀

≤

σ

≤

s. xn(tn

σ)

.

0
}

≥

−

(6.2)

113

By the starvation-freedom assumption derived from Assumption 2.2, sn is finite and by the

right-continuity of xn, xn(tn

sn)

0. Now the cut with frontier (etn−sn

)n satisfies the

n

−
predicate, but might not be consistent because it could be that

≥

sn

tn
|

sm)
|
s1 is the largest of all the tn

(tm

−

−

−

> ϵ for

sn’s.

−

some n, m. Suppose without loss of generality that t1

−

Define bn = max(tn

sn, t1

s1

−

−

−

ϵ) for all n > 1. Note that bn

≤

tn because C is consistent

(t1

tn

ϵ so a fortiori t1

tn

−
bn = tn

≤
sn, is immediate), and xn(bn)

−

≤

2 , . . . , ebN
N )
is consistent and satisfies the predicate. It is also leftmost by construction of s1. Therefore

, eb2

≥

−

1

−
0. Then the cut L with frontier (et1−s1

−

≤

ϵ + s1 and so bn = t1

s1

ϵ

tn, whereas the other case,

L is a leftmost satcut.

The reasoning for rightmost cuts follows the above lines, except for predicate satisfaction.

Namely: let sn now be the biggest shift forwards in time preserving positivity:

sn := sup
{

s

|

s

≥

0 and

0

∀

≤

σ

≤

s. xn(tn + σ)

.

0
}

≥

(6.3)

By the starvation-freedom assumption derived from Assumption 2.2, sn is finite. Now the cut

with frontier (etn+sn

n

)n might not be consistent because it could be that

tn+sn
|

(tm+sm)
|

−

> ϵ

for some n, m. Suppose without loss of generality that t1+s1 is the smallest of all the tn+sn’s.

Define bn = min(tn + sn, t1 + s1 + ϵ) for all n > 1. Note that bn

tn because C is consistent.

≥

Then the cut R with frontier (et1+s1

N ) is consistent, but does not necessarily satisfy
the predicate because of possible discontinuities (Namely, if tn + sn is a point of discontinuity

2 , . . . , ebN

, eb2

1

for xn then possibly xn(tn + sn) < 0). R is also rightmost by construction of s1 and the bn.

Therefore R is a rightmost cut.

Thus every satcut is between a leftmost satcut and rightmost cut. Also by construction

of L and R (specifically, Equation 6.2 and 6.3), there is no cut in-between that does not

satisfy the predicate. That is, there is no C s.t. L

C

⊑

⊑

R and C

= φ. (Here
̸|

⊑

is the

ordering relation on the lattice of cuts).

■

Thus we may visualize satcuts as forming N -dimensional intervals with endpoints given by

the extremal cuts. The main result of this section states that there are finitely many extremal

114

satcuts in any bounded time interval, so the extremal satcuts are the finite representation

we seek for SE.

Theorem 6. A distributed signal has finitely many extremal satcuts in any bounded time

interval.

In order to prove the above theorem, we will need the following definitions: the leftmost

event of a cut C is an event et

n ∈
With β a real number, an event et′
m

front(C) where t

t′ for all other events et′

m ∈

≤

front(C).

is said to be β-offset from et
n

if and only if t′ = t + β.

We will need the following three lemmas.

Lemma 11. The leftmost event of a rightmost cut is a right root.

Proof. Consider the leftmost event et
n

of a rightmost cut C. Because C is rightmost, then

xn(t

δ)

0 for all sufficiently small positive δ. Assume for a contradiction that et
n

is not a

−

≥
right root, so xn(t + α)

0 for all sufficiently small α

≥

≥

0, say all α strictly less than some

α. Since et
n

is leftmost, we can add the events et+α

n

, 0

α

min
{

≤

≤

ϵ

2, α

2 , γ

, to C to form a

}

new cut C ′. Choosing γ small enough guarantees that C ′ is consistent. C ′ is also satisfying

because we only added events such that xn(t + α)

0. This shows C is not a rightmost cut,

≥

which contradicts our choice of C. ■

Lemma 12. All events of the frontier of a rightmost cut are either right roots or ϵ-offset

from a right root.

Proof. Let et
n

be the leftmost event in the frontier of a rightmost cut C. By Lemma 11

this event is a right root. Now consider any other event et′

front(C) which is not a right

root, and assume for contradiction that t′

m ∈
= t + ϵ. Then xm(t′)

0 and (as in the proof of

≥

Lemma 11) xm(t′ + α)

the events

et′+α
m |

{

α

∈

≥
[0, γ)
}

0 for all sufficiently small α. If t′ < t + ϵ, then it is possible to add

to C, with γ small enough, to obtain a satcut to its immediate

right, which contradicts C being rightmost. On the other hand if t′ > t + ϵ this contradicts

that et
n

and et′
m

are part of the same frontier. Thus t′ = t + ϵ. ■

115

̸
The next lemma (and its proof) parallels Lemma 11 and Lemma 12, but for leftmost

satcuts.

Lemma 13. The rightmost event of a leftmost satcut is a left root. Moreover, every event

of the frontier of a leftmost satcut is either a left root or is (

ϵ)—offset from a left root.

−

Thus every extremal satcut has a left root or a right root as one its constituent events.

Since there are only finitely many roots in any bounded interval, this gives us the desired

conclusion.

Therefore, it is conceivably possible to recover algorithmically the extremal satcuts, and

therefore all satcuts by Lemma 10. The rest of this chapter shows how.

6.3 The Abstractor Process

Having captured the structure of satcuts, we now define the distributed abstractor process

that will turn our continuous-time problem into a discrete-time one, amenable to further

processing by our modified version of the slicer algorithm of [26]. This abstractor also has

the task of creating a happened-before relation. We first note a few complicating factors.

First, this will not simply be a matter of sampling the roots of each signal. That is because

extremal satcuts can contain non-root events, as shown in Figure 6.1. Thus the abstractor

must somehow find and sample these non-root events as part of its operation. Second, as in

the discrete case, we need a kind of clock that allows the local slicer to know the happened-

before relation between events. The local timestamp of an event, and existing clock notions,

are not adequate for this. Third, to establish the happened-before relation, there is a need

to exchange event information between the processes, without degenerating everything into

a centralized process (by sharing everything with everyone). This complicates the operation

of the local abstractors, but allows us to cut the number of messages in half.

6.3.1 Abstractor Description

The abstractor is described in algorithm 6.1 on page 117.

Its output is a stream of

discrete-time events, their correct PVC values, and the relation ⇝ between them - i.e., a

116

A1

A2

A1

A2

Figure 6.3 A distributed signal of two agents (top) and the output of the abstractor (bottom).
The abstractor marks zero-crossings as discrete root events and creates new events (dark
circles) to maintain consistency.

Data: Signal of agent An
Result: A stream of discrete events which are roots or ϵ-offset from roots
trigger found a root et
n

at local time t:

add et
n

info (n, t, PVC, left or right root) to local buffer if et
n

is right root:

for each agent m

= n:
info to agent m

send et
n

trigger received message about right root et
m

from agent Am:

Set t′ := t + ϵ, where ϵ is the maximum clock skew create local event et′
n

create
(setting the PVC for et′
info (n, t′, PVC,
relation et
n
m
left or right root) to local buffer /* Visit events in the buffer, forwarding
*/
ones that are ready to the slicer.

appropriately) add et′
n

⇝ et′
n

for each event es
n

in the local buffer :

/* Ready events are those whose PVCs will not be updated anymore. See
*/
from every other agent Ak

text for details.

if An received at least one message about a right root etk
k
such that tk
Set vs
it to local slicer

t:
n[n] = s and vs

ϵ for all k

n[k] = s

−

≥

= n Remove es
n

from buffer and send

Algorithm 6.1 Local abstractor for agent An

discrete-time distributed signal. This signal is processed by the local slicers as it is being

produced by the abstractor.

The abstractor runs as follows. It is decentralized, meaning that there is a local abstractor

running on each agent. Agent An’s local abstractor maintains a buffer of discrete events,

and consists of two trigger processes. The first is triggered when a root is detected (by a

117

̸
̸
local zero-finding algorithm). It stores the root’s information in a local buffer (for future

processing). If it is a right root, it also sends it to the other agents. The second trigger process

is triggered when the agent receives a right root information from some other process, at

which point it does three things: it creates a local discrete event and a corresponding relation

⇝ between events, it updates events in its local buffer to see which ones can be sent to the

local slicer process (described later), and then it sends them. It is clear, by construction,

that ⇝ is a happened-before relation: it is the subset of ⇝ needed for detection purposes.

Before an event et
n

is sent to the slicer, it must have a PVC that correctly reflects the

happened-before relation. This means that all events that happened-before et
n

must be

known to agent n, which uses them to update the PVC timestamps. This happens when

events have reached agent An from every other agent, with timestamps that place them after

. This is guaranteed to happen by the starvation-free assumption.

et
n

The output of a local abstractor is a stream of discrete events, so that the output of the

decentralized abstractor as a whole is a distributed discrete-time signal. See Figure 6.3.

Given that all right roots are assigned discrete events by the first trigger, and given that

ϵ-offset events are also created from them by the second trigger,

Theorem 7. All events in rightmost cuts are returned by the abstractor. Moreover, a

rightmost cut of E is also a cut of the discrete signal returned by the abstractor.

Thus the slicer process can find the rightmost satcuts when it processes the discrete

signal. The leftmost satcuts will be handled by the slicer using the PVCs, as will be shown

in the next section. Doing it this way relieves the abstractor from having to communicate

the left roots between processes, thus saving on messages and the corresponding wait times.

6.4 The Slicer Process for Detecting Predicates

The second process in our detector is a decentralized slicer process, so-called to keep with

the common terminology in discrete distributed systems [53]. The slicer is decentralized: it

consists of N local slicers

n, one per agent. The slicer runs in parallel with the abstractor
S

and processes the abstractor’s output as it is produced. Recall that the abstractor’s output

118

consists of a stream of discrete events, coming from the N agents. These events are either

roots or ϵ-offset from roots. If an event is a left root or ϵ-offset from a left root, we will call

it a left event. We define right events similarly. We will write Fn for those events, output by

the abstractor, that occurred on An.

Every slicer

n maintains a token Tn, which is a constant-size data structure to keep
S

track of satcuts that contain An events. Specifically, for every event et
n

in Fn, the token Tn

is forwarded between the agents, collecting information to determine whether there exists a

satcut that contains et
n

. We say the slicer is trying to complete et
n

. The token’s updates are

such that it will find that satcut if it exists, or determines that none exists; either way, it is

then reset and sent back to its parent process An to handle the next event in Fn.

Let et
n

be an event that the slicer is currently trying to complete. The token’s updates

vary, depending on whether it is currently completing a left event, or a right event. If Tn

is completing a right event, the token is updated as follows. The token currently has a cut

whose frontier contains et
n

, which is either a satcut or not. If it is, the token has successfully

completed the event and is returned to An to handle the next event in Fn. If not, then by

the property of regular predicates [26], there exists a forbidden event es
m

on the frontier of

the cut which either prevents the cut from being consistent or from satisfying the predicate.

Tn is sent to the process Am containing this forbidden event. Tn’s so-called target event,

whose inclusion may give Tn a satcut, is the event on Am following the forbidden es
m
m until it receives
S

token does not find a next event following es
m

, then the token is kept by

. If the

the next event from the abstractor (which is guaranteed to happen under the starvation-

free assumption). After the token retrieves the next event, the updates to the token and

progression of

n then follow the CGNM slicer [26]. Space limitations make it impossible to
S

describe the CGNM slicer here, and we refer the reader to the detailed description in [26].

If handling a left event, the token is updated as follows. First, as before, Tn is sent to

the process Am which generates the forbidden es
m

– i.e., which prevents Tn from completing

et
n

. Tn’s target event may not be the next event on that process following es
m

: this is because

119

if et
n

is a left root, there may exist a left event et−ϵ
m

on Am which is part of a continuous-

time leftmost satcut (by Definition 7), but which was not created by the abstractor. In this

case, if the token were to follow the updates for a right event, it would skip a potential

satcut. Instead, the slicer

m will create this event: namely, if
S

m sees a new event es′
m
S

where

s′ > t

−

ϵ, it knows that et−ϵ
m

has not and will not show up (will not be produced by the

abstractor) because messages are FIFO. The slicer at this point creates the new event et−ϵ
m

.

This is valid since in continuous-time, by definition, every moment has a corresponding event

on every agent. Once the token retrieves this created et−ϵ
m

as its new target, the updates to

the token and progression of

n follow the CGNM slicer [26], similarly to the right event
S

scenario.

Correctness of

S

. We will show that all extremal cuts of the continuous-time signal are

included in the discrete lattice. Since the CGNM slicer computes the discrete lattice, this

means in particular that it computes the extremal cuts that are in it. From these extremal

cuts, we can then recover the continuous-time satcuts by Lemma 10.

Lemma 14. For all events et
n

that are left roots, the token Tn incorporates all et−ϵ
m

for all

m

= n.

Proof. For a left root et
n

, by Theorem 2 its PVC is vt

n = [t

ϵ, . . . , t

ϵ, t, t

ϵ, . . . , t

ϵ]. Since

−
= n it must incorporate
a token Tn is tasked with identifying consistent cuts, for each m

−

−

−

the leastmost event on Am which can form a consistent cut with et
n
is a left root on An. ■

. Therefore, Tn incorporates all et−ϵ
m

events where et
n

event as et−ϵ
m

. The PVC identifies this

Lemma 15. The modified slicer processes all events of a leftmost satcut.

Proof. By Lemma 13, all events of a leftmost satcut are either at time t or t

ϵ, where t

−

is the time of a left root. Since by Lemma 14 every token Tn will visit the t

ϵ event for

−

any left root at t on An, every t

−

ϵ will be processed for any left root. Thus, all events of a

leftmost satcut will be processed. ■

Theorem 8. Our slicer returns all extremal cuts.

120

̸
̸
Proof. The abstractor creates discrete events for all roots, as well as ϵ-offsets from right

roots. By Lemma 15, the slicer creates all events of a leftmost satcut. This means that all

events of leftmost and rightmost satcuts are processed by the slicer. Therefore, since the

modified slicer returns a lattice of satcuts, the extremal satcuts are included. ■

We give the space and time complexity of the overall detector.

Since this is an online detector which runs forever (as long as the system is alive), we

must fix a time interval for the analysis.

Theorem 9. The time complexity for each agent is O(2RN ), where R is the number of

right roots in the given analysis interval. The detector consumes O(N 3) memory to store

the tokens. If roots are uniformly distributed, then the local buffers of the abstractor and

slicer grow at the most to size O(N 2).

Proof. We distinguish the following cases:

Time complexity. The calculations in our algorithm come from the abstractor, and the

modification to the CGNM slicer. Finding a root of a signal xn takes constant time in the

system parameters. The abstractor has every process send right root info to every other

process, for a complexity of N

1 per right root, and total complexity of (N

−
R is the number of right roots in the system in a given bounded window of time.

−

1)R where

Consider slicer

n, which is hosting token Tm. The slicer creates a new event, for every
S

target event of Tm that was not produced by the abstractor of Am. Event creation is O(N )

since it requires the creation of a size-N PVC assigned to the event. Event storage takes

constant time if the new event is simply appended at the end of the local buffer, or O(k)

if the event is inserted in-order in the sorted local buffer of size k. Either one works: the

first one is cheaper, but an unsorted buffer costs more to find events in it. The latter is

more expensive up-front, but the sorted buffer can be searched faster. Either way, the slicer

modification costs a total of O(N

·

events in the system.

M ) in a given bounded window of time with M missed

121

Now the number of target events requiring creation is on the order of the number of right

roots since they result from left roots, and there are equal numbers of left and right roots.

Thus M = O(R). Therefore, the total complexity for our algorithm in a given bounded

window of time is O(R(N

−

1 + N )). Of course, this is then added to the complexity of

running the modified slicer, which is O(N 2D), where D is the number of events in the

discrete-time signal. At the most, there are 2R events. So finally the total time complexity

is O(R(N

−

1 + N ) + 2N 2R), or O(R(2N + 2N 2)/N ) = O(2RN ) per agent.

Space complexity.

Indeed, a PVC timestamp has size O(N ) (since it is an N -dimensional

vector). This is in fact the optimal complexity for characterizing causality [23]. One token

stores N PVCs at all times and token updates replace old PVC values by new ones. Therefore

one token has size O(N 2), and all N tokens (one per agent) require O(N 3) space.

How long events stay in the abstractor’s local buffers depends on message transmission

times, since events are removed from the buffers after the appropriate messages are received

(see Algorithm 6.1).

It also depends on the distribution of events within the interval of

analysis, not just their rate 1/R. E.g.

if roots are uniformly distributed in the analysis

interval, then the nth abstractor’s local buffer grows at the most to size O(N 2), as it receives

roots from the other N

−

1 agents and stores the O(N ) PVC timestamp for each root. Then

event removal starts as An receives target events. Similar considerations apply to the slicer’s

local buffers. In such a case the detector’s total space complexity is O(N 3 + 2N 2). ■

Finally, there is no bound on detection delay, since we do not assume any bounds on

message transmission time. Assuming some bound on transmission delay easily yields a

corresponding bound on detection delay.

6.4.1 Worked-out example

We now work through an example execution of the detector on Figure 6.4. We focus on

agent A2, its abstractor

2, slicer

A

2 and its token T2.
S

1. Agent A2 encounters a left root in the signal at local time 3.5. This information is

122

3.5
2

ϵ

−

5.8

−

ϵ

5.8 + ϵ
6

A1

A2

2

−

ϵ

3.5

6

ϵ
5.8

6 + ϵ

−

Figure 6.4 Example of subsection 6.4.1. Bold intervals are where the local signals are non-
negative. The happened-before relation is illustrated with solid arrows. The predicate is ϕ =
0). Solid circles represent discrete events returned by the abstractor; hollow
(x1
≥
circles are those created by the slicers. The leftmost satcut of this example is [3.5
ϵ, 3.5]
and the rightmost is [6, 5.8].

(x2

0)

−

≥

∧

forwarded to the abstractor.

2. The abstractor

2 adds the new root to its buffer with a PVC =[3.5

A

ϵ, 3.5].

−

3. A2 finds a right root in the signal at local time 5.8 and forwards it to

2.

A

4. The abstractor sends the root information to agent A1. It then adds this root to its

buffer with a PVC timestamp of [5.8

ϵ, 5.8].

−

5. Abstractor

2 receives a message from A1 about a right root at A1’s local time 6. Note

A

that this is the first knowledge A2 has about anything that is occurring on A1, even

though A1 has already found a left root.

6.

7.

2 uses A1’s message to create a new local event at 6 + ϵ with PVC [6, 6 + ϵ].

A

2 also adds this new local event to its buffer. Since all messages are FIFO, A2 knows

A
that there will be no new messages which will create events before 6 + ϵ. Thus, it can

remove both of the events 3.5 and 5.8 from the buffer and forward them to its local

slicer

2. At this point both of A1’s events have been forwarded to its slicer, although
S

A2 has no knowledge of this.

8. The slicer

2 receives an event with a PVC [3.5
S

−

ϵ, 3.5]. Token T2 is waiting for the

next event, so it adds this event to its potential cut.

123

9. The token is processed with its new potential cut. The cut is found to be inconsistent

since T2 has no information about any A1 events.

10. The token’s target is set to be 3.5

ϵ on A1 and the token is sent to A1.

−

11. A1 receives T2.

It walks through its local events 2 and 6 and determines that T2’s

target event is between the two.

12.

1 creates a new event e3.5−ϵ
S

2

and notes that x2(3.5

ϵ)

−

≥

0.

13. Token T2 incorporates the new event to its potential cut. The new potential cut is

consistent and satisfies the predicate. It is then sent back to A2.

14. A2 receives T2. T2 indicates a satisfying cut, which the agent outputs as a result. It

then advances T2 to its next event at time 5.8.

15. T2 has the current cut of [3.5

−

ϵ, 5.8]. This is not consistent, so it is given the target

ϵ on A1. It is then sent to A1.

5.8

−

16. A1 receives the token.

1 walks through its local events and finds that the token’s
S
target is between the left root and the right root.

17.

1 creates a new event at 5.8
S

−

ϵ and notes that x1(5.8

ϵ)

−

≥

0.

18. The token adds the event to its potential cut. It finds that its new potential cut is

consistent and satisfies the predicate. It is then sent back to A2.

19. A2 receives T2 and outputs the satcut. The algorithm then continues with new events

as they occur.

Through this example, agent A2 discovered the satcuts [3.5

ϵ, 3.5] and [5.8

ϵ, 5.8]. The

−

−

first is the leftmost satcut of the interval of satcuts. A1 discovered an additional satcut

[6, 6

−

ϵ]. Joining this satcut with A2’s second satcut returns a result of [6, 5.8], which is the

rightmost satcut of the interval of satcuts.

124

N = 4

)
s
(

e
m

i
t

n
u
R

5

4

3

2

1

0

10

40
30
20
Root rate (roots/s)

50

(a) Runtime vs root rate on 4 synthetic sig-
nals.

)
s
(

e
m

i
t

n
u
R

5

10

20

30

40

50 2

Root rate (roots/s)

6

3

5

4

N u m b er of a ge nts

(b) Online monitoring. The red horizon-
tal plane indicates the runtime threshold
(namely, 5s) below which it is possible to do
online detection.

Figure 6.5 Runtime vs root rate and N on synthetic data.

6.5 Case Studies and Evaluation

We implemented our detection algorithm and ran experiments to 1) illustrate its opera-

tion, and 2) observe runtime scaling with number of agents and with average rate of events.

The detector was implemented in Julia for ease of prototyping, but future versions will be in

C for speed. All experiments are replicated to exhibit 95% confidence interval. Experiments

were ran on a single thread of an Ubuntu machine powered by an AMD Ryzen 7 5800X CPU

@ 3.80GHz.

We consider two sources of data: the first is a set of N synthetically generated signals,

N = 1...6. Each signal has a 5s duration, and is generated randomly while ensuring an

average root rate of µn. That is, on average, µn roots exist in every second of signal xn.

For the second source of data, we use the Fly-by-Logic toolbox [100] to control up to 6

simulated UAVs (i.e., drones) performing various reach-avoid missions. Their 3-dimensional

trajectories are recorded over 6 seconds. We monitor the predicate “All UAVs are at a height

of at least 10m simultaneously”. Maximum clock skew ϵ is set to 0.05s.

Effect of root rate (µn) on run time. We use 4 synthetic signals of 5s duration, and

measure the detection runtime as the root rate for all signals is varied between 10roots/s

125

)
s
(

e
m

i
t

n
u
R

15

10

5

0

µn = 50roots/s

0.3

0.2

0.1

)
s
(

e
m

i
t

n
u
R

2

3
5
4
Number of agents

6

2

3
5
4
Number of agents

6

(a) Detection of synthetic signals at 50 root-
s/s.

(b) Detection of UAV signals.

Figure 6.6 Runtime vs number of agents.

and 50roots/s. Figure 6.5a shows the results. Naturally, as µne increases, so does the run

time due to having to process more tokens.

Online detection. We want to identify when it is possible for us to perform online de-

tection with the Julia implementation, i.e. such that the detector finishes before the end of

the signal being processed. To this end, we use the synthetic signals of duration 5s and vary

both root rate and number of agents. Figure 6.5b shows the results: all combinations of

root rates and number of agents with runtimes under the threshold of 5s can be performed

online.

Effect of number of agents on run time. Figure 6.6 shows the effect of number of

agents N on runtime. As expected, the runtime increases with N .

126

CHAPTER 7

RESOURCE OPTIMIZATION OF STREAM PROCESSING IN
LAYERED SENSOR NETWORKS

In this chapter, we set our sights on monitoring reliability by optimizing resource consump-

tion in a generalized class of CPS. In Chapters 3, 4, 5, and 6 we proposed different monitoring

techniques for distributed systems with respect to different specifications under both central-

ized and decentralized monitoring settings. However, solely monitoring formal specification

on a distributed CPS is not enough to guarantee its functionality. For example, in a decen-

tralized monitoring setup, if one or more monitors start reporting erroneous results, then

it is possible to reach false positive and/or false negative verdicts on the distributed CPS

against some specification. Therefore it is imperative that we ensure the reliability of all

monitors, and by extension, the reliability of the distributed CPS, that is, the network of

monitors or agents, as a whole.

However, determining reliability of a distributed CPS depends on an array of factors,

including the type of agents in the network. For example, the method for computing relia-

bility of a network of UAVs will vastly vary from the method for computing reliability of a

network of medical equipment.

To this end, we present a generalized model of a class of CPS, where each monitor is

represented by an (IoT) device or an agent in a layered network of producers and consumers.

We elaborate our technique for monitoring reliability of layered stream processing networks,

while optimizing for minimal resource consumption by its nodes.

7.1 Producer-Consumer Network with Resource Constraints

Before talking about our problem statement, we present our model that is used to capture

a layered network of nodes tasked with stream processing jobs subject to resource constraints,

flows, and target reliability.

7.1.1 Resource Bounds

We first present the notions of reusable and consumable resources in our model:

127

• Reusable resources are not depleted when an item is processed. Examples of reusable

resources are CPU, power, memory, network bandwidth, and quality. These resources

are instantly reclaimed once an item is processed. We denote the finite set of reusable

resources in the system as follows:

R =

{

R1, R2, . . . , Rn

,

}

1.

for some n

≥

• Consumable resources are depleted once an item is processed. Examples of consumable

resources are energy, time, and reliability. For instance, once error is encountered

during the processing of an item along its path in the network, it cannot be reclaimed.

We denote the finite set of consumable (depletable) resources as follows:

D =

{

D1, D2, . . . , Dm

}

for some m

1.

≥

Our model supports bounding resources on both nodes and edges. Let G = (V, E) be

a producer-consumer network. A bound on a resource res

R

D for a subset of nodes

V (respectively, a subset of edges E

V

⊆

also set bres

V =

lb, ub
⟩
⟨

(respectively, bres

E =

∪
E) is denoted by bres
V

∈

⊆
) as a pair that implies the sum of resource
lb, ub
⟩

⟨

(respectively, bres
E

). We

res

R

∪

∈

D unit (e.g., power) consumed by all nodes (respectively, edges) in V (respectively,

E) must reside within the lower bound lb and the upper bound ub. Finally, we denote the

set of all resource bounds for all resources in R

D and for any subsets of nodes and edged

∪

by B.

For instance, if a node v has 8 cores, using the conventional notation of multi-core systems,

the maximum CPU usage is 800%. In this case, a bound bCPU

{v} =

is applied. Another

0, 800
⟩

⟨

example is applying power bounds to a cluster of nodes. This implies that the sum of

power consumed by all nodes in the cluster should not exceed a specific value. A bound

bPWR
{v1,v2,v3} =
exceed 500 watts.

0, 500
⟩
⟨

denotes the total power consumption of nodes v1, v2 and v3 should not

128

Let µres

v

(respectively, µres

e

) be the amount of resource res

R

D unit consumed by

node v

∈

V (respectively, edge e

∈

E). Formally, a bound bres

∈
V =

∪
lb, ub
⟩
⟨

on vertices V

V

⊆

and a resource res enforces the following:

lb

(cid:88)

≤

v∈V

µres
v ≤

ub.

Likewise, a bound bres

E =

lb, ub
⟩

⟨

on edges E

⊆

E and a resource res enforces the following:

lb

≤

(cid:88)

e∈E

µres
e ≤

ub.

7.1.2 Configurations

There are various configuration parameters that impact the resource usage of a node and

the reliability of its output.

• Sampling rate. Some systems depend on sampling from continuous-time and continuous-

valued signals and the amount of resources consumed by a node is proportional to the

sampling rate [18]. Lower sampling rate is usually associated with reduced reliability

or confidence. Hence, sampling rate is a configuration parameter that controls the

tradeoff between resource usage and reliability.

• Outgoing data rate. The outgoing data rate of a node impacts the resource usage

of subsequent nodes [18, 113]. If subsequent nodes decide to sample this data, then

reliability is negatively impacted.

• Precision. Some algorithms support controllable precision. For instance, image pro-

cessing may be accomplished with high or low precision [86]. The work in [84] demon-

strates how configurable precision impacts accuracy and resource usage.

• Algorithm alternatives.

In some systems, there are different algorithms that can be

used to process the data, with varying degrees of resource usage and reliability [70,

76]. For instance, data loss prevention (DLP) systems employ different classifiers for

malicious activity that are designed to have different processing costs [93].

129

To simplify our model, we abstract all the above parameters into a single quality symbol.

This symbol encompasses sampling, buffering, precision, and algorithmic alternatives. Given

a producer-consumer network G = (V, E), let us associate each node v

V with a finite set

∈

of quality levels:

Qv =

(cid:110)

Qual1(v), Qual2(v), . . . , Qualk(v)

(cid:111)

where the number of levels k can be different for each node v. A node v can use each quality

level Quali(v), where 1

i

≤

≤

k to process items that are being received at some input data

rate in IRate(v) and being produced at some outgoing data rate in ORate(v). Part of our

stream optimization (see Section 7.2) is to find the best quality for the possible input/output

data rates. To this end, for each node v

V, let

∈

ϑv : IRate(v)

Qv

×

→

ORate(v)

be a function that maps an incoming data rate and a quality level to an outgoing data rate.

That is, we have:

ORatev = ϑ

(cid:16)

IRate(v), Quali(v)

(cid:17)

where Quali(v) is the ith quality level of node v.

7.1.3 Reliability

Quantifying reliability is generally a challenging task. Reliability of each node not only

depends on its quality level, but on other environmental factors as well. For example, the

reliability of a node that captures video streams may vary based on the time of the day and

the surrounding lighting conditions. Another example would be the case where a node may

become less reliable once it nears the end of its average life cycle. Let us assume each node

V is influenced by mv number of environmental factors. We denote Uj

≤
mv as the jth environmental factor of the node v. An environmental factor of 1 indicates

[0, 1] where 1

v ∈

v

j

∈

≤

the best possible reliability when other factors (as well as the quality) remain unchanged,

whereas an environmental factor of 0 indicates the worst. All the intermediary values are

determined by the node’s architecture. In a similar manner, we denote UQual

v

[0, 1] as the

∈

130

quality factor of the node v. A quality factor of 1 maps to the highest quality level Qualmax(v)

supported by the implementation of the node’s code. This could be a configuration where

a computationally intensive algorithm is used, input data is not sampled or buffered, and

numerical precision is set to the maximum supported precision. On the other hand, a quality

factor of 0 maps to the lowest quality level Qualmin(v) supported by the node. This should be

a configuration below which the system becomes unusable. Quality factor for the remaining

quality levels in Qv

− {
that we have defined the quality factor UQual

Qualmax(v), Qualmin(v)
}
and the environmental factors U1

are determined by the system design. Now

v, U2

v, . . . , Umv
v

v

for a node v, we are ready to define its reliability αv

[0, 1] as follows:

∈

αv =

UQual

v + W1

v.U1
1 + W1

v + W2
v + W2

v + . . . + Wmv
v

v.U2
v + . . . + Wmv

v

.Umv
v

Where Wv =

{

W1

v, W2

Uv =

{

U1

v, U2

v, . . . , Umv
v }

v, . . . , Wmv
v }
.

are the respective weights of the environmental factors

Note that it is difficult to determine the discrete quality levels of a node, and map the

said quality levels to numerical quality factor values. This is mainly because different nodes

in a producer-consumer network carry out different tasks, and therefore, require their own

method of quality level determination. For example, the quality level of a node that is tasked

with capturing and streaming video can be determined by its current video resolution. In

other words, the maximum operational resolution can be considered as the highest quality

and mapped to a quality factor of 1, the minimum operational resolution can be considered as

the lowest quality and mapped to a quality factor of 0, and every other operational resolutions

in between can be mapped between 0 and 1 based on their pixel count. However, this method

will clearly fail determine the quality levels of a node that is tasked with detecting motion,

where the polling interval rate could be a better representation of quality levels for the

said node. Note that, we do not attempt to provide an absolute method for determining

quality levels and mapping them to appropriate quality factor values. We merely propose

an abstraction that allows tweaking the system into yielding desirable results.

131

7.1.4 Relationship between Configurations and Resources

We now define the relationships between configurations and resources. Let CRateres

denote the set of possible rates of consumption of resource res. Also, let

φres
v

: IRate(v)

Qv

×

→

CRate(res)

be a function that maps the rate of incoming data and the quality level of node v

V to

∈

a possible consumption rate value in CRate(res). For example, for a node v with a quality

levels of Qual(v) that is receiving data at the rate of IRate(v), we determine the rate at

which resource PWR is consumed on the said node using φPWR

v

(IRate(v), Qual(v)). Recall

that resource res can be either reusable or non-reusable. Hence, each node defines a set of

functions, in which the elements are functions φres

v

for all resources res

D as follows:

R

∪

∈

Φv =

(cid:110)

φres
v

res

|

∈

R

D

∪

− {

REL

(cid:111)
}

where REL is the reliability resource. While reliability depends on the quality level of a

node like other resources, it also depends on the reliability of incoming data, as well as the

environmental factors. Therefore, we exclude reliability since it is defined differently.

We incorporate this notion of reliability to model systems where error is compound, i.e.,

receiving erroneous data may impact the reliability of produced data differently even at

the same quality level, and under same environmental factors. This behavior is common in

precision based quality levels, where rounding error is compounded as more mathematical

operations are performed on a data path. Thus, we introduce the following recursive function

ψv to determine the compounded reliability of node v

V as follows:

∈

ψv(Qual(v)) =

ψu(Qual(u))
{

|

u

∈

) if Pred(v)
Pred(v)
}

=

∅






comp(Qual(v), Uv, Wv,

αv if Pred(v) =

∅

132

̸
ORate
q2
75
60
70
90
80

q1
100
90
100
120
110

q3
50
30
40
60
70

PWR
q2
65
55
65
70
75

q1
80
75
80
85
85

q3
40
35
40
40
36

TIME
q2
13.3
16.6
14.3
11.1
15.1

q1
10
11.1
10
8.3
10.3

q3
20
33.3
25
16.6
21.1

Table 7.1 Nodes v[1,5] resource usage.

PWR
q2
70
65
55
70
80

q1
85
80
88
90
100

q3
55
50
50
55
60

TIME
q2
20.1
18.2
17.7
14.5
22

q1
16
18.2
15
13
17

q3
25
38.3
29
19
45

αv
q2
90
92
88
93

q1
100
100
100
100

q3
82
84
79
83

− − −

Table 7.2 Nodes v[6,10] resource usage.

v1
v2
v3
v4
v5

v6
v7
v8
v9
v10

where αv is the reliability of v when it is a source node (by system design) and comp denotes

a function that computes the reliability of a node given its quality, environmental factors and

their weights, and the reliability of its predecessors. For instance, comp could be instantiated

with a function that computes the average (or maximum) reliability of all predecessors times

the quality level of the node.

For example, we introduce the characteristics of the nodes in the network of Figure 7.2.

Table 7.1 lists the production rate (ORate) for resources power (PWR) and response time

(TIME) of nodes v[1,4] at different quality levels. We abbreviate the quality level Quali as

qi The quality level for nodes v[1,4] is designated by the sampling rate. Thus, the highest

quality level q1 has the highest rate of outgoing items, versus the lowest quality level q3.

7.1.5 Revised Definitions

Based on the definitions introduced in the previous subsections, we now redefine a node

as follows:

(cid:68)

v =

Pred(v), Succ(v), Qual(v), ORate(v), Φv, ψv

(cid:69)

133

Thus, the node now includes a set of quality levels (i.e., Qual(v)), a function that determines

the rate of outgoing data (i.e., ORate(v)), a set of functions that determine resource usage

(i.e., Φv), and a function that determines the reliability of the node’s output (i.e., ψv).

Finally, we redefine the graph as follows:

(cid:68)

G =

V, E, R, D, B

(cid:69)

Thus, the graph now defines a set D of consumable resources, a set R of reusable resources,

and a set B of bounds on resources.

7.2 Problem Statement

First, observe that in the model proposed in Section 7.1, quality levels may affect the

following:

1. Node reliability ψv, which is a function of the quality level, environmental factors and

their weights, and the incoming reliability values of all predecessors.

2. Resource consumption φres

v

, which is a function of the quality level and the incoming

data rate.

3. Production rate ORate(v), which is a function of the quality level and the incoming

data rate.

The majority of the stream processing systems benefit greatly from knowing how to answer

one or both of these two questions; (1) how the usage of available resources can be optimized

to reach maximum reliability, and (2) how to minimize available resource usage while ensuring

reliability is maintained above a target threshold. Thus, roughly speaking, our multi-objective

problem statement is as follows. Given (1) a producer consumer network on which a set of

bounds is defined, and (2) a target reliability for all consumer-only nodes,

• Quality Maximization. Our first objective is to identify a single quality level for every

node, such that the reliability for consumer-only nodes, is maximized, while satisfying

all bounds. For example, maximizing the efficiency of each device in a sensor network

134

of smart home devices while not exceeding the specified renewable resources like power,

CPU, bandwidth etc.

• Resource Usage Minimization. Our second objective is to minimize consumption of a

given resource for all nodes while achieving a target reliability. For example, minimizing

the power usage of a producer-consumer network, that does not demand maximum

reliability from its nodes.

Formally, our optimization problem is as follows:

Problem Statement

Given a producer-consumer network G =

, a resource res
V, E, R, D, B
⟩
⟨

∈

R

∪

D,

identify quality levels Qual(u) for all u

V subject to:

∈

and

v
∀

v′

∈ {

|

Succ(v′) =

. max

∅}

(cid:16)

ψv

(cid:17)
(cid:0)Qual(v)

V. min

(cid:16)

φres
v

(cid:0)IRate(v), Qual(v)(cid:1)(cid:17)

v
∀

∈

In the next section, we will present our solution to solve the above optimization problem.

7.3 SMT-based Solution

In this section, we present our solution to solve the multi-objective optimization problem

presented in Section 7.2. Our solution is based on a reduction to solving the satisfiability

problem for SMT. Practically, we will utilize an SMT-solver in order to optimize reliability

and resource consumption tradeoffs. The SMT problem is solved on a remote network

monitor that is used to poll each sensor node in the network at a fixed interval in order to

keep track of available resources, as well as control the quality levels of the said node.

Each SMT instance is described in terms of (1) SMT entities (e.g., variables, functions,

constants, etc.) and (2) SMT constraints (e.g., Boolean conditions over first-order predi-

cates).

135

7.3.1 SMT Entities

We now introduce the entities that are used to represent the components of our producer-

consumer network G = (V, E). In some SMT entity definitions, we use free variables that

the SMT solver can manipulate in order to provide a satisfaction verdict.

Nodes.

In our SMT encoding, we represent the set of nodes V as a set of integers

1, 2,

{

· · ·

,

,

V
|

|}

where each element represents a node in V.

Edges. We store the information of edges in E in the form of a

V
|

| × |

V
|

Boolean array

edge such that:

|V|
(cid:94)

|V|
(cid:94)

i=1

j=1

edge[i][j] =





true

if

false if

(i, j)

(i, j)

E

E

∈

̸∈

where edge[i][j] implies there exists an edge from node vi to node vj in G.

Successor Nodes. We encode the function Succ as an SMT function succ that maps a

node to a set of successor nodes as follows:

|V|
(cid:94)

i=1

succ(i) =

(cid:110)
j

|

(i, j)

∈

(cid:111)

E

Predecessor Nodes. We encode the function Pred as an SMT function pred that maps a

node to a set of predecessor nodes as follows:

|V|
(cid:94)

i=1

pred(i) =

(j, i)

j

{

|

E
}

∈

Node Resource Consumption. We define the function rcon that maps a resource and a

node to a free variable that denotes the resource consumption of the said node as follows:

(cid:94)

|V|
(cid:94)

res∈R∪D

i=1

rcon(res, i) = µres

i

Edge Resource Consumption. We define the function rcoe that maps a resource and

an edge to a free variable that denotes the resource consumption of the said edge as follows:

(cid:94)

(cid:94)

res∈R∪D

(i,j)∈E

rcoe(res, (i, j)) = µres
(i,j)

136

Resource Inflow. We define the function iflo that maps a resource and a node to a free

variable that denotes the inflow of resource to the said node as follows:

(cid:94)

|V|
(cid:94)

res∈R∪D

i=1

iflo(res, i) = ζ res

i

,

where ζ res

v

denotes the inflow of resource res into node v

Resource Outflow. We define the function oflo that maps a resource and a node to a free

variable that denotes the outflow of resource from the said node as follows:

(cid:94)

|V|
(cid:94)

res∈R∪D

i=1

oflo(res, i) = ξres

i

,

where ξres

v

denotes the outflow of resource res from node v.

Edge Flow. We define the function eflo that maps a resource and an edge to a free variable

that denotes the amount of flow going through the said edge as follows:

(cid:94)

(cid:94)

res∈R∪D

(i,j)∈E

eflo(res, (i, j)) = νres
(i,j)

where νres
(i,j)

denotes the flow of resource res through edge (i, j).

Quality. We encode the function Qual as an SMT function qual that maps a node to all of

its abstracted quality levels in disjunction (see Section 7.1) as follows:

|V|
(cid:94)

i=1

(cid:0)qual(i) =

|Q|
(cid:95)

j=1

Qualj(vi)(cid:1)

Reliability. We encode the function ψ as an SMT function rel that maps the quality of a

node to its reliability as follows:

|V|
(cid:94)

i=1

rel(qual(i)) =






eval(qual(i), Uv, Wv,

rel(qual(j))

{

j

|

∈

pred(i)

)
}

αi

if

if

pred(i)

=

pred(i) =

∅

∅

Note that when pred(i) =

, node i is the source (producer-only) node in the producer-
∅

consumer network, and therefore, its reliability, αi is known.

137

̸
7.3.2 SMT Constraints

We now introduce the constraints that address our problem statement using the SMT

entities we defined in the previous section.

Resource Inflow Constraint. For resources res

D, the amount of a resource flowing

R

∪

∈

into a node depends on the amount of flow carried over all its incoming edges. Traditionally,

the inflow is the sum of all flows on incoming edges. That is,

(cid:94)

|V|
(cid:94)

res∈R∪D

i=1

ζ res
i =

(cid:88) (cid:8)eflo(res, (i, j))

(i, j)

|

(v′, v)

∈ {

v′

|

pred(v)
}

∈

(cid:9)

For instance, power is a resource that can be summed over incoming edges.

Resource Outflow Constraint. For resources in res

D, the amount of a resource

R

∪

∈

flowing out from a node is traditionally equal to the amount of the resource flowing in, which

is the conservation of flow principle. This models renewable resources efficiently, yet does

not capture non-renewable resources. We generalize the resource outflow constraint using

function Ξ:

(cid:94)

|V|
(cid:94)

res∈R∪D

i=1

oflo(res, i) = Ξ r

i (iflo(res, i), rcon(r, i))

For instance, power is a renewable resource, and thus,

|V|
(cid:94)

i=1

ΞPWR
i

(iflo(PWR, i), rcon(PWR, i)) = iflo(PWR, i)

However, energy is depletable, and therefore,

|V|
(cid:94)

i=1

ΞEGY
i

(iflo(EGY, i), rcon(EGY, i)) = iflo(EGY, i)

rcon(EGY, i)

−

Resource Bound Constraint on Nodes. For all resources res

D, we enforce the

R

∪

∈

given upper bound and lower bound on nodes as follows:

(cid:94)

(cid:16)

lb

res∈R∪D

|V|
(cid:88)

≤

i=1

(cid:8)rcon(res, i)(cid:9)

(cid:17)

ub

≤

Resource Bound Constraint on Edges. We use bounds on edges to control the dis-

tribution of resources res

R

∪

∈

D across outgoing edges of a node. We identify two main

methods of assigning flow to outgoing edges: broadcast and distribution resources.

138

Broadcast.

In this case, nodes broadcast their outflow to outgoing edges. Reliability is

broadcast, since all outgoing edges of a node carry data with the same reliability level that

the node produces. We can enforce resources res

D to be broadcast using the following

R

∪

∈

constraint:

(cid:94)

(cid:94)

(cid:16)

(i,j)∈E
Upon simplification, we have:

res∈R∪D

oflo(res, i)

≤

rcoe(res, (i, j))

≤

oflo(res, i)

(cid:17)

(cid:94)

(cid:94)

res∈R∪D

(i,j)∈E

rcoe(res, (i, j)) = oflo(res, i)

Distribution.

In this case, the outflow is distributed across all outgoing edges. For in-

stance, in a multiple consumer setting any one of a set of receiving nodes can process items.

In this case, the outgoing data flow of the producer is distributed among all consumer nodes.

The objective here is to determine the fraction of data flowing to each consumer such that

resource bounds are respected and the usage of a specific resource is optimized. We can

enforce resources res

(cid:94)

∈
|V|
(cid:94)

res∈R∪D

i=1

D to be distributed using the following constraint:

R

∪

(cid:0)oflo(res, i)

(cid:88)

≤

j∈succ(i)

(cid:8)rcoe(res, (i, j))(cid:9)

oflo(res, i)(cid:1)

≤

Upon simplification, we have:

(cid:94)

|V|
(cid:94)

(cid:88)

res∈R∪D

i=1

j∈succ(i)

(cid:8)rcoe(res, (i, j))(cid:9) = oflo(res, i)

Data Flow Constraint. Data outflow can be either broadcast or distributed. We encode

the function Out as an SMT function out that maps an edge to outdoing data rate of that

edge.

Broadcast.

all edges:

In case the data outflow is broadcast, we enforce the following constraint on

(cid:94)

(i,j)∈E

out((i, j)) = ORate(vi)

139

Distribution.

on all edges:

In case the data outflow is distributed, we enforce the following constraint

|V|
(cid:94)

(cid:88)

i=1

j∈succ(i)

(cid:8)out((i, j))(cid:9) = ORate(vi)

Reliability Maximization Constraint. Finally, let C denote the conjunction of all the

above constraints. The constraint for maximizing reliability on sink (consumer only) nodes

is as follows:

C

∧

(cid:0) (cid:94)

max(rel(qual(i)))(cid:1)

i∈{j|succ(i)=∅}

Resource Optimization Constraint.

If we want to minimize the total consumption of

some res

R

∪

∈

D across all nodes, while ensuring the reliability on sink (consumer only)

nodes remain above a given threshold α, then we enforce the following constraint instead:

C

∧

(cid:16) (cid:94)

i∈{j|succ(i)=∅}

rel(qual(i))

α(cid:1)

≥

∧

min (cid:0)

|V|
(cid:88)

i=1

(cid:8)rcon(res, i)(cid:9)(cid:17)

7.3.3 Solver Optimization

Solving the Reliability Maximizing Constraint and the Resource Optimization Constraint

both require a significant amount of computation power and time (see Figure 7.1a). This is

mostly because C is a conjunction of a large set of constraints, coupled with the fact that

it is a minimization or maximization problem. This means, there is only one solution for

which the value of the object is maximized or minimized, which in turn means our SMT

solver having to explore a large search space.

To this end, we employ some optimization techniques to our model in order to reduce

run time for the SMT solver. In this subsection, we show one such technique and report the

improvement it shows in terms of run time over the naive method.

Binary Probing: First, let Aα be an SMT constraint such that,

Aα = C

∧

(cid:0) (cid:94)

i∈{j|succ(i)=∅}

rel(qual(i))

α(cid:1)

≥

140

Data: Producer-consumer network G, SMT constraint Aα, Target reliability α, Error margin

ϵ

Result: Estimated Best Reliability α′
αmin
f ound
if

←
false while
αmin

.5 α′
f ound do

0 αmax

1 pivot

ϵ then

←

←

¬

←

0

←
αmax
−
|
f ound
end
β

| ≤
true

←

solve(Apivot, G) if β

←

if β > α′ then

=

1 then

−

α′

β

←

pivot
⌋

← ⌊

end
αmin

end
else

αmax

pivot
⌋

← ⌊

end
if f ound then
break

end
pivot

←

end
return α′

(αmin + αmax)/2

Algorithm 7.1 Best Reliability Estimation Algorithm

We say Aα

= G iff Aα is satisfied for α, otherwise Aα
|

= G. Let solve be a function that,
̸|

given G and Aα, returns some value β such that α

β

≤

≤

1 and Aα

= G, otherwise β =
|

1.

−

Formally,

solve(Aα, G)






β

−

if

Aα

= G
|

∧

β

∈

[α, 1]

1 otherwise

Now, using a Binary Probing technique described in Algorithm 7.1, we can invoke our

SMT solver in a pattern similar to the traditional binary search, and find an estimated α′,

that is sufficiently close, that is, within the error margin ϵ of the real best reliability. Using

a similar technique to this, we can also find the estimated minimum of any res

D. It

R

∪

∈

should be mentioned that even in worst case, after just five SMT invocations, the error from

binary probing will only be

≈

3.125%, which usually falls within the acceptable error margins

for devices in sensor networks, as far as reliability and other resources are concerned.

141

̸
7.4 Machine Learning-based Optimization

On top of our SMT-based solution described in Section 3.3, we employ a machine learning-

based optimization technique to further improve our solution in terms of run time at the

cost of negligible (details in the next section) accuracy.

7.4.1 Artificial Neural Network

We first create an Artificial Neural Network [2] (ANN) where neurons in the output layer

denote the resources that need to be optimized, and the neurons in the input layer denote

the remaining resources. For example, when solving the Quality Maximization problem,

each neuron in the output layer represents each quality level of each node v

V, that is, the

∈

number of neurons in the output layer is lo =

V
|

. each neuron in the input layer represents
|

remaining resources like power, CPU, memory, bandwidth etc., that is li =

,
}|
R is the quality resource. For determining the number of neurons in the

QUAL

R
|

−{

D

∪

where QUAL

∈

hidden layer, we chose the method proposed by the authors in [129], that is, the number of

neurons in the hidden layer is,

lh =

(cid:16) 2

3 × |

R

D

∪

− {

QUAL

}|

(cid:17)

+

V
|

|

As an example, let us consider the producer-consumer network in Figure 2.5. If we want to

maximize the quality of the network with respect to power (PWR), CPU (CPU), memory

(MEM) and bandwidth (BW), then the corresponding ANN should be of the form where

lo = 9, li = 4, and lh = 12.

7.4.2 Training Dataset

Our training dataset for the ANN is generated using the SMT-based solution detailed

in Section 3.3. For example, in order to generate the training data-set for the Quality

Maximization problem, we find the best qualities for all nodes for resources with random

values, and populate the data-set with the results. This allows us to carry out machine

learning process in an unsupervised manner.

During training, while the traditional approach is to split the dataset into two subsets

142

(i.e. training dataset and testing dataset), for smaller datasets, this may introduce biased

estimates [56]. As our model must be applicable to both small and large datasets, in order to

reduce statistical bias, we employ k-fold cross validation [123] to train our ANN. The process

of k-fold cross validation is as follows:

1. Split the dataset into randomized groups of equal sizes: g1, g2, . . . , gk.

2. For i

∈

[1, k] do:

• Assign group gi as the test dataset.
• Assign groups g1, g2, . . . , gi−1, gi+1, . . . , gk as the training dataset.
• Train the model on the training set and evaluate it on the test dataset.

Using k-fold cross validation ensures that each group is used as the testing dataset once,

and as the training dataset k

−

1 times. There various ways of selecting the value of k.

However, in our work, we assign k = 10, as this is shown to generally yield minimal statistical

bias [51, 69]. Note that for generating our dataset, we normalize the sample values to avoid

unwanted weights. However, when we report the experimental results, we use the actual

values for ease of camparison and understandability.

7.4.3 Model Accuracy

We determine the accuracy of our trained model by directly comparing its results with

the results from SMT-based solution. For the Quality Maximization problem, we define the

accuracy, accq of the model as follows:

accq = 1

−

(cid:80)

v∈V

SM Tqv −
|
(cid:80)
Qv
|
|

v∈V

M Lqv |

Where the quality level reported by the SMT-based solution is the SM T th
qv
node v, the quality level reported by the machine learning model is the M Lth
qv

quality level of

quality level of

node v, and Qv is the set of quality levels of node v. For the Resource Minimization problem,

we define the accuracy accr of the model as follows:

(cid:80)

res∈R∪D

accr =

|SM Tresv −M Lresv |
M AXresv −M INresv
R
|

D
|

∪

143

Where SM Tresv

is the value of resource res of node v reported by the SMT-based solution,

M Lresv

is the value of resource res of node v reported by the machine learning model, and

M AXresv

(resp., M INresv

) denotes the upper bound (resp., lower bound) of resource res

observed in the dataset. Note that, 0

≤ {
accuracy, and 1 indicates the best accuracy.

accq, accm

1, where 0 indicates the worst

} ≤

7.5 Case Studies and Evaluation

In this section, we evaluate our technique for resource optimization using synthetic data

generated from a simulated layered network of nodes, as well as real world data collected from

a network of embedded devices, where the nodes in the network are Raspberry Pi devices

tasked with specific streaming objectives.

7.5.1 Synthetic Experiments

In this subsection, we introduce our synthetic experiments to demonstrate how our pro-

posed model can be used to optimize various resources.

7.5.1.1 Experimental Setup

We construct our producer-consumer network using 8 nodes, V =

v1, v2,
{

· · ·

, v8

}

, with

v[1,7] being the producer nodes, and v[2,8] being the consumer nodes. We add two placeholder

nodes to the network, vin and vout, along with two edges (vin, v1) and (v8, vout). Figure 7.2

shows our producer-consumer network. We use edge (vin, v1) to regulate bounds on resources,

and we use node vout to compute network reliability.

7.5.1.2 Resource Bounds

In this experiment, we consider the two resources power (PWR) and reliability (REL).

Where PWR

R and REL

∈

∈

D. We assign three possible quality levels to each node.

Table 7.3 shows different power consumption values for all possible quality levels in each

node.

144

20

15

10

5

0

)
s
(

e
m

i
t

n
u
R

18.86

bPWR
V

=

0, 1835
⟩
⟨

3.19

10−2

3.6

·

1

0.8

0.6

y
t
i
l
i
b
a
i
l
e
R

SMT-based model
ML-based model

Naive

Probing

ML

1,900

1,850

1,800

1,750

1,700

Algorithm

Power (Watt)

(a) Naive vs. optimized algorithm.

(b) Reliability vs. power.

)
s
(

e
m

i
t

n
u
R

4

3

2

1

0

SMT-based model
ML-based model

)
s
(

e
m

i
t

n
u
R

3

2

1

0

SMT-based model
ML-based model

1,900

1,850

1,800

1,750

1,700

Power (Watt)

10

9

7
8
6
Node count

5

4

(c) Run time vs. power.

(d) Run time vs. reliability.

Figure 7.1 Synthetic experiment results.

145

vin

v1

v2

v3

v4

v5

v6

v7

v8

v9

v10

vout

Figure 7.2 A producer-consumer network of 8 nodes.

q1
-
200
185
190
195
180
195
185
180
190
200
-

q2
-
195
180
185
190
175
190
180
175
185
195
-

q3
-
190
175
180
185
170
185
175
170
180
190
-

q4
-
185
170
175
180
165
180
170
165
175
185
-

q5
-
180
165
170
175
160
175
165
160
170
180
-

vin
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
vout

Table 7.3 Nodes v[1,9] power consumption (in watts) for different quality levels.

7.5.1.3 Machine Learning Setup

For our machine learning dataset, we generate 500 data samples using our SMT-based

solution. each sample contains a randomly selected PWR value, and the optimized quality

levels for the 10 nodes in Figure 7.2. We train our ANN with this dataset for 10 epochs (full

iterations) using the k-fold cross validation method described in Section 7.4.

7.5.1.4 Experimental Results

We now run a variety of experiments on the setup described above and report our findings

below.

146

Naive Model vs. Optimized Model. vs. Machine Learning Model. First we

run the model to find the best possible reliability, given bounds on other resources in the

producer-consumer network. We want to observe the improvement in run time between

a model that performs regular constraint solving, and a model that performs constraint

solving using the binary probing technique shown in Algorithm 7.1. To this end, we assign

the PWR resource bPWR

V

to

0, 1835
⟩
⟨

and run our solvers. As shown in Figure 7.1a, using the

naive (brute force) technique, we get a best reliability value of 0.855 within a run time of

18.855 seconds, whereas using the binary probing technique, we get a best reliability value

of 0.857 within a run time of 3.185 seconds. Using our machine learning model, we get a

best reliability value of 0.839 in under 0.036 seconds.

Reliability vs. Power. We now observe the tradeoff between reliability and power. We

start off by assigning the PWR resource bPWR

V

to

. We can observe in Figure 7.1b
0, 1900
⟩

⟨

the burndown of reliability as we tighten the power bound by 5 watts on each iteration.

We stop at

0, 1700
⟩
⟨

when the network is no longer functional due to the given power being

lower than the minimal requirement. In this scenario, our machine learning-based model

report a more uniform reliability drop than our SMT-based model. Figure 7.1c demonstrates

the run time measurements for the same experiment. As the available power is reduced,

the search space for the SMT solver gets smaller as well due to having to check for fewer

valid configurations. Which is why a gradual decline in run time can be observed for the

SMT-based model. However, the machine learning-based model reports the results through

inference, and therefore show very little variation in run time.

Run time vs. Node availability.

In this experiment, we observe the effect of node

availability, and by extension overall reliability on run time. To this end, we assign the

PWR resource bPWR

V

to

0, 1835
⟩
⟨

and reduce reliability of the nodes v5, v2, v3, v8, v7 and v9

to 0 one node at a time in the given order. Figure 7.1d shows the run time of this experiment.

As expected, as more nodes become inactive, the overall run time for the solver decreases

for the SMT-based model. However, similar to the previous observation, for the ML-based

147

Figure 7.3 A Multi-Layer Network of Raspberry Pi Devices.

model, the run time remains steady. Note that removing any more nodes from the network

will render the network inactive, as the solver will fail to find any valid path from vin to vout.

7.5.2 Case Study

In this subsection, we introduce our case study on a real world layered sensor network,

where each sensor is tasked with a streaming or processing job, and is operated with a

Raspberry Pi device.

7.5.2.1 Experimental Setup

We construct our layered sensor network with five nodes as shown in Figure 7.3, where

v0, v1 and v2 are producer nodes, and v1, v2, v3 and v4 are consumer nodes in V. Just like

before, we add two place holder nodes vin and vout such that, vin has an outgoing edge to

v0, and vout has two incoming edges from v3 and v4. Below we explain the streaming tasks

of each device, along with the resources and quality levels.

Motion Sensor Node (v0). This node is comprised of four motion sensors that are able

to detect objects and movement. When any one of these motion sensors are activated, v0

sends an activation signal to its subsequent nodes. Table 7.4a shows resource consumption

for node v0 under different quality levels. In this case, the quality levels simply indicate the

number of active motion sensors. Motion sensors are fairly reliable under normal operational

circumstances, which is why v0 has high reliability as long as at least one sensor is active.

148

𝑣0𝑣𝑜𝑢𝑡𝑣𝑖𝑛𝑣1𝑣3𝑣2𝑣4Active Sensors Reliability Power Usage
0
0.94
0.96
0.98
1

450
455
460
465
470

0
1
2
3
4

q5
q4
q3
q2
q1

Resolution Bandwidth Power Usage
15
256 x 144
35
426 x 240
50
640 x 360
125
854 x 480
275
1280 x 720
2500
1920 x 1080

446
476
488
500
512
566

q6
q5
q4
q3
q2
q1

(a) Quality levels and resource usage of v0.

(b) Quality level and resource usage of v1.

Illuminance Brightness Power Usage
0
10
20
30
40
50
60
70
80
90
100

120000
108000
96000
84000
72000
60000
48000
36000
24000
12000
0

472
479
486
493
500
507
514
521
528
535
542

q11
q10
q9
q8
q7
q6
q5
q4
q3
q2
q1

Resolution Bandwidth Power Usage
15
256 x 144
35
426 x 240
50
640 x 360
125
854 x 480
275
1280 x 720
2500
1920 x 1080

458
464
470
476
482
488

q6
q5
q4
q3
q2
q1

(c) Quality level and resource usage of v2.

(d) Quality level and resource usage of v3.

Resolution Bandwidth Power Usage
30
256 x 144
70
426 x 240
100
640 x 360
250
854 x 480
550
1280 x 720
5000
1920 x 1080

452
452
452
452
452
452

q6
q5
q4
q3
q2
q1

1410
1405
1400
1395
1390
1385
1380
1375
1370
1365
1360
1355

MOTION CAM LIGHT LOCAL CLOUD
q1
q1
q1
q1
q1
q1
q1
q1
q1
q1
q1
q1

q1
q1
q1
q1
q2
q3
q4
q5
q6
q6
q6
q6

q1
q2
q3
q4
q4
q4
q4
q4
q4
q4
q4
q4

q1
q1
q1
q1
q1
q1
q1
q1
q1
q1
q3
q6

q1
q1
q1
q1
q1
q1
q1
q1
q1
q1
q1
q1

(e) Quality level and resource usage of v4.

(f) Quality level change due to power.

Table 7.4 Quality level tables for different nodes.

149

)
p
(

n
o
i
t
u
l
o
s
e
R

1080

720

480

360

240

144

Cloud node
Local node

y
t
i
l
i
b
a
i
l
e
R

1

0.8

0.6

0.4

Bandwidth usage (kB/s)
Power usage (mAh)

0

500 1,000 1,500 2,000 2,500

144 240 360 480 720 1080

Power (mAh) / Bandwidth usage (kB/s)

Resolution (p)

(a) Resolution vs. power.

(b) Local vs. cloud processing.

y
t
i
l
i
b
a
i
l
e
R

1

0.8

0.6

0.4

0.2

0

2

1.5

1

0.5

)
s
(

i

e
m
T
n
u
R

1,410

1,400

1,390

1,380

1,370

1,360

1,350

1,410

1,400

1,390

1,380

1,370

1,360

1,350

Power (mAh)

Power (mAh)

(c) Reliability vs. power.

(d) Run time vs. power.

Figure 7.4 Case study results.

150

Camera Node (v1). This node is tasked with operating a 5MP camera at varying resolu-

tion/bitrate, and is activated upon receiving an activation signal from v0. The resolution/bi-

trate of the captured stream is changed at runtime, which allows us to map the different

resolution/bitrate modes to the quality level of node v1. The node consumes two other re-

sources: bandwidth and power. At the highest quality level, this node uses

2.5 Mbit/s

≈

and

≈

566 mAh. Table 7.4b shows resource consumption for node v1 under different quality

levels, and Figure 7.4a shows the tradeoff between stream resolution vs. power consumption

and bandwidth usage. Note that change stream resolution/bitrate is not always proportional

to bandwidth. This is due to video compression standards (e.g H.264).

Illuminance Detection Node (v2). This node is connected to an illuminance detector,

and a smart light bulb with adjustable brightness. As both of these devices are powered and

operated by node v2 with minimal data transfer and delay, we consider them to be a part

of node v2 itself. Similar to v1, this node is activated upon receiving an activation signal

from v0. Depending on how dark or bright the general area that is being streamed by the

camera on node v1 is, node v2 adjusts the brightness of the smart bulb accordingly. We keep

the illuminance detector some distance away from the network, in order to prevent light

flickering due to feedback loop. Table 7.4c shows resource consumption for node v1 under

different quality levels.

Local Storage Node (v3). This node compresses, checksum verifies, and stores the video

stream received through edge (v1, v3) into a secure external hard disk drive. Table 7.4d shows

resource consumption for v1 under different quality levels.

Cloud Storage Node (v4). Similar to node v3, node v4 is also tasked with storing the

video stream received through edge (v1, v3). However, instead of storing it locally, the stream

is uploaded directly to a cloud storage. Offloading the compression and checksum verifica-

tion task to the cloud allows node v4 to consume minimal power, at the cost of additional

bandwidth. Table 7.4e shows resource consumption for node v4 under different quality levels.

151

Note that the bandwidth requirement for node v4 is double the amount in comparison to that

of v3 at the same quality level. This is due to the fact that when v4 receives a video stream

from node v1, it uploads the same stream to the cloud, effectively doubling the required

bandwidth. Furthermore, we assume that while maintaining similar qualities and stream,

node v4 generally more reliable than node v3 for being able to store and backup data in the

cloud as shown in Figure 7.4b.

7.5.2.2 Machine Learning Setup

For our machine learning dataset, similar to our synthetic experiments, we generate 500

data samples using our SMT-based solution. each sample contains a randomly selected

PWR value, and the optimized quality levels for the 5 nodes in the Pi network shown in

Figure 7.3. We train our ANN with this dataset for 10 epochs (full iterations) using the

k-fold cross validation method described in Section 7.4.

7.5.2.3 Experimental Results

We run a variety of experiments on the setup described above and report our findings

below.

Reliability vs. Power. We now observe the tradeoff between reliability and power in

our multi-layer network of nodes. We start off by assigning the PWR resource bPWR

V

to

as shown in Figure 7.4c. From this point we measure the best obtainable reliability

0, 1410
⟨
⟩
and tighten the bound by 5 mAh on each iteration in a similar manner as our synthetic

experiment. Figure 7.4d shows the run time for the same experiment. The run time is low

at the beginning due to the system having adequate power flow to operate all nodes at a

near maximum reliability, and therefore having very few SMT constraints to solve. Observe

that our machine learning-based model exhibits similar behavior as seen during our synthetic

experiments. The reliability drop is a steady decline, whereas the run time does not very to

a great degree. However, for bother models this changes when bPWR

V

is

0, 1410
⟩
⟨

and onward,

as the available power is no longer sufficient for all nodes to operate at maximum quality

152

and reliability.

We now observe the changes in quality levels for which the burndown in reliability has

occurred. From Table 7.4f, we can see that the model gradually lowered the quality levels of

v0 to q4 first due to the small difference in reliability between each quality level. Afterwards,

the quality level of v3 (local node) is lowered instead of v4 (cloud node) due to v3 consuming

more power than v4. Once the quality level of v3 has reached the lowest point, lowering the

available power further finally caused the model to lower the quality level of v4. Finally,

below 900 mAh, the available power was not sufficient for keeping the nodes running, even

at the lowest quality levels, and therefore, the network was shutdown.

We run the same experiment again with available bandwidth as the tightening resource.

In this case, we observe the exact opposite behavior, where the quality level of v4 (cloud

node) was lowered first, and then v3 (local node). This is due to the fact that v4 requires

double the bandwidth when compared to v3, as shown in Table 7.4e.

7.6 Conclusion

In this chapter, we developed a generalized model of a streaming sensor network CPS as

a network of producers and consumers. Our approach incorporates tradeoffs between out-

put quality and resource utilization. These tradeoffs were articulated as a multi-objective

optimization problem with the goal of minimizing resource consumption while maximizing

the reliability (and quality) of devices (or tasks) in a network. To tackle the aforementioned

optimization challenge, we provided an efficient technique based on constraint solving uti-

lizing SMT-solvers to identify the ideal processing quality selection for each node in the

network while respecting resource limitations and minimizing error. We further improve this

work by incorporating machine learning and dramatically speed up the resource optimiza-

tion speed. Since sensor network applications frequently require stream processing, which

entails a complicated network of processing nodes where data is collected, analyzed, and

then communicated to succeeding nodes, this is a significant problem. We put our work into

practice, and provide the results of an experiment using an IoT device network.

153

CHAPTER 8

RELATED WORK

In this section, we summarize a portion of the vast quantity of work that has been done thus

far in the field of distributed CPS that has influenced this dissertation, going as far back as

the origins of distributed monitoring.

8.1 Lattice-based Distributed Monitoring

Early work on distributed monitoring involving predicate detection in distributed com-

puting [53, 90, 115] are known to be NP-complete. To propose a more efficient solution, a

computation slicing [91] technique is used to reduce the computation size, which in turns

results in smaller search space in predicate detection, as far as state space is concerned. This

work is later extended to an online distributed monitoring algorithm [26].

Lattice-based distributed monitoring solutions generally suffer from two shortcomings:

(1) having to handle an enormous amount of concurrent states, and (2) lacking methods to

handle temporal properties. A methodology for detecting Basic Temporal Logic [99] partially

addresses the latter issue by providing methodologies for detecting a subset of temporal

operators in distributed systems. A bound-based monitoring approach [130] addresses state

space problem, and then later extended to a more efficient technique [116] that utilizes SAT

solvers. In our work, we avoid using any lattice-based distributed monitoring approaches,

and by extension, avoid needing to handle a unamanagable amount of concurrent states.

8.2 Runtime Monitoring in CPS

Accurate time-keeping for CPS was thoroughly investigated by the Roseline project [98].

Roseline team addresses the problem that local clocks have little, if any, knowledge of the

quality of time needed by the software, nor any ability to adapt to it. They achieve this by

rethinking and re-engineering how the knowledge of time is handled across a computing sys-

tem’s hardware and software and driving accurate timing information deep into the software

system.

154

Assuming perfect time synchrony, an offline toolbox called S-TALIRO [5] is introduced

by its developers, that searches for trajectories of robust MTL [67] semantics. S-TALIRO

can analyze arbitrary Simulink models or user defined functions that model the system, and

operate using randomized testing based on stochastic optimization techniques like Monte-

Carlo methods [87] and Ant-Colony optimization [37].

An online monitoring technique [35] of STL [36] for continuous and hybrid systems

employs an efficient algorithm for computing the robustness degree in which a piecewise-

continuous signal satisfies or violates an STL formula.

An efficient monitoring solution [34] is proposed by its authors that utilizes Dynamic

Programming algorithms for online monitoring of the state robustness of MTL [67] specifica-

tions with past time operators. The authors provide an approach for predictive monitoring

by computing the robustness of MTL with unbounded past and bounded future temporal

operators over sampled traces of CPS. However, in order to do so, prior knowledge the full

dynamical model of the system is required.

Our work relating to predicate detection is closer to the work involving an online monitor-

ing approach [33], where a robust online monitoring of partial traces is formalized. However,

this work assumes worst-case a priori bounds on signal values, but without factoring system

dynamics.

A logic called differential dynamic game logic [105] is a new logic that aims to demonstrate

how the satisfaction of a temporal property is affected by imperfect implementations. This

work is similar to a conformance testing framework [1], where the authors quantify the

closeness between two systems via a distance measure between their outputs, and study of

how the satisfaction of a temporal property is affected by timing inaccuracies.

The a control-theoretic software monitoring solution [83] is proposed by its authors for

coordinating time predictability and memory utilization in runtime monitoring of systems

that interact with the physical world. This method maximizes memory utilization being

employing a minimally intrusive monitoring tactic.

155

A tool called Brace [133] is introduced by its developers, where using the tool, users can

attempt to minimize false positives and false negatives, while trying to stay under a given

threshold for computation overhead of CPS. The authors do not provide any guarantees of

completely eliminating false positives or false negatives, only minimizing them. This aspect

is different from our approach on monitoring distributed signals in CPS, due to the fact that

our approach guarantees no false positives.

Another tool called ModelPlex [89] is introduced as a method for ensuring verification

results for models. ModelPlex also allows the said models to account for the effect of envi-

ronmental variances and disturbances on a CPS, while considering only the relevant part of

surrounding physics.

In the medical field, a specification language DRTV [63] in order to specify vital real-

time data sampled by medical devices. DRTV also allows for runtime monitoring temporal

properties originated from clinical guidelines.

A hybrid approach to runtime monitoring in CPS called Extended Hidden Markov sys-

tems [112] is explored, where the systems under inspection are comprised of both integer-

valued and real-valued variables.

While the above works propose various techniques and tools for monitoring CPS, they do

not account for partial synchrony, nor system dynamics in their monitoring methodologies.

8.3 Asynchronous Distributed Monitoring

The notion of computational slide [91] introduces the ability to monitor distributed sys-

tems in an asynchronous setting. In this approach, the slice of a computation with respect to

a predicate is a sub-computation with the least number of consistent cuts that contains all

consistent cuts of the computation satisfying a given predicate. This work is later extended

to a distributed setting [26], where a distributed algorithm is presented for computing the

slice of a distributed computation with respect to a regular predicate.

The work on distributed monitoring of concurrent and asynchronous systems [14] inves-

tigates the problem of distributed monitoring under time asynchrony, with application to

156

distributed fault management in telecommunications networks. To this end, the authors

combine compositional unfoldings to handle concurrency, and a variant of graphical algo-

rithms and belief propagation, originating from statistics and information theory. This work

is later further extended [44], where the authors study the diagnosis of distributed asyn-

chronous systems with concurrency. In this work, diagnosis is performed by a peer-to-peer

distributed architecture of supervisors. This approach relies on Petri net [103] unfoldings

and event structures, as means to manipulate trajectories of systems with concurrency.

A tool called DIANA [109] is introduced by its developers in order to monitor temporal

properties of distributed systems. The authors use past time distributed temporal logic (a

variant of past time linear temporal logic) as the specification language. In this approach, the

notion of knowledge vector is introduced where each process is kept aware of other processes’

local states. This approach, however, suffers from producing false negatives.

A decentralized runtime verification technique for LTL specifications [96] demonstrates

a method for runtime verification of asynchronous distributed programs for the 3-valued

semantics of LTL specifications. This approach however, also suffers from false negatives

results. On the other hand, in a temporal logic predicate detection approach [99] the authors

introduce the concept of a compact representation of all global cuts that satisfy a predicate.

The approaches mentioned above all operate within fully asynchronous setting. To the

contrary to these approaches, we leverage a practical assumption and employ an off-the-shelf

clock synchronization algorithm to limit the time window of asynchrony.

A method for designing parallel algorithms [54] is proposed to solve constrained combina-

torial optimization problems like marriage problem, shortest path problem, market clearing

price problem, and so on. The authors achieve this by transforming these problems into a

search problem, where an element that satisfies an appropriate predicate in a distributive

lattice is obtained.

An approach for detecting latent bugs caused by concurrency and race conditions among

concurrent processes [116] is by its authors. In this work, the authors propose a method

157

for detecting errors and monitoring system constraints in partially synchronous distributed

systems using a monitoring framework with SMT as its foundation.

In the work involving runtime monitoring of LTL formulas for synchronous distributed

systems in the absence of a central data collection point [9], the authors propose an approach

where LTL formulas are decomposed into sub-formulas, such that satisfaction or violation

of specifications can be detected by local monitors alone. This work is later expanded upon

with the introduction of a synchronous global clock [28], in which monitors are organised

as a tree across the distributed system, and each child feeds intermediate results to its

parent. A similar approach using LTL, but for stream runtime verification of CPS [31] is

later proposed in. However, these approaches assume perfectly synchronous clocks, which is

rarely achieveable.

The four-valued logic Runtime Verification Linear Temporal Logic RV-LTL [11] intro-

duces a logic, where the system behavior either (i) satisfies the monitored property, (ii)

violates the property, (iii) will presumably violate the property, or (iv) will presumably

conform to the property in the future, once the system has stabilized. This work is later

improved upon, where a fault tolerant verification technique LTL2k+4 [16] is proposed for

asynchronous systems.

In our automata-based monitoring technique, we used LTL3 over

four-valued LTL or LTL2k+4, because the unknown verdict in LTL3 was sufficient for our

monitoring purposes, and the distinction between ‘will presumably violate the property’ and

‘will presumably satisfy the property’ served no additional benefit.

8.4 Synchronous Distributed Monitoring

Two approaches for runtime monitoring of LTL formulas have been studied by the au-

thors of the monitor framework THEMIS [41]. The first approach introduces a data structure

that keeps track of the execution of an automaton, has predictable parameters and size, and

guarantees strong eventual consistency. The second approach defines decentralized specifi-

cations wherein multiple specifications are provided for separate parts of the system. The

framework THEMIS can be used to analyze systems using the two approaches.

158

An adaptive synchronous parallel method for distributed machine learning [131] is ex-

plored, where the performance monitoring model adaptively adjusts the synchronization

method of each computing node with the parameter server by taking into account the whole

performance of each node, ensuring improved accuracy. Furthermore, this technique guards

against the machine learning model being influenced by irrelevant tasks in the same cluster.

A hybrid approach to monitoring is taken by the authors of the monitoring tool called

SMEDL [132], where low-level properties are checked synchronously, while higher-level ones

are checked asynchronously. SMEDL can be used to construct and deploy monitors based

on an architecture specification.

The specification language intended for industrial use called LOLA [119] is proposed,

where the authors provide a syntactic characterization of efficiently monitorable specifica-

tions, for which the space requirement of the online monitoring algorithm is independent of

the size of the trace, and linear in the specification size. can express properties involving

both the past and the future.

Both online and offline verification techniques using temporal logics [107] are studied by

the authors of the specification language LOLA and present in detail the online and offline

monitoring algorithms. To this end, the authors use temporal logic for Stream Runtime

Verification [17].

A novel efficient two-layered monitoring technique [119] is proposed by its authors that

aims to overcome the time and space constraints introduced by most synchronous monitoring

approaches. The first layer is imperfect yet effective, whereas the second layer is exact but

(relatively) ineffective. The two-layered monitor also supports the usage of O(1) sized Hybrid

Logical Clocks. Another approach that aims to overcome the time and space constraints is

a monitoring method that incorporates a control idea of synchronization on CPS [60], which

include dividing a node’s main loop program into several processes and adopting twice trigger

signals to activate synchronization control.

The impact of synchronous and asynchronous monitoring instrumentation on runtime

159

overheads in the context of a runtime verification framework for actor-based systems [21] is

thoroughly studied, and the authors show that, in such a context, asynchronous monitoring

incurs substantially lower overhead costs. They also demonstrate how, for certain properties

that require synchronous monitoring, a hybrid approach can be used that ensures timely vi-

olation detections for the important events while, at the same time, incurring lower overhead

costs that are closer to those of an asynchronous instrumentation.

A solution to the decentralized monitoring problem for the more general setting of stream

runtime verification [31] is provided by its authors, and a property on specification is also

introduced here that guarantees that the online monitoring can be performed with bounded

resources.

An algorithm for distributing and monitoring LTL formula [10] employs a technique

where satisfaction or violation of specifications can be detected by local monitors alone, even

when the system’s implementation details are hidden to the user.

However, these approaches have the shortcoming of assuming a global clock across all

distributed processes. Predicate detection for asynchronous system [114] has been studied

extensively where 3 distinct detection modalities are achieved by introducing the notion of

‘definitely occurred before’ and ‘possibly occurred before’ event orderings. However, doing

so causes the assumption needed to evaluate happen-before relationship to be too strong.

In this dissertation, we utilize HLC, which not only is more realistic but also decreases

the level of concurrency. Finally, an automata-based fault tolerant verification technique [64]

is proposed for synchronous systems with no clock skews across the distributed processes.

A fault-tolerant distributed membership protocol for the determination of the set of active

nodes in a synchronous distributed real-time systems [66] is presented. We use a clock

synchronization algorithm which guarantees bounded clock skews. Our solution is also SMT

based and to our knowledge this is the first SMT based distributed monitoring algorithm for

LTL, which results in better scalability.

160

8.5 Partially Synchronous Distributed Monitoring

In the context of monitoring partially synchronous systems, the feasibility of monitoring

partially synchronous distributed systems [116] in order to detect latent bugs was first inves-

tigated. The authors provide a monitoring framework, where both the system constraints,

and the latent bugs are modeled as SMT formulas. The latent bugs are identified using

SMT solvers. This technique was later generalized to full LTL [49], where the presence of

latent bugs are detected using SMT solvers in a discrete setting. The authors introduce two

monitoring techniques where the specification in the LTL is either represented by a deter-

ministic finite automaton, or, a progression-based formula rewriting technique to reduce the

distributed runtime verification problem to an SMT problem.

A tool for identifying data races in distributed system traces called SPIDER [102] is

introduced for handling non-deterministic discrete event orderings. This is an automated

tool that can be used to identify data races in distributed system traces. However, these

approaches cannot fully capture the continuous-time and continuous-valued behavior of CPS.

There is extensive work in identifying a subclass of systems [22], for which convergence

features may be confirmed using the proof of convergence for the related discrete-time shared

state system. The method is extended to systems in which an agent’s state develops con-

tinuously over time. The proof approach was formalized in the PVS interface for timed I/O

automata and used to verify the convergence of a mobile agent pattern formation algorithm.

A failure detector for partially synchronous distributed systems [121] is proposed by its

authors, where the authors present an alternative failure detector algorithm, which is based

on a clock synchronization algorithm.

A solution to the processor group membership problem [29] is achieved by precisely

specifying the processor problem in order to define the system model and failure assumptions.

The author then provides two protocols for solving this problem.

A technique to monitor predicates on a partially synchronous distributed system by retim-

ing continuous signals [95] is explored. While this approach improves monitoring efficiency

161

by levering knowledge about system dynamics, it is limited to only being able to monitor

predicates, and cannot capture temporal behavior.

A method for runtime monitoring of blockchain executions for partially synchronous

distributed computations [50] is proposed where the specification language is metric temporal

logic [67].

The effects of the impedance mismatch between the monitor and the underlying program

for the detection of conjunctive predicates [130] is analyzed. An interesting observation of

this work is that the authors identify a small interval where the monitor assumptions are

hypersensitive to the underlying program environment.

A domain specific language called PSync [38] based on the Heard-Of model [24, 25], is

demonstrated, where asynchronous faulty systems are viewed as synchronous ones with an

adversarial environment that simulates asynchrony and faults by dropping messages.

While the approaches above provide various techniques for monitoring partially syn-

chronous discrete systems, they are unable to fully capture the continuous nature of CPS.

8.6 Decentralized Distributed Monitoring

There is a rich literature dealing with decentralized predicate detection in the discrete-

time setting. These works range from detection of regular discrete-time predicates [26] to

detecting lattice-linear predicates over discrete states [54]. There is recent work on perform-

ing detection on a regular subset of Computation Tree Logic [108] that aims to avoid the

state explosion problem. Literature books by Garg [52] and Singhal [68] elaborate extensively

on decentralized monitoring in discrete-time settings. By contrast, we are concerned with

monitoring continuous-time signals in decentralised setting, which have uncountably many

events and necessitate new techniques. For instance, one cannot iterate through events as

done in the discrete setting.

The recent works [94, 95] do monitoring of temporal formulas over partially synchronous

analog distributed systems, that is, they only find one satisfaction, not all. Moreover, their

solution is centralized.

162

Generally, there is a plethora of work to be found on monitoring temporal logic proper-

ties, especially Linear Temporal Logic (LTL) and Metric Temporal Logic (MTL). Notably,

these works involve using a three-valued MTL for monitoring in the presence of failures and

non-FIFO communication channels [7], monitoring satisfaction of an LTL formula [10], us-

ing a three-valued LTL for distributed systems with asynchronous properties [96], using a

tableau technique for three-valued LTL [8], and finally, using a past-time distributed tempo-

ral logic which emphasizes distributed properties over time [109]. However, all these methods

either focus on centralized monitoring or work in discrete settings. In our work, we provide

methodologies for decentralized monitoring in continuous time settings.

8.7 Monitoring Reliability in CPS

Resource trade-off is a broadly studied with respect to monitoring reliability in CPS.

Such as the work exploring the trade-offs between power and reliability of Wireless Sensor

Networks [30]. This work proposes a model for evaluating the reliability of WSNs considering

the battery level as a key factor.

The problem of modeling and evaluating the coverage-oriented reliability of CPS subject

to common-cause failures is explored [110], where the proposed methodology takes advantage

of reduced ordered binary decision diagrams, which is similar to our binary probing technique.

A methodology based on an automatic generation of a fault tree [111] is proposed in order

to evaluate the reliability and availability of CPS, when permanent faults occur on network

devices.

One stream of work in CPS is concerned with security related trade-offs, where security

comes at a cost of energy or performance. A relevant literature in this regard provides a

classification of existing security concerns and researches [3].

A method to determine when to inject cryptographic checks without interfering with

control tasks [75] is demonstrated by its authors, where the general idea behind the method-

ology is maximizing security checks while maintaining a predefined level of control quality.

Another similar work is available where the authors propose a feedback scheduling tech-

163

nique for maintaining network quality of service in wireless sensor networks [125]. Both of

these works fall under soft real-time constraints, where essentially security is traded off with

deadline adherence.

A notable literature review further studies and elaborates on the challenges in designing

reliable CPS [74]. The work emphasizes on the necessity of raising the level of abstraction in

terms of designing reliable CPS, as the current networking technologies often do not provide

adequate foundation for CPS.

There is a line of work in the parallel and distributed processing domain on distributing

power resources efficiently. For example, a method to bound the energy consumption of a

message passing interface program [106] is proposed, where the authors use a linear program-

ming model that knows the execution time of jobs on machines and the effect of changing

the frequency on their speedup. This work has been then extended to propose a scalable

method to determine individual task power bounds in a distributed setting given a global

power bound [84].

Our work involving resource consumption in a producer consumer network draws inspi-

ration from existing work that addresses the problem of energy consumption in a producer

consumer network using learning mechanisms to reduce the energy consumption of the overall

system [82].

Many researchers target the problem of finding optimal energy savings without impacting

performance. One such notable study observes the trade-offs between energy and delay for

a wide set of applications [48]. The work also studies metrics to use to predict memory or

communication bottlenecks. There exist multiple works that attempt to tackle this bottle-

neck problem on a single processor [62, 78]. These works mainly proposes an alternative

Dynamic voltage and frequency scaling strategy that maintains the same performance at

reduced energy consumption.

The work on formal control techniques for power-performance management [124] discusses

the effectiveness of using control theory in power management. Furthermore, the series of

164

works [126, 127, 128] construct an integer linear programming (ILP) model to determine the

minimum energy consumption that a program can consume on a single processor.

Our work on multi-resource multi-node optimization problem is similar to the work on

the management of energy security trade-off in a distributed cyber-physical system [120], in

the sense that our approach is more generalized.

165

CHAPTER 9

CONCLUSION

In this chapter, we summarize our work and highlight our contributions for each methodology.

We then discuss our current ongoing work and short term goals. Finally, we conclude by

exploring potential future avenues of research that could be logical next steps of our work.

9.1 Summary

We begin with distributed runtime monitoring in this dissertation. Our proposed tech-

niques take an LTL formula and a distributed computation as input and, assuming a bounded

clock skew among all processes, chops the computation into multiple segments before ap-

plying the automata-based and progression-based monitoring algorithms implemented as an

SMT decision problem to verify the correctness of the formula. We carried out rigorous

synthetic experiments using LTL formulas of varying complexity. Although we attempted to

keep our synthetic experiments as close to real world scenarios as possible, we acknowledge

that in these synthetic experiments (as well as any synthetic experiments in our following

works), there could be missing environmental variables (which would otherwise be present

in real world scenarios) that could influence our run time. However, to partially account for

this shortcoming, we carried out case studies on Cassandra consistency circumstances and

NASA air traffic control dataset.

Following that work, we show an online predicate detection strategy for distributed signals

that do not share a global clock. To make the problem tractable, we use causality analysis

between real-valued signals, a reasonable assumption on maximum clock skew among local

clocks, and rough knowledge of system dynamics. We also studied the influence of signal

dynamics information on monitoring efficiency. By testing on a real network of autonomous

cars, a simulated network of UAVs, and a simulated water distribution system, we discov-

ered numerous noteworthy discoveries. Our method may be used to successfully monitor a

distributed CPS in an online setting.

For distributed CPS, we presented an approach for monitoring specifications expressed in

166

signal temporal logic (STL), where continuous-time and valued signals from a group of agents

do not share a global clock. Our method relies on an off-the-shelf clock synchronization

solution, such as NTP, to ensure a maximum constrained clock skew across all agents in

the system. Leveraging our work in predicate detection, we also presented a signal retiming

approach that effectively aligns continuous signals in order to detect potential STL violations.

To address the complexity, we simplify our runtime monitoring problem to a basic SMT

solving problem and cut the distributed signals into a sequence of smaller segments. We also

presented a formula progression approach similar to our work with distributed systems, which

takes a distributed signal and an STL formula as input and outputs another STL formula that

depicts the formula’s progress through the signals. We also presented experimental results

from the monitoring of an unmanned aerial vehicle (UAV) fleet and a water distribution

system.

We then extend our work to decentralized monitoring, where we perform online conjunc-

tive predicate detection for distributed signals. Our algorithm returns all possible violations

of the predicate, which in turn allows us to identify and eliminate bugs from distributed

systems regardless of actual clock drift.

Finally, we provided a generalized model of a streaming network CPS as a producer-

consumer network. Our approach incorporates tradeoffs between output quality and resource

utilization. These tradeoffs were articulated as a multi-objective optimization problem with

the goal of lowering resource utilization while maximizing the reliability (and quality) of

devices (or jobs) in a network. To tackle the aforementioned optimization challenge, we

provided an efficient technique based on constraint solving utilizing SMT-solvers to identify

the ideal processing quality selection for each node in the network while respecting resource

limitations and minimizing error. This is a significant problem since network applications

frequently require stream processing, which entails a complicated network of processing nodes

where data is collected, analyzed, and then communicated to following nodes. We have fully

implemented our approach and shown testing findings on an IoT device network.

167

9.2 Ongoing Work

We have thus far discussed methodologies for monitoring various formal specifications on

partially synchronous distributed CPS under both centralized and decentralized monitoring

settings. However, in every case, we assume that all the agents in these systems are honest,

that is, the agents follow the intended behaviors and protocols without malicious intent.

Our current work involves designing secure monitoring techniques for both centralized and

decentralized distributed CPS, where ensuring data privacy is the primary objective.

We explain the necessity of data privacy in CPS with the following example. Alice uses

health monitoring wearables to measure her heart rate, blood glucose level, etc. Alice’s

hospital has a server (monitor) that would like to monitor Alice’s health data, and if a

certain specification is met (e.g., Alice’s heart rate is above a threshold and glucose level

is below a threshold), send an alert to Alice’s caregiver. However, Alice does not wish to

reveal her personal health data to the monitor, and the monitor does not want to reveal

specification to Alice. In other words, both Alice and the monitor wish runtime verification

to be performed on Alice’s data using the monitor’s specification, while keeping each party’s

data private.

9.2.1 Monitoring with Secure Multi-Party Computation

Secure Multi-Party Computation or simply Multi-Party Computation (MPC) [57] is a

cryptographic protocol that allows multiple parties to jointly compute a function over their

individual private inputs without ever revealing the said inputs to each other.

An example of MPC would be, consider a scenario where three friends, Alice, Bob and

Charlie wish to compute their average salary while never disclosing their actual salary to one

another. Let Sa, Sb and Sc be the salaries of Alice, Bob and Charlie respectively. Only Alice

knows the value of Sa, Bob knows the value of Sb, and Charlie knows the value of Sc. Now

Alice privately splits her salary amount into three random pieces, such that, Sa = a1+a2+a3.

Bob and Charlie do the same, that is, Sb = b1 + b2 + b3 and Sc = c1 + c2 + c3. Now, Alice

shares a2 with Bob, and a3 with Charlie. Bob shares b1 with Alice, and b3 with Charlie.

168

Charlie shares c1 with Alice, and c2 with Bob. Now Alice computes S1 = a1 + b1 + c1, Bob

computes S2 = a2 + b2 + c2, and Charlie computes S3 = a3 + b3 + c3. It should be noted

that, it is impossible to extract any salary amounts from S1, S2 or S3. However, if Alice,

Bob and Charlie now share S1, S2 or S3 with each other, add compute (S1 + S2 + S3)/3, then

the desired average salary can be obtained with any party revealing their salary amount to

others.

While the above example is fairly straightforward, MPC provides more complex protocols

with which arithmetic operations can be carried out without any loss of precision [42]. In

our work we are mostly interested in performing addition and multiplication operations with

MPC protocols efficiently, as generally rely on these two operations for our retiming approach

(Recall 4.4e).

However, MPC does come with its own set of challenges. While addition protocols can

be executed locally (i.e., on agents), multiplication protocols require agents to share partial

data with each other multiple times before being able to compute the solution. Naturally,

this is an issue for runtime verification, as there are various factors (e.g., network latency,

workload, agent availability) that can influence communication delay, and by extension, run

time.

We have already made significant headway in addressing some of the challenges presented

by runtime verification using MPC. We hope to continue our work in this direction, and make

significant progress in the near future.

9.3 Future Work

The work done in this dissertation paves way for various intriguing directions for further

investigation. In this section we discuss the possible avenues of future work that are currently

in our consideration.

First of all, for the monitoring approaches proposed in Chapter 3, 4, and 5, a study of

the trade-off between accuracy and scalability can be conducted. We can define accuracy of

169

verdicts as follows:

actual number of correct verdicts

actual number of correct verdicts

−

number of missed verdicts

An interesting scope of research would be to observe and report the relationship between the

degradation of accuracy and the improvement of runtime for the aforementioned monitoring

techniques.

While monitoring predicates on distributed signals, our approach finds the first global

states that violate a predicate in a segment. A crucial step in debugging distributed CPS is

to find all such states. Thus, it is important to investigate data structures that can efficiently

represent a set of global states of distributed continuous signals that violate a predicate. In

the discrete setting, computation slices [91] are an example of such a data structure. One

way to achieve this is by using the long-known notion of regions in timed automata [4].

Because we are reducing the monitoring problem to an SMT solution problem, the prob-

lem may become undecidable in some cases. The inevitable next step is to identify the STL

piece where the problem is undecidable.

Another conceivable aim is for our monitor of the framework to become fully distributed,

as we assume a central monitor in all cases in this dissertation. Having a centralized monitor

also exposes our techniques to a single point of failure.

Furthermore, we have every reason to suspect that individual monitors in the system may

have faults, such as crashing or reporting false verdicts. This necessitates the development

of distributed fault-tolerant monitoring techniques.

For our approach on monitoring reliability of CPS, one obvious use of our method is

to represent networks that are not necessarily acyclic, that is, the network may include

feedback loops. Another intriguing line of study is to watch and report on the trade-off

between monitor reliability and runtime overhead, as well as network communication.

170

BIBLIOGRAPHY

[1] Abbas, H., Mittelmann, H., and Fainekos, G. (2014). Formal property verification in
In 2014 Twelfth ACM/IEEE Conference on Formal

a conformance testing framework.
Methods and Models for Codesign (MEMOCODE), pages 155–164. IEEE.

[2] Abiodun, O. I., Jantan, A., Omolara, A. E., Dada, K. V., Mohamed, N. A., and Arshad,
H. (2018). State-of-the-art in artificial neural network applications: A survey. Heliyon,
4(11):e00938.

[3] Alguliyev, R., Imamverdiyev, Y., and Sukhostat, L. (2018). Cyber-physical systems and

their security issues. Computers in Industry, 100:212–223.

[4] Alur, R. and Dill, D. L. (1994). A theory of timed automata. Theoretical computer

science, 126(2):183–235.

[5] Annpureddy, Y., Liu, C., Fainekos, G., and Sankaranarayanan, S. (2011). S-taliro: A tool
for temporal logic falsification for hybrid systems. In International Conference on Tools
and Algorithms for the Construction and Analysis of Systems, pages 254–257. Springer.

[6] Barrett, C. and Tinelli, C. (2018). Satisfiability modulo theories. Springer.

[7] Basin, D., Klaedtke, F., and Zălinescu, E. (2015). Failure-aware runtime verification
of distributed systems. In 35th IARCS Annual Conference on Foundations of Software
Technology and Theoretical Computer Science (FSTTCS 2015), volume 45, pages 590–603.
Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

[8] Bataineh, O., Rosenblum, D. S., and Reynolds, M. (2019). Efficient decentralized ltl mon-
itoring framework using tableau technique. ACM Transactions on Embedded Computing
Systems (TECS), 18(5s):1–21.

[9] Bauer, A. and Falcone, Y. (2012). Decentralised ltl monitoring. In International Sympo-

sium on Formal Methods, pages 85–100. Springer.

[10] Bauer, A. and Falcone, Y. (2016). Decentralised ltl monitoring. Formal Methods in

System Design, 48(1):46–93.

[11] Bauer, A., Leucker, M., and Schallhart, C. (2010). Comparing ltl semantics for runtime

verification. Journal of Logic and Computation, 20(3):651–674.

[12] Bauer, A., Leucker, M., and Schallhart, C. (2011). Runtime verification for ltl and tltl.
ACM Transactions on Software Engineering and Methodology (TOSEM), 20(4):1–64.

[13] Benndorf, M. and Haenselmann, T. (2016). Time synchronization on android devices
In The Tenth International Conference on Sensor

for mobile construction assessment.

171

Technologies and Applications. Thinkmind.

[14] Benveniste, A., Haar, S., Fabre, E., and Jard, C. (2003). Distributed monitoring of con-
current and asynchronous systems. In International Conference on Concurrency Theory,
pages 1–26. Springer.

[15] Bhuyan, B., Sarma, H. K. D., Sarma, N., Kar, A., Mall, R., et al. (2010). Quality of
service (qos) provisions in wireless sensor networks and related challenges. Wireless Sensor
Network, 2(11):861.

[16] Bonakdarpour, B., Fraigniaud, P., Rajsbaum, S., Rosenblueth, D. A., and Travers, C.
(2016). Decentralized asynchronous crash-resilient runtime verification. In 27th Interna-
tional Conference on Concurrency Theory (CONCUR 2016). Schloss Dagstuhl-Leibniz-
Zentrum fuer Informatik.

[17] Bozzelli, L. and Sánchez, C. (2014). Foundations of boolean stream runtime verification.

In International Conference on Runtime Verification, pages 64–79. Springer.

[18] Brunelli, D. and Caione, C. (2015). Sparse recovery optimization in wireless sensor

networks with a sub-nyquist sampling rate. Sensors, 15(7):16654–16673.

[19] Cassandra, A.

(2014).

Apache

cassandra.

Website. Available online at

http://planetcassandra. org/what-is-apache-cassandra, 13.

[20] Cassandras, C. G. and Lafortune, S. (2008). Introduction to discrete event systems.

Springer.

[21] Cassar, I. and Francalanza, A. (2015). On synchronous and asynchronous monitor

instrumentation for actor-based systems. arXiv preprint arXiv:1502.03514.

[22] Chandy, K. M., Mitra, S., and Pilotto, C. (2008). Convergence verification: From
shared memory to partially synchronous systems. In International Conference on Formal
Modeling and Analysis of Timed Systems, pages 218–232. Springer.

[23] Charron-Bost, B. (1991). Concerning the size of logical clocks in distributed systems.

Information Processing Letters, 39(1):11–16.

[24] Charron-Bost, B. and Schiper, A. (2006). The heard-of model: Unifying all benign

failures. EPFL Scientific Publications.

[25] Charron-Bost, B. and Schiper, A. (2009). The heard-of model: computing in distributed

systems with benign faults. Distributed Computing, 22(1):49–71.

[26] Chauhan, H., Garg, V. K., Natarajan, A., and Mittal, N. (2013). A distributed abstrac-
tion algorithm for online predicate detection. In 2013 IEEE 32nd International Symposium

172

on Reliable Distributed Systems, pages 101–110. IEEE.

[27] Chen, H. (2017). Applications of cyber-physical system: a literature review. Journal of

Industrial Integration and Management, 2(03):1750012.

[28] Colombo, C. and Falcone, Y. (2016). Organising ltl monitors over distributed systems

with a global clock. Formal Methods in System Design, 49(1):109–158.

[29] Cristian, F. (1988). Agreeing on who is present and who is absent in a synchronous
distributed system. In 1988 The Eighteenth International Symposium on Fault-Tolerant
Computing. Digest of Papers, pages 206–207. IEEE Computer Society.

[30] Dâmaso, A., Rosa, N., and Maciel, P. (2014). Reliability of wireless sensor networks.

Sensors, 14(9):15760–15785.

[31] Danielsson, L. M. and Sánchez, C. (2019). Decentralized stream runtime verification.

In International Conference on Runtime Verification, pages 185–201. Springer.

[32] De Moura, L. and Bjørner, N. (2008). Z3: An efficient smt solver. In International
conference on Tools and Algorithms for the Construction and Analysis of Systems, pages
337–340. Springer.

[33] Deshmukh, J. V., Donzé, A., Ghosh, S., Jin, X., Juniwal, G., and Seshia, S. A. (2017).
Robust online monitoring of signal temporal logic. Formal Methods in System Design,
51(1):5–30.

[34] Dokhanchi, A., Hoxha, B., and Fainekos, G. (2014). On-line monitoring for temporal
In International Conference on Runtime Verification, pages 231–246.

logic robustness.
Springer.

[35] Donzé, A., Ferrere, T., and Maler, O. (2013). Efficient robust monitoring for stl. In

International Conference on Computer Aided Verification, pages 264–279. Springer.

[36] Donzé, A. and Maler, O. (2010). Robust satisfaction of temporal logic over real-valued
signals. In International Conference on Formal Modeling and Analysis of Timed Systems,
pages 92–106. Springer.

[37] Dorigo, M., Birattari, M., and Stutzle, T. (2006). Ant colony optimization.

IEEE

computational intelligence magazine, 1(4):28–39.

[38] Drăgoi, C., Henzinger, T. A., and Zufferey, D. (2016). Psync: a partially synchronous
language for fault-tolerant distributed algorithms. ACM SIGPLAN Notices, 51(1):400–
415.

[39] Drone Life (2019).

FAA UTM project: Decentralized uas traffic management

173

demonstration. https://dronelife.com/2019/09/09/decentralized-uas-traffic-management-
demonstration.

[40] Dwork, C., Lynch, N., and Stockmeyer, L. (1988). Consensus in the presence of partial

synchrony. Journal of the ACM (JACM), 35(2):288–323.

[41] El-Hokayem, A. and Falcone, Y. (2020). On the monitoring of decentralized specifi-
cations: semantics, properties, analysis, and simulation. ACM Transactions on Software
Engineering and Methodology (TOSEM), 29(1):1–57.

[42] Evans, D., Kolesnikov, V., Rosulek, M., et al. (2018). A pragmatic introduction to
secure multi-party computation. Foundations and Trends® in Privacy and Security, 2(2-
3):70–246.

[43] FAA (2019). DOT UAS initiatives. https://www.faa.gov/uas/programs_partnerships/

DOT_initiatives.

[44] Fabre, E., Benveniste, A., Haar, S., and Jard, C. (2005). Distributed monitoring of
concurrent and asynchronous systems. Discrete Event Dynamic Systems, 15(1):33–84.

[45] Fainekos, G. E. and Pappas, G. J. (2007). Robust sampling for mitl specifications.
In International Conference on Formal Modeling and Analysis of Timed Systems, pages
147–162. Springer.

[46] Fraigniaud, P., Rajsbaum, S., and Travers, C. (2013). Locality and checkability in

wait-free computing. Distributed Computing, 26(4):223–242.

[47] Fraigniaud, P., Rajsbaum, S., and Travers, C. (2020). A lower bound on the number of
opinions needed for fault-tolerant decentralized run-time monitoring. Journal of Applied
and Computational Topology, 4(1):141–179.

[48] Freeh, V. W., Lowenthal, D. K., Pan, F., Kappiah, N., Springer, R., Rountree, B. L., and
Femal, M. E. (2007). Analyzing the energy-time trade-off in high-performance computing
applications. IEEE Transactions on Parallel and Distributed Systems, 18(6):835–848.

[49] Ganguly, R., Momtaz, A., and Bonakdarpour, B. (2021). Distributed runtime verifica-
tion under partial synchrony. In 24th International Conference on Principles of Distributed
Systems (OPODIS 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

[50] Ganguly, R., Xuey, Y., Jonckheere, A., Ljungy, P., Schornsteiny, B., Bonakdarpour, B.,
and Herlihy, M. (2022). Distributed runtime verification of metric temporal properties for
cross-chain protocols. arXiv preprint arXiv:2204.09796.

[51] Gareth, J., Daniela, W., Trevor, H., and Robert, T. (2013). An introduction to statistical

learning: with applications in R. Spinger.

174

[52] Garg, V. (2002a). Elements of Distributed Computing. John Wiley & Sons.

[53] Garg, V. K. (2002b). Elements of distributed computing. John Wiley & Sons.

[54] Garg, V. K. (2020). Predicate detection to solve combinatorial optimization problems. In
Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures,
pages 235–245.

[55] Garg, V. K. and Chase, C. M. (1995). Distributed algorithms for detecting conjunctive
In Proceedings of 15th International Conference on Distributed Computing

predicates.
Systems, pages 423–430. IEEE.

[56] Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/-

variance dilemma. Neural computation, 4(1):1–58.

[57] Goldreich, O. (1998). Secure multi-party computation. Manuscript. Preliminary ver-

sion, 78(110).

[58] Hasabelnaby, M. (2016). Decentralized runtime verification of ltl specifications in dis-

tributed systems. Master’s thesis, University of Waterloo.

[59] Havelund, K. and Rosu, G. (2001). Monitoring programs using rewriting. In Proceedings
16th Annual International Conference on Automated Software Engineering (ASE 2001),
pages 135–143. IEEE.

[60] He, F. and Zhao, S. (2008). Research on synchronous control of nodes in distributed
network system. In 2008 IEEE International Conference on Automation and Logistics,
pages 2999–3004. IEEE.

[61] Hendry-Brogan, M. (2019). Global unmanned aerial vehicle (uav) market report. Tech-

nical report, Technical report, May.

[62] Hsu, C.-H. and Kremer, U. (2003). The design, implementation, and evaluation of a
compiler algorithm for cpu energy reduction. In Proceedings of the ACM SIGPLAN 2003
conference on Programming language design and implementation, pages 38–48.

[63] Jiang, Y., Song, H., Wang, R., Gu, M., Sun, J., and Sha, L. (2016). Data-centered
IEEE transactions on

runtime verification of wireless medical cyber-physical system.
industrial informatics, 13(4):1900–1909.

[64] Kazemlou, S. and Bonakdarpour, B. (2018). Crash-resilient decentralized synchronous
In 2018 IEEE 37th Symposium on Reliable Distributed Systems

runtime verification.
(SRDS), pages 207–212. IEEE.

[65] Ketkar, N. (2017). Introduction to keras. In Deep learning with Python, pages 97–111.

175

Springer.

[66] Kopetz, H., Grünsteidl, G., and Reisinger, J. (1991). Fault-tolerant membership service
In Dependable Computing for Critical

in a synchronous distributed real-time system.
Applications, pages 411–429. Springer.

[67] Koymans, R. (1990). Specifying real-time properties with metric temporal logic. Real-

time systems, 2(4):255–299.

[68] Kshemkalyani, A. and Singhal, M. (2011). Distributed Computing: Principles, Algo-

rithms, and Systems. Cambridge University Press.

[69] Kuhn, M., Johnson, K., et al. (2013). Applied predictive modeling, volume 26. Springer.

[70] Kuila, P. and Jana, P. K. (2014). A novel differential evolution based clustering algo-

rithm for wireless sensor networks. Applied soft computing, 25:414–425.

[71] Kulkarni, S. S., Demirbas, M., Madappa, D., Avva, B., and Leone, M. (2014). Logical
physical clocks. In International Conference on Principles of Distributed Systems, pages
17–32. Springer.

[72] Lakshman, A. and Malik, P. (2010). Cassandra: a decentralized structured storage

system. ACM SIGOPS Operating Systems Review, 44(2):35–40.

[73] Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system.

Communications of the ACM, 21(7):558–565.

[74] Lee, E. A. (2008). Cyber physical systems: Design challenges.

In 2008 11th IEEE
international symposium on object and component-oriented real-time distributed computing
(ISORC), pages 363–369. IEEE.

[75] Lesi, V., Jovanov, I., and Pajic, M. (2017). Security-aware scheduling of embedded
control tasks. ACM Transactions on Embedded Computing Systems (TECS), 16(5s):1–21.

[76] Lim, K. K., Park, J., and Shon, J. G. (2019). Differential data processing technique
to improve the performance of wireless sensor networks. The Journal of Supercomputing,
75(8):4489–4504.

[77] Liu, L., Kong, W., Ando, T., Yatsu, H., and Fukuda, A. (2013). A survey of acceleration
techniques for smt-based bounded model checking. In 2013 international conference on
computer sciences and applications, pages 554–559. IEEE.

[78] Lorch, J. R. and Smith, A. J. (2001). Improving dynamic voltage scaling algorithms

with pace. ACM SIGMETRICS Performance Evaluation Review, 29(1):50–61.

176

[79] Maler, O. and Nickovic, D. (2004). Monitoring temporal properties of continuous signals.
In Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems, pages
152–166. Springer.

[80] Manna, Z. and Pnueli, A. (2012). Temporal verification of reactive systems: safety.

Springer Science & Business Media.

[81] Mattern, F. et al. (1988). Virtual time and global states of distributed systems. Univ.,

Department of Computer Science.

[82] Medhat, R., Bonakdarpour, B., and Fischmeister, S. (2018). Energy-efficient multiple
producer-consumer. IEEE Transactions on Parallel and Distributed Systems, 30(3):560–
574.

[83] Medhat, R., Bonakdarpour, B., Kumar, D., and Fischmeister, S. (2015). Runtime moni-
toring of cyber-physical systems under timing and memory constraints. ACM Transactions
on Embedded Computing Systems (TECS), 14(4):1–29.

[84] Medhat, R., Funk, S., and Rountree, B. (2017). Scalable performance bounding under
multiple constrained renewable resources. In Proceedings of the 5th International Workshop
on Energy Efficient Supercomputing, pages 1–8.

[85] Mehlitz, P., Giannakopoulou, D., and Shafiei, N. (2019). Analyzing airspace data with
race. In 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC), pages 1–10.
IEEE.

[86] Mehmood, I., Ullah, A., Muhammad, K., Deng, D.-J., Meng, W., Al-Turjman, F.,
Sajjad, M., and de Albuquerque, V. H. C. (2019). Efficient image recognition and retrieval
on iot-assisted energy-constrained platforms from big data repositories. IEEE Internet of
Things Journal, 6(6):9246–9255.

[87] Metropolis, N. and Ulam, S. (1949). The monte carlo method. Journal of the American

statistical association, 44(247):335–341.

[88] Mills, D., Martin, J., Burbank, J., and Kasch, W. (2010). Network time protocol version

4: Protocol and algorithms specification. RFC 5905, RFC Editor.

[89] Mitsch, S. and Platzer, A. (2016). Modelplex: Verified runtime validation of verified

cyber-physical system models. Formal Methods in System Design, 49(1):33–74.

[90] Mittal, N. and Garg, V. K. (2001). On detecting global predicates in distributed compu-
tations. In Proceedings 21st International Conference on Distributed Computing Systems,
pages 3–10. IEEE.

[91] Mittal, N. and Garg, V. K. (2005). Techniques and applications of computation slicing.

177

Distributed Computing, 17(3):251–277.

[92] Mittal, V., Gupta, S., and Choudhury, T. (2018). Comparative analysis of authentica-
tion and access control protocols against malicious attacks in wireless sensor networks. In
Smart computing and informatics, pages 255–262. Springer.

[93] Mogull, R. and Securosis, L. (2007). Understanding and selecting a data loss prevention

solution. Technicalreport, SANS Institute, 27.

[94] Momtaz, A., Abbas, H., and Bonakdarpour, B. (2023). Monitoring signal temporal
logic in distributed cyber-physical systems. In Proceedings of the ACM/IEEE 14th Inter-
national Conference on Cyber-Physical Systems (with CPS-IoT Week 2023), ICCPS ’23,
page 154–165, New York, NY, USA. Association for Computing Machinery.

[95] Momtaz, A., Basnet, N., Abbas, H., and Bonakdarpour, B. (2021). Predicate mon-
In International Conference on Runtime

itoring in distributed cyber-physical systems.
Verification, pages 3–22. Springer.

[96] Mostafa, M. and Bonakdarpour, B. (2015). Decentralized runtime verification of ltl
specifications in distributed systems. In 2015 IEEE International Parallel and Distributed
Processing Symposium, pages 494–503. IEEE.

[97] Moura, L. d. and Bjørner, N. (2008). Z3: An efficient smt solver.

In International
conference on Tools and Algorithms for the Construction and Analysis of Systems, pages
337–340. Springer.

[98] National Science Foundations (2014). Revolutionizing how we keep track of time in

cyber-physical systems. https://nsf.gov/news/news_summ.jsp?cntnid=131691.

[99] Ogale, V. A. and Garg, V. K. (2007). Detecting temporal logic predicates on distributed
In International Symposium on Distributed Computing, pages 420–434.

computations.
Springer.

[100] Pant, Y. V., Abbas, H., and Mangharam, R. (2017). Smooth operator: Control using
the smooth robustness of temporal logic. In 2017 IEEE Conference on Control Technology
and Applications (CCTA), pages 1235–1240. IEEE.

[101] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blon-
del, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine
learning in python. the Journal of machine Learning research, 12:2825–2830.

[102] Pereira, J. C., Machado, N., and Sousa Pinto, J. (2020). Testing for race conditions
in distributed systems via smt solving. In International Conference on Tests and Proofs,
pages 122–140. Springer.

178

[103] Petri, C. A. and Reisig, W. (2008). Petri net. Scholarpedia, 3(4):6477.

[104] Pnueli, A. (1977). The temporal logic of programs. In 18th Annual Symposium on

Foundations of Computer Science (sfcs 1977), pages 46–57. ieee.

[105] Quesel, J.-D. (2013). Similarity, logic, and games: bridging modeling layers of hybrid

systems. PhD thesis, Univ., Fak. II, Department für Informatik.

[106] Rountree, B., Lowenthal, D. K., Funk, S., Freeh, V. W., De Supinski, B. R., and
Schulz, M. (2007). Bounding energy consumption in large-scale mpi programs. In SC’07:
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pages 1–9. IEEE.

[107] Sánchez, C. (2018). Online and offline stream runtime verification of synchronous
systems. In International Conference on Runtime Verification, pages 138–163. Springer.

[108] Sen, A. and Garg, V. K. (2004). Detecting temporal logic predicates in distributed
programs using computation slicing. In Principles of Distributed Systems: 7th Interna-
tional Conference, OPODIS 2003, La Martinique, French West Indies, December 10-13,
2003, Revised Selected Papers 7, pages 171–183. Springer.

[109] Sen, K., Vardhan, A., Agha, G., and Rosu, G. (2004). Efficient decentralized moni-
toring of safety in distributed systems. In Proceedings. 26th International Conference on
Software Engineering, pages 418–427. IEEE.

[110] Shrestha, A., Xing, L., and Liu, H. (2007). Modeling and evaluating the reliability
of wireless sensor networks. In 2007 Annual Reliability and Maintainability Symposium,
pages 186–191. IEEE.

[111] Silva, I., Guedes, L. A., Portugal, P., and Vasques, F. (2012). Reliability and availabil-
ity evaluation of wireless sensor networks for industrial applications. Sensors, 12(1):806–
838.

[112] Sistla, A. P., Žefran, M., and Feng, Y. (2011). Runtime monitoring of stochastic cyber-
physical systems with hybrid state. In International Conference on Runtime Verification,
pages 276–293. Springer.

[113] Sodhro, A. H., Chen, L., Sekhari, A., Ouzrout, Y., and Wu, W. (2018). Energy
efficiency comparison between data rate control and transmission power control algorithms
for wireless body sensor networks. International Journal of Distributed Sensor Networks,
14(1):1550147717750030.

[114] Stoller, S. D. (1997). Detecting global predicates in distributed systems with clocks.

In International Workshop on Distributed Algorithms, pages 185–199. Springer.

[115] Stoller, S. D. and Schneider, F. B. (1995). Verifying programs that use causally-ordered

179

message-passing. Science of computer programming, 24(2):105–128.

[116] Tekken Valapil, V., Yingchareonthawornchai, S., Kulkarni, S., Torng, E., and Demir-
bas, M. (2017). Monitoring partially synchronous distributed systems using smt solvers.
In International Conference on Runtime Verification, pages 277–293. Springer.

[117] USNRC

(2021a).

Emergency

core

cooling

systems.

https://www.nrc.gov/docs/ML1122/ML11223A220.pdf.

[118] USNRC (2021b). Pressurized water reactor systems. https://www.nrc.gov/reading-

rm/basic-ref/students/for-educators/04.pdf.

[119] Valapil, V. T., Kulkarni, S., Torng, E., and Appleton, G. (2020). Efficient two-layered
monitor for partially synchronous distributed systems (technical report). arXiv preprint
arXiv:2007.13030.

[120] Vu, A.-D., Medhat, R., and Bonakdarpour, B. (2019). Managing the security-energy
In Proceedings of the 10th ACM/IEEE

tradeoff in distributed cyber-physical systems.
International Conference on Cyber-Physical Systems, pages 118–128.

[121] Widder, J., Lann, G. L., and Schmid, U. (2005). Failure detection with booting in
In European Dependable Computing Conference, pages

partially synchronous systems.
20–37. Springer.

[122] Wolf, W. (2009). Cyber-physical systems. Computer, 42(03):88–89.

[123] Wong, T.-T. and Yeh, P.-Y. (2019). Reliable accuracy estimates from k-fold cross
validation. IEEE Transactions on Knowledge and Data Engineering, 32(8):1586–1594.

[124] Wu, Q., Juang, P., Martonosi, M., Peh, L.-S., and Clark, D. W. (2005). Formal control

techniques for power-performance management. IEEE micro, 25(5):52–62.

[125] Xia, F., Ma, L., Dong, J., and Sun, Y. (2008). Network qos management in cyber-
physical systems. In 2008 International Conference on Embedded Software and Systems
Symposia, pages 302–307. IEEE.

[126] Xie, F., Martonosi, M., and Malik, S. (2003). Compile-time dynamic voltage scaling
settings: Opportunities and limits. In Proceedings of the ACM SIGPLAN 2003 conference
on Programming language design and implementation, pages 49–62.

[127] Xie, F., Martonosi, M., and Malik, S. (2004). Intraprogram dynamic voltage scaling:
Bounding opportunities with analytic modeling. ACM Transactions on Architecture and
Code Optimization (TACO), 1(3):323–367.

[128] Xie, F., Martonosi, M., and Malik, S. (2005). Bounds on power savings using runtime

180

dynamic voltage scaling: an exact algorithm and a linear-time heuristic approximation.
In Proceedings of the 2005 international symposium on Low power electronics and design,
pages 287–292.

[129] Xu, S. and Chen, L. (2008). A novel approach for determining the optimal number
of hidden layer neurons for fnn’s and its application in data mining. 5th International
Conference on Information Technology and Applications.

[130] Yingchareonthawornchai, S., Nguyen, D. N., Valapil, V. T., Kulkarni, S. S., and Demir-
bas, M. (2016). Precision, recall, and sensitivity of monitoring partially synchronous dis-
tributed systems. In International Conference on Runtime Verification, pages 420–435.
Springer.

[131] Zhang, J., Tu, H., Ren, Y., Wan, J., Zhou, L., Li, M., and Wang, J. (2018). An
adaptive synchronous parallel strategy for distributed machine learning. IEEE Access,
6:19222–19230.

[132] Zhang, T., Gebhard, P., and Sokolsky, O. (2016). Smedl: combining synchronous and
In International Conference on Runtime Verification, pages

asynchronous monitoring.
482–490. Springer.

[133] Zheng, X., Julien, C., Podorozhny, R., Cassez, F., and Rakotoarivelo, T. (2016). Effi-
cient and scalable runtime monitoring for cyber–physical system. IEEE Systems Journal,
12(2):1667–1678.

[134] Zhou, Y., Zhang, Y., and Fang, Y. (2007). Access control in wireless sensor networks.

Ad Hoc Networks, 5(1):3–13.

181