RUNTIME VERIFICATION OF DISTRIBUTED SYSTEMS
                            By
                     Ritam Ganguly
                  A DISSERTATION
                      Submitted to
              Michigan State University
       in partial fulfillment of the requirements
                    for the degree of
      Computer Science - Doctor of Philosophy
                           2023


                                          ABSTRACT
     Given the broad scale of distribution and complexity of today’s system, an exhaustive
model-checking algorithm is computationally costly and testing is not exhaustive enough.
Runtime Verification on the other hand analyzes a developing execution, be it online or
offline, of the system in order to check for the health of the system with respect to
some specification. Runtime verification of distributed systems with respect to temporal
specification is both critical as well as a challenging task. It is critical because it ensures the
reliability of the system by detecting violations of system requirements. To guarantee the
lack of violations one has to analyze every possible ordering of system events which makes it
computationally expensive and hence challenging. In this dissertation, we focus on a partially
synchronous distributed system, where the various components of the distributed system do
not share a common global clock and a clock synchronization algorithm limits the maximum
clock skew among processes to a constant. Following listed are the main contributions of
this dissertation,
     • We introduce two monitoring techniques where the specification in the linear temporal
       logic (LTL) is either represented by a deterministic finite automaton, or, we use
       a progression-based formula re-witting technique to reduce the distributed runtime
       verification problem to an SMT problem.
     • We introduce a progression-based formula rewriting scheme for monitoring metric
       temporal logic (MTL) specifications which employ SMT-solving techniques with
       probabilistic guarantees.
     • We introduce an (offline) SMT-based monitor synthesis algorithm, which results
       in minimizing the size of monitoring messages for an automata-based synchronous
       monitoring algorithm that copes with up to t crash monitor failures.
     • We extend the stream-based specification language Lola for monitoring partially-
       synchronous systems and develop an (online) SMT-based decentralized monitoring
       technique for the same.


• All of our techniques have been tested by both extensive synthetic experiments and
  real-life case studies, such as a distributed database, Cassandra; an Internet-of-Things
  dataset of an house, Orange4Home; an Ethereum-based smart contracts; Industrial
  Control Systems (ICS), Secure Water Treatment (SWaT), etc.


This dissertation is dedicated to my grandparents,
    Rina Ganguly and Rama Prasad Ganguly
                         iv


                                  ACKNOWLEDGMENTS
    First of all I would like to thank my advisor, Dr. Borzoo Bonakdarpour, for offering me
technical, financial, and moral support during the four years of my research. He introduced
me to the area of runtime verification of distributed systems. Much of the results reported
in this dissertation is inspired by my discussion with him about our ideas and developing a
general verification approach for a wide range of distributed systems with different system
specifications. He helped me understand what research is and how to solve a problem.
    My dissertation guidance committee comprising of Dr.         Borzoo Bonakdarpour, Dr.
Sandeep Kulkarni, Dr. Eric Torng and Dr. Shaunak D. Bopardikar has been of great help,
guidance and encouragement. I would like to express gratitude to Dr. Sandeep Kulkarni and
Dr. Gurpur Prabhu (from Iowa State University) for giving me the exposure and motivation
behind taking teaching as a career.
    It has been a great pleasure to work closely with Anik Momtaz (Michigan State
University) and Yingjie Xue (Brown University). They co-authored multiple papers with
me on runtime verification of distributed systems with respect to LTL and MTL specifications
respectively. It is impossible to mention their innumerable contributions towards my work.
    I would like to truly thank the Department of Computer Science and Engineering, College
of Engineering at Michigan State University and the Department of Computer Science at
Iowa State University for offering me financial support through teaching assistantship for
several semesters and travel grants for conference travel and registration.
    I would also like to thank my family, specially Ranjan Ganguly (baba), Molly Ganguly
(ma), Ranjit Ganguly (jethu), and Rina Ganguly (amma) for their continuing encouragement
and support. Additionally, the continuous encouragement from Saumitra Sinha (sinha-jethu)
and Biman Ghosh (biman-jethu) has enabled me to not only have a pleasant stay but also
to be inspired to travel to USA to pursue my PhD.
    Special thanks goes out to my colleagues at Trustworthy and Reliable Technologies
(TART) laboratory Anik Momtaz, Eshita Zaman, Tzu-Han Hsu, and Oyendrila Dobe for
                                               v


proof reading my papers. Finally, I would like to thank my friends Puja Agarwal, Aniket
Banerjee, Abhratanu Dutta, Saptaparni Ghosh, Sayantani Ghosh, Aishwarya Mazumdar,
Debrudra Mitra, and Soham Vanage; because of them my PhD journey has been enjoyable
and memorable.
                                           vi


                              TABLE OF CONTENTS
LIST OF TABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        x
LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        xi
LIST OF ALGORITHMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1 Introduction. . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . .  1
   1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.2 Technical Challenges of RV of Distributed System       . . . . . . . . . . . . . . .  5
       1.2.1 Formal Specification . . . . . . . . . . . .     . . . . . . . . . . . . . . .  9
   1.3 Thesis Statement . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . 10
   1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
   1.5 Organization . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . 13
Chapter 2 Preliminary Concepts. . . . . . .         . . . . . . . . . . . . . . . . . . . . 15
   2.1 Distributed System . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . . 15
       2.1.1 Synchronous Distributed System         . . . . . . . . . . . . . . . . . . . . 16
       2.1.2 Partially-Synchronous Distributed      System    . . . . . . . . . . . . . . . 16
   2.2 Linear Temporal Logics (LTL) for RV . .      . . . . . . . . . . . . . . . . . . . . 17
       2.2.1 Infinite-trace Semantics of LTL . .    . . . . . . . . . . . . . . . . . . . . 18
       2.2.2 Finite-trace Semantics of LTL . .      . . . . . . . . . . . . . . . . . . . . 18
   2.3 Metric Temporal Logic . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . 20
   2.4 Hybrid Logical Clocks . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . 22
   2.5 Stream-based Specification Lola . . . .      . . . . . . . . . . . . . . . . . . . . 23
Chapter 3 Runtime Verification for Linear Temporal              Specifications. . .     . . 27
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
       3.1.1 Problem Statement . . . . . . . . . . . . . .      . . . . . . . . . . . . . . 32
   3.2 Formula Progression for LTL . . . . . . . . . . . . .    . . . . . . . . . . . . . . 33
   3.3 SMT-based Solution . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . 40
       3.3.1 Overall Idea . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . 40
       3.3.2 SMT Entities . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . 43
       3.3.3 SMT Constraints . . . . . . . . . . . . . . .      . . . . . . . . . . . . . . 44
   3.4 Optimization . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . 46
       3.4.1 Segmentation of Distributed Computation .          . . . . . . . . . . . . . . 46
       3.4.2 Parallelized Monitoring . . . . . . . . . . . .    . . . . . . . . . . . . . . 48
   3.5 Case Studies and Evaluation . . . . . . . . . . . . .    . . . . . . . . . . . . . . 51
       3.5.1 Implementation and Experimental Setup . .          . . . . . . . . . . . . . . 51
       3.5.2 Analysis of Results – Synthetic Experiments        . . . . . . . . . . . . . . 53
       3.5.3 Case Study 1: Cassandra . . . . . . . . . . .      . . . . . . . . . . . . . . 58
       3.5.4 Case Study 2: RACE . . . . . . . . . . . . .       . . . . . . . . . . . . . . 61
   3.6 Summary and Limitation . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . 62
                                             vii


Chapter 4 Runtime        Verification        for      Time-bounded            Temporal
           Specifications. . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . .  64
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .  64
      4.1.1 Estimating Offset distribution . . . . . . . . . . . . .      . . . . . . . . .  68
      4.1.2 Formal Problem Statement . . . . . . . . . . . . . . .        . . . . . . . . .  70
  4.2 Formula Progression for MTL . . . . . . . . . . . . . . . . .       . . . . . . . . .  72
  4.3 SMT-based Solution . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . .  76
      4.3.1 SMT Entities . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . .  76
      4.3.2 SMT Constraints . . . . . . . . . . . . . . . . . . . .       . . . . . . . . .  77
      4.3.3 Segmentation of a Distributed Computation . . . . .           . . . . . . . . .  79
  4.4 Case Study and Evaluation . . . . . . . . . . . . . . . . . . .     . . . . . . . . .  80
      4.4.1 UPPAAL Benchmarks . . . . . . . . . . . . . . . . .           . . . . . . . . .  80
      4.4.2 Blockchain . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . .  88
  4.5 Summary and Limitation . . . . . . . . . . . . . . . . . . . .      . . . . . . . . .  99
Chapter 5 Fault Tolerant Runtime Verification of Synchronous                Distributed
           Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . 100
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . 100
  5.2 Model of Computation . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . 103
      5.2.1 Overall Picture . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 103
      5.2.2 Detailed Description . . . . . . . . . . . . . . . . . . .      . . . . . . . . 104
      5.2.3 Fault Model . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 106
      5.2.4 Problem Statement . . . . . . . . . . . . . . . . . . . .       . . . . . . . . 106
  5.3 The General Idea and Motivating Example . . . . . . . . . . .         . . . . . . . . 107
      5.3.1 Symbolic View µ . . . . . . . . . . . . . . . . . . . . .       . . . . . . . . 107
      5.3.2 Computing LC . . . . . . . . . . . . . . . . . . . . . .        . . . . . . . . 108
      5.3.3 Motivating Example . . . . . . . . . . . . . . . . . . .        . . . . . . . . 109
  5.4 Monitor Transformation Algorithm . . . . . . . . . . . . . . .        . . . . . . . . 110
      5.4.1 The Challenge of Constructing Extended Monitors . .             . . . . . . . . 111
      5.4.2 Identifying the Minimum-size Split . . . . . . . . . . .        . . . . . . . . 112
      5.4.3 The Complete Transformation Algorithm . . . . . . . .           . . . . . . . . 116
  5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 122
      5.5.1 Synthetic Experiments . . . . . . . . . . . . . . . . . .       . . . . . . . . 122
      5.5.2 Orange4Home Dataset . . . . . . . . . . . . . . . . . .         . . . . . . . . 128
  5.6 Summary and Limitation . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . 131
Chapter 6 Decentralized       Runtime         Verification      for      Stream-based
           Specifications. . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . .  132
  6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  132
  6.2 Partially Synchronous Lola . . . . . . . . . . . .     . . . . . . . . . . . . . . .  134
      6.2.1 Distributed Streams . . . . . . . . . . . .      . . . . . . . . . . . . . . .  135
      6.2.2 Partially Synchronous Lola . . . . . . . .       . . . . . . . . . . . . . . .  136
  6.3 Decentralized Monitoring Architecture . . . . . .      . . . . . . . . . . . . . . .  139
      6.3.1 Overall Picture . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . .  139
      6.3.2 Detailed Description . . . . . . . . . . . .     . . . . . . . . . . . . . . .  140
                                           viii


      6.3.3 Problem Statement . . . . . . . . . . . . . . . . . .     . . . . . . . . . . 142
  6.4 Calculating LS . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . 142
  6.5 SMT-based Solution . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 146
      6.5.1 SMT Entities . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 146
      6.5.2 SMT Constrains . . . . . . . . . . . . . . . . . . .      . . . . . . . . . . 146
  6.6 Runtime Verification of Lola specifications . . . . . . . .     . . . . . . . . . . 148
      6.6.1 Computing LC . . . . . . . . . . . . . . . . . . . .      . . . . . . . . . . 148
      6.6.2 Bringing it all Together . . . . . . . . . . . . . . .    . . . . . . . . . . 149
  6.7 Case Study and Evaluation . . . . . . . . . . . . . . . . . .   . . . . . . . . . . 152
      6.7.1 Synthetic Experiments . . . . . . . . . . . . . . . .     . . . . . . . . . . 153
      6.7.2 Case Studies: Decentralized ICS and Flight Control        RV  . . . . . . . . 157
  6.8 Summary and Limitation . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 163
Chapter 7 Related Work. . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . . . 164
  7.1 Lattice-theoretic Distributed Monitoring    . . . . . . . . . . . . . . . . . . . . 164
  7.2 Monitoring Distributed System . . . . .     . . . . . . . . . . . . . . . . . . . . 165
  7.3 Monitoring Time-bounded Specification .     . . . . . . . . . . . . . . . . . . . . 167
  7.4 Runtime Verification of Hyperproperties     . . . . . . . . . . . . . . . . . . . . 169
  7.5 Fault-tolerant Distributed Monitoring . .   . . . . . . . . . . . . . . . . . . . . 170
  7.6 Statistical Model Checking . . . . . . . .  . . . . . . . . . . . . . . . . . . . . 171
  7.7 Beyond Runtime Verification . . . . . . .   . . . . . . . . . . . . . . . . . . . . 172
Chapter 8 Conclusion and Future       Work. . .     . . . . . . . . . . . . . . . . . . . 174
  8.1 Summary . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . . . . . 174
  8.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
  8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
      8.3.1 Distributed Systems .     . . . . . . . . . . . . . . . . . . . . . . . . . . 177
      8.3.2 AI Safety . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . . . . . 178
BIBLIOGRAPHY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
                                           ix


                               LIST OF TABLES
Table 1.1: Summarized Publications. . . . . . . . . . . . . . . . . . . . . . . . .   14
Table 5.1: List of formulas used to check our algorithm. . . . . . . . . . . . . . . 125
Table 5.2: Formula from Orange4Home. . . . . . . . . . . . . . . . . . . . . . . . 129
                                         x


                                LIST OF FIGURES
Figure 1.1: Distributed computation.. . . . . . . . . . . . . . . . . . . . . . . . .     6
Figure 1.2: Computation Lattice.. . . . . . . . . . . . . . . . . . . . . . . . . . .     7
Figure 2.1: LTL3 monitor for ϕ = a U b.. . . . . . . . . . . . . . . . . . . . . . .     19
Figure 2.2: HLC example.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    23
Figure 3.1: Distributed computation.. . . . . . . . . . . . . . . . . . . . . . . . .    28
Figure 3.2: Distributed computation.. . . . . . . . . . . . . . . . . . . . . . . . .    29
Figure 3.3: Monitor automaton for formula ϕ.. . . . . . . . . . . . . . . . . . . .      30
Figure 3.4: Progression and segmentation.. . . . . . . . . . . . . . . . . . . . . .     31
Figure 3.5: Progression example.. . . . . . . . . . . . . . . . . . . . . . . . . . . .  36
Figure 3.6: Removing non-loop cycles in an LTL3 Monitor.. . . . . . . . . . . . . .      41
Figure 3.7: Reachability Matrix for a U b.. . . . . . . . . . . . . . . . . . . . . . .  49
Figure 3.8: Reachability Tree for a U b.. . . . . . . . . . . . . . . . . . . . . . . .  49
Figure 3.9: Synthetic experiments – impact of different parameters.. . . . . . . .       55
Figure 3.10: Impact of parallelization on different data.. . . . . . . . . . . . . . . . 57
Figure 3.11: False Warnings for Synthetic Data.. . . . . . . . . . . . . . . . . . . .   57
Figure 3.12: Cassandra experiments.. . . . . . . . . . . . . . . . . . . . . . . . . .   59
Figure 4.1: Hedged Two-party Swap.. . . . . . . . . . . . . . . . . . . . . . . . .      65
Figure 4.2: Progression Example.. . . . . . . . . . . . . . . . . . . . . . . . . . .    66
Figure 4.3: Example of a Cumulative Density Function.. . . . . . . . . . . . . . .       69
Figure 4.4: Differrent time interleaving of events.. . . . . . . . . . . . . . . . . . . 70
Figure 4.5: A trace example divided into three segments.. . . . . . . . . . . . . .      75
                                           xi


Figure 4.6: Train model.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     81
Figure 4.7: Gate model.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    81
Figure 4.8: Fischer model.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     82
Figure 4.9: Gossiping people model.. . . . . . . . . . . . . . . . . . . . . . . . . .      83
Figure 4.10: Different parameter’s impact on runtime for synthetic data.. . . . . .         86
Figure 4.11: Different parameter’s impact on statistical guarantee for synthetic
             data.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Figure 4.12: Results from the blockchain experiments.. . . . . . . . . . . . . . . .        99
Figure 5.1: LTL3 monitor for ϕ = ♦(a ∧ b).. . . . . . . . . . . . . . . . . . . . . . 108
Figure 5.2: Extended LTL3 monitor for ϕ = ♦(a ∧ b).. . . . . . . . . . . . . . . . . 111
Figure 5.3: Splitting a transition to two.. . . . . . . . . . . . . . . . . . . . . . . 118
Figure 5.4: Splitting a self-loop to two.. . . . . . . . . . . . . . . . . . . . . . . . 118
Figure 5.5: Crash distribution over a trace of length 100.. . . . . . . . . . . . . . 124
Figure 5.6: Average # of rounds and total # of messages sent per situation for
             different read and crash distributions for flip-flop distributed trace for
             ϕ4 with l = 1.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Figure 5.7: Impact of communicating after l states for various LTL formula on
             synthetic data.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Figure 5.8: Impact of communicating after l states for various LTL formula on data
             from Orange4Home dataset.. . . . . . . . . . . . . . . . . . . . . . . . 130
Figure 6.1: Partially Synchronous LOLA.. . . . . . . . . . . . . . . . . . . . . . . 133
Figure 6.2: Partially Synchronous Lola Example.. . . . . . . . . . . . . . . . . . 138
Figure 6.3: Dependency Graph Example.. . . . . . . . . . . . . . . . . . . . . . . 139
Figure 6.4: Example of generating LS .. . . . . . . . . . . . . . . . . . . . . . . . 145
Figure 6.5: Impact of different parameters on runtime for synthetic data.. . . . . 155
                                             xii


Figure 6.6: Impact of different parameters on message size for synthetic data.. . . 156
Figure 6.7: False-Positives for ICS Case-Studies.. . . . . . . . . . . . . . . . . . . 162
Figure 8.1: Decision boundary plot.. . . . . . . . . . . . . . . . . . . . . . . . . . 181
                                         xiii


                           LIST OF ALGORITHMS
Algorithm 1:  Non-Self Loop Cycle Removal Algorithm. . . . . . . . . . . . . . .         41
Algorithm 2:  Always. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    74
Algorithm 3:  Eventually. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  74
Algorithm 4:  Until. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Algorithm 5:  Behavior of Monitor Mi , for i ∈ [1, n]. . . . . . . . . . . . . . . . . 105
Algorithm 6:  Updated behavior of Monitor Mi , for i ∈ [1, n]. . . . . . . . . . . . 109
Algorithm 7:  Function to determine whether a transition has to split. . . . . . . 113
Algorithm 8:  Extended LTL3 Monitor Construction. . . . . . . . . . . . . . . . . 117
Algorithm 9:  Behavior of a Monitor Mi , for i ∈ [1, |M|]. . . . . . . . . . . . . . 140
Algorithm 10: Computation on Monitor Mi . . . . . . . . . . . . . . . . . . . . . 150
                                          xiv


Chapter 1
Introduction
1.1      Motivation
     As the world moves ahead, we find ourselves surrounded by technology. At the core of
this technology today, lies several intelligent, automated programs as pointed out in [SSS16].
From self-driving cars to automated smart contracts for blockchain transactions, from
keeping records efficiently in a data center to maneuvering aircraft in the sky, our health,
safety, well-being, and finance is managed, directed and often controlled by these ‘intelligent’
software. But the thing that makes these pieces of software unbiased is the same thing that
makes this software vulnerable to attacks of different kinds. Since these systems work without
any intervention from humans, we must verify these systems before deploying them. Any
slight error in the development/deployment of this software might be the reason behind
multi-million dollar losses or even loss of human lives - the very lives it was built to protect
and be beneficial to.
     Multiple examples of these kinds of faults can be seen in our world.           As pointed
out by [EP18], the Parity Multisig Wallet smart contract [Tec17] version 1.5 included a
vulnerability that led to the loss of 30 million US dollars. Thus, developing an effective,
safe, and fault-tolerant system is both urgent and essential to protect against possible losses,
both financial and human. Furthermore, critical infrastructure such as manufacturing and
distribution of power, gas, water, etc. are often the site of these attacks, which makes the
                                                 1


company incur a loss of around $5 million and 50 days of system downtime on average.
A recent report [SP18] pointed out that such an attack often compromises the integrity of
the data generated and thereby making the operator vulnerable to making sound decisions.
Moreover, as identified in [LLL+ 17, LLLG16, LHJ+ 14], distributed systems are prone to
distributed concurrency (DC) bugs caused by non-deterministic timing of distributed events.
The results show 63% of all DC bugs surface in the presence of hardware faults such as
machine crashes, network delay, timeouts, and disk errors. Additionally, 53% of DC bugs
lead to explicit local or global error in widely deployed cloud-based distributed systems,
Cassandra, Hadoop MapReduce, HBase and Zookeeper.
      In the past few decades, achieving system-wide dependability and reliability has
substantially benefited by incorporating rigorous formal methods to verify and prove the
correctness of safety-critical systems as pointed out in [Bow93]. In the aviation industry,
formal methods have been used to develop standards and are accepted as a part of the
certification process [RTC22]. Tools such as Astrée [CCF+ 05] and Frama-C [KKP+ 15] were
successfully employed to formally analyze portions of the code for several aircraft models
including the current largest passenger aircraft A380 [MLD+ 13, SWDD09]. In social media,
Facebook internally runs the INFER tool to verify selected properties, such as memory safety
errors and other common bugs of their mobile apps, used by over a billion people [CDD+ 15].
These are some of the success stories of verification in building reliable and dependable
systems identified in [HGM20].       Amazon Web Services (AWS) has included a runtime
threat detection coverage for Amazon Elastic Kubernetes Service (Amazon EKS) [Ser23a]
nodes and containers within the AWS environment.              EKS Runtime Monitoring uses a
GuardDuty [Ser23b] security agent to add runtime visibility into individual EKS workloads,
file access, process execution, and network connections.
      Reliability and dependability are especially critical in the domain of distributed systems
that inherently consist of complex algorithms and intertwined concurrent components. Given
the complexity of today’s computing systems, deploying exhaustive verification techniques
                                                2


such as model checking and theorem proving come at a high cost in terms of time, resources,
and expertise. In many cases, formal verification is hard to scale to a realistic size to analyze
the system’s correctness. Moreover, exhaustive verification techniques may overlook bugs
due to unanticipated stimuli from the environment, internal bugs in virtual machines, or
operating systems as well as hardware faults. On the other side of the spectrum, testing is a
best-effort method to examine the correctness, which scrutinizes only a subset of behaviors of
the system. Due to its under-approximate nature, testing often does not reveal obscure corner
cases that complex systems may reach at run time. In a distributed setting, the inherent
uncertainty about an exponential number of orderings of events makes testing techniques
often blind to concurrency bugs.
     Runtime verification (RV) is a lightweight popular technique, where a monitor or a set
of monitors continually inspects the health of a system under consideration at run time with
respect to a formally specified set of properties. The formal specification is normally in
the form of some language with clear syntax and semantics, such as regular expressions or
some form of temporal logic. RV acts as a crucial complement to costly model checking and
non-exhaustive testing. It often acts as a crucial bridge between how a system was designed
to perform versus how the system actually performs in the presence of various external
environmental factors. Compared to model checking and testing, runtime verification stands
out because of its ability to verify the actual execution of the system, along with the ability
to be aware of any external stimuli of the environment affecting the working of the system.
     As the scale and application of distributed systems are reaching new heights, so is the
complexity of verifying the correctness of these systems. To add to this complexity, we
find added challenges in the form of different clock synchronization schemes adopted by
the distributed system. In other words, we can classify distributed systems according to
the clock synchronization schemes they follow and are mainly of two types: synchronous
and asynchronous. In a synchronous system, all components of the distributed system
share a common global clock. Although verifying such a system is comparatively easier,
                                                3


maintaining such a system is costly with synchronization messages required to be sent at
very close intervals. On the other hand, asynchronous systems involve no synchronization
messages. Although it is extremely cheap to maintain such a system, verifying such a system
is extremely costly since it involves checking all possible interleavings of the events. An
efficient yet effective solution involves the presence of a clock synchronization algorithm that
sends out clock synchronization messages after a certain time instance. This limits the clock
skew between all pairs of components to a constant and thereby limiting the number of
interleavings needed to verify such a system.
Motivating Examples: Consider a large geographically separated distributed database
consisting of two datasets, Student, containing details of the student enrolled in the
university and Enrollment, that keeps a track of the classes each student has enrolled
in.     The distributed nature of the database makes maintaining a common global
clock shared among all the components, a challenge.                  Moreover, the distributed
database does not maintain the normalization of data.              This makes the data stored
in the database vulnerable to being replicated and also promotes unrelated data to
be stored in the database.            For example, an entry in the Student table reads,
(1234, “Leslie Lamport”, “126 Spartan Drive. East Lansing. MI 48800”). This represents a
student with the name Leslie Dijkstra and student identification number 1234, living at
the corresponding residential address.        On the other side, an entry in the Enrollment
table reads, (1234, “Edsger Dijkstra”, “CSE 260: Discrete Mathematics”. This represents a
student with the name Leslie Dijkstra and student identification number 1234, enrolled in the
corresponding course. As can be seen, although the student identification number matches,
the names does not.
      In another example, we see an entry in the Enrollment table that reads,
(2345, “Andrew Tanenbaum”, “CSE 410: Distributed Systems”. This represents a student
with the name Andrew Tanenbaum and student identification number 2345, enrolled in the
corresponding course, but no such entry exists in the Student table with the respective
                                                 4


student identification number. Errors like these are often common and lead to violation of
the ACID (Atomicity, Consistency, Isolation, and Durability) property.
     Model checking of such a distributed database would entail a large state space consisting
of all possible combinations of entries in each of the datasets, along with its time of
occurrence. Although it would be exhaustive and would be successful in determining the
faults, it would involve a huge cost and expertise, making it a non-preferred option. On the
other hand, testing, although would be cheap, detection of such an error is not guaranteed.
Additionally, considering the large size of the distributed database, the design of the test
cases is a headache and would involve the skill of the tester. Runtime Verification allows for
achieving a balance between guaranteeing the detection of such an error once it happens,
and a light-weight technique, making it one of the most preferred options.
1.2       Technical Challenges of RV of Distributed System
     Monitoring distributed systems and distributed monitoring has recently gained
traction [CGNM13, BF12, CF16, SVAG04, Gar02, SS95, OG07, YNV+ 16, VYK+ 17, VKTA20,
BKZ12, BKMZ15, BKMZ13], as a technique to discover latent bugs in concurrent settings.
Most of the above-mentioned approaches have a common assumption for the system under
inspection being synchronous. All processes in a synchronous distributed system share a
common global clock. As such, there exists a total ordering of the events taking place in
each process, and finding the ordered trace of events is comparatively easier. The time of
occurrence of each event along with any message sent-receiving events leads us toward the
totally ordered trace. To give a better understanding of the challenges faced in the verification
of distributed systems, Figure 1.1 represents a distributed computation consisting of two
processes, P1 and P2 . Each change in the local computation is represented by an event. For
example, the events {e10 , e11 , e12 , e13 , e14 } (resp. {e20 , e21 , e22 , e23 , e24 }) are from the process P1 (resp.
P2 ). Each event is either a message sent, a message received, or a local computation. A
message-send event is represented by an outgoing arrow, whereas a message-receiving event
                                                           5


                       e10              e11             e12     e13                       e14
                 P1
                  (1, p ∧ ¬r)     (2, ¬p ∧ ¬r)          4        6               (7, p ∧ ¬r)
                           e20          e21             e22                  e23              e24
                 P2
                    (1, ¬p ∧ ¬r)        3      (4, ¬p ∧ ¬r)                  7        (9, ¬p ∧ r)
                               Figure 1.1: Distributed computation.
is represented by an incoming arrow. In Figure 1.1, event e21 (resp. e12 ) and event e14 (resp.
e23 ) are the corresponding send (resp. receive) event. Additionally, each event is represented
by a pair, consisting of the time of occurrence and the valuation of the atomic propositions,
p and r. For example, event e14 is represented by (7, p ∧ ¬r), which denotes the time of
occurrence of the event as time step 7 and the atomic proposition p is true whereas the
atomic proposition r is false.
       Given a distributed computation with a synchronous clock, we can form a totally
ordered set by observing the time of occurrence of the events. For the events in Figure 1.1,
we can order the events as [{e10 , e20 }, {e11 }, {e21 }, {e12 , e22 }, {e13 }, {e23 , e14 }, {e24 }]. An interesting
observation is that since the time of occurrence of the events e13 and e23 is 6 and 7 respectively,
we list the event e13 as one that happened before the event e23 . This is also because e13 is a
sending event of a message of which e23 is the receiving event and we know that a send
operation strictly happens before the corresponding receiving event. Given this trace, the
monitor checks for the satisfiability of the specification and generates the verdict for the
given distributed computation.
       With the size and complexity of distributed systems growing and with each component
of a distributed system being often located at a different geographical location, maintaining
a common global clock is difficult. As a result, we often find ourselves with an asynchronous
distributed system, one where all the components have their local clock with no relation to
each other.
                                                     6


                                         {e14 , e24 }
                            {e14 , e23 }               {e13 , e24 }
                                                                                                  {e14 , e24 }
                            {e14 , e22 }               {e13 , e23 }
                                                                                                  {e14 , e23 }
                            {e13 , e22 }               {e14 , e21 }
                                                                                     {e14 , e22 }              {e13 , e23 }
                                                                                                  {e13 , e22 }
                            {e12 , e22 }               {e13 , e21 }
                                                                                                  {e12 , e22 }
                            {e11 , e22 }               {e12 , e21 }
                                                                                     {e11 , e22 }              {e12 , e21 }
                            {e11 , e21 }               {e10 , e22 }
                                                                                                  {e11 , e21 }
               {e11 , e20 }              {e10 , e21 }               {e22 }
                                                                                     {e11 , e20 }              {e10 , e21 }
                {e11 }                   {e10 , e20 }               {e21 }    {e11 }              {e10 , e20 }              {e21 }
                             {e10 }                     {e20 }                        {e10 }                    {e20 }
                                            {}                                                       {}
              (a) Considering Asynchronous                                   (b) Considering Partially-
              System                                                         Synchronous System ( = 2)
                                                     Figure 1.2: Computation Lattice.
      Monitoring of asynchronous distributed systems, as seen in [MB15, BFR+ 16], does not
scale well when verifying large systems. The lack of a global clock makes the time of
occurrence of an event irrelevant in deciding the order of occurrence. Thus we are left
with partially-ordered events. The possible number of traces of events that can be formed
given the computation grows exponentially to the number of processes in the system. As
can be understood, iterative monitoring of asynchronous distributed systems does not scale
well.
      As seen in Figure 1.2a, from the computation lattice we are able to generate
                                                                           7


multiple traces and each trace can yield multiple verdicts. Given an LTL specification
ϕ =     (¬p → (¬p U r)) (read as ‘from the next step no p should appear before an r),
we can have both true and false verdict. For the traces that considers event e20 appears
before e10 , evaluate to false, because at event e10 , ¬p evaluate to false, and we do not observe
any r before that. Similarly for the traces which consider event e10 happens before event e20
and event e24 happens before event e14 satisfies the specification and there-by evaluating to a
true verdict. This makes monitoring of asynchronous distributed systems an NP-Complete
problem [Gar02] in the number of processes in the setting.
     Thus, to come to a middle ground, we often find asynchronous systems use a clock
synchronization algorithm (like NTP [Mil10]) that limits the maximum clock skew between
any two processes in the system to a constant.             This constant is known as the clock
synchronization constant and is denoted by .              There are mainly two main ways of
synchronizing the clocks:
   • External clock synchronization: It uses a centralized time source, such as a GPS
      receiver, to keep all clocks in sync. This is the most accurate way to synchronize clocks,
      but it requires all devices to have access to the same time source.
   • Internal clock synchronization: It uses a peer-to-peer communication to adjust
      the clocks of each device relative to the others. This is less accurate than external
      clock synchronization, but it does not require all devices to have access to the same
      time source.
In this dissertation, we utilize an external clock synchronization, that limits the lattice
blowup experienced when monitoring an asynchronous system to a bound. Any two events
from different processes, more than  time apart can be totally ordered using the time
of occurrence. Any two events from different processes within  time are still considered
concurrent. This reduces the computation lattice by a considerable amount.
     Figure 1.2b shows the computation lattice for the distributed computation in Figure 1.1
when considering partial synchrony for  = 2. As can be observed, the computation lattice
                                                 8


is considerably less. For the same LTL specification, ϕ =      (¬p → (¬p U r)), the verdict of
the monitor is false. For both the case, (1) event e20 happened before event e10 and (2) event
e10 happened before event e20 , we see that that time of occurrence of event e14 and event e24
dictates that event e14 strictly happened before the event e24 since the time of occurrence of
this event is not less than . This makes the monitor compute a single verdict false for the
given computation under partial synchrony.
1.2.1      Formal Specification
     A Verification approach can only be as complete as the specification of the system
properties. As identified in [Cli14], system specifications are needed to be mathematically
precise and complete. Thus, we represent each event in the distributed computation by a set
of predicates/propositions that reflects the values of the corresponding predicate/proposition
in that event. In verification, we aim to check the conformance of these events against
expected values. We express our expectations as specifications of the system. In [VYK+ 17,
VKTA20], the authors propose a distributed predicate detection technique for partially-
synchronous systems. Although predicate detection is useful to represent certain types of
system specifications, it lacks the expressiveness that temporal logic offers.
     Depending upon the type of system to be monitored, we decide on the logic of
specification to be used. It can be selected from a wide variety of options. For example, when
monitoring for mutual separation of autonomous drones or race conditions in distributed
memory, the logic of choice is propositional logic. On the other hand, when monitoring
more complex distributed systems such as read/write consistency in a database or priority-
based train-platform allocation system, the regular predicate is of little use. We need
more expressive Linear Temporal Logic (LTL) [Pnu77] in this case. Furthermore, when
trying to monitor smart contracts involving a set of blockchains, transactions are usually
time-bound. Such case studies require a time bounded logic, such as Metric Temporal
Logic (MTL) [AH92, AH94] where each temporal operator has a time bound attached to it.
                                                9


Additionally, Industrial Control Systems (ICS) require more expressive specification language
that can handle aggregate functions like count, average, etc to make the Programmable Logic
Controller (PLC) take well-informed, sound decisions. For monitoring such systems, we use
a stream-based specification language Lola [DSS+ 05].
1.3      Thesis Statement
     The approaches discussed above play a major role in verifying distributed systems.
However, in the face of increasing size and complexity of distributed systems with evolving
requirements, a real-time feasible runtime verification approach of a partially-synchronous
distributed system is highly desirable. Additionally, since the verdict comes with a formal
guarantee, lightweight compared to other formal verification approaches, and observant of
dynamic changes in the environment affecting the working of the distributed system makes
runtime verification an extremely desirable choice. However, current runtime verification
approaches lack to make runtime verification appreciated in every distributed system
application. We list the limitations of the present approaches and the corresponding approach
we use to address them:
    • Sharing a common global clock among different geographically separated components
      of a large distributed the system is not realistic. To limit the exponential blow-up of the
      computational space due to asynchrony, presence of a clock synchronization algorithm
      is practical: We consider a partially-synchronous distributed system.
    • With changing requirements and a more versatile distributed system being developed,
      more expressive temporal logic is used to mention the specifications: We consider
      specifications as temporal properties using LTL, time-bounded temporal properties using
      MTL, and stream-based specification language using Lola.
    • A robust monitoring approach which scales well with changing system properties: We
      employ an SMT-based monitoring approach that encodes the distributed system to check
      for satisfaction and violation of the system property.
                                                10


    • The monitoring approach should also be fault-tolerant, in other words, the verdict
      should be unaffected even if some components of the monitoring architecture behave
      faultily. We study fault-tolerant monitoring for synchronous system.
    • The approach should finally be able to monitor the system at a similar pace as
      the events take place on the system under consideration: We propose an online
      decentralized stream-based runtime verification approach where each monitor broadcasts
      a partially-evaluated Lola associated equation to all other monitors.
     With the above-mentioned motivation, we focus on developing a runtime verification
approach that defends the following statement.
                   Runtime Verification of a partially-synchronous
                       distributed system in real-time is feasible.
     The contribution of our work validates the above statement. Briefly, based on the
type of specification used to represent the distributed system and the type of monitoring
architecture in use, we classify our contribution into four cases where (1) the specification
of the system can be represented by LTL (2) the system is time sensitive and as a result
we use MTL to represent the specification and (3) develop a fault-tolerant decentralized
monitoring algorithm and (4) develop a decentralized runtime verification approach for Lola
specification.
1.4      Contribution
     We list the major contribution of our work below with the publications recorded in
Table 1.1:
    • Runtime verification of partially-synchronous distributed system w.r.t. LTL
      specifications We propose two sound and complete solutions to the problem of
      distributed runtime verification (RV) with respect to LTL formulas.       Both of our
      solutions use a fault-proof central monitor, and in order to remedy the explosion
      of different interleavings, we make a practical assumption of the presence of a
                                              11


  clock synchronization algorithm.         The first approach is based on constructing
  a LTL3 automata of the LTL formula and constructing multiple SMT queries to
  determine which states of the monitor automaton are reachable for a given distributed
  computation. The other approach involves developing a formula progression technique.
  Specifically, given a finite trace α, and an LTL formula ϕ, we define a function Pr, such
  that Pr(α, ϕ) characterizes the progression of ϕ and α. Progression is defined as the
  rewritten formula for future extensions of α depending on what has been observed thus
  far, which returns either true, false, or an LTL formula. We test our approach through
  not only a set of vigorous synthetic experiments but also by monitoring the same set
  of consistency conditions in Cassandra. We also put our approach to the test using a
  real-time airspace monitoring dataset (RACE) from NASA [MGS19].
• Runtime verification of partially-synchronous distributed system w.r.t. MTL
  specifications We propose a sound and complete solution to the problem of distributed
  runtime verification (RV) with respect to MTL formulas. We deploy a fault-proof central
  monitor, and in order to remedy the explosion of different interleavings, we again
  make a practical assumption of the presence of a clock synchronization algorithm.
  We introduce a progression-based formula rewriting technique that is reduced to an
  SMT encoding over distributed computations which takes into consideration the events
  observed thus far to rewrite the specifications for future extensions. Our monitoring
  algorithm accounts for all possible orderings of events without explicitly generating
  them when evaluating MTL formulas. We report on the results of rigorous experiments
  on monitoring synthetic data, using benchmarks in the tool UPPAAL [BDL04], as well
  as monitoring correctness, liveness, and conformance conditions for smart contracts on
  blockchains.
• Crash-Resilient        decentralized      runtime    verification    of    synchronous
  distributed system w.r.t. LTL specifications We assume that a set of monitors,
  subject to crash failures, are distributed over a synchronous communication network.
                                            12


       Each monitor only has a partial view of the underlying system. In order to minimize
       the size of the transformed automaton, we formulate an offline optimization problem
       in the satisfiability modulo theory (SMT). This limits the size of the message to be
       O(log(|Mϕ3 |) · |AP|). We have evaluated our approach on a variety of LTL formulas for
       traces being generated using different random distributions as well as an IoT dataset,
       Orange4Home [CLRC17].
    • Decentralized stream-based runtime verification of partially-synchronous
       distributed system We assume that a set of partially-synchronous set of monitors,
       are distributed over a partially-synchronous communication network. Each monitor
       only has a partial view of the entire system and utilize a message-passing based
       communication to share the locally computed results with other monitors. We first
       present a general technique for runtime monitoring of distributed applications whose
       behavior can be modeled as input/output streams with an internal computation module
       in the partially synchronous semantics, where an imperfect clock synchronization
       algorithm is assumed. Second, we propose a generalized stream-based decentralized
       runtime verification technique. We also rigorously evaluate our algorithm on extensive
       synthetic experiments and several Industrial Control Systems and aircraft SBS message
       datasets.
1.5       Organization
     This report consists of 7 chapters. Each chapter addresses a separate aspect of runtime
verification.
    • We present the different preliminary concepts of a distributed system, linear temporal
       logic (LTL), metric temporal logic (MTL), etc. in Chapter 2.
    • We introduce and discuss two solutions for monitoring partially synchronous
       distributed systems w.r.t. LTL specifications in Chapter 3.
    • Next, we propose a monitoring solution with probabilistic guarantees for time-bounded
                                               13


                 Distributed
 Chapter #                      Specification     Monitor         Conference/Journal
               System(clock)
                                                                      Published in
                  Partially-
      3                             LTL          Centralized         OPODIS-2020
                Synchronous
                                                                    Minor-revision in
                                                                    Springer-FMSD
                                                                      Published in
                  Partially-
      4                             MTL          Centralized       IEEE ICDCS-2022
                Synchronous
                                                                    Under review in
                                                                     Elsevier-JPDC
                                                                      To appear in
      5         Synchronous         LTL         Decentralized
                                                                      IEEE-TDSC
                  Partially-                                          Submitted in
      6                            LOLA         Decentralized
                Synchronous                                      ACM EMSOFT-2023
                         Table 1.1: Summarized Publications.
  temporal specifications in Chapter 4.
• In Chapter 5, we introduce a fault-tolerant decentralized monitoring approach for
  synchronous distributed system.
• In Chapter 6, we propose a decentralized stream-based runtime verification technique
  for partially-synchronous distributed systems.
• Finally, in Chapter 7 we present the related work in the literature of runtime verification
  of distributed systems followed by the conclusion and road map for future work in
  Chapter 8.
                                           14


Chapter 2
Preliminary Concepts
     In this chapter, we discuss and introduce the various preliminary concepts we use in the
course of this report.
2.1      Distributed System
     A distributed system is a computing environment where various components, often
geographically separated, are spread across multiple computers (or other computing devices)
on a network with the aim of achieving a common goal. These devices split up the work,
coordinating their efforts to complete the job more efficiently than if a single device had been
responsible for the same task.
     In the scope of this report, we classify distributed system into two classes. One, where
the components of the distributed system (processes) shared one common global clock,
known as synchronous distributed system. Second, the components do not share a common
global clock, but are synchronized by the help of a clock synchronization algorithm (eg.
NTP [Mil10]), known as partially synchronous distributed system.
     We assume a loosely coupled message passing system, consisting of n processes, denoted
by P = {P1 , P2 , . . . , Pn }, without any shared memory. Channels are assumed to be FIFO,
and lossless. In our model, each local state change is considered an event, and every message
activity (send or receive) is also represented by a new event. Message transmission does
                                                 15


not change the local state of processes and the content of a message is immaterial to our
purposes. We will need to refer to some global clock which acts as a ‘real’ time keeper.
2.1.1       Synchronous Distributed System
     In a synchronous distributed system, all the processes, share the global clock of the
system. The local clock (or time) of a process Pi , is same as that of the global clock (or time)
G. Since all the processes share the global clock, a event in a process can be easily arranged
by looking at the time of occurrence of the corresponding event. For any two events, eiσ
(resp. ejσ0 ), occurring in process i (resp. j) at time σ (resp. σ 0 ) can be ordered using the
Lamport’s happen-before relation [Lam78] (        ) as
                                      (σ < σ 0 ) ↔ (eiσ  ejσ0 )
or
                                      (σ 0 < σ) ↔ (ejσ0  eiσ )
Thus the events can be arranged in a unique ordering depending upon the time of occurrence,
to form a trace, to be used for monitoring.
2.1.2       Partially-Synchronous Distributed System
     A partially synchronous distributed system makes a practical assumption of partial
synchrony. The local clock (or time) of a process Pi , where i ∈ [1, n], can be represented as
an increasing function ci : R≥0 → R≥0 , where ci (G) is the value of the local clock at global
time G. Then, for any two processes Pi and Pj , we have ∀G ∈ R≥0 .|ci (G) − cj (G)| < , with
 > 0 being the maximum clock skew. The value  is assumed to be fixed and known by
the monitor in the rest of this paper. In the sequel, we make it explicit when we refer to
‘local’ or ‘global’ time. This assumption is met by using a clock synchronization algorithm,
like NTP [Mil10], to ensure bounded clock skew among all processes. It is to be understood,
however, that this global clock is a theoretical object used in definitions, and is not available
                                                 16


to the processes.
      An event in process Pi is of the form eiτ,σ , where σ is logical time (i.e., a natural number)
and τ is the local time at global time G, that is, τ = ci (G). We assume that for every two
events eiτ,σ and eiτ 0 ,σ0 , we have (τ < τ 0 ) ⇔ (σ < σ 0 ).
Definition 1. A distributed computation on N processes is a tuple (E, ), where E is a
set of events partially ordered by Lamport’s happened-before ( ) relation [Lam78], subject
to the partial synchrony assumption:
    • In every process Pi , 1 ≤ i ≤ N, all events are totally ordered, that is,
                                 ∀τ, τ 0 ∈ R+ .∀σ, σ 0 ∈ Z≥0 .(σ < σ 0 ) → (eiτ,σ eiτ 0 ,σ0 ).
    • If e is a message send event in a process, and f is the corresponding receive event by
       another process, then we have e                 f.
    • For any two processes Pi and Pj , and any two events eiτ,σ , ejτ 0 ,σ0 ∈ E, if τ +  < τ 0 , then
       eiτ,σ     ejτ 0 ,σ0 , where  is the maximum clock skew.
    • If e       f and f          g, then e     g.
Definition 2. Given a distributed computation (E, ), a subset of events C ⊆ E is said
to form a consistent cut iff when C contains an event e, then it contains all events that
happened-before e. Formally, ∀e ∈ E.(e ∈ C) ∧ (f                   e) → f ∈ C.
      We represent the set of all consistent cut by C. The frontier of a consistent cut C,
denoted front(C) is the set of events that happen last in the cut. front(C) is a set of eilast
for each i ∈ [1, |P|] and eilast ∈ C. We denote eilast as the last event in Pi such that ∀eiτ,σ ∈
E.(eiτ,σ 6= eilast ) → (eiτ,σ       eilast ).
2.2         Linear Temporal Logics (LTL) for RV
      Let AP be a set of atomic propositions and Σ = 2AP be the alphabet. We call each
element of Σ an event. For example, for AP = {a, b}, event s = {} means that both
propositions a and b are not true in s and event s0 = {a} means that only proposition
a is true in s0 . A trace is a sequence s0 s1 s2 · · · , where si ∈ Σ, for every i ≥ 0. The
set of all finite (respectively, infinite) traces over Σ is denoted by Σ∗ (respectively,
                                                          17


Σω ). Throughout the paper, we denote finite traces by the letter α, and infinite traces
by the letter σ. For a finite trace α = s0 s1 · · · sn , by αi , we mean trace suffix si si+1 · · · sn of α.
2.2.1       Infinite-trace Semantics of LTL
     The syntax and semantics of the linear temporal logic (LTL) [Pnu77, MP79] are defined
for infinite traces. The syntax is defined by the following grammar:
                            ϕ ::= p | ¬ϕ | ϕ ∨ ϕ |              ϕ | ϕ U ϕ
where p ∈ AP, and where        and U are the ‘next’ and ‘until’ temporal operators respectively.
We view other propositional and temporal operators as abbreviations, that is, true = p ∨ ¬p,
false = ¬true, ϕ → ψ = ¬ϕ ∨ ψ, ϕ ∧ ψ = ¬(¬ϕ ∨ ¬ψ),                 ϕ = true U ϕ (eventually ϕ), and
  ϕ=¬        ¬ϕ (always ϕ). We denote the set of all LTL formulas by ΦLTL .
     The infinite-trace semantics of LTL is defined as follows. Let σ = s0 s1 s2 · · · ∈ Σω , i ≥ 0,
and let |= denote the satisfaction relation:
    σ, i |= p         iff    p ∈ si
    σ, i |= ¬ϕ        iff    σ, i 6|= ϕ
    σ, i |= ϕ1 ∨ ϕ2   iff    σ, i |= ϕ1 or σ, i |= ϕ2
    σ, i |=   ϕ       iff    σ, i + 1 |= ϕ
    σ, i |= ϕ1 U ϕ2 iff      ∃k ≥ i : σ, k |= ϕ2 and ∀j ∈ [i, k) : σ, j |= ϕ1
Also, σ |= ϕ holds if and only if σ, 0 |= ϕ holds.
2.2.2       Finite-trace Semantics of LTL
     In the context of RV, the 3-valued LTL (LTL3 for short) [BLS11] evaluates LTL formulas
for finite traces, but with an eye on possible future extensions where as finite LTL, or
FLTL [MP95] only takes into consideration the current trace with no eye towards the future.
                                                    18


                                              {a}
                                               q0
                                           {}        {a, b}, {b}
                                         q⊥            q>
                                        true          true
                             Figure 2.1: LTL3 monitor for ϕ = a U b.
In LTL3 , the set of truth values is B3 = {>, ⊥, ?}, where > (resp., ⊥) denotes that the formula
is permanently satisfied (resp., violated), no matter how the current finite trace extends, and
‘?’ denotes an unknown verdict, i.e., there exists an extension that can violate the formula,
and another extension that can satisfy the formula. Let α ∈ Σ∗ be a non-empty finite trace.
The truth value of an LTL3 formula ϕ with respect to α, denoted by [α |=3 ϕ], is defined as
follows:                                 
                                                      ∀σ ∈ Σω : α.σ |= ϕ
                                         
                                         
                                         
                                         
                                          > if
                                         
                                         
                            [α |=3 ϕ] = ⊥ if          ∀σ ∈ Σω : ασ 6|= ϕ
                                         
                                         
                                         
                                         
                                         
                                         ? otherwise.
                                         
Definition 3. The LTL3 monitor for a formula ϕ is the unique deterministic finite state
machine Mϕ = (Σ, Q, q0 , δ, λ), where Q is the set of states, q0 is the initial state, δ : Q×Σ →
                                                                                     
Q is the transition function, and λ : Q → B3 is a function such that λ δ(q0 , α) = [α |=3 ϕ],
for every finite trace α ∈ Σ∗ .
     For example, Fig. 2.1, shows the monitor automaton for formula ϕ = a U b.
     The syntax of FLTL is also identical to that of LTL, and the semantics is based on the
truth values B2 = {>, ⊥}, where > (resp., ⊥) denotes that the formula is satisfied (resp.,
violated) given the current finite trace. For atomic propositions and Boolean operators,
the semantics of FLTL is identical to those of LTL. Let ϕ, ϕ1 , and ϕ2 be LTL formulas,
α = s0 s1 . . . sn be a non-empty finite trace, and |=F denote the satisfaction relation in FLTL.
                                                  19


The semantics of FLTL for the temporal operators are as follows:
                                                
                                                [α1 |=F ϕ] if α1 6= 
                                                
                                                
                                  [α |=F ϕ] =
                                                
                                                ⊥
                                                               otherwise.
                                           
                                           > if ∃k ∈ [0, n] : ([αk |=F ϕ2 ] = >)∧
                                           
                                           
                                           
                                           
                                           
                                           
                     [α |=F ϕ1 U ϕ2 ] =             ∀l ∈ [0, k) : ([αl |=F ϕ1 ] = >)
                                           
                                           
                                           
                                           
                                           
                                           ⊥ otherwise.
                                           
     In order to further illustrate the difference between LTL and FLTL and LTL3 , consider
formula ϕ =          p, and a finite trace α = s0 s1 · · · sn . If p 6∈ si for some i ∈ [0, n], then
[α |=3 ϕ] = ⊥, that is, the formula is permanently violated and so is the case in FLTL where,
[α |=F ϕ] = ⊥. Now, consider formula ϕ =             p. If p 6∈ si for all i ∈ [0, n], then [α |=3 ϕ] =?.
This is because there exist an infinite extension to α that can satisfy or violate ϕ in the
infinite semantics of LTL. But, this is not the case in FLTL where [α |=F ϕ] = ⊥ as it did not
observe any p in the observed finite trace.
2.3         Metric Temporal Logic
     Let I be a set of nonempty intervals over Z≥0 . We define an interval, I, to be
                               [start, end ) , {a ∈ Z≥0 | start ≤ a < end }
where start ∈ Z≥0 , end ∈ Z≥0 ∪ {∞} and start < end . We define AP as the set of all atomic
propositions, and Σ = 2AP as the set of all possible states. A trace is represented by a pair
which consists of a sequence of states, denoted by α = s0 s1 · · · , where si ∈ Σ for every i > 0
and a sequence of non-negative numbers, denoted by τ̄ = τ0 τ1 · · · , where τi ∈ Z≥0 for all
i > 0. We represent the set of all infinite traces by a pair of infinite sets, (Σω , Zω≥0 ). The trace
sk sk+1 · · · (resp. τk τk+1 ) is represented by αk (resp. τ k ). For an infinite trace α = s0 s1 · · ·
and τ̄ = τ0 τ1 · · · , τ̄ is an increasing sequence, meaning τi+1 ≥ τi , for all i ≥ 0.
                                                    20


Syntax The syntax of metric temporal logic (MTL) [AH92, AH94] for infinite traces are
defined by the following grammar:
                                          ϕ ::= p | ¬ϕ | ϕ1 ∨ ϕ2 | ϕ1 U I ϕ2
where p ∈ AP and U               I  is the ‘until’ temporal operator with time bound I. We also have
true = p∨¬p, false = ¬true, ϕ1 → ϕ2 = ¬ϕ1 ∨ϕ2 , ϕ1 ∧ϕ2 = ¬(¬ϕ1 ∨¬ϕ2 ),                         I ϕ = true U I ϕ
(“eventually”) and           I   ϕ = ¬(     I ¬ϕ) (“always”). The set of all MTL formulas is denoted by
ΦMTL .
Semantics The semantics of metric temporal logic (MTL) is defined over α = s0 s1 · · · and
τ̄ = τ0 τ1 · · · as follows:
                         (α, τ̄ , i) |= p            iff p ∈ si
                         (α, τ̄ , i) |= ¬ϕ           iff (α, τ̄ , i) 6|= ϕ
                         (α, τ̄ , i) |= ϕ1 ∨ ϕ2      iff (α, τ̄ , i) |= ϕ1 ∨ (α, τ̄ , i) |= ϕ2
                         (α, τ̄ , i) |= ϕ1 U I ϕ2 iff ∃j ≥ i.τj − τi ∈ I ∧ (α, τ̄ , j) |=
                                                         ϕ2 ∧ ∀k ∈ [i, j), (α, τ̄ , k) |= ϕ1
Also, (α, τ̄ ) |= ϕ holds if and only if (α, τ̄ , 0) |= ϕ.
      In the context of RV, we introduce the notion of finite MTL. The truth values are
represented by the set B2 = {>, ⊥}, where > (resp. ⊥) represents a formula that is satisfied
(resp. violated) given a finite trace. We represent the set of all finite traces by a pair of
finite sets, (Σ∗ , Z∗≥0 ). For a finite trace, α = s0 s1 · · · sn and τ̄ = τ0 τ1 · · · τn the only semantic
that needs to be redefined is that of U (‘until’) and is as follows:
                                                 
                                                   > if ∃j ≥ i.τj − τi ∈ I([αj |=F ϕ2 ] = >)∧
                                                 
                                                 
                                                 
                                                 
                                                 
                                                 
                                                 
              [(α, τ̄ , i) |=F ϕ1 U I ϕ2 ] =             ∀k ∈ [i, j) : ([αk |=F ϕ1 ] = >)
                                                 
                                                 
                                                 
                                                 
                                                 
                                                 ⊥ otherwise.
                                                 
                                                          21


     In order to further illustrate the difference between MTL and finite MTL, consider formula
ϕ =      I p and a trace α = s0 s1 · · · sn and τ̄ = τ0 τ1 · · · τn . We have [(α, τ̄ ) |=F ϕ] = > if
for some j ∈ [0, n], we have τj − τ0 ∈ I and p ∈ si , otherwise ⊥. Now, consider formula
ϕ=     I  p. We have [(α, τ̄ ) |=F ϕ] = ⊥, if for some j ∈ [0, n], we have τj − τ0 ∈ I and p 6∈ si ,
otherwise >.
2.4        Hybrid Logical Clocks
     A hybrid logical clock (HLC) [KDM+ 14] is a tuple (τ, σ, ω) for detecting one-way
causality, where τ is the local time, σ ensures the order of send and receive events between
two processes, and ω indicates causality between events. Thus, in the sequel, we denote an
event by eiτ,σ,ω . More specifically, for a set E of events:
    • τ is the local clock value of events, where for any process Pi and two events eiτ,σ,ω , eiτ 0 ,σ0 ,ω0
      ∈ E, we have τ < τ 0 iff eiτ,σ,ω       eiτ 0 ,σ0 ,ω0 .
    • σ stipulates the logical time, where:
           – For any process Pi and any event eiτ,σ,ω ∈ E, τ never exceeds σ, and their difference
             is bounded by  (i.e, σ − τ ≤ ).
           – For any two processes Pi and Pj , and any two events eiτ,σ,ω , ejτ 0 ,σ0 ,ω0 ∈ E, where
             event eiτ,σ,ω receiving a message sent by event ejτ 0 ,σ0 ,ω0 , σ is updated to max{σ, σ 0 , τ }.
             The maximum of the three values are chosen to ensure that σ remains updated
             with the largest τ observed so far. Observe that σ has similar behavior as τ ,
             except the communication between processes has no impact on the value of τ for
             an event.
    • ω : E → Z≥0 is a function that maps each event in E to the causality updates, where:
           – For any process Pi and a send or local event eiτ,σ,ω ∈ E, if τ < σ, then ω is
             incremented. Otherwise, ω is reset to 0.
           – For any two processes Pi and Pj and any two events eiτ,σ,ω , ejτ 0 ,σ0 ,ω0 ∈ E, where
             event eiτ,σ,ω receiving a message sent by event ejτ 0 ,σ0 ,ω0 , ω(eiτ,σ,ω ) is updated based
                                                        22


                              (τ , σ, ω)              7 3                                7
                              10 10 0             20 20 0           21 21 0         31 31 0
                       P1
                               0 10 1      1 10 2                    2 10 5         20 20 0
                       P2
                                000                 1 10 3        2 10 4            20 20 0
                       P3
                                                                     C0              C1 C2
                                          Figure 2.2: HLC example.
            on max{σ, σ 0 , τ }.
         – For any two processes Pi and Pj , and any two events eiτ,σ,ω , ejτ 0 ,σ0 ,ω0 ∈ E, (τ =
            τ 0 ) ∧ (ω < ω 0 ) → eiτ,σ,ω       ejτ 0 ,σ0 ,ω0 .
     In our implementation of HLC, we assume that it is fault-proof. Fig. 2.2 shows an HLC
incorporated partially synchronous concurrent timelines of three processes with ε = 10.
Observe that the local times of all events in front(C1 ) are bounded by ε. Therefore, C1 is a
consistent cut, but C0 and C2 are not.
2.5      Stream-based Specification Lola
     A Lola [DSS+ 05] specification describes the computation of output streams given a set
of input streams. A stream α of type T is a finite sequence of values, t ∈ T. Let α(i), where
i ≥ 0, denote the value of the stream at time stamp i. We denote a stream of finite length
(resp. infinite length) by T∗ (resp. Tω ).
Definition 4. A Lola specification is a set of equations over typed stream variables of the
form:
                                       s1 = e1 (t1 , · · · , tm , s1 , · · · , sn )
                                         .. ..
                                          . .
                                       sn = en (t1 , · · · , tm , s1 , · · · , sn )
where s1 , s2 , · · · , sn are called the dependent variables,                         t1 , t2 , · · · , tm are called
                                                               23


the independent variables,                  and      e1 , e2 , · · · , en  are  the stream expressions over
s1 , · · · , sn , t1 , · · · , tm .
        Typically, Input streams are referred to as independent variables, whereas output streams
are referred as dependent variable. A stream expression is constructed as follows:
      • If c is a constant of type T, then c is an atomic stream expression of type T
      • If s is a stream variable of type T, then s is an atomic stream expression of type T.
      • If f : T1 × T2 × · · · Tk → T is a k-ary operator and for 1 ≤ i ≤ k, ei is an expression
          of type Ti , then f (e1 , e2 , · · · , ek ) is a stream expression of type T
      • If b is a stream expression of type boolean and e1 , e2 are stream expressions of type T,
          then ite(b, e1 , e2 ) is a stream expression of type T, where ite is the abbreviated form
          of if-then-else.
      • If e is a stream expression of type T, c is a constant of type T and i is an integer, then
          e[i, c] is a stream expression of type T. e[i, c] refers to the value of the expression e
          offset by i positions from the current position. In case the offset takes it beyond the
          end or before the beginning of the stream, then the default value is c.
For example, consider the following Lola specification, where t1 and t2 are independent
stream variables of type boolean and t3 is an independent stream variable of type integer.
                                              s1 = true
                                              s2 = t3
                                              s3 = t1 ∨ (t3 ≤ 1)
                                              s4 = ((t3 )2 + 7)          mod 15
                                              s5 = ite(s2 , s4 , s4 + 1)
                                              s6 = ite(t1 , t3 ≤ s4 , ¬s3 )
                                              s7 = t1 [+1, false]
                                                               24


                                         s8 = t1 [−1, true]
                                         s9 = s9 [−1, 0] + (t3           mod 2)
                                       s10 = t2 ∨ (t1 ∧ s10 [1, true])
where, ite is the abbreviated form of if-then-else and stream expressions s7 and s8 refers to
the stream t1 with an offset of +1 and −1, respectively.
       Furthermore, Lola can be used to compute incremental statistics, where a given a
stream, α, a function, fα (v, u), computes a measure, where u represents the measure thus
far and v, the current value. Given a sequence of values, v1 , v2 , · · · , vn , with a default value
d, the measure over the data is given as
                                     u = fα (vn , fα (vn−1 , · · · , fα (v1 , d)))
Example of such functions include count, fcount (v, u) = u + 1, sum, fsum (v, u) = u + v, max,
fmax (v, u) = max{v, u}, among others. Aggregate functions like average, can be defined
using two incremental functions, count and sum.
       The semantics of Lola specifications is defined in terms of the evaluation model, which
describes the relation between input and output streams.
Definition 5. Given a Lola specification ϕ over independent variables, t1 , · · · , tm , of type,
T1 , · · · , Tm , and dependent variables, s1 , · · · , sn with type, Tm+1 , · · · , Tm+n , let τ1 , · · · , τm
be the streams of length N + 1, with τi of type Ti . The tuple hα1 , · · · , αn i of streams of length
N + 1 is called the evaluation model, if for every equation in ϕ
                                        si = ei (t1 , · · · , tm , s1 , · · · , sn )
hα1 , · · · , αn i satisfies the following associated equations:
                            αi (j) = υ(ei )(j)       for (1 ≤ i ≤ n) ∧ (0 ≤ j ≤ N )
where υ(ei )(j) is defined as follows. For the base cases:
                                                   υ(c)(j) = c
                                                           25


                                                   υ(ti )(j) = τi (j)
                                                  υ(si )(j) = αi (j)
For the inductive cases, where f is a function (e.g., arithmetic):
                                                                                
                       υ f (e1 , · · · , ek ) (j) = f υ(e1 )(j), · · · , υ(ek )(j)
                                            
                        υ ite(b, e1 , e2 ) (j) = if υ(b)(j) then υ(e1 )(j)else υ(e2 )(j)
                                                     
                                                     υ(e)(j + k) if 0 ≤ j + k ≤ N
                                 υ(e[k, c])(j) =
                                                     c                    otherwise 
      The set of all equations associated with ϕ is noted by ϕα .
Definition 6. A dependency graph for a Lola specification, ϕ is a weighted and directed
graph G = hV, Ei, with vertex set V = {s1 , · · · , sn , t1 , · · · , tm }. An edge e : hsi , sk , wi (resp.
e : hsi , tk , wi) labeled with a weight w is in E iff the equation for αi (j) in ϕα contains
αk (j + w) (resp. τk (j + w)) as a subexpression. Intuitively, an edge records that si at a
particular position depends on the value of sk (resp. tk ), offset by w positions.
      Given a set of synchronous input streams {α1 , α2 , · · · , αm } of respective type T =
{T1 , T2 , · · · , Tm } and a Lola specification, ϕ, we evaluate the Lola specification, given
by:
                                               (α1 , α2 , · · · , αm ) |=S ϕ
given the above semantics, where |=S denotes the synchronous evaluation.
                                                              26


Chapter 3
Runtime Verification for Linear
Temporal Specifications
3.1        Introduction
     The main challenge with distributed monitoring lies within the fact that in the absence
of a global clock, it is not always possible for the monitor to establish the correct order
of occurrence of events across different processes.         In fact, given the non-deterministic
nature of distributed applications, it is perfectly foreseeable that a runtime monitor may
produce different verdicts for the same distributed computation based on different ordering
of events. In the case of complete asynchrony, this in turn results in a combinatorial blow-up
of possibilities that the monitor must explore at run time, which in turn makes the problem
computationally expensive. However, state-of-the-art networks, such as Google Spanner are
augmented with clock synchronization techniques that result in partial-synchrony [CDE+ 13].
These clock synchronization techniques guarantee a maximum clock-skew of ε between any
pair of processes. Having such a guarantee considerably limits the combinatorial blow-up,
    (Published) Ritam Ganguly, Anik Momtaz, and Borzoo Bonakdarpour, Distributed Runtime Verification
Under Partial Synchrony, 24th International Conference on Principles of Distributed Systems (OPODIS
2020).
(Under minor-review) Ritam Ganguly, Anik Momtaz, and Borzoo Bonakdarpour, Runtime Verification of
Partially-Synchronous Distributed System, Springer Formal Methods in System Design.
                                                   27


                                                0         4           7
                                     P1
                                          x1 = 0       x1 = 1
                                          x2 = 0       x2 = 2
                                     P2
                                                0         3                9
                                       Figure 3.1: Distributed computation.
as events outside the window of ε can be ordered.
      To give an example of the blow-up experienced by the monitor, consider Figure 3.1,
where we have two processes P1 and P2 hosting two discrete variables x1 and x2 , respectively.
Let us also consider the linear temporal logic (LTL) property ϕ =                    (x2 > x1 ) and a maximum
clock-skew, also known as clock-synchronization constant, to be ε = 2. Events x1 = 1 and
x2 = 0, as well as x1 = 0 and x2 = 2, are not considered concurrent, as the events in
these event pairs are more than ε time apart. However, events x1 = 1 and x2 = 2 are
considered concurrent, as these events occurred within ε time from one another. Therefore,
it is not possible to determine the exact ordering of these events, without a global clock.
Thus, the formula gets evaluated to both true and false, as both possible ordering of events
must be taken into account. The number of different possible ordering of events can increase
dramatically as more events and processes are introduced.
      Handling concurrent events generally results in combinatorial enumeration of all
possibilities and, hence, intractability of distributed RV. Existing distributed RV techniques
operate in two extremes: they either assume a global clock [BF16b], which is unrealistic for
large-scale distributed settings or assume complete asynchrony [OG07, MB15], which do not
scale well. To further elaborate on our point, consider the processes P1 and P2 in Fig. 3.2,
with events {e10 , e11 , e12 , e13 , e14 , e15 } on process P1 , and events {e20 , e21 , e22 , e23 , e24 } on process P2
divided into two segments, seg1 and seg2 , and a LTL formula,
                                                                        
                                                ϕ=         r → (¬p U r) .
                                                            28


                   ∅             ∅          ∅            ∅            r        ∅
              P1
                       p              ∅            ∅           ∅        p
              P2
                           seg1                                  seg2
                             Figure 3.2: Distributed computation.
Observe that the predicate p (resp. r) is true at events e20 and e24 (resp. e14 ), and in the rest
of the events both predicates are false, denoted by ∅. The scenario where e20 happens before
e10 and e14 happens before e24 , the LTL property, ϕ, is satisfied. However, the scenario where
e10 happens before e20 and e14 happens after e24 , violates ϕ.
     Thus, following the above example, the main research problem we aim to tackle in this
paper is the following. Given a finite distributed computation and an LTL formula, our
objective is to design efficient algorithms that determine whether or not the computation
satisfies the formula. As shown above, the main obstacle is solving this problem is the
explosion of interleavings at run time that need to be explored in order to monitor a
computation.
Contributions In order to address the combinatorial explosion of various interleavings
introduced by the absence of a global clock, our first design choice is a practical assumption,
namely, a bounded skew of ε between local clocks of each pair of processes, which is
guaranteed by a clock synchronization mechanism (e.g., NTP [Mil10]).
     Our first technique is based on constructing the LTL3 [BLS11] monitor automaton of
an LTL formula and constructing multiple SMT queries to determine which states of the
monitor automaton are reachable for a given distributed computation. For example, Fig. 3.3
shows the monitor automaton for formula ϕ mentioned earlier and one has to construct 4
different SMT queries to determine the set of all possible reachable states at the end of the
computation in Fig. 3.2. We transform our monitoring decision problem into an SMT solving
problem. The SMT instance includes constraints that encode (1) our monitoring algorithm
                                                29


                                                       q0
                               ∅, {p, r}, {p}, {r}
                                                     ∅, {p}
                                    ∅       q1                      q2
                                 {p, r}, {r}             {p, r}, {r}
                                true        q>          true        q⊥
                        Figure 3.3: Monitor automaton for formula ϕ.
based on the 3-valued semantics of LTL [BLS11], (2) behavior of communicating processes
and their local state changes in terms of a distributed computation, and (3) the happened-
before relation subject to the  clock skew assumption. Then, it attempts to concretize an
uninterpreted function whose evaluation provides the possible verdicts of the monitor with
respect to the given computation. In order to make the verification problem tractable, we
chop a computation into multiple segments and effectively reduce the search space of each
SMT query (see Fig. 3.4). Thus, the result of monitoring each segment (the possible LTL3
states) should be carried to the next segment. Furthermore, given the fact that distributed
applications nowadays run on massive cloud services, we extend our solution to a parallel
monitoring algorithm to utilize the available computing infrastructure and achieve better
scalability.
     The intuition behind our second monitoring technique is that since (in the first approach)
running SMT queries to test whether each state of the LTL3 monitor automaton is reachable
is excessive, it should be sufficient to test whether temporal sub-formulas of an LTL formula
hold in a distributed computation. Similar to the first approach, we utilize segmentation,
to break down the problem size. In the second, approach to carry the result of monitroing
from one segment to the next, we also develop a formula progression technique. Specifically,
given a finite trace α, and an LTL formula ϕ, we define a function Pr, such that Pr(α, ϕ)
characterizes the progression of ϕ and α. Progression is defined as the rewritten formula for
                                                  30


                     ∅         ∅          ∅                  ∅         r     ∅
               P1                                  P1
                       p         ∅           ∅          ∅         ∅       p
               P2                                  P2
                            seg1                               seg2
                                      ϕ=    (   r → (¬p U r))
                          Figure 3.4: Progression and segmentation.
future extensions of α depending on what has been observed thus far, which returns either
true, false, or an LTL formula. We emphasize that the main difference between our technique
and the classic rewriting technique [HR01a] is that, function Pr takes a finite trace as input
while the algorithm in [HR01a] rewrites the input LTL formula in a state-by-state manner.
This means that in our setting, rewriting based on the fixed point representation of temporal
operators is not possible. Our motivation is due to the fact that when a given distributed
computation is chopped into a number of segments then a state-by-state rewriting approach
would incur too many SMT queries, making it unscalable. For example, in Fig. 3.4 (which
is the computation in Fig. 3.2 chopped to two segments), our progression-based approach
needs the same 4 SMT queries for seg1 (2 for each of the sub-formulas      r and   (¬p)). The
evaluation yields formulas ¬(      r) and     r → (¬p U r) as the possible formulas and as a
result we only need to build 4 SMT queries in seg2 compared to 5 for the automata-based
approach.
     Our method is fully implemented and the datasets generated during and/or analysed
during the current study are available in https://github.com/TART-MSU/dist-ltl-rv. We
make a detailed comparison between the proposed approaches in this paper through
not only a set of vigorous synthetic experiments, but also monitoring the same set of
consistency conditions in Cassandra. We also put our approach to test using a real-time
airspace monitoring dataset (RACE) from NASA [MGS19]. Our experiments show that the
progression-based approach has 35% reduced overhead (See Section 3.5 as compared to the
automata-based approach.
     In summary, the main contributions of this paper is as follows:
                                                 31


    • We transform our monitoring decision problem into an SMT problem, to make for an
       efficient yet correct approach to consider different interleavings. Given an LTL formula,
       our solution provides all possible verdicts on a given computation.
    • We present two monitoring approaches to address the challenges (mentioned earlier)
       of distributed runtime verification with regard to LTL formulas under a partially
       synchronous setting. In our first approach, we keep track of the observed events and the
       possible future outcomes by employing an automata-based technique. In our second
       approach, we employ a more efficient progression-based technique, where we rewrite the
       given LTL specifications based on the current observations. For both of our approaches,
       we consider a fault-proof central monitor.
    • We divide a given computation into multiple segments in order to make the verification
       problem tractable, and as a result, significantly reduce the search space of each SMT
       query. Furthermore, we parallelize our monitoring technique in order to utilize the
       available computational resources and gain greater scalability.
    • Finally, we explore and report on extensive comparisons between our automata-based
       approach and our progression-based approach in terms of runtime and complexity.
3.1.1       Problem Statement
     Given a distributed computation (E,           ), a valid sequence of consistent cuts is of the
form C0 C1 C2 · · · , where for all i ≥ 0, (1) Ci denotes a set of events included in the consistent
cut, (2) Ci is a subset of its succeeding consistent cut, Ci+1 , that is, Ci ⊂ Ci+1 , and (3) Ci+1
has one additional event compared to its preceding consistent cut Ci , that is, |Ci |+1 = |Ci+1 |.
Let C denote the set of all valid sequences of consistent cuts. We define the set of all traces
of (E,    ) as follows:
                                    n                                             o
                       Tr(E,   ) = front(C0 )front(C1 ) · · · | C0 C1 C2 · · · ∈ C .
                                                   32


Now for our automata-based approach (resp. progression-based approach), the evaluation of
the LTL formula ϕ with respect to (E,         ) in the 3-valued semantics (resp. finite semantics)
is the following:
                                               n                        o
                            [(E,   ) |=3 ϕ] = α |=3 ϕ | α ∈ Tr(E,     )
and
                                               n                         o
                           [(E,   ) |=F ϕ] = α |=F ϕ | α ∈ Tr(E,       )
respectively. This means evaluating a distributed computation with respect to a formula
results in a set of verdicts, as a computation may involve several traces.
3.2       Formula Progression for LTL
     In a synchronous system, verification on a computation can be performed in a state by
state approach due to the existence of a total ordering of events [BF16a]. However, in a
partially synchronous system, no such total ordering of events is possible. A distributed
computation (E,       ) may have different partial ordering of events dictated by different
interleaving of events. Therefore, it is possible to obtain multiple verdicts on the same
distributed computation (E,       ). In order to explore these verdicts, we propose a monitoring
approach based on formula progression that, if possible, partially evaluates a formula on the
current computation, and based on the verdict, provides a rewritten formula that is to be
evaluated on the extensions of the computation. As an example, let us consider the formula
to be monitored as, ϕ =        (a →     b). Now, if in some trace in a computation, the monitor
observes a, then for the extensions of computations, it is enough to monitor the rewritten
formula, ϕ0 =     b, as the final verdict is no longer dependent on the occurrence of a. We call
this method of rewriting formula Progression, which we discuss in length in the following
section.
Definition 7. A progression function Pr : Σ∗ × ΦLTL → ΦLTL is one that for all finite traces
α ∈ Σ∗ , infinite traces σ ∈ Σω , and formulas ϕ ∈ ΦLTL , we have: ασ |= ϕ iff and only if
σ |= Pr(α, ϕ).
                                                   33


     We emphasize that the main difference between our technique and the classic rewriting
technique [HR01a] is that, function Pr takes a finite traces as input, while the algorithm
in [HR01a] rewrite the input LTL formula in a state-by-state manner. This means that
rewriting based on the fixed point representation of temporal operators is not possible. The
motivation for our approach comes from the fact the a given distributed computation is
chopped into a number of segments, and verification of each segment is handled by an SMT
query. A state by state approach would incur too many SMT queries, making it unscalable.
Remark 1. It is straightforward to see that for any α ∈ Σ∗ and ϕ ∈ Φ, if a progression
function returns a non-trivial formula, which we denote by Pr(α, ϕ) = ϕ0 for some ϕ0 ∈ ΦLTL ,
then the verdict of monitoring is unknown.
Atomic propositions. Let ϕ = p for some p ∈ AP. The verdict is provided depending
upon whether or not p ∈ α(0). This is the only case where the output of Pr cannot be a
rewritten formula; the possible verdicts are either true or false:
                                         
                                         
                                         true if
                                                        p ∈ α(0)
                              Pr(α, ϕ) =
                                         
                                         false if
                                                        p 6∈ α(0)
Negation. Let ϕ = ¬φ. We have Pr(α, ϕ) = ¬Pr(α, φ).
Disjunction. Let ϕ = ϕ1 ∨ ϕ2 . If either sub-formula ϕ1 or ϕ2 is evaluated to false, then
the progression of ϕ becomes the other sub-formula ϕ2 or ϕ1 respectively, since that will be
the only responsible sub-formula for the  verdict of all future computations:
                        
                        
                        
                        
                        
                         true       if    Pr(α, ϕ1 ) = true ∨ Pr(α, ϕ2 ) = true
                        
                        
                        
                        
                        
                        
                        
                         false      if    Pr(α, ϕ1 ) = false ∧ Pr(α, ϕ2 ) = false
                        
                        
             Pr(α, ϕ) = ϕ02          if    Pr(α, ϕ1 ) = false ∧ Pr(α, ϕ2 ) = ϕ02
                        
                        
                        
                        
                          ϕ01              Pr(α, ϕ2 ) = false ∧ Pr(α, ϕ1 ) = ϕ01
                        
                        
                        
                        
                                    if
                        
                        
                        
                        ϕ01 ∨ ϕ02 if      Pr(α, ϕ1 ) = ϕ01 ∧ Pr(α, ϕ2 ) = ϕ02
                        
                        
                                              34


 Next operator. Let ϕ =        φ. The verdicts true, false and φ0 can only be reached if α1 is
not an empty trace, that is, | α1 |6= 0. Otherwise, if we are at the last event in the trace,
then the progression of ϕ becomes φ; implying φ must hold at the beginning of the future
extension:                    
                                               Pr(α1 , φ) = true ∧ | α1 |6= 0
                              
                              true
                              
                                       if
                              
                              
                              
                              
                                               Pr(α1 , φ) = false ∧ | α1 |6= 0
                              
                              false
                                       if
                   Pr(α, ϕ) =
                                 φ0            Pr(α1 , φ) = φ0 ∧ | α1 |6= 0
                              
                              
                              
                              
                                       if
                              
                              
                              
                                               | α1 |= 0
                              
                              φ
                                       if
  Always and eventually operators. Progression in the temporal operator ‘always’,
(resp. ‘eventually’,   ) may yield false (resp. true) or remain unchanged:
                                       
                                       
                                       false
                                               if      [α |=F ϕ] = ⊥
                           Pr(α, ϕ) =
                                       
                                        φ
                                               if      otherwise
                                       
                                       
                                       true
                                               if     [α |=F ϕ] = >
                           Pr(α, ϕ) =
                                       
                                       
                                           φ   if     otherwise
Note that the semantics of FLTL is not frequently used, due to LTL3 being generally more
expressive, as shown in [BLS10b]. However, the expressiveness of LTL3 would actually be
an issue if it were used to construct the progression rules. To be more precise, the ‘?’
(unknown) verdict in LTL3 semantics would raise additional and unnecessary complications
in the progression rules, as this verdict does not provide any additional information as far
as our progression-based approach is concerned. Therefore, we use FLTL for specifying the
progression rules without any loss of generality as shown later in the proof of Lemma 1.
Until operator. Let ϕ = ϕ1 U ϕ2 . Recall that ϕ1 U ϕ2 = ϕ2 ∨ (ϕ1 ∧ (ϕ1 U ϕ2 )). We
divide the U formula into two parts, one with globally ( ϕ1 ) and the other eventuality
(   ϕ2 ). These sub-formulas are evaluated separately and the verdict of each of them is used
                                               35


                       α                        α0                      α00
                  ∅    ∅     ∅           ∅      r      ∅            ∅   q     p
                                 Figure 3.5: Progression example.
to define the progression for the U operator. However, for the case when both ϕ1 and ϕ2
occur in the same computation, we cannot come to a verdict without considering the order
of satisfaction of these sub-formulas. That is, on a given finite trace α, if ϕ2 holds in α(i)
(denoted     i ϕ2 ) and ϕ1 holds throughout in all states from α(0) to α(i−1) (denoted      i−1 ϕ1 ),
then the progression of ϕ becomes true. If this is not the case, and         ϕ1 does not hold in α,
the progression of ϕ becomes false, since this signifies a break from the streak of ϕ1 required
for ϕ to hold. If it is neither of the above two cases, and the evaluated verdict of       Pr(α, ϕ2 )
is >, then this represents a case where we do not have enough information about ϕ1 to
evaluate ϕ1 U ϕ2 . Thus, making the progression solely dependant on ϕ1 . The progression of
ϕ remains unchanged if ϕ1 holds throughout α, but ϕ2 does not hold anywhere:
             
             
             
             
             
               true                       if ∃i ∈ [0, |α| − 1] . [α |=F i Pr(α, ϕ2 )] = >
             
             
             
             
             
             
             
                                             ∧ [α |=F i−1 Pr(α, ϕ1 )] = >
             
             
Pr(α, ϕ) = false                           if [α |=F Pr(α, ϕ1 )] = ⊥ ∧ not the first case
             
             
             
             
             
             
             
             
               Pr(α, ϕ1 )                 if [α |=F Pr(α, ϕ2 )] = > ∧ not the second case
             
             
             
             
             Pr(α, ϕ1 ) U Pr(α, ϕ2 ) if [α |=F Pr(α, ϕ1 )] = > ∧ [α |=F Pr(α, ϕ2 )] = ⊥
             
   Example.        Consider the formula, ϕ =            r → (¬p U q) with sub-formulas ϕs =
{   r, q,      q,    p}, according to our progression rules. Consider the trace in Fig. 3.5
divided into three segments. In the first segment α, neither p, q nor r are present, and as
far as the laws of the progression function defined above, ϕ remains unchanged for the next
segment; i.e., Pr(α, ϕ) = ϕ. In the second segment α0 , proposition r is observed, this satisfies
sub-formula       r the progressed formula becomes ¬p U q; i.e., Pr(α0 , ϕ) = ¬p U q. In the
                                                   36


next segment α00 , proposition q occurs before p. This falls under the first case of the until
progression operator. Since q happens after a streak of ¬p, we arrive at the verdict true; i.e.,
Pr(α00 , ¬p U q) = true. Put it another way, Pr(αα0 α00 , ϕ) = true.
Lemma 1. Given an LTL formula ϕ, and a two finite traces α, σ ∈ Σ∗ , trace ασ satisfies ϕ
if and only if σ satisfies Pr(α, ϕ). Formally,
                               [ασ |=F ϕ] ⇐⇒ [σ |=F Pr(α, ϕ)]
Proof. We distinguish the following cases:
Case 1: First, we consider the base case of this proof, where the formula is an atomic
proposition, that is, ϕ = p.
     (⇒) Let us first consider that p is observed on the first state of ασ. This implies,
[ασ |=F ϕ] yields true, and Pr(α, ϕ) yields >. Therefore, [σ |=F Pr(α, ϕ)] must also yield true.
     Now, let us consider that p is not observed on the first state of ασ. This implies,
[ασ |=F ϕ] yields false, and Pr(α, ϕ) yields ⊥. Therefore, [σ |=F Pr(α, ϕ)] must also yield
false.
     (⇐) Let us first consider that [σ |=F Pr(α, ϕ)] yields true. This implies, Pr(α, ϕ) yields
>, and [ασ |=F ϕ] yields true. Therefore, p must have been observed on the first state of
ασ.
     Now, let us consider that [σ |=F Pr(α, ϕ)] yields false. This implies, Pr(α, ϕ) yields ⊥,
and [ασ |=F ϕ] yields false. Therefore, p must not have been observed on the first state of ασ.
Case 2: Assume that the proof has been established for the case when the formula is
ϕ = φ. Now, we consider the case where the formula is ϕ = ¬φ.
     We can say [ασ |=F ¬φ] is equivalent to ¬[ασ |=F φ] according to the finite-trace
semantics of LTL. We can also say [σ |=F Pr(α, ¬φ)] is equivalent to [σ |=F ¬Pr(α, φ)] since
Pr(α, ¬φ) = ¬Pr(α, φ) is defined as a progression rule. Furthermore, [σ |=F ¬Pr(α, φ)] is
equivalent to ¬[σ |=F Pr(α, φ)] according to the finite-trace semantics of LTL.
     Based on our assumption, the proof has already been established for [ασ |=F φ] ⇐⇒
[σ |=F Pr(α, φ)]. Therefore, ¬[ασ |=F φ] ⇐⇒ ¬[σ |=F Pr(α, φ)], and by extension,
                                               37


[ασ |=F ¬φ] ⇐⇒ [σ |=F Pr(α, ¬φ)]
Case 3: Assume that the proof has been established for the case when the formula is
ϕ = φ. Now, we consider the case where the formula is ϕ = φ.
      Let us first consider the case where the length of the trace α is 1, that is, | α |= 1 and
| α1 |= 0. In this particular case, [ασ |=F φ] is equivalent to [σ |=F φ]. Furthermore,
Pr(α, φ) = φ; which implies, [σ |=F Pr(α, φ)] is equivalent to [σ |=F φ]. Therefore,
[ασ |=F φ] ⇐⇒ [σ |=F Pr(α, φ)].
      Now, let us consider the case where the length of the trace α is longer than 1, that
is, | α |≥ 1 and | α1 |≥ 1. In this case, [ασ |=F φ] is equivalent to [α1 σ |=F φ], and
[σ |=F Pr(α, φ)] is equivalent to [σ |=F Pr(α1 , φ)].
      Based on our assumption, the proof has already been established for [α1 σ |=F φ] ⇐⇒
[σ |=F Pr(α1 , φ)]. Therefore, [ασ |=F φ] ⇐⇒ [σ |=F Pr(α, φ)].
Case 4: Assume that the proof has been established for the cases when the formulas are
ϕ = ϕ1 and ϕ = ϕ2 . Now, we consider the case where the formula is ϕ = ϕ1 ∨ ϕ2 .
      Based on our assumption, the proof has already been established for [ασ |=F φ1 ] ⇐⇒
[σ |=F Pr(α, φ1 )] and [ασ |=F φ2 ] ⇐⇒ [σ |=F Pr(α, φ2 )]. Therefore, we can derive the
following:
                   [ασ |=F (ϕ1 ∨ ϕ2 )] ⇐⇒ [ασ |=F ϕ1 ∨ ασ |=F ϕ2 ]
                                       ⇐⇒ [σ |=F Pr(α, ϕ1 ) ∨ σ |=F Pr(α, ϕ2 )]
                                       ⇐⇒ [σ |=F ϕ01 ∨ ϕ02 ]
                                       ⇐⇒ [σ |=F Pr(ϕ1 ∨ ϕ2 )].
Case 5: Assume that the proof has been established for the cases when the formulas are
ϕ = ϕ1 and ϕ = ϕ2 . Now, we consider the case where the formula is ϕ = ϕ1 U ϕ2 .
      First, we prove the equivalence between ϕ = ϕ1 U ϕ2 and its corresponding SMT
formula. That is,
                                               38


                        [(σ, i) |=F ϕ1 U ϕ2 ] ⇐⇒ [∃k ≥ i .     k ϕ2 ∧    k−1 ϕ1 ]
      To this end, we have,
 [(σ, i) |=F ϕ1 U ϕ2 ]
                                                
   ⇐⇒ [(σ, i) |=F ϕ2 ∨ ϕ1 ∧ (ϕ1 U ϕ2 ) ]
                                                         
   ⇐⇒ [(σ, i) |=F ϕ2 ] ∨ [(σ, i) |=F ϕ1 ∧ (ϕ1 U ϕ2 ) ]
                                                                    
   ⇐⇒ [(σ, i) |=F ϕ2 ] ∨ [(σ, i) |=F ϕ1 ] ∧ [(σ, i + 1) |=F ϕ1 U ϕ2 ]
                                                                                            
                                                                                           
   ⇐⇒ [(σ, i) |=F ϕ2 ] ∨ [(σ, i) |=F ϕ1 ] ∧ [(σ, i + 1) |=F ϕ2 ∨ ϕ1 ∧ (ϕ1 U ϕ2 ) ]
                                                                
   ⇐⇒ [(σ, i) |=F ϕ2 ] ∨ [(σ, i) |=F ϕ1 ] ∧ [(σ, i + 1) |=F ϕ2 ] ∨ . . . ∨ [(σ, i + k) |=F ϕ1 U ϕ2 ]
      for some k ≥ 1. Now, in order for [(σ, i) |=F ϕ1 U ϕ2 ] to yield true, there must be a
k ≥ 1 such that [(σ, i) |=F ϕ1 ∧ . . . ∧ (σ, i + k − 1) |=F ϕ1 ∧ (σ, i) |=F ϕ2 ], that is
          [(σ, i) |=F ϕ1 U ϕ2 ] ⇐⇒ [∃k ≥ 1 . (σ, i) |=F ϕ1 ∧ . . . ∧ (σ, i + k − 1) |=F ϕ1 ∧
                                       (σ, k) |=F ϕ2 ]
                                  ⇐⇒ [∃k ≥ 1 . (σ, i) |=F   k ϕ2 ∧ (σ, i) |=F     k−1 ϕ1 ]
      Now, we prove this case for the lemma as follows,
      (⇒) Let us assume [ασ |=F ϕ1 U ϕ2 ] = > and [σ |=F Pr(α, ϕ1 U ϕ2 )] = ⊥.
If [σ |=F Pr(α, ϕ1 U ϕ2 )] = ⊥, then that implies either [α |=F                        ϕ1 ] = ⊥, or
[α |=F ϕ1 ] = > ∧ [α |=F ϕ2 ] = ⊥ ∧ [σ |=F Pr(α, ϕ1 ) U Pr(α, ϕ2 )] = ⊥. However, neither
of these two cases are possible since in order for [ασ |=F ϕ1 U ϕ2 ] = > to hold, ϕ1 must
hold until ϕ2 in ασ. Therefore, [σ |=F Pr(α, ϕ1 U ϕ2 )] = >.
      (⇐) Let us assume [ασ |=F ϕ1 U ϕ2 ] = ⊥ and [σ |=F Pr(α, ϕ1 U ϕ2 )] = >.
If [σ |=F Pr(α, ϕ1 U ϕ2 )] = >, then that implies either [α |=F ϕ1 U ϕ2 ] = > or
[α |=F       ϕ1 ] = > ∧ [σ |=F Pr(α, ϕ1 ) U Pr(α, ϕ2 )] = >. However, neither of these two
cases are possible since in order for [ασ |=F ϕ1 U ϕ2 ] = ⊥ to hold, either ϕ1 must
be violated on some state before ϕ2 is observed, or ϕ2 is never observed. Therefore,
                                                    39


[σ |=F Pr(α, ϕ1 U ϕ2 )] = ⊥.
3.3       SMT-based Solution
     In this section, we elaborate on our solution for distributed monitoring using the two
monitoring techniques mentioned before: (1) automata-based approach, and (2) progression-
based approach.
3.3.1       Overall Idea
Automata-based approach.              Recall from Section 3.1 (Fig. 3.4) that monitoring a
distributed computation may result in multiple verdicts depending upon different ordering
of events. In other words, given a distributed computation (E,            ) and an LTL formula ϕ,
different ordering of events may reach different states in the monitor automaton Mϕ =
(Σ, Q, q0 , δ, λ) (as defined in Definition 3). In order to ensure that all possible verdicts are
explored, we generate an SMT instance for (1) the distributed computation (E,            ), and (2)
each possible path in the LTL3 monitor. Thus, the corresponding decision problem is the
following: given (E,      ) and a monitor path q0 q1 · · · qm in an LTL3 monitor, can (E,   ) reach
qm ? If the SMT instance is satisfiable, then λ(qm ) is a possible verdict. For example, for the
monitor in Fig. 2.1, we consider two paths q0∗ q⊥ and q0∗ q> (and, hence, two SMT instances).
Thus, if both instances turn out to be unsatisfiable, then the resulting monitor state is q0 ,
where λ(q0 ) =?.
     We note that LTL3 monitors may contain non-self-loop cycles. In order to simplify the
SMT instance creation process (for each possible path in the LTL3 monitor), we collapse
each non-self-loop cycle into one state with a self-loop labeled by the sequence of events in
the cycle using Algorithm 1. As an example, in Fig. 3.6, Algorithm 1 first takes an LTL3
monitor (Fig. 3.6a) and adds the necessary self-loops (Fig. 3.6b). Then it eliminates all
                                                 40


               q0                                           q0                                            q0
               q1                                           q1        a1 a2 a3                            q1      a1 a2 a3
       a1           a3                                a1           a3                               a1
               a2                                           a2                                            a2
     q2               q3                            q2               q3                           q2             q3
           a4 a5                                          a4 a5                                         a4 a5
               q4                                a2 a3 a1   q4    a3 a1 a2                     a2 a3 a1   q4  a3 a1 a2
               qr                                           qr                                            qr
            (a)                                             (b)                                           (c)
                          Figure 3.6: Removing non-loop cycles in an LTL3 Monitor.
non-self-loop cycles by removing transitions from states with higher identifiers to states with
lower identifiers in cycles (Fig. 3.6c). The non-deterministic nature of the final automata
ensure that all the transitions and the accepting language of the automata are preserved.
Algorithm 1: Non-Self Loop Cycle Removal Algorithm.
  1:  Input Mϕ = (Σ, Q, q0 , δ, λ)
  2:  Output M0ϕ = (Σ, Q, q0 , δ 0 , λ)
  3:  Let CP be the set of all possible paths containing cycles
  4:  δ0 ← δ
  5:  for each q ∈ Q do
                           sm          sn
  6:    for each q −→           · · · −→  q ∈ CP do
  7:      δ 0 (q, sm · · · sn ) ← q
  8:    end for
  9:  end for
                            s              sk          sm            sk          sn
10:   for each qm →        − qn ∈ {qi −    → qj | q −→     · · · qi −→ qj · · · −→  q ∈ CP} do
11:     if m > n then
12:       δ 0 (qm , s) ← ∅
13:     end if
14:   end for
15:   return Mϕ
Lemma 2. Let Mϕ = (Σ, Q, q0 , δ, λ) be the monitor automaton for LTL formula, ϕ, and
M0ϕ = (Σ, Q, q0 , δ 0 , λ) be the monitor automaton with no non-self loop cycles, obtained from
applying Algorithm 1 on Mϕ . Given a finite trace, α = a1 a2 · · · an and a initial state, q ∈ Q,
we prove that λ(δ(q, α)) = λ(δ 0 (q, α)).
                                                                 41


Proof. We distinguish the following cases:
Case 1 (⇒):
      First we show, λ(δ(q, α)) → λ(δ 0 (q, α)), that is, ∀α, ∀q ∈ Q . λ(δ(q, α)) =⇒ λ(δ 0 (q, α))
Let α = a1 a2 · · · an , where ∀i ∈ [1, n].ai ∈ Σ. Algorithm 1 removes non-self loop cycles
by removing a transition such that the corresponding transition of δ(q, ai ), δ 0 (q, ai ),
                                                                                 ai−k       ai
where i ∈ [1, m] does not exist. This is such that ∃k ∈ [1, i] . q 0 −−→ · · · q −          →  q 0 . This
transition is same as δ 0 (q 0 , ai−k · · · ai ) = q 0 which was one of the added self-loops. The
rest of the transitions are maintained such that δ(q, ai ) = δ 0 (q, ai ), where q ∈ Q and i ∈ [1, m].
Case 2 (⇐):
      Now, we show, λ(δ 0 (q, α)) → λ(δ(q, α)), that is, ∀α, ∀q ∈ Q . λ(δ 0 (q, α)) =⇒ λ(δ(q, α))
Let α = a1 a1 · · · an , where ∀i ∈ [1, n].ai ∈ Σ. A self-loop in M0ϕ can be represented by
∃i ∈ [1, n], ∃k ∈ [1, n − i] . δ 0 (q, ai ai+1 · · · ai+k ) = q. In another words, there exists a path
   ai      ai+1      ai+k
q−→   q 0 −−→ · · · −−→ q in Mϕ . The rest of the non-self loop transitions are the same, such
that δ 0 (q, ai ) = δ(q, ai ), where q ∈ Q and i ∈ [1, m].
Thus, λ(δ(q, α)) = λ(δ 0 (q, α))
Progression-based approach. In a synchronous system, verification on a computation
can be performed in a state by state approach due to the existence of a total ordering of
events [BF16a]. However, in a partially synchronous system, no such ordering of events is
possible. A distributed computation (E,               ) may have different ordering of events dictated
by different interleavings of events. Therefore, it is possible to obtain multiple verdicts on
the same distributed computation (E,               ). In order to explore these verdicts, we propose
a monitoring approach based on formula progression that, if possible, partially evaluates a
formula on the current computation, and based on the verdict, provides a rewritten formula
that is to be evaluated on the extensions of the computation. As an example, let us consider
the formula to be monitored as, ϕ =              (a →       b). Now, if in some trace in a computation,
the monitor observes a, then for the extensions of computations, it is enough to monitor the
rewritten formula, ϕ0 =          b, as the final verdict is no longer dependent on the occurrence of
                                                        42


a. We call this method of rewriting formula Progression, which we discuss in length later
on. In the next two subsections, we present the SMT entities and constraints with respect
to one monitor path and a distributed computation.
3.3.2      SMT Entities
      SMT entities represent the sub-formulas of an LTL formula and a distributed
computation. After the verdicts from all the sub-formulas are generated, we construct our
rewritten formula by attaching the said verdicts to their corresponding parent formulas in
the parse tree and then performing an in-order traversal starting from the root of the parse
tree. At the end of the traversal, the resulting formula is, in fact, the progression for the
next computation. We now introduce the entities that represent a path in an LTL3 monitor
Mϕ = (Σ, Q, q0 , δ, λ) for LTL formula ϕ and distributed computation (E,             ). It should be
noted that the SMT entities in this subsection are used in both the automata-based and the
progression-based approaches.
                                    s      s             sj             sm−1
Monitor automaton. Let q0 −         → 0
                                        q1 −
                                           → 1
                                                         → qj )∗ · · · −−−→ qm be a path of monitor
                                               · · · (qj −
Mϕ , which may or may not include a self-loop. We include a non-negative integer variable
                           si
ki for each transition qi −
                          →   qi+1 , where i ∈ [0, m − 1] and si ∈ Σ. This is also true for the
             sj
self-loop qj −
             → qj , for which we include a non-negative interger kj .
Distributed computation. In our SMT encoding, the set of events, E are represented by a
bit vector, where each bit corresponds to an individual event in the distributed computation,
(E,    ). We conduct a pre-processing of the distributed computation, during which we create
an E ×E matrix, hbSet to incorporate the additional happen-before relations obtained by the
clock-synchronization algorithm. Afterwards, we populate the hbSet with 0’s and 1’s, such
that hbSet[i][j] = 1 if E[i]    E[j], and hbSet[i][j] = 0 otherwise. We introduce a function
µ : E × AP → {true, false} in order to establish a relation between each event and the
atomic propositions in it. In the event that other variables or constants are used in defining
the predicates (e.g. x1 + x2 ≥ 2), µ is constructed accordingly. Finally, we introduce an
                                                   43


uninterpreted function ρ : Z≥0 → 2E that identifies a sequence of consistent cuts from {} to
{E} for reaching a verdict, while satisfying a number of given constraints explained in 3.3.3.
3.3.3      SMT Constraints
      Once we define the necessary SMT entities, we move onto the SMT constraints. We
first define the common SMT constraints for consistent cuts that are enforced on both
the automata-based and the progression-based approaches. Afterwards we define the SMT
constraints that are more dependant on the methodology.
Consistent cut constraints over ρ. In order to ensure that the uninterpreted function
ρ identifies a sequence of consistent cuts, we enforce certain consistent cut constraints. The
first constraint enforces that each element in the range of ρ is in fact a consistent cut:
                                                                          
                                   0          0                       0
                  ∀i ∈ [0, m].∀e, e ∈ E. (e        e) ∧ (e ∈ ρ(i)) → e ∈ ρ(i)
Next, we enforce that the sequence of consistent cuts identified by ρ start from an empty
set of events, and each successor cut of the sequence contains one more new event than its
predecessor.
                               ∀i ∈ [0, m]. |ρ(i + 1)| = |ρ(i)| + 1
Finally, we ensure that each successive consistent cut is immediately reachable in (E,     ) by
enforcing a subset relation:
                                     ∀i ∈ [0, m]. ρ(i) ⊆ ρ(i + 1)
      Once a sequence of consistent cuts have been generated, we check if the sequence satisfies
the specification. This is done using (1) progression-based approach, where the LTL formula
is represented by a SMT constrain and (2) LTL3 automata-based approach, where a path on
the automata is represented as an SMT constraint. This is repeated for all sub-formulas of
the original LTL formula and all paths in the LTL3 automata respectively as discussed below.
Constraints for LTL3 automata over ρ. These constraints are responsible for generating
                                                  44


a valid sequence of consistent cuts given a distributed computation (E,             ) that runs on
                   s                        sm−1
monitor path q1 −  → 1
                        q2 · · · qj∗ · · · −−−→ qm . We begin with interpreting ρ(km ) by requiring
that running (E,      ) ends in monitor state qm . The corresponding SMT constraint is:
                                              µ(front(ρ(km )), sm−1 )
For every monitor state qi , where i ∈ [0, m − 1], if qi does not have a self-loop, the
corresponding SMT constraint is:
                             µ(front(ρ(ki+1 − 1)), si ) ∧ (ki = ki+1 − 1)
For every monitor state qj , where j ∈ [0, m − 1], suppose qj has a self-loop (recall that a
cycle of r transitions in the monitor automaton is collapsed into a self-loop labeled by a
sequence of r letters). Let us imagine that this self-loop executed z number of times for
some z ≥ 0. Furthermore, we denote the sequence of letters in the self-loop as sj1 sj2 · · · sjr .
The corresponding SMT constraint is:
                              ^ z ^   r                                     
                                                                      
                                          µ front ρ(kj + r(i − 1) + n) , sjn
                              i=1 n=1
Again, since z is a free variable in the above constraint, the solver will identify some value
z ≥ 0 which is exactly what we need. To ensure that the domain of ρ starts from the empty
consistent cut (i.e., ρ(0) = ∅), we add:
                                                     k0 = 0.
Finally, let C denote the conjunction of all the above constraints. Recall that this conjunction
is with respect to only one monitor path from q0 to qm . Since there may be multiple paths
in the monitor automaton that can reach qm from q0 , we replicate the above constraints for
each such path. Suppose there are n such paths and let C1 , C2 , . . . , Cn be the corresponding
                                                       45


SMT constraints for these n paths. We include the following constraint:
                                      C1 ∨ C2 ∨ C3 ∨ · · · ∨ Cn
This means that if the SMT instance is satisfiable, then computation (E,                   ) can reach
monitor state qm from q0 .
Constraints for LTL progression over ρ.                  Given a distributed system (E,         ), the
aforementioned constraints may generate a valid sequence of consistent cuts that may yield
different verdicts based on the ordering of the concurrent events. Therefore, in order to
avoid false positives, all possible outcomes are explored when evaluating an LTL formula ϕ
on (E,    ). We achieve this by checking for both satisfaction and violation in the sequence
of consistent cuts C0 C1 C2 · · · Cm interpreted by the uninterpreted function ρ(m). Note that
monitoring any LTL formula using our progression rules will result in monitoring sub-formulas
with only atomic propositions, globally and eventually temporal operators:
                ϕ=p                  front(ρi ) |= p, for p ∈ AP (satisfaction, i.e.,>)
              ϕ=     φ                   ∃i ∈ [0, m]. front(ρi ) 6|= φ (violation, i.e.,⊥)
             ϕ=      φ                ∃i ∈ [0, m]. front(ρi ) |= φ (satisfaction, i.e.,>)
Opposite cases result in a rewritten formula that will progress to the next segment. In
general, the verdict for any LTL formula will be derived using our progression rules in
Section 3.2.
3.4      Optimization
3.4.1      Segmentation of Distributed Computation
     RV is known to be an NP-complete problem in the number of processes in a distributed
setting [Gar02]. The complexity exhibits even more exponential blowup during verifying
formulas with nested temporal operators. In order to cope with this complexity, we divide
                                                  46


our computation into smaller segments, (seg1 ,          )(seg2 ,      ) · · · (segl/g ,  ) to create smaller,
albeit more SMT problems. Given a distributed computation (E,                        ) of length l, we divide
         l
it into  g
           smaller segments length g. The set of events in segment j, where j ∈ [1, gl ], is the
following:
                            n                                                                 o
                  segj = enτ,σ,ω |σ ∈ [max{0, (j − 1) × g − }, j × g] ∧ n ∈ [1, |P|]
Note that each segment (barring seg0 ) has to be constructed starting at  time units before
the previous segments ending point.            This creates an overlap of  time units between
each pair of adjacent segments.            Doing so ensures that no pair of possible concurrent
become non-concurrent due to the splits caused by segmentation.                          Therefore, dividing
the actual computation into segments does not have any effect on the final verdict of the
said computation. We also use parallelization to make our algorithm perform faster, while
utilizing most of the computation power modern processors are capable of handling.
Lemma 3. A distributed computation, (E, ), of length l satisfies an LTL formula, ϕ, if and
only if the distributed computation, (E, ), is divided into gl segments of length g satisfies ϕ
using the automata-based approach. That is,
                             [(E, ) |=3 ϕ] ⇐⇒ [(seg1 .seg2 . · · · .seg l ,    ) |=3 ϕ]
                                                                        g
Proof. Let us assume [(E, ) |=3 ϕ] 6= [(seg1 .seg2 . · · · .seg l , ) |=3 ϕ], that is,
                                                                                  g
{α |=3 ϕ | α ∈ Tr(E, )} 6= {α |=3 ϕ | α ∈ Tr(seg1 .seg2 . · · · .seg l , )} (Recall
                                                                                              g
Section 3.1.1).
     (⇒) Let Ck be a consistent cut such that Ck is in Tr(E, ), but not in
Tr(seg1 .seg2 . · · · .seg l , ) for some k ∈ [0, |E|]. This implies that the frontier of Ck ,
                           g
front(Ck ) 6⊆ seg1 and front(Ck ) 6⊆ seg2 and · · · and front(Ck ) 6⊆ seg l . However, this
                                                                                           g
is not possible, as according to the segmentation construction, there must be a segj
where 1 ≤ j ≤ gl such that front(Ck ) ⊆ segj . Therefore, such Ck cannot exist, and
{α |=3 ϕ | α ∈ Tr(E, )} ⊆ {α |=3 ϕ | α ∈ Tr(seg1 .seg2 . · · · .seg l , )}. By extension,
                                                                                      g
[(E, ) |=3 ϕ] ⇒ [(seg1 .seg2 . · · · .seg l , ) |=3 ϕ]
                                           g
     (⇐) Let Ck be a consistent cut such that Ck is in Tr(seg1 .seg2 . · · · .seg l ,                       ),
                                                                                                       g
                                                   47


but not in Tr(E, ) for some k ∈ [0, |E|].                              This implies, front(Ck ) ⊆ segj
and front(Ck ) 6⊆ E for some j ∈ [1, gl ].                      However, this is not possible due to
                                      l
the fact that ∀j ∈ [1, g ] . segj ⊆ E.                         Therefore, such Ck cannot exist, and
{α |=3 ϕ | α ∈ Tr(seg1 .seg2 . · · · .seg l , )} ⊆ {α |=3 ϕ | α ∈ Tr(E, )}. By extension,
                                              g
[(seg1 .seg2 . · · · .seg l , ) |=3 ϕ] ⇒ [(E, ) |=3 ϕ]
                          g
     Therefore, [(E,           ) |=3 ϕ] ⇐⇒ [(seg1 .seg2 . · · · .seg l , ) |=3 ϕ].
                                                                     g
Lemma 4. A distributed computation (E, ) of length l satisfies an LTL formula ϕ if and
only if the distributed computation, (E, ), is divided into gl segments of length g satisfies ϕ
using the progression-based approach. That is,
                             [(E,   ) |=F ϕ] ⇐⇒ [(seg1 .seg2 . · · · .seg l ,  ) |=F ϕ]
                                                                          g
Proof. Using Lemma 1 and Lemma 3, we can trivially prove, [(E,                          ) |=F ϕ]  ⇐⇒
[(seg1 .seg2 . · · · .seg l , ) |=F ϕ].
                          g
3.4.2       Parallelized Monitoring
     Many cloud services use clusters of computers equipped with multiple processors and
computing cores. This allows them to deal with high data rates and implement high-
performance parallel/distributed applications. Monitoring such applications should also be
able to exploit the massive infrastructure. To this end, we now discuss parallelization of our
SMT-based monitoring technique.
     Let G be a sequence of g segments G = seg1 seg2 · · · segg . Our idea is to create a job
queue for each available computing core, and then distribute the segments evenly across
all the queues to be monitored by their respective cores independently. However, simply
distributing all the segments across cores is not enough for obtaining a correct result. For
example, consider formula ϕ = a U b and two segments, seg1 and seg2 across two cores, Cr 1
and Cr 2 , respectively. In order for the monitor running on Cr 2 to give the correct verdict, it
must know the result of the monitor running on Cr 1 . In a scenario, where Cr 1 observes one
or more ¬a in seg1 , a violation must be reported even if Cr 2 does not observe b and no ¬a.
                                                      48


                             seg1          seg2           seg3          seg4
                         q0   q>  q⊥    q0 q>   q⊥   q0   q>   q⊥   q0  q>   q⊥
                    q0
                         T    F    F    T   T   F    T     T   T    T    T   T
                         q0   q>  q⊥    q0 q>   q⊥   q0   q>   q⊥   q0  q>   q⊥
                    q>
                         F    F    F    F   T   F    F     T   F    F    T   F
                         q0   q>  q⊥    q0 q>   q⊥   q0   q>   q⊥   q0  q>   q⊥
                    q⊥
                         F    F    F    F   F   T    F     F   T    F    F   T
                             Figure 3.7: Reachability Matrix for a U b.
                                                                  q0
                                                                  q0
                                                            q0            q>
                                                                          q>
                                                 q0         q>       q⊥
                                                                          q>
                                             q0  q>   q⊥    q>       q⊥
                              Figure 3.8: Reachability Tree for a U b.
Generally speaking, the temporal order of events makes independent evaluation of segments
impossible for LTL formulas. Of course, some formulas such as safety (e.g.,     p) and co-safety
(e.g.,   q) properties are exceptions.
     For our automata-based approach, we address this problem in two steps. Let Mϕ =
(Σ, Q, q0 , δ, λ) be an LTL3 monitor. Our first step is to create a 3-dimensional reachability
matrix RM by solving the following SMT decision problem: given a current monitor state
qj ∈ Q and segment segi , can this segment reach monitor state qk ∈ Q, for all i ∈ [1, g], and
j, k ∈ [0, |Q| − 1]. If the answer to the problem is affirmative, then we mark RM [i][j][k] with
true, otherwise with false. This is illustrated in Fig. 3.7 for the monitor shown in Fig. 2.1,
where the grey cells are filled arbitrarily with the answer to the SMT problem. This step can
be made embarrassingly parallel, where each element of RM can be computed independently
by a different computing core. One can optimize the construction of RM by omitting
                                                49


redundant SMT executions. For example, if RM [i][j][>] = true, then RM [i0 ][>][>] = true
for all i0 ∈ [i, |Q| − 1]. Likewise, if RM [i][j][⊥] = true, then RM [i0 ][⊥][⊥] = true for all
i0 ∈ [i, |Q| − 1].
       The second step is to generate a verdict reachability tree from RM . The goal of the tree
is to check if a monitor state qm ∈ Q can be reached from the initial monitor state q0 . This is
achieved by setting q0 as the root and generating all possible paths from q0 using RM . That
is, if RM [i][k][j] = true, then we create a tree node with label qj and add it as a child of the
node with the label qk . Once the tree is generated, if qm is one of the leaves, only then we can
say qm is reachable from q0 . In general, all leaves of the tree are possible monitoring verdicts.
Note that creation of the tree is achieved using a sequential algorithm. For example, Fig.3.8
shows the verdict reachability tree generated from the matrix in Fig. 3.7.
       For our progression-based approach, we adhere to a similar technique for parallelized
monitoring as our automata-based approach. The key difference being, in the progression-
based approach subformulas are used, whereas in the automata-based approach different
states are used. As an example, the previous formula ϕ = a U b will be broken into two
subformulas ϕ1 =         a and ϕ2 =      b, before creating the reachibility matrix, and then
generating the verdict for both these subformulas.
Lemma 5. A distributed computation (E, ) of length l satisfies an LTL formula ϕ if and
only if the parallelized monitoring technique satisfies ϕ. That is,
                                > ∈ [(E,   ) |=3 ϕ] ⇐⇒ λ(q) = >
and,
                                ⊥ ∈ [(E,   ) |=3 ϕ] ⇐⇒ λ(q) = ⊥
Where q ∈ Q is some leaf node in the verdict reachability tree generated from RM during
the parallelized monitoring process and λ is the labelling function in Mϕ .
Base Case: Let us first consider the case where there is only one segment. That is, l = g.
       (⇒) If > ∈ [(E, ) |=3 ϕ] (resp., ⊥ ∈ [(E, ) |=3 ϕ]), then according to the
construction of the corresponding verdict reachability tree made from the RM , the root node
                                                 50


q0 must have a child q> (resp., q⊥ ), such that, λ(q> ) = > (resp., λ(q⊥ ) = ⊥). This child is
also a leaf node, as the height of a verdict reachability tree is 2 when there is only one segment.
      (⇐) We can trivially show that if λ(q> ) = > (resp., λ(q⊥ ) = ⊥), that is, if q> (resp.,
q⊥ ) is reachable from q0 , then > ∈ [(E, ) |=3 ϕ] (resp., ⊥ ∈ [(E, ) |=3 ϕ]).
Hypothesis: Let us assume the proof as been established for l = g × k. Now we consider
l = q × (k + 1) as the segment length.
      (⇒) If > ∈ [(E, ) |=3 ϕ] (resp., ⊥ ∈ [(E, ) |=3 ϕ]), then according to our
assumption, there must be at least one node at height k + 1 (height of the leaf nodes where
there are k segments), such that λ(q> ) = > (resp., λ(q⊥ ) = ⊥). Now for k + 1 number of
segments, according to the construction of the corresponding verdict reachability tree made
from the RM , the node q> (resp., q⊥ ) can only have the child q> (resp., q⊥ ). Therefore,
there must be at least one node at height k + 2 (height of the leaf nodes when there are k + 1
segments), such that λ(q> ) = > (resp., λ(q⊥ ) = ⊥).
      (⇐) We can trivially show that if λ(q> ) = > (resp., λ(q⊥ ) = ⊥), that is, if q> (resp.,
q⊥ ) is reachable from q0 , then > ∈ [(E, ) |=3 ϕ] (resp., ⊥ ∈ [(E, ) |=3 ϕ]).
3.5       Case Studies and Evaluation
      In this section, we emphasize on analyzing our SMT-based solution without digressing
into analyzing other dimensions such as instrumentation, data collection, data transfer,
monitoring, etc., as given the distributed setting, runtime will be the dominant factor over
any other kind of overhead. We evaluate our proposed technique using synthetic experiments,
Cassandra (a distributed database), and the RACE dataset from NASA [MGS19].
3.5.1      Implementation and Experimental Setup
      Each experiment can be divided into three phases: (1) data generation, (2) data
collection and (3) data verification. For data-generation, we develop a synthetic program
that randomly generates a distributed computation (i.e., the behavior of a set of programs in
terms of their local computations and inter-process communication). Generating synthetic
                                                 51


experimental data offer benefits that enable us to draw comparison between different
parameters and their effect on the approach. For example, generating data for different
values of ε is beneficial to study its effect on the runtime and the number of false warning
verdicts of our approach.
     When developing the synthetic distributed system as part of our experiment, we ensure
a partially-synchronous setting by including an HLC implementation. We use a uniform
distribution (0, 2) to define the type of event (local computation, send and receive message)
and a flip-coin distribution for computing the atomic propositions that are true at each
local computation event. Although the events in our synthetic experiments in Section 3.5.2
are uniformly distributed over the length of the trace, the event distribution as part of
the Cassandra experiments in Section 3.5.3 are affected by the network latency and other
external factors. In addition, we assume that that there is an external data collection program
which keeps track of the data/states of the system under verification. It generates the trace
logs which is used by the monitoring program to verify against the given LTL specifications
mentioned in Figure 3.9b.
     For data verification, we consider the following parameters: (1) number of processes
(|P|), (2) computation duration (l secs), (3) segment length (g), (4) event rate (r
events/process/sec), (5) maximum clock skew (), (6) depth of the automaton (d) and number
of nested temporal operators (|φ|) for the LTL formula under monitoring. The main metric is
to measure the runtime of SMT solving for each configuration of the parameters. Note that
the time axis is shown in log-scale in all the plots presented in this section. When we analyze
the effects of one parameter by holding the value of all the other parameters at a relevant
constant value. In all the graphs, we compare the runtime of our automata-based approach
against the progression based approach. We use a MacBook Pro with Intel i7-7567U(3.5Ghz)
processor, 16GB RAM, 512 SSD and g++ Apple clang version 12.0.5 (clang-1205.0.22.9)
interface to the Z3 SMT-solver [dMB08] to generate the traces. To evaluate our parallel
algorithm, we use a server with 2x Intel Xeon Platinum 8180 (2.5GHz) processor, 768GB
                                                52


RAM, 112 vcores and g++(GCC) 9.3.1 interface to the Z3 SMT-solver [dMB08]. Unless
specified otherwise, the system under consideration has |P| = 2, l = 2 sec, g = 250ms,
r = 10 events/process/sec,  = 250ms and d = 3.
3.5.2      Analysis of Results – Synthetic Experiments
      In this set of experiments, we exhaust all the available parameters and note how it affects
SMT solving. We test each parameter individually to study its effect on runtime. As our
generated synthetic data does not depend on any external factors, we induce a delay to not
only limit the number of events happening at every time unit, but also to ensure uniform
distribution of events over the execution of each process. We use a uniform distribution
(0, |Σ|) to assign a value to each local computation event in each process. We only use one
CPU core for the following experimental results.
      Overall, we notice an improvement of around 35% when the progression based technique
is compared to the other automata based approach. This improvement in performance owes
to two main reasons: (1) compared to the automata-based approach, the LTL constrains
in our progression-based approach is less demanding in terms of computational complexity.
Each sub-formula consists of mostly one atomic proposition as opposed to multiple atomic
propositions in each path of the automaton, which in turn speeds up the overall verification
process, and (2) the total number of SMT-instances needed is fewer due to the less number
of sub-formulas compared to automaton paths given the same specification. We now analyze
the results in detail.
Impact of predicate structure. In this experiment (Figure 3.9a), we consider different
predicate distribution over AP for the formula, ϕ1 , i.e., how many processes are involved with
a particular predicate. We consider different predicate structures: O(1), O(n), O(n2 ) and
O(n3 ) which signifies the order of the number of SMT-encodings that need to be generated
for the given distribution of predicates. As can be seen, the progression based technique
outperforms the automata-based technique overall by 35% on average.
                                                53


      Having said that, during our experiments when comparing the runtime of our monitoring
approach for increasing number of sub-formulas, we observe a slight decrease in the overall
efficiency in runtime when using the progression-based approach compared to the automata-
based approach. Since the progression-based approach is based on evaluating each sub-
formula, there exists an LTL formula where the number of sub-formulas is more than
the number of paths in the corresponding automata, and thus, the the progression-based
approach might not be as efficient as the automata-based approach in such a scenario.
      For example, consider a formula, ϕ =      a∨    b∨    c, where the automata has two
states, which makes the number of paths to be 2. However, the progression involves 3
sub-formulas, which makes the progression based approach less efficient than its automata
counterpart. We would like to point out that the formula can be rewritten as        (a ∨ b ∨ c),
which makes both the approaches yield similar results. Thus we hypothesize that for all LTL
formulas, the progression-based approach will be more (if not equally) efficient to that of the
automata-based approach.
Impact of LTL formula. Given an LTL formula, the depth of nested temporal operators
plays an important role as suggested by Fig. 3.9b. We experimental with the following LTL
formula and the progression based technique achieved an average improvement of 32.8%
compared to the automata-based one.
     ϕ1 =    p                                                          d=2        |φ| = 1
     ϕ2 = (q →      p)                                                  d=3        |φ| = 2
     ϕ3 = ((q ∧     r) → (¬p U r))                                      d=4        |φ| = 3
     ϕ4 = ((q ∧     r) → (¬p U (r ∨ (s ∧ ¬p ∧ (¬p U t)))))              d=5        |φ| = 8
     ϕ5 =    r → (s ∧ (¬r U t) →      (¬r U (t ∧   p)))                 d=6        |φ| = 8
     ϕ6 = ((q ∧     r) → (s ∧ (¬r U t) →      (¬r U (t ∧   p))) U r)    d=7        |φ| = 9
                                              54


                                                                                                               500
                                  O(1) Automata                                                                             ϕ1 Automata
                                 O(1) Progression                                                                           ϕ1 Progression
                                  O(n) Automata                                                                             ϕ2 Automata
              1,000              O(n) Progression                                                                           ϕ2 Progression
                                 O(n2) Automata                                                                100          ϕ3 Automata
               500                                                                                                          ϕ3 Progression
                                 O(n2) Progression                                                              50
                                 O(n3) Automata                                                                             ϕ4 Automata
Runtime (s)                                                                                      Runtime (s)
                                                                                                                            ϕ4 Progression
                                 O(n3) Progression                                                                          ϕ5 Automata
               100                                                                                                          ϕ5 Progression
                                                                                                                10          ϕ6 Automata
                   50                                                                                                       ϕ6 Progression
                                                                                                                    5
                   10
                                                                                                                    1
                       5
                                 2      3     4      5   6   7   8       9       10                                         1                 2                     3              4              5
                                            Number of Processes |P|                                                                     Number of Processes |P|
                                     (a) Predicate Structure                                                                            (b) LTL Formula
                            g = 0.5 Automata                                                                             |P| = 1 Automata
                                                                                                                         |P| = 2 Automata
                            g = 0.4 Automata                                                                             |P| = 3 Automata
                            g = 0.3 Automata                                                              500            |P| = 4 Automata
                            g = 0.2 Automata                                                                             |P| = 5 Automata
                           g = 0.5 Progression                                                                          |P| = 1 Progression
              10                                                                                                        |P| = 2 Progression
                           g = 0.4 Progression                                                                          |P| = 3 Progression
                           g = 0.3 Progression                                                            100           |P| = 4 Progression
Runtime (s)                                                                                 Runtime (s)
                           g = 0.2 Progression                                                                          |P| = 5 Progression
                                                                                                               50
               5
                                                                                                               10
                                                                                                                5
                                                                                                                1
               1
                           50        100 150 200 250 300 350 400 450 500                                            4   5       6       7         8       9    10       11   12    13      14    15   16
                                            Clock skew  (ms)                                                                       Event rate (/process/sec)
                                             (c) Epsilon                                                                                (d) Event Rate
              500
                                                                  |P| = 1 Automata                                           |P| = 1 Automata
                                                                  |P| = 2 Automata                                           |P| = 2 Automata
                                                                  |P| = 3 Automata                                           |P| = 3 Automata
                                                                  |P| = 4 Automata                                           |P| = 4 Automata
                                                                  |P| = 5 Automata                             100           |P| = 5 Automata
                                                                 |P| = 1 Progression                                        |P| = 1 Progression
              100                                                |P| = 2 Progression
                                                                 |P| = 3 Progression
                                                                                                                50          |P| = 2 Progression
                                                                                                                            |P| = 3 Progression
               50                                                |P| = 4 Progression                                        |P| = 4 Progression
Runtime (s)                                                                                      Runtime (s)
                                                                 |P| = 5 Progression                                        |P| = 5 Progression
                                                                                                                10
               10                                                                                                   5
                   5
                                                                                                                    1
                   1
                            50       100 150 200 250 300 350 400 450 500                                                0.5         1       1.5       2       2.5       3    3.5       4   4.5    5
                                         Segment Length g(ms)                                                                       Computation Duration l sec
                                      (e) Segment Length                                                                    (f) Computation Duration
                                     Figure 3.9: Synthetic experiments – impact of different parameters.
                                                                                       55


Impact of partial synchrony.          Figure 3.9c shows an expected result where increasing
clock skew  results in greater runtime as the number of possible concurrent events across
processes increases exponentially. When comparing with the automata-based approach , the
progression-based technique yields us an improvement of 33.36%.
Impact of event rate.          Figure. 3.9d shows that our approach breaks even with the
computation duration for |P| = 3 for an event rate of 5events/process/sec. However,
increasing the event rate increases the search space for the SMT solver. Overall we improve by
34.4% by using the progression-based technique compared to the automata-based technique.
Impact of segment count. Increasing the segment length increases the number of events
to be worked with, and therefore, exponentially increasing the runtime of our approach. In
Fig. 3.9e, we do not see much improvement for |P| = 1, 2, since the number of events is
not large enough to make an impact. However, we see better performance with low segment
length for higher number of processes. Note, the runtime increases for very small segment
length, since the time taken to generate a higher number of SMT encodings outweigh the
performance gain from smaller segments. Here too, we notice an improvement of 32.6% for
the progression-based technique over the automata-based technique.
Impact of computation duration.                 In this experiment (Fig. 3.9f), we increase
computation duration and measure its effect on runtime. With increasing computation
duration, the number of segments needed to verify the longer computation increases, and
thereby resulting in a linear increase of the runtime.        The progression-based approach
improves the runtime by 33.1% when compared to the automata-based approach.
Impact of parallelization. Distributing the verification among multiple cores improves
the performance of the approach by a considerable amount.           As seen in Figure 3.10a,
increasing the number of cores from 1 to 10 improves the performance by a huge margin.
However, increasing it further shows little improvement, as the time taken for generating the
SMT encodings starts to dominate the time taken to solve it. An improvement of 33.8% is
                                                56


              1,000                                           |P| = 1 Automata                                                                                         sbs - 1                                                           1 event/sec/process
                                                                                                                            50,000                                                                                                       2 event/sec/process
               500
                                                              |P| = 2 Automata                                                                                         sbs - 2                 100
                                                              |P| = 3 Automata
                                                                                                                                                                       sbs - 3                                                           3 event/sec/process
                                                              |P| = 4 Automata
                                                              |P| = 5 Automata
                                                             |P| = 1 Progression                                                                                                               50
                                                             |P| = 2 Progression
               100                                           |P| = 3 Progression
                                                                                                                                                                                 Runtime (s)
                                                             |P| = 4 Progression
Runtime (s)                                                                                                   Runtime (s)
                50                                           |P| = 5 Progression
                                                                                                                                                                                               20
                                                                                                                            10,000
                                                                                                                                                                                               10
                10
                 5                                                                                                                                                                              5
                                                                                                                             5,000
                 1                                                                                                                                                                              2
                                                                                                                                                                                                1
                                                                                                                                                                                                     1   2   3   4   5   6   7   8   9   10   11   12   13   14
                      0   10   20   30   40   50   60   70   80     90 100                                                             1       2        3      4           5
                                    Number of Cores                                                                                            number of Cores                                               number of process(cores)
                      (a) Synthetic Data                                                                                                   (b) SBS Data                                                  (c) Google Data
                                                   Figure 3.10: Impact of parallelization on different data.
                                                                                                            100
                                                                            100 - %-age of false warnings
                                                                                                             80
                                                                                                             60
                                                                                                             40
                                                                                                                            O(n) Conjunction Satisfaction
                                                                                                             20              O(n) Conjunction Violation
                                                                                                                            O(n) Disjunction Satisfaction
                                                                                                                             O(n) Disjunction Violation
                                                                                                              0
                                                                                                                                     0.050.10.150.20.250.30.350.40.450.5
                                                                                                                                     Time synchronization constant ()
                                                         Figure 3.11: False Warnings for Synthetic Data.
achieved for progression-based approach when compared to automata-based approach.
Impact of  on false warnings. As discussed in Section 2.4, the monitor does not have
access to the global clock, it can report events as concurrent, when in reality, one happened
before the other in the system under observation. However, during this experiment, we
keep track of the global clock values separately, which gives us full knowledge over the total
ordering of all events. Thus, allowing us to study and report the real verdicts alongside the
reported verdicts. We observe that the monitor sometimes report false warnings, that is, it
reports both verdicts (satisfaction and violation), when in reality, only one has occurred.
Note that the monitor never fails to report real verdicts. However, it may report false
warnings alongside real verdicts on some occasions. Although this does not change the
correctness of the approach, it may still include false warnings as part of the set of evaluated
results.
                                                                                                                                                   57


     In Figure 3.11, we observe that with the increase of the maximum clock skew , the
number of false warnings increases. The increase in false warnings is attributed to the fact
that as the value of  increases, so does the number of events considered as concurrent by
the monitor.
     Additionally, we observe that the number of false warning is greatly influenced by the
predicate structure of the LTL formula, as evident from Figure 3.11. For O(n) conjunctive
satisfaction formula monitoring and O(n) disjunctive violation formula monitoring, false
warnings might occur if any one of the n sub-formulas are violated or satisfied, respectively.
Therefore, we see a higher number of false warnings.        Similarly, for O(n) disjunctive
satisfaction formula monitoring and O(n) conjunctive violation formula monitoring, false
warnings might occur if all of the n sub-formulas are violated or satisfied, respectively.
Therefore, we see a lower number of false warnings.
3.5.3      Case Study 1: Cassandra
     Cassandra [LM10] is a No-SQL distributed database management system. We simulate
a distributed database with two data-centers: one cluster consisting of 4 nodes, and the
other cluster consisting of 3 nodes, with one node from each cluster serving as the seed
node. All data is replicated among every node in both the clusters. Each node runs on Red
Hat OpenStack Platform using 4 VCPUs, 4GB RAM, Ubuntu 1804, Cassandra 3.11.6, and
Java 1.8.0 252. We have also simulated a system of multiple processes where each process
is responsible for the basic database operations (read, write and update). These processes
are also capable of inter-process communication that allows for informing other processes in
case of a write of a new entry to the database.
     To make our simulated database realistic, we tested the latency of our system to the
ones offered by Google Cloud, Microsoft Azure and Amazon Web Services. The fastest
response was clocked at 41ms compared to 100ms from our system. The reason behind such
a high latency when compared to the industry standard owes to the slow bandwidth and
                                             58


                                                 ϕrw , |P| = 2 Automata                                 ϕrw , |P| = 2 Automata
                                                 ϕrw , |P| = 3 Automata                                 ϕrw , |P| = 3 Automata
                                                ϕwrc, |P| = 2 Automata                                 ϕwrc, |P| = 2 Automata
                                                ϕwrc, |P| = 3 Automata                                 ϕwrc, |P| = 3 Automata
               100                               ϕdrc, |P| = 2 Automata                        100      ϕdrc, |P| = 2 Automata
                                                 ϕdrc, |P| = 3 Automata                                 ϕdrc, |P| = 3 Automata
                                                ϕrw , |P| = 2 Progression                              ϕrw , |P| = 2 Progression
                                                ϕrw , |P| = 3 Progression                              ϕrw , |P| = 3 Progression
                                                ϕwrc, |P| = 2 Progression                       50     ϕwrc, |P| = 2 Progression
 Runtime (s)                                                                     Runtime (s)
                                                ϕwrc, |P| = 3 Progression                              ϕwrc, |P| = 3 Progression
                50                              ϕdrc, |P| = 2 Progression                              ϕdrc, |P| = 2 Progression
                                                ϕdrc, |P| = 3 Progression                              ϕdrc, |P| = 3 Progression
                                                                                                10
                                                                                                 5
                10
                 5
                     4   6   8   10   12   14     16       18      20                                  1      2       3      4     5   6   7   8   9   10
                         Segmentation frequency (Hz)                                                              Computation duration (s)
                         (a) Segment Length                                                            (b) Computation Duration
                                           Figure 3.12: Cassandra experiments.
infrastructure differences. We consider a latency of 100ms for all our experiments, and fix
maximum clock skew  at 250ms.
               We design the processes such that each process is capable of reading, writing, or updating
the entries in the database. We use a (0, 2) uniform distribution to select the type of the
operation that is to be performed by the process. Once there is any kind of addition from
the write operation, the change is notified to the other processes using the inter-process
communications. We consider no loss of messages in transmission and all messages are read
by the receiving process immediately once they are received.
               In a database, consistency level helps maintain the minimum of replications that needs
to be performed on an operation in order to consider the operation to be successfully
executed. According to the recommendations from Cassandra the sum of read and write
consistency should be more than the replication factor so as to remove any chances of read
or write anomaly in the database. We aim to monitor and identify read/write anomalies
in the database using runtime monitoring techniques. The corresponding LTL specification
becomes:
                                                          n 
                                                          ^                                                       
                                            ϕrw =                   write(i) →                       read(i)
                                                          i=0
where n is the number of read/write operations.
                                                                            59


     One of the challenges for using a distributed database such as Cassandra is the lack of
normalization (of database) capabilities. Therefore, we aim to monitor write reference check
and delete reference check. We introduce two tables:
                      Student(id, name)         Enrollment(id, course)
We enforce the write and delete reference check on the tables above. For a write in the
Enrollment table, it should always be preceded by a write in the Student table with
the same id. Similarly, for a delete from the Student table, it should always be preceded
by a delete from the Enrollment table with the same id. These enforces no insertion and
deletion anomaly, and therefore, leads to the following LTL specification:
                                                                           
                 ϕwrc = ¬ ¬write(Student.id) U write(Enrollment.id)
                                                                            
                ϕdrc = ¬ ¬delete(Enrollment.id) U delete(Student.id)
Extreme load scenario. Figure 3.12b and 3.12a plot runtime vs computation duration
and runtime vs segmentation frequency respectively, under full read/write load allowed by
our network. When compared with the results from that of the synthetic experiments, these
results are slightly noisier. This owes to the fact that in the synthetic experiments, the
events were evenly spread over the entire computation duration, whereas here they are not
uniform. Database operations involving network communications (read, write and update)
takes an average of 100ms, however sending and receiving of messages are inter-process
communication, and takes about 10-15ms, making the overall event distribution non-uniform.
When comparing with the automata-based approach, we do not see much improvement when
monitoring ϕwrc or ϕdrc using progression based approach. However, when monitoring ϕrw ,
we observe an average improvement of 55.53%.
Moderate load scenario.         In Figure 3.12b, we were able to make even for number of
processes as low as 2. Now, to look for a real-life example with moderate database operations
                                               60


we consider Google Sheets API, which allows a maximum of 500 requests per 100 seconds
per project and a 100 requests per second per user, i.e., on an average 5 events/sec per
project and a user can only generate 1 event/sec. To evaluate how our approach performs
in such a scenario, we increase the number of processes and the number of cores available
to monitor such a system to study the time taken to verify the trace generated by such
a system. We plot our findings in Fig. 3.10c, and notice that we break even for an event
rate of 3 events/sec/user considering the progression-based approach. This is a significant
improvement over the automata-based approach, where we could only break even for an
event rate of 2 events/sec/user. Our algorithm performs wells when the number of processes
are 7, 8 and 9 which is much more than what is permitted by Google. This allows for us to
be confident that our approach can pave way for implementation in a real-life settings.
3.5.4     Case Study 2: RACE
     Runtime for Airspace Concept Evaluation (RACE) [MGS19] is a framework developed
by NASA that is used to build an event based, reactive airspace simulation. We use a
dataset developed using this RACE framework (https://github.com/NASARace/race-data).
This dataset contains three sets of data collected on three different days. Each set was
recorded at around 37◦ N Latitude and 121◦ W Longitude. The dataset includes all 8 types
of messages being sent by the SBS unit by using a Telnet application to listen to port 30003,
but we only use the messages with ID MSG − 3 which is the Airborne Position Message
and includes a flight’s latitude, longitude and altitude using which we verify the mutual
separation of all pairs of aircraft.
     On analyzing the dataset, we observe that the time difference between the time message
was generated to the time message was logged is usually less than a second apart, thus
we considered an  = 1s over the time message was generated. Furthermore, calculating
the distance between two coordinates is computationally expensive, as we need to factor in
parameters such as curvature of earth. In order to speed up distance related calculations, we
                                             61


consider a constant latitude of 111.2km and longitude of 87.62km, at the cost of a negligible
error margin. We use these as constants and multiply them by the difference in latitude and
longitude, and factor in the altitude to get the distance between two aircrafts. We verify
mutual separation by considering the minimum separation between every pair of aircrafts to
be 500m. From the dataset, we observe that each aircraft generates a message on at least 1
sec intervals. There are 3 separate datasets: sbs-1 consists of 293 aircrafts, 168,283 messages
spread over 3 hours and 28 minutes and 58seconds; sbs-2 consists of 110 aircrafts, 64,218
messages spread over 1 hour 1 minute and 46 seconds; sbs-3 consists of 97 aircrafts, 64,162
messages spread over 49 minutes and 42 seconds.
     In Fig. 3.10b, we compare our achieved runtime against the three datasets available from
RACE (labelled sbs-1, sbs-2 and sbs-3). We monitor the data in real time, with 10s long
segments and  of 1s. We test our approach using the parallelization technique introduced
in 3.4.2 by using more number of cores on the processor and utilize all available cores. Our
results break even for 4 cores. This makes our approach desirable for aircraft monitoring
and similar systems such as IoT.
3.6      Summary and Limitation
     In this chapter, we propose two monitoring technique which takes an LTL formula and
a distributed computation as input. We apply a automata-based and a progression-based
formula rewriting monitoring algorithm implemented as an SMT decision problem in order
to verify the correctness of the distributed system with respect to the formula. We also
conduct extensive synthetic experiments along with monitoring traces by Cassandra and
RACE dataset by NASA.
     The monitoring approach takes an LTL formula as specification which is not very
expressive in the sense that it fails to express specifications for systems with time bounded
execution.    Additionally, as discussed in Section 3.5, the approach does not scale well
when considering larger distributed system. Currently, the monitoring runtime increases
                                               62


exponentially with increase in the number of processes or events being monitored. This is a
big limiting factor when designing a verification approach which can work in real time.
                                              63


Chapter 4
Runtime Verification for
Time-bounded Temporal
Specifications
4.1       Introduction
     In this chapter, we advocate for a runtime verification (RV) approach, to monitor the
behavior of a system of blockchains with respect to a set of temporal logic formulas. Applying
RV to deal with multiple blockchains can be reduced to distributed RV, where a centralized
or decentralized monitor observes the behavior of a distributed system in which processes
do not share a global clock. Although RV deals with finite executions, the lack of a common
global clock prohibits it from having a total unique ordering of events in a distributed setting.
Put it another way, the monitor can only form a partial order of event which may result in
different verification verdicts. Enumerating all possible partial ordering of events at run time
    (Published) Ritam Ganguly, Yingjie Xue, Aaron Jonckheere, Parker Ljung, Benjamin Schornstein,
Borzoo Bonakdarpour, and Maurice Herlihy, Distributed Runtime Verification of Metric Temporal Properties
for Cross-Chain Protocols, IEEE 42nd International Conference on Distributed Computing Systems (ICDCS
2022).
(Under review) Ritam Ganguly, Yingjie Xue, Aaron Jonckheere, Parker Ljung, Benjamin Schornstein,
Borzoo Bonakdarpour, and Maurice Herlihy, Distributed Runtime Verification of Metric Temporal
Properties, Elsevier Journal of Parallel and Distributed Computing.
                                                     64


                          Alice               Apricot         Banana                Bob
                                        Premium(pa + pb)
                                                            Premium(pb)
                                Escrow(h, tA)
                                                                     Escrow(h, tB )
                                          Redeem(alice)
                                                            Redeem(bob)
                               Figure 4.1: Hedged Two-party Swap.
incurs in an exponential blow up, making the approach not scalable. To add to this already
complex task, most specifications for verifying blockchain smart contracts, come with a time
bound. This means, not only the partial ordering of the events are at play when verifying,
but also the actual physical time of occurrence of the events dictates the verification verdict.
     In this chapter, we propose an effective, sound and complete solution to distributed RV
for timed specifications expressed in the metric temporal logic (MTL) [Koy90]. To present a
high-level view of MTL, consider the two-party swap protocol [XH21] shown in Fig 4.1. Alice
and Bob, each in possession of Apricot and Banana blockchain assets respectively, wants to
swap their assets between each other without being a victim of a sore loser attack [XH21] (A
sore loser attack is a type of attack in cross-blockchain commerce. It occurs when one party
decides to halt participation partway through, leaving other parties’ assets locked up for a
long duration). There is a number of requirements that should be followed by the conforming
parties to discourage any attack on themselves. We use metric temporal logic (MTL) [Koy90]
to express such requirements. One such requirement is, where Bob should not be able to
redeem his asset before Alice redeems hers within eight time units can be represented by the
MTL formula:
                     ϕspec = ¬Apr.Redeem(bob) U              [0,8) Ban.Redeem(alice).
     We consider a fault proof central monitor which has the complete view of the system but
                                                         65


                       SetUp Deposit(pb )         Escrow(h, tA )    Redeem(bob)
               Apr                                Apr
                         1          3                    5               7
                       SetUp Deposit(pa + pb )       Escrow(h, tB ) Redeem(alice)
               Ban                               Ban
                         1   seg1      4                    6 seg2        7
                              Figure 4.2: Progression Example.
has no access to a global clock. In order to limit the blow-up of states posed by the absence
of a global clock, we make a practical assumption about the presence of a bounded clock skew
 between the local clocks of every pair of processes, guaranteed by a clock synchronization
algorithm (e.g. NTP [Mil10]). This setting is known to be partially synchronous when we
do not assume the presence of a global clock and limit the impact of asynchrony within
clock drifts. Such an assumption limits the window of partial orders of events only within
 time units and significantly reduces the combinatorial blow-up caused by nondeterminism
due to concurrency. Existing distributed RV techniques either assume a global clock when
working with time sensitive specifications [BKMZ15, WOH19] or use untimed specifications
when assuming partial synchrony [GMB21, MBAB21].
     As often observed, the real clock skew between two processes is less than the maximum
clock skew that is allowed by the system. Here, as a part of the monitoring scheme we want
to take that into consideration when monitoring the distributed computation. We study the
observed clock skew between every pair of processes and estimate the cumulative density
function (cdf) that the clock skew follows. Based on our estimated cdf, we quantify the time
of occurrence of each event in the distributed system and is able to calculate the probabilistic
guarantee for the verdict of the monitor.
     We introduce an SMT-based progression-based formula rewriting technique over
distributed computations which takes into consideration the events observed thus far to
rewrite the specifications for future extensions. Our monitoring algorithm accounts for all
possible orderings of events without explicitly generating them when evaluating MTL formulas.
                                               66


For example, in Fig. 4.2, we see the events and the time of occurrence in the two blockchains,
Apricot(Apr) and Banana(Ban) divided into two segments, seg1 and seg2 for computational
purposes. Considering maximum clock skew  = 2 and the clock skew cdf function return
0.25, 0.75, 1 for observed clock skew −1, 0, 1 respectively for the specification ϕspec , at the end
of the first segment, we have three possible rewritten formulas for the next segment along
with the statistical guarantee of each of them:
          ϕspec1 = ¬Apr.Redeem(bob)              U [0,5) Ban.Redeem(alice); pr   = 0.1875
          ϕspec2 = ¬Apr.Redeem(bob)              U [0,4) Ban.Redeem(alice); pr   = 0.5625
          ϕspec3 = ¬Apr.Redeem(bob)                U   [0,3) Ban.Redeem(alice); pr = 0.15
This is possible due to the different ordering and different time of occurrence of the events
Deposit(pb ) and Deposit(pa + pb ). In other words, the possible time of occurrence of the
event Deposit(pb ) (resp. Deposit(pa + pb )) is either 2, 3 or 4 (resp. 3, 4, or 5) due to the
maximum clock skew of 2. The probabilistic guarantee is calculated by the possible time
of occurrence and the probability of the corresponding time of occurrence. To calculate the
statistical guarantee of a verdict we see how was it reached. Here for, ϕspec1 , we see that
it can be reached by the time of occurrence of Deposit(pb ) (resp. Deposit(pa + pb )) being
either 2 or 3 (resp. 3). Thereby making the probability, 0.25 × 0.25 + 0.5 × 0.25 = 0.1875.
Similarly, we calculate the statistical guarantee of the other verdicts.
     Likewise, at the end of seg2 , we have ϕspec1 and ϕspec2 evaluate to true where as ϕspec3
evaluate to false. This is because, even if we consider the scenario when Ban.Redeem(alice)
occurs before Apr.Redeem(bob), a possible time of occurrence of Ban.Redeem(alice) is 8
(resp. 6) which makes ϕspec3 (resp. ϕspec1 and ϕspec2 ) evaluate to false (resp. true). The
statistical guarantee of the verdicts, true and false being 0.6875 and 0.6875. An interesting
note here is that the sum of guarantee for true and false is not 1. This is because of the
case, where both verdicts were equally likely. In Fig. 4.2, when the time of occurrence of
                                                67


both Ban.Redeem(alice) and Apr.Redeem(bob) is 6, either order of occurrence is possible.
Thereby making, both verdicts, equally likely.
     We have fully implemented our technique (https://github.com/TART-MSU/rv-mtl-
blockc) and report the results of rigorous experiments on monitoring synthetic data, using
benchmarks in the tool UPPAAL [LPY97], as well as monitoring correctness, liveness and
conformance conditions for smart contracts on blockchains. We put our monitoring algorithm
to test, studying the effect of different parameters on the runtime and report on each of them.
4.1.1      Estimating Offset distribution
     We run a number of diagnostic tests on every pair of processes in the distributed
computation. Since the distributed system considered allows message passing, tests include
sending and receiving of messages. A client process sends a dummy message to a server
processes and once a the server process receives a message, it replies the client. Using the
timestamps of the messages, we calculate the offset
                                           (t1 − t0 ) + (t2 − t3 )
                                      Θ=
                                                      2
and the round trip delay
                                      δ = (t3 − t0 ) − (t2 − t1 )
where, t0 is the client’s timestamp of the requested packet transmission, t1 is the server’s
timestamp of the requested packet reception, t2 is the server’s timestamp of the response
packet transmission and t3 is the client’s timestamp of the response packet reception. We
derive the expression for the offset, for the request packet (resp. response packet),
                           t0 + Θ + δ/2 = t1            t3 + Θ − δ/2 = t2
Solving for Θ yields the time offset. This procedure is repeated for n times for each pair
of processes and a vector of offsets is collected that defines the system, (x1 , x2 , · · · , xn ).
                                                  68


This vector of independent, identical distributed bounded random numbers constitute our
sample. We assume that it follows a common cumulative distribution function F (x). Then
the empirical distribution function is defined by
                                                                                n
                         number of elements in the sample ≤ x   1X
                F̂ (x) =                                      =      1x ≤x
                                          n                     n i=1 i
where 1a is the indicator of event a. This makes, F̂ (x) an unbiased estimator of F (x). Since
the data in our setting is bounded by (−, ) where F̂ (−) = 0 and F̂ () = 1. We decide to
break the entire range into h steps where each step is of length 2/h. Thus, the estimated
probability of a time offset, t is given by p(t) = F̂ (t) − F̂ (t − 2/h).
     For       example,                  for    a         vector       of           observed   offsets,
(−4, −3, −3, −2, −2, −1, −1, −1, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 4), and  = 5, the estimated
distribution function can be graphically represented by Figure 4.3 where h = 5. Thus we
calculate the estimated probability of a time offset, p(t = 0) = F̂ (0) − F̂ (−2) = 0.9 − 0.65 =
0.25.
                                   1.2
                                    1
                                   0.8
                          F̂ (x)   0.6
                                   0.4
                                   0.2
                                    0
                                           −4   −2         0       2        4
                                                           x
                   Figure 4.3: Example of a Cumulative Density Function.
     For a better understanding as to how close our estimated distribution function, F̂ (t), is
to the real distribution, F (t), we run a hypothesis test for the mean and standard deviation
with a p-value of ≤ 0.05. It is also to be noted that any other non-parametric density
                                                     69


                                     P1           a              ¬a
                                                  1               4
                                     P2        a               b
                                               2               5
                            Figure 4.4: Differrent time interleaving of events.
estimation method can be used, eg. kernel density estimator, spectral density estimator, etc.
4.1.2         Formal Problem Statement
      In a partially synchronous system, there are different possible ordering of events
and each unique ordering of events [BF12] might evaluate to different RV verdicts. Let
(E,    ) be a distributed computation.             A sequence of consistent cuts is of the form
C0 C1 C2 · · · , where for all i ≥ 0, we have (1) Ci ⊂ Ci+1 and (2) |Ci | + 1 = |Ci+1 |, and
(3) C0 = ∅. The set of all sequences of consistent cuts is denoted by C. We note that
in our view, the time interval I in the syntax of MTL represents the physical (global)
time G. Thus, when deriving all the possible traces given the distributed computation
(E,    ), we account for all different orders in which the events could possibly occur with
respect to G. This involves replacing the local time of occurrence of an event, eiσ with the set
of events {eiσ0 | σ 0 ∈ [max{0, σ −  + 1}, σ + )}. This is to account for the maximum clock
drift that is possible on the local clock of a process when compared to the global clock. For
example, given the computation in Fig. 4.4, a maximum clock skew  = 2 and a MTL formula,
ϕ=aU          [0,6) b, one has to consider all possible traces including (a, 1)(a, 2)(b, 4)(¬a, 5) |= ϕ
and (a, 1)(a, 2)(¬a, 4)(b, 5) 6|= ϕ.
      In any typical system, the observed clock skew between any pair of processes is much
less than the maximum clock skew that is allowed. To get a better understanding of the
clock skew, we run some diagnostic tests (explained in Section 4.1.1) to help estimate the
probability density function (pdf) defining the clock skew. Here, lets assume that the pdf is
represented by a function P (eiσ , π), such that given an event, eiσ and the function gives us
the probability that the event took place at global time π.
      Given a sequence of consistent cuts, it is evident that for all j > 0, |Cj − Cj−1 | = 1 and
                                                     70


event Cj − Cj−1 is the last event that was added onto the cut Cj . To translate monitoring of a
distributed system into monitoring a trace with guarantees, we define a sequence of natural
numbers as π̄ = π0 π1 · · · , where π0 = 0 and for each j ≥ 1, we have πj = σ, such that
front(Cj ) − front(Cj−1 ) = {eiσ }. To maintain time monotonicity, we only consider sequences
where for all i ≥ 0, πi+1 ≥ πi .
      The set of all traces that can be formed from (E,             ) is defined as:
                                     n                                            o
                        Tr(E,    ) = front(C0 )front(C1 ) · · · | C0 C1 · · · ∈ C
In the sequel, we assume that every sequence α of frontiers in Tr(E,                ) is associated with a
sequence π̄. Thus, to comply with the semantics of MTL, we refer to the elements of Tr(E,                )
by pairs (α, π̄). The statistical guarantee associated with a verdict is calculated as the
product of the probability of each event occurring at the time considered when generating
the trace.
                                                    Y
                                 Pr(α, π̄) =                     P (eiσ , σ 0 )
                                             ∀j.Cj −Cj−1 ={eiσ }
Thus, we evaluate an MTL formula ϕ with respect to a computation (E,                    ) as follows:
                                         n                                           o
                     [(E,   ) |=F ϕ] = (α, π̄, 0) |=F ϕ | (α, π̄) ∈ Tr(E,          )
This boils down to having a set of verdicts and the corresponding probability of generating
it, since a distributed computation may involve several traces and each trace might evaluate
to a different verdict.
Overall idea of our solution To solve the above problem (evaluating all possible
verdicts), we propose a monitoring approach based on formula-rewriting (Section 4.2) and
SMT solving (Section 4.3). Our approach involves iteratively (1) chopping a distributed
computation into a sequence of smaller segments to reduce the problem size, (2) progress
the MTL formula for each segment for the next segment, which results in a new MTL formula
by invoking an SMT solver and (3) calculate the sum of the probability of each trace that
                                                   71


yield the same MTL formula. Since each computation/segment corresponds to a set of possible
traces due to partial synchrony, each invocation of the SMT solver may result in a different
verdict.
4.2      Formula Progression for MTL
     We start describing our solution by explaining the formula progression technique.
Definition 8. A progression function is of the form Pr : Σ∗ × Z∗≥0 × ΦMTL → ΦMTL and is
defined for all finite traces (α, τ̄ ) ∈ (Σ∗ , Z∗≥0 ), infinite traces (α0 , τ̄ 0 ) ∈ (Σω , Zω≥0 ) and MTL
formulas ϕ ∈ ΦMTL , such that (α.α0 , τ̄ .τ̄ 0 ) |= ϕ if and only if (α0 , τ̄ 0 ) |= Pr(α, τ̄ , ϕ).
     Compared to the classic formula rewriting technique in [HR01b], here the function Pr
takes a finite trace as input, while the algorithm in [HR01b] rewrites the formula after
every observed state. When monitoring a partially synchronous distributed system, multiple
verdicts are possible as a result of no unique ordering of events, as a result the classical
state-by-state formula rewriting technique is of little use. The motivation of our approach
comes from the fact that for computation reasons, we chop the computation into smaller
segments and the verification of each segment is done through an SMT query. A state-by-
state approach would incur in a huge number of SMT queries being generated.
     Let I = [start, end ) denote an interval. By I−τ , we mean the interval I 0 = [start 0 , end 0 ),
where start 0 = max{0, start − τ } and end 0 = max{0, end − τ }. Also, for two time instances
τi and τ0 , we let InInt(i) return true or false depending upon whether τi − τ0 ∈ I.
Progressing atomic propositions. For an MTL formula of the form ϕ = p, where p ∈ AP,
the result depends on whether or not p ∈ α(0). This marks as our base case for the other
temporal and logical operators:
                                                   
                                                   
                                                   true
                                                           if p ∈ α(0)
                                Pr(α, τ̄ , ϕ) =
                                                   
                                                   false
                                                           if p 6∈ α(0)
                                                     72


Progressing negation. For an MTL formula of the form ϕ = ¬φ, we have:
                                       Pr(α, τ̄ , ϕ) = ¬Pr(α, τ̄ , φ).
Progressing disjunction. Let ϕ = ϕ1 ∨ ϕ2 . Apart from the trivial cases, the result of
progression of ϕ1 ∨ ϕ2 is based on progression of ϕ1 and/or progression of ϕ2 :
                             
                             
                             
                             
                             
                              true       if Pr(α, τ̄ , ϕ1 ) = true ∨ Pr(α, τ̄ , ϕ2 ) = true
                             
                             
                             
                             
                             
                             
                             
                              false      if Pr(α, τ̄ , ϕ1 ) = false ∧ Pr(α, τ̄ , ϕ2 ) = false
                             
                             
            Pr(α, τ̄ , ϕ) =    ϕ20
                                          if Pr(α, τ̄ , ϕ1 ) = false ∧ Pr(α, τ̄ , ϕ2 ) = ϕ02
                             
                             
                             
                             
                               ϕ01        if Pr(α, τ̄ , ϕ2 ) = false ∧ Pr(α, τ̄ , ϕ1 ) = ϕ01
                             
                             
                             
                             
                             
                             
                             
                             
                             ϕ0 ∨ ϕ0     if Pr(α, τ̄ , ϕ1 ) = ϕ01 ∧ Pr(α, τ̄ , ϕ2 ) = ϕ02
                             
                             
                                 1     2
Always and eventually operators. As shown in Algorithms 2 and 3, the progression for
‘always’, (  I ϕ) and ‘eventually’, (     I ϕ) depends on the value of InInt(i) and the progression
of the inner formula ϕ. In Algorithms 2 and 3, we divide the algorithm into three cases: (1)
line 4, corresponds to if I is within the sequence τ̄ ; (2) line 6, corresponds to where I starts
in the current trace but its end is beyond the boundary of the sequence τ̄ , and (3) line 9,
corresponds to if the entire interval I is beyond the boundary of sequence τ̄ . In Algorithm 2,
we are only concerned about the progression of ϕ on the suffix (αi , τ̄ i ) if InInt(i) = true. In
case, InInt(i) = false the consequent drops and the entire condition equates to true. In other
words, equating over all i ∈ [0, |α|], we are only left with conjunction of Pr(αi , τ̄ i , ϕ) where
InInt(i) = true. In addition to this, we add the initial formula with updated interval for
the next trace. Similarly, in Algorithm 3, equating over all i ∈ [0, |α|], if InInt(i) = false the
corresponding Pr(αi , τ̄ i , ϕ) is disregarded and the final formula is a disjunction of Pr(αi , τ̄ i , ϕ)
with InInt(i) = true.
Progressing the until operator. Let the formula be of the form ϕ1 U I ϕ2 . According
to the semantics of until, ϕ1 should be evaluated to true in all states leading up to some
                                                     73


Algorithm 2: Always.                                                Algorithm 3: Eventually.
 1: function Pr(α, τ̄ , I ϕ)                                         1: function Pr(α, τ̄ , I ϕ)
 2: if Istart ≤ τ|α| − τ0 then                                       2: if Istart ≤ τ|α| − τ0 then
 3:    if Iend ≤ τ|α| − τ0 then                                     3:      if Iend ≤ τ|α| − τ0 then                        
         return i∈[0,|α|] InInt(i) → Pr(αi , τ̄ i , ϕ)                         return i∈[0,|α|] InInt(i) ∧ Pr(αi , τ̄ i , ϕ)
                  V                                                                       W
 4:                                                                  4:
 5:    else                                                         5:      else                                             
         return i∈[0,|α|] InInt(i) → Pr(αi , τ̄ i , ϕ) ∧                       return i∈[0,|α|] InInt(i) ∧ Pr(αi , τ̄ i , φ) ∨
                  V                                                                        W
 6:                                                                  6:
      [I−(τ|α| −τ0 )) ϕ                                                     [I−(τ|α| −τ0 )) ϕ
 7:    end if                                                        7:      end if
 8: else                                                             8: else
 9:    return [I−(τ|α| −τ0 )) ϕ                                      9:      return [I−(τ|α| −τ0 )) ϕ
10: end if                                                          10: end if
11: end function                                                    11: end function
Algorithm 4: Until.
 1: function Pr(α, τ̄ , ϕ1 U I ϕ2 )
 2: if Istart ≤ τ|α| − τ0 then
 3:    if Iend ≤ τ|α| V− τ0 then                                                                    W
 4:      return           i∈[0,|α|] (τi      <       Istart + τ0 )    →        Pr(αi , τ̄ i , ϕ1 )   ∧    j∈[0,|α|] InInt(j) ∧
                                                      
    Pr(α, τ̄ , [0,τj −τ0 ) ϕ1 ) ∧ Pr(αj , τ̄ j , ϕ2 )
 5:    else          V                                                                              W
                                                                                    i i
 6:      return           i∈[0,|α|] (τi      <       I start + τ0 )   →        Pr(α  , τ̄   , ϕ1 )   ∧    j∈[0,|α|] InInt(j) ∧
                                                                              
    Pr(α, τ̄ , [0,τj −τ0 ) ϕ1 ) ∧ Pr(αj , τ̄ j , ϕ2 ) ∨ ϕ1 U (I−(τ|α| −τ0 ) ϕ2
 7:    end if
 8:  else                                        
                                    i i
                   V
 9:    return         i∈[0,|α|] Pr(α , τ̄ , ϕ1 ) ∧ ϕ1 U (I−(τ|α| −τ0 ) ϕ2
10:  end if
11: end function
                                                                74


                        (α, τ̄ )                                     (α0 , τ̄ 0 )                           (α00 , τ̄ 00 )
            (∅, 1)                   (∅, 3)                           (∅, 4)                      (∅, 6)                    ({p}, 7)
              0              1         2                       0        1           2               0           1                2
                         (∅, 2)                             ({r}, 3)              (∅, 5)                    ({q}, 7)
                              Figure 4.5: A trace example divided into three segments.
i ∈ I, where ϕ2 evaluates to true. We start by progressing ϕ1 (resp. ϕ2 ) as                                         [0,τi −τ0 )    ϕ1 (resp.
  [τi ,τi +1) ϕ2 ) for some i ∈ I. Since, we are only verifying the sub-formula,                                           [τi ,τi +1) ϕ2 , on
the trace sequence (α, τ̄ ), it is equivalent to verifying the sub-formula                                 [0,1)   ϕ2 ≡ ϕ2 over the
trace sequence (αi , τ̄ i ). Similar to Algorithms 2 and 3, in Algorithm 4 we need to consider
three cases. In lines 4, 6 and 9, following the semantics of until operator, we make sure for
all i ∈ [0, |α|], if τi < Istart + τ0 , ϕ1 is satisfied in the suffix (αi , τ̄ i ). In addition to this
there should be some j ∈ [0, |α|] for which if InInt(j) = true, then the trace satisfies the
sub-formula            [0,τj −τ0 )  ϕ1 and         [τj ,τj +1) ϕ2 ). In lines 6 and 9, we also accommodate for future
traces satisfying the formula ϕ1 U I ϕ2 with updated intervals.
Example In Fig. 4.5, the time line shows propositions and their time of occurrence, for
formula         [0,6)  r → (¬p U           [2,9) q).   The entire computation has been divided into 3 segments,
(α, τ̄ ), (α0 , τ̄ 0 ), and (α00 , τ̄ 00 ) and each state has been represented by (s, τ ):
    • We start with segment (α, τ̄ ). First we evaluate                                  [0,6) r, which requires evaluating
          Pr(αi , τ̄ i , r) for i ∈ {0, 1, 2}, all of which returns the verdict false and there by rewriting
          the sub-formula as                 [0,4) r. Next, to evaluate the sub-formula ¬p U                      [2,9) q,      we need to
          evaluate (1) Pr(αi , τ̄ i , ¬p) for i ∈ {0, 1} since τi − τ0 < 2 and both evaluates to true, (2)
          Pr(α, τ̄ ,      [0,2) ¬p) which also evaluates to true and (3) Pr(α2 , τ̄ 2 , q) which evaluates as
          false. Thereby, the rewritten formula after observing (α, τ̄ ) is                              [0,3) r → (¬p U               [0,6) q).
    • Similarly, we evaluate the formula now with respect to (α0 , τ̄ 0 ), which makes the sub-
          formula           [0,3) r evaluate to true at τ = 3 and the sub-formula ¬p U                            [0,6) q     (there is no
          such i ∈ {0, 1, 2} where τi − τ0 < 0 and for all j ∈ {0, 1, 2}, Pr(α0j , τ̄ 0j , q) = false) is
          rewritten as ¬p U             [0,4) q.
                                                                       75


    • In (α00 , τ̄ 00 ), for j = 1, Pr(α00 , τ̄ 00 , [0,2) ¬p) = true and Pr(α00j , τ̄ 00j , q) = true, and thereby
      rewriting the entire formula as true.
4.3         SMT-based Solution
4.3.1       SMT Entities
     SMT entities represent the variables used to represent the distributed computation.
After we have the verdicts for each of the individual sub-formulas, we use the progression
laws discussed in Section 4.2 to construct the formula for the future computations.
Distributed Computation We represent a distributed computation (E,                                  ) by a function
f : E → {0, 1, . . . , |E| − 1}. To represent the happen-before relation, we define a E × E
matrix called hbSet where hbSet[eiσ ][ejσ0 ] = 1 represents eiσ                 ejσ0 for eiσ , ejσ0 ∈ E. Also, if
|σ − σ 0 | ≥  then hbSet[eiσ ][ejσ0 ] = 1, else hbSet[eiσ ][ejσ0 ] = 0. This is done in the pre-processing
phase of the algorithm and in the rest of the chapter, we represent events by the set E and
a happen-before relation by              for simplicity.
     In order to represent the possible time of occurrence of an event, we define a function
δ : E → Z≥0 , where
                           ∀eiσ ∈ E.∃σ 0 ∈ [max{0, σ −  + 1}, σ +  − 1].δ(eiσ ) = σ 0
     Given an event, we map each possible time of occurrence of the vent with the respective
probability using a function p : E × Z≥0 → [0, 1], where p(eiσ , δ(eiσ )) is some real number in
the range [0, 1] such that
      ∀eiσ ∈ E, ∀σ1 , σ2 ∈ [max{0, σ −  + 1}, σ +  − 1].(σ1 < σ2 ) → p(eiσ , σ1 ) ≤ p(eiσ , σ1 )
and
                      ∀eiσ ∈ E.p(eiσ , σ + epsilon − 1) = 1; p(eiσ , σ − epsilon + 1) = 0
To connect events, E, and propositions, AP, on which the MTL formula ϕ is constructed, we
                                                             76


define a boolean function µ : AP × E → {true, false}. For formulas involving non-boolean
variables (e.g., x1 + x2 ≤ 7), we can update the function µ accordingly. We represent a
sequence of consistent cuts that start from {} and end in E, we introduce an uninterpreted
function ρ : Z≥0 → 2E to reach a verdict, given it satisfies all the constrains explained
in 4.3.2. Lastly, to represent the sequence of time associated with the sequence of consistent
cuts, we introduce a function τ : Z≥0 → Z≥0 .
4.3.2     SMT Constraints
     Once we have the necessary SMT entities, we move onto including the constraints for
both generating a sequence of consecutive cuts and also representing the MTL formula as a
SMT constraint.
Consistent cut constraints over ρ: In order to make sure the sequence of cuts represented
by the uninterpreted function ρ, is a sequence of consistent cuts, i.e., they follow the happen-
before relations between events in the distributed system:
                                                                  
                                     0         0
                                                    e) ∧ e ∈ ρ(i) → e0 ∈ ρ(i)
                                                                                
                  ∀i ∈ [0, |E|].∀e, e ∈ E. (e
Next, we make sure that in the sequence of consistent cuts, the number of events present in
a consistent cut is one more than the number of events that were present in the consistent
cut before it:
                                ∀i ∈ [0, |E|). | ρ(i + 1) |=| ρ(i) | +1
Next, we make sure than in the sequence of consistent cuts, each consistent cut includes all
the events that were present in the consistent cut before it, i.e, it is a subset of the consistent
cut prior in the sequence.
                                     ∀i ∈ [0, |E|].ρ(i) ⊂ ρ(i + 1)
The sequence of consistent cuts starts from {} and ends at E.
                                        ρ(0) = ∅; ρ(|E|) = E
                                                   77


The sequence of time reflects the time of occurrence of the event that has just been added
to the sequence of consistent cut:
                    ∀i ≥ 1.τ (i) = δ(eiσ ), such that ρ(i) − ρ(i − 1) = {eiσ }
We make sure the monotonicity of time is maintained in the sequence of time
                                    ∀i ∈ [0, |E|).τ (i + 1) ≥ τ (i)
     Calculating statistical guarantee over ρ: The statistical guarantee of a verdict is
the same as the probability of generating the corresponding trace which yielded the respective
verdict. To avoid a iterative process of generating all possible traces, we use a consolidated
method which limits the number of traces to be verified. Forall i ≥ 1, if ρ(i)−ρ(i−1) = {ei σ},
then we define two entities such that
                                σstart = max{τ (i − 1), σ −  + 1}
and
                                               σend = δ(eiσ )
We define a function P : E × Z≥0 × Z≥0 → [0, 1] which calculates the probability for the
range of time of occurrence of the event given by [σstart , σend ] as
                        P (eiσ , σstart , σend ) = p(eiσ , σend ) − p(eiσ , σstart )
The probability of generating the corresponding trace is given by
                                                 Y
                         Pr(ρ, τ ) =                           P (eiσ , σstart , σend )
                                         ∀i∈[0,|E|].max{τ (i)}
where we aim to maximize the each τ (i)
     Constraints for MTL formulas over ρ: These constraints will make sure that ρ will
not only represent a valid sequence of consistent cuts but also makes sure that the sequence
                                                      78


of consistent cuts satisfy the MTL formula. As is evident, a distributed computation can
often yield two contradicting evaluation. Thus, we need to check for both satisfaction and
violation for all the sub-formulas in the MTL formula provided. Note that monitoring any
MTL formula using our progression rules will result in monitoring sub-formulas which are
atomic propositions, eventually and globally temporal operators. Below we mention the
SMT constrain for each of the different sub-formula. Violation (resp. satisfaction) for
atomic proposition and eventually (resp. globally) constrain will be the negation of the
one mentioned.
                                  _
            ϕ=p                            µ[p, e] = true, for p ∈ AP (satisfaction, i.e., >)
                             e∈front(ρ(0))
        ϕ=      I ϕ      ∃i ∈ [0, |E|].τ (i) − τ (0) ∈ I ∧ ρ(i) 6|= ϕ (violation, i.e., ⊥)
        ϕ=      I ϕ      ∃i ∈ [0, |E|].τ (i) − τ (0) ∈ I ∧ ρ(i) |= ϕ (satisfaction, i.e., >)
A satisfiable SMT instance denotes that the uninterpreted function was not only able to
generate a valid sequence of consistent cuts but also that the sequence satisfies the MTL
formula given the computation. This result is then fed to the progression cases to generate
the final verdict.
4.3.3      Segmentation of a Distributed Computation
     We know that predicate detection, let alone runtime verification, is NP-complete [Gar02]
in the size of the system (number of processes). This complexity grows to higher classes when
working with nested temporal operators. To make the problem computationally viable, we
aim to chop the computation, (E,           ) into g segments, (seg1 ,     ), (seg2 , ), · · · , (segg , ).
This involves creating small SMT-instances for each of the segments which improves the
runtime of the overall problem. In a computation of length l, if we were to chop it into g
                                                    l
segments, each segment would of the length          g
                                                      +  and the set of events included in it can be
given by:
                                                    79


                                                              
                      n
                         i                (j − 1) × l    j×l                     o
               segj = eσ | σ ∈ max 0,                 − ,       ∧ i ∈ [1, | P |]
                                               g            g
     Note that monitoring of a segment should include the events that happened within 
time of the segment actually starting since it might include events that are concurrent with
some other events in the system not accounted for in the previous segment.
4.4      Case Study and Evaluation
     In this section, we analyze our SMT-based solution. We note that we are not concerned
about data collections, data transfer, etc, as given a distributed setting, the runtime of
the actual SMT encoding will be the most dominating aspect of the monitoring process.
We evaluate our proposed solution using traces collected from benchmarks of the tool
UPPAAL [LPY97] (UPPAAL is a model checker for a network of timed automata. The tool-set
is accompanied by a set of benchmarks for real-time systems. Here, we assume that the
components of the network are partially synchronized.) models (Section 4.4.1) and a case
study involving smart contracts over multiple blockchains (Section 4.4.2).
4.4.1     UPPAAL Benchmarks
Setup
     Below we explain in details how each of the UPPAAL models work. In respect to our
monitoring algorithm, we consider multiple instances of each of the models as different
processes. Each event consists of the action that was taken along with the time of occurrence
of the event. In addition to this, we assume a unique clock for each instance, synchronized
by the presence of a clock synchronization algorithm with a maximum clock skew of .
The Train-Gate It models a railway control system which controls access to a bridge for
several trains. The bridge can be considered as a shared resource and can be accessed by one
train at a time. Each train is identified by a unique id and whenever a new train appears in
                                               80


the system, it sends a appr message along with it’s id. The Gate controller has two options:
(1) send a stop message and keep the train in waiting state or (2) let the train cross the
bridge. Once the train crosses the bridge, it sends a leave message signifying the bridge is
free for any other train waiting to cross.
                                              leave[id]
                                      Safe                  Cross
                              appr[id]
                                      Appr                  Start
                                    stop[id]              go[id]
                                                 Stop
                                   Figure 4.6: Train model.
     The gate keeps track of the state of the bridge, in other words the gate acts as the
controller of the bridge for the trains. If the bridge is currently not being used, the gate
immediately offers any train appearing to go ahead, otherwise it sends a stop message. Once
the gate is free again from a train leaving the bridge, it sends out a go message to any train
that had appeared in the mean time and was waiting in the queue.
                                               Free
                                             appr[e]
                           go[front()]                    leave[id]
                                                Occ
                                     appr[e]        stop[tail()]
                                   Figure 4.7: Gate model.
                                               81


                       ^
                ϕ1 = (     ¬Train[i].Cross) U Train[1].Cross
                       i∈P
                      ^                                                       
                ϕ2 =        Train[i].Appr →      (Gate.Occ U Train[i].Cross)
                      i∈P
where P is the set of trains.
The Fischer’s Protocol It is a mutual exclusion protocol designed for n processes. A
process always sends in a request to enter the critical section (cs). On receiving the request,
a unique pid is generated and the process moves to a wait state. A process can only enter
into the critical section when it has the correct id. Upon exiting the critical section, the
process resets the id which enables other processes to enter the cs
                                         id = 0    req
                                    A
                             id = 0       id = pid           id = 0
                                      id == pid
                                   cs             wait
                                  Figure 4.8: Fischer model.
                                       X
                              ϕ3 = (       P[i].cs ≤ 1)
                                       i∈P
                                       ^
                              ϕ4 = (       P[i].req →   I P[i].cs)
                                       i∈P
The Gossiping People The model consists of n people, each having a private secret they
wish to share with each other. Each person can Call another person and after a conversation,
both person mutually knows about all their secrets. With respect to our monitoring problem,
we make sure that each person generates a new secret that needs to be shared among others
                                               82


infinitely often.
                                                 Start
                                         start()
                            exchange()            Call
                                           talk()            listen()
                                                 Listen
                             Figure 4.9: Gossiping people model.
                                     ^
                         ϕ5 =     I(      (i 6= j) → Person[i].secret[j])
                                    i,j∈P
                                ^
                         ϕ6 =         (  I  Person[i].secrets)
                                i∈P
     Each experiment involves three steps: (1) offset calculation of the given distributed
system, (2) distributed computation/trace generation and (3) trace verification. As stated
earlier, the value of the offset ranges from (−, ) with 0 signifying that there is no skew
between the two processes. To study as to how the offset distribution effects the statistical
guarantee of a verdict, we make use of five different distribution.
    • A truncated normal distribution, T X1 : (µ = 0, σ =       
                                                               1.5
                                                                   )
    • A truncated normal distribution T X2 : (µ = 0, σ = 5 )
    • A uniform distribution U1 : U (−, )
    • A uniform distribution U2 : U (− 2 , 2 )
    • A sum of two truncated normal distribution, T X3 , with (µ = −/2, σ = 3 ) and (µ =
       /2, σ = 3 ).
                                                  83


The truncated normal distribution has limits of (−, ). For each UPPAAL model, we consider
each pair of consecutive events are 0.1s apart, i.e., there are 10 events per second per process.
For our verification step, our monitoring algorithm executes on the generated computation
and verifies it against an MTL specification. We consider the following parameters (1) primary
which includes time synchronization constant (), (2) MTL formula under monitoring, (3)
number of segments (g), (4) computation length (l), (5) number of processes in the system
(P), (6) event rate and (7) offset distribution. We study the runtime of our monitoring
algorithm against each of these parameters. We use a machine with 2x Intel Xeon Platinum
8180 (2.5 Ghz) processor, 768 GB of RAM, 112 vcores with gcc version 9.3.1.
Analysis : Runtime
     We study each of the parameters individually and analyze how it effects the runtime of
our monitoring approach. All results correspond to  = 15ms, |P| = 2, g = 15, l = 2sec,
an event rate of 10events/sec, ϕ4 as the MTL specification and U1 as the offset distribution
unless mentioned otherwise. We vary the number of processes in the system from 2 to 4,
since in most cross-chain transactions the number of blockchains involved is small.
Impact of different formula. Fig. 4.10a shows that runtime of the monitor depends
on two factors: the number of sub-formulas and the depth of nested temporal operators.
Comparing ϕ3 and ϕ6 , both of which consists of the same number of predicates but since ϕ6
has recursive temporal operators, it takes more time to verify and the runtime is comparable
to ϕ1 , which consists of two sub-formulas. This is because verification of the inner temporal
formula often requires observing states in the next segment in order to come to the final
verdict. This accounts for more runtime for the monitor.
Impact of epsilon. Increasing the value of time synchronization constant (), increases
the possible number of concurrent events that needs to be considered. This increases the
complexity of verifying the computation and there-by increasing the runtime of the algorithm.
In addition to this, higher values of  also correspond to more number of possible traces that
                                                84


are possible and should be taken into consideration. We observe that the runtime increases
exponentially with increasing the value of time synchronization constant in Fig. 4.10b. An
interesting observation is that, with longer segment length, the runtime increases at a higher
rate than with shorter segment length. This is because with longer segment length and higher
, it equates to a larger number of possible traces that the monitoring algorithm needs to
take into consideration. This increases the overall runtime of the verification algorithm by a
considerable amount and at a higher pace.
Impact of segment frequency. Increasing the segment frequency makes the length of each
segment lower and thus verifying each segment involves a lower number of events. We observe
the effect of segment frequency on the runtime of our verification algorithm in Fig. 4.10c.
With increasing the segment frequency, the runtime decreases unless it reaches a certain
value (here it is ≈ 0.6) after which the benefit of working with a lower number of events is
overcast by the time required to setup each SMT instances. Working with higher number of
segments equates to solving more number of SMT problem for the same computation length.
Setting up the SMT problem requires a considerable amount of time which is seen by the
slight increase in runtime for higher values of segment frequency.
Impact of computation length. As it can be inferred from the previous results, the
runtime of our verification algorithm is majorly dictated by the number of events in the
computation. Thus, when working with a longer computation, keeping the maximum clock
skew and the number of segments constant, we should see a longer verification time as well.
Results in Fig. 4.10d supports the above claim.
Impact of number of truth values per segment. In order to take into consideration
all possible truth values of a computation, we execute the SMT problem multiple times,
with the verdict of all previous executions being added to the SMT problem such that no
two verdict is repeated. Here in Fig. 4.10e we see that the runtime is linearly effected by
increasing number of distinct verdicts. This is because, the complexity of the problem that
                                               85


                    ϕ1                                                           500    g = 40
                    ϕ2                                                                  g = 25
              500   ϕ3
                    ϕ4                                                                  g = 20
                    ϕ5                                                                  g = 15
                    ϕ6                                                           100    g = 12
              100
Runtime (s)                                                        Runtime (s)
                                                                                        g = 10
               50                                                                 50    g=8
                                                                                        g=7
               10                                                                 10
                5                                                                  5
                1
                                                                                   1
                    1    2    3 4 5          7          10                             0.5   1   1.5    2    2.5   3    3.5
                             Number of Processes |P|                                    Time Synchronization Constant (s)
                        (a) Different Formula                                                    (b) Epsilon
              500                                                                500
                                                |P| = 1; ϕ6                             |P| = 1; ϕ6
                                                |P| = 1; ϕ4                             |P| = 1; ϕ4
                                                |P| = 2; ϕ6                      100    |P| = 2; ϕ6
              100                               |P| = 2; ϕ4                             |P| = 2; ϕ4
                                                |P| = 3; ϕ6                       50
               50                                                                       |P| = 3; ϕ6
Runtime (s)                                                        Runtime (s)
                                                |P| = 3; ϕ4                             |P| = 3; ϕ4
                                                |P| = 4; ϕ6                             |P| = 4; ϕ6
                                                |P| = 4; ϕ4                       10
               10                                                                       |P| = 4; ϕ4
                                                                                   5
                5
                                                                                   1
                1
               0.25 0.5 0.75 1 1.25 1.5 1.75            2                                  10      20     30      40     50
                       Segment Frequency (sec−1)                                                Computation length (l)
                        (c) Segment Frequency                                           (d) Computation Length
                    |P| = 1; ϕ6                                                         |P| = 1; ϕ6
                    |P| = 1; ϕ4                                                  500    |P| = 1; ϕ4
              500   |P| = 2; ϕ6                                                         |P| = 2; ϕ6
                    |P| = 2; ϕ4                                                         |P| = 2; ϕ4
                    |P| = 3; ϕ6                                                         |P| = 3; ϕ6
Runtime (s)                                                        Runtime (s)
              100   |P| = 3; ϕ4                                                  100    |P| = 3; ϕ4
               50   |P| = 4; ϕ6                                                         |P| = 4; ϕ6
                    |P| = 4; ϕ4                                                   50    |P| = 4; ϕ4
               10
                5
                                                                                  10
                1                                                                  5
                    1            2           3          4                              5         7     9     11    13    15
                         No. of solutions (/segment)                                            Event Rate (event/sec)
                        (e) Number of Process                                                   (f) Event Rate
                    Figure 4.10: Different parameter’s impact on runtime for synthetic data.
                                                              86


                         100                                                                  100
 %-age of SAT result      80                                          %-age of SAT result      80
                          60                                                                   60
%-age of UN-SAT result                                               %-age of UN-SAT result
                               T X1
                          40   T X2                                                            40   O(n) conjunction, U1
                                U1                                                                  O(n) disjunction, U1
                          20    U2                                                             20   O(n) conjunction, U2
                               T X3                                                                 O(n) disjunction, U2
                               10      20     30    40     50                                       10      20     30      40   50
                                                                                                                   
                                      (a) Epsilon                                                   (b) Predicate Structure
              Figure 4.11: Different parameter’s impact on statistical guarantee for synthetic data.
the SMT is trying to solve does not change when trying to evaluate to a different solution.
Impact of event-rate. Increasing the event rate involves more number of events that needs
to be processed by our verification algorithm per segment and thereby increasing the runtime
at an exponential rate as seen in Fig. 4.10f. We also observe that with higher number of
processes, the rate at which the runtime of our algorithm increases is higher for the same
increase in event rate.
Analysis : Statistical Guarantee
                         Next, we study the effect of different parameters on the statistical guarantee of the
verdict computed by the monitor. All results correspond to  = 20ms, |P| = 2, g = 15,
l = 2sec, an event rate of 10events/sec and ϕ4 as the MTL specification unless mentioned
otherwise.
Impact of epsilon. As can be imagined, impact of larger clock skew has an negative impact
on the verification result of the system. Larger clock skew leads to more number of events
being considered as concurrent and that leads to more number of traces that is possible with
the correct order of events being compromised. This leads to a lower statistical guarantee
associated with system with larger . In our case, as seen in Figure 4.11a, we receive perfect
score when  = 10ms, since this makes all the event perfectly ordered. Moreover, the
                                                                87


guarantee slides uniformly with increasing value of .
     The other observation from Figure 4.11a is how the guarantee is effected by the different
offset distribution. The less is the standard deviation of the distribution, the closer to the
global clock is the time of occurrence of the event. This makes the statistical guarantee of
yielding a satisfiable result more than compared to when the time of occurrence is far from
the global clock. Thus T X2 yields higher percentage of satisfiable result when compared to
T X1 , which has a larger standard deviation.
Impact of type of logical operator. Here, we compare how the the type of logical
operator effects the probabilistic guarantee of a verdict. As can be seen in Figure 4.11b,
formulas separated with a disjunction has a higher probabilistic percentage than the formulas
separated by a conjunction.      This can be explained by how a formula separated with
conjunction is evaluated compared to the one with disjunction. In case of disjunction, any
one sub-formula evaluated to be true rewrites the entire formula to be true, where-as in case
of conjunction, all the sub-formulas need to be evaluated to be true and only then we come
to a verdict of true. This marks the satisfiability percentage difference between the formulas
separated by conjunction and disjunction.
4.4.2      Blockchain
Setup
     We implemented the following cross-chain protocols from [XH21]: two-party swap,
multi-party swap, and auction. The protocols are written as smart contracts in Solidity
and tested using Ganache, a tool that creates mocked Ethereum blockchains. Using a single
mocked chain, we mimicked cross-chain protocols via several (discrete) tokens and smart
contracts, which do not communicate with each other.
Two Party Swap Protocol: We use the hedged two-party swap example from [XH21]
to describe our experiments. The implementation of the other two protocols are similar.
Suppose Alice would like to exchange her apricot tokens with Bob’s banana tokens, using
                                               88


the hedged two-party swap protocol shown in Fig. 4.1. This protocol provides protection for
parties compared to a standard two-party swap protocol [Nol13], in that if one party locks
their assets to exchange which is refunded later, this party gets a premium as compensation
for locking their assets. The protocol consists of six steps to be executed by Alice and Bob
in turn. In our example, we let the amount of tokens they are exchanging be 100 ERC20
tokens and the premium pb be 1 token and pa + pb be 2 tokens. We deploy two contracts
on both apricot blockchain(the contract is denoted as ApricotSwap) and banana blockchain
(denoted as BananaSwap) by mimicking the two blockchains on Ethereum. Denote the
time that they reach an agreement of the swap as startT ime. ∆ is the maximum time for
parties to observe the state change of contracts by others and take a step to make changes
on contracts. In our experiment, ∆ = 500 milliseconds. By the definition of the protocol,
the execution should be:
    • Step 1. Alice deposits 2 tokens as premium in BananaSwap before ∆ elapses after
      startT ime .
    • Step 2. Bob should deposit 1 token as premium in ApricotSwap before 2∆ elapses
      after startT ime.
    • Step 3. Alice escrows her 100 ERC20 tokens to ApricotSwap before 3∆ elapses after
      startT ime.
    • Step 4. Bob escrows her 100 ERC20 tokens to BananaSwap before 4∆ elapses after
      startT ime.
    • Step 5. Alice sends the preimage of the hashlock to BananaSwap to redeem Bob’s
      100 tokens before 5∆ elapses after startT ime. Premium is refunded.
    • Step 6. Bob sends the preimage of the hashlock to ApricotSwap to redeem Alice’s 100
      tokens before 6∆ elapses after startT ime. Premium is refunded.
     If all parties are conforming, the protocol is executed as above. Otherwise, some asset
refund and premium redeem events is triggered to resolve the case where some party deviates.
To avoid distraction, we do not provide details here.
                                              89


     Each     smart   contract  provides   functions   to   let  parties   deposit   premiums
DepositPremium(), escrow an asset EscrowAsset(), send a secret to redeem assets
RedeemAsset(), refund the asset if it is not redeemed after timeout, RefundAsset(), and
counterparts for premiums RedeemPremium() and RefundPremium(). Whenever a function
is called successfully (meaning the transaction sent to the blockchain is included in a block),
the blockchain emits an event that we then capture and log. The event interface is provided
by the Solidity language. For example, when a party successfully calls DepositPremium(),
the PremiumDeposited event emits on the blockchain. We then capture and log this event,
allowing us to view the values of PremiumDeposited ’s declared fields: the time when it
emits, the party that initiated DepositPremium(), and the amount of premium sent. Those
values are later used in the monitor to check against the specification.
Three Party Swap Protocol: The three-party swap example we implemented can be
described as a digraph where there are directed edges between Alice, Bob and Carol. For
simplicity, we consider each party transfers 100 assets. Transfer between Alice and Bob is
called ApricotSwap, meaning Alice proposes to transfer 100 apricot tokens to Bob, transfer
between Bob and Carol called BananaSwap, meaning Bob proposes to transfer 100 banana
tokens to Carol, transfer between Carol and Alice, called CherrySwap, meaning Carol
proposes to transfer 100 cherry tokens to Alice. Different tokens are managed by different
blockchains (Apricot, Banana and Cherry respectively).
     We denote the time they reach an agreement of the swap as startT ime. ∆ is the
maximum time for parties to observe the state change of contracts by others and take a step
to make changes on contracts. According of the protocol, the execution should follow the
following steps:
    • Step 1. Alice deposits 3 tokens as escrow premium in ApricotSwap before ∆ elapses
       after startT ime .
    • Step 2. Bob deposits 3 tokens as escrow premium in BananaSwap before 2∆ elapses
       after startT ime .
                                              90


    • Step 3. Carol deposits 3 tokens as escrow premium in CherrySwap before 3∆ elapses
      after startT ime.
    • Step 4. Alice deposits 3 tokens as redemption premium in CherrySwap before 4∆
      elapses after startT ime.
    • Step 5. Carol deposits 2 tokens as redemption premium in BananaSwap before 5∆
      elapses after startT ime .
    • Step 6. Bob deposits 1 token as redemption premium in ApricotSwap before 6∆
      elapses after startT ime.
    • Step 7. Alice escrows 100 ERC20 tokens to ApricotSwap before 7∆ elapses after
      startT ime.
    • Step 8. Bob escrows 100 ERC20 tokens to BananaSwap before 8∆ elapses after
      startT ime.
    • Step 9. Carol escrows 100 ERC20 tokens to CherrySwap before 9∆ elapses after
      startT ime.
    • Step 10. Alice sends the preimage of the hashlock to CherrySwap to redeem Carol’s
      100 tokens before 10∆ elapses after startT ime.
    • Step 11. Carol sends the preimage of the hashlock to BananaSwap to redeem Bob’s
      100 tokens before 11∆ elapses after startT ime.
    • Step 12. Bob sends the preimage of the hashlock to ApricotSwap to redeem Alice’s
      100 tokens before 12∆ elapses after startT ime.
     If all parties are conforming, the protocol is executed as above. Otherwise, some asset
refund and premium redeem events will be triggered to resolve the case where some party
deviates. To avoid distraction, we do not provide details here.
Liveness: A liveness property of a program is that it asserts that something good eventually
happens. In other words, a liveness property describes something that must happen during
an execution. Below shows the specification to liveness, i.e., if all the steps of the protocol
                                              91


has been taken:
     ϕliveness =    [0,∆) apr.depositEscrowPr(alice) ∧         [0,2∆) ban.depositEscrowPr(bob)
     ∧    [0,3∆) che.depositEscrowPr(carol) ∧      [0,4∆)   che.depositRedemptionPr(alice)
     ∧    [0,5∆) ban.depositRedemptionPr(carol) ∧          [0,6∆)  apr.depositRedemptionPr(bob)
     ∧    [0,7∆) apr.assetEscrowed(alice) ∧   [0,8∆)    ban.assetEscrowed(bob)
     ∧    [0,9∆) che.assetEscrowed(carol) ∧    [0,10∆)   che.hashlockUnlocked(alice)
     ∧    [0,11∆) ban.hashlockUnlocked(carol) ∧       [0,12∆)   apr.hashlockUnlocked(bob)
     ∧     assetRedeemed(alice) ∧      assetRedeemed(bob) ∧             assetRedeemed(carol)
     ∧     EscrowPremiumRefunded(alice) ∧           EscrowPremiumRefunded(bob)
     ∧     EscrowPremiumRefunded(carol)
     ∧     RedemptionPremiumRefunded(alice)
     ∧     RedemptionPremiumRefunded(bob)
     ∧     RedemptionPremiumRefunded(carol)
Safety A safety property of a program is that it asserts that nothing bad happens during
execution. In other words, a safety property describes something that must not happen
during an execution.       Below shows the specification to check if an individual party is
conforming. If a party is found to be conforming we ensure that there is no negative payoff
                                                92


for the corresponding party. Specification to check Alice is conforming:
      ϕalice  conf  =    [0,∆) apr.depositEscrowPr(alice)∧
                                                                                                      
           [0,3∆)  che.depositEscrowPr(carol) →             [0,4∆) che.depositRedemptionPr(alice) ∧
                                                                                        
       ¬che.depositRedemptionPr(alice) U che.depositEscrowPr(carol) ∧
                                                                                                    
           [0,6∆) apr.depositRedemptionPr(bob) →                [0,7∆) apr.assetEscrowed(alice) ∧
                                                                                   
       ¬apr.assetEscrowed(alice) U apr.depositRedemptionPr(bob) ∧
                                                                                                  
           [0,9∆) che.assetEscrowed(carol) →            [0,10∆) che.hashlockUnlocked(alice) ∧
                                                                               
       ¬che.hashlockUnlocked(alice) U che.assetEscrowed(carol) ∧
                                                                                    
       ¬ban.hashlockUnlocked(carol) U che.hashlockUnlocked(alice) ∧
                                                                                 
       ¬apr.hashlockUnlocked(bob) U che.hashlockUnlocked(alice)
Specification to check conforming Alice does not have a negative payoff:
                                                   X                                  X                  
      ϕalice saf ety = ϕalice conf orm →                      amount         ≥                    amount
                                             TransTo = alice                    TransFrom = alice
Hedged Below shows the specification to check that, if a party is conforming and its escrowed
asset is refunded, then it gets a premium as compensation.
                                                                                        
                  ϕalice hedged =      ϕalice conf orm ∧ apr.assetEscrowed(alice) →
               X                            X                                                              
                         amount ≥                       amount + apr.redemptionPremium.amount
        TransTo = alice               TransFrom = alice
Auction Protocol: In the auction example, we consider Alice to be the auctioneer who
would like to sell a ticket (worth 100 ERC20 tokens) on the ticket (tckt) blockchain, and
Bob and Carol bid on the coin blockchain and the winner should get the ticket and pay
for the auctioneer what they bid, and the loser will get refunded. We denote the time that
they reach an agreement of the auction as startT ime. ∆ is the maximum time for parties
to observe the state change of contracts by others and take a step to make changes on
                                                        93


contracts. Let T icketAuction be a contract managing the “ticket” on the ticket blockchain,
and CoinAuction be a contract managing the bids on the coin blockchain. The protocol is
briefed as follows.
    • Setup. Alice generates two hashes h(sb ) and h(sc ). h(sb ) is assigned to Bob and h(sc )
      is assigned to Carol. If Bob is the winner, then Alice releases sb . If Carol is the
      winner, then Alice releases sc . If both sb and sc are released in T icketAuction, then
      the ticket is refunded. If both sb and sc are released in CoinAuction , then all coins are
      refunded. In addition, Alice escrows her ticket as 100 ERC20 tokens in T icketAuction
      and deposits 2 tokens as premiums in CoinAuction.
    • Step 1 (Bidding). Bob and Carol bids before ∆ elapses after startT ime.
    • Step 2 (Declaration). Alice sends the winner’s secret to both chains to declare a winner
      before 2∆ elapses after startT ime.
    • Step 3 (Challenge). Bob and Carol challenges if they see two secrets or one secret
      missing, i.e. Alice cheats, before 4∆ elapses after startT ime. They challenge by
      forwarding the secret released by Alice using a path signature scheme [Her18].
    • Step 4 (Settle). After 4∆ elapses after startT ime, on the CoinAuction, if only the
      hashlock corresponding to the actual winner is unlocked, then the winner’s bid goes
      to Alice. Otherwise, the winner’s bid is refunded. Loser’s bid is always refunded. If
      the winner’s bid is refunded, all bidders including the loser gets 1 token as premium
      to compensate them. On the T icketAuction, if only one secret is released, then the
      ticket is transferred to the corresponding party who is assigned the hash of the secret.
      Otherwise, the ticket is refunded.
Liveness A liveness property of a program is that it asserts that something good eventually
happens. In other words, a liveness property describes something that must happen during
an execution. Below shows the specification to check that, if all parties are conforming, the
                                                94


winner (Bob) gets the ticket and the auctioneer gets the winner’s bid.
          ϕliveness =     [0,∆)  coin.bid(bob) ∧     [0,2∆) coin.declaration(alice, sb )∧
                      [0,2∆) tckt.declaration(alice, sb ) ∧      (4∆,∞) coin.redeemBid(any)∧
                      (4∆,∞)  coin.refundPremium(any) ∧ coin.bid(carol) →
                                                     
                      [0,∆) coin.refundBid(any) ∧ tckt.redeemTicket(any)∧
                   ¬coin.challenge(any) ∧ ¬tckt.challenge(any)
Safety A safety property of a program is that it asserts that nothing bad happens during
execution. In other words, a safety property describes something that must not happen
during an execution. Below shows the specification to check that, if a party is conforming,
this party does not end up worse off. Take Bob (the winner) for example.
     Specification to define Bob is conforming:
          ϕbob conf orm =       [0,∆) coin.bid(bob)
                                                                                          
                         ∧     coin.declaration(alice, sc ) ∨ coin.challenge(carol, sc ) →
                         ∧ tckt.declaration(alice, sc ) ∨ tckt.challenge(carol, sc )∨
                                                    
                        tckt.challenge(bob, sc ) ∧ coin.declaration(alice, sb )∨
                                                       
                        coin.challenge(carol, sb ) → ∧ tckt.declaration(alice, sb )∨
                                                                                 
                        tckt.challenge(carol, sb ) ∨ tckt.challenge(bob, sb )
                                                                                         
                         ∧ tckt.declaration(alice, sc ) ∨ tckt.challenge(carol, sc ) →
                          coin.declaration(alice, sc ) ∨ coin.challenge(carol, sc )∨
                                                    
                        coin.challenge(bob, sc ) ∧
                        
                            tckt.declaration(alice, sb )∨
                                                      
                        tckt.challenge(carol, sb ) → coin.declaration(alice, sb )∨
                                                                                  
                        coin.challenge(carol, sb ) ∨ coin.challenge(bob, sb )
                                                       95


Specification to define Bob does not end up worse off:
                                                    
              ϕbob saf ety = ϕbob conf orm →          coin.refundBid(any)
                                                                                   
                           ∧ coin.redeemPremium(any) ∨ tckt.redeemTicket(any)
Hedged Below shows the specification to check that, if a party is conforming and its
escrowed asset is refunded, then it gets a premium as compensation.
                                 
              ϕbob hedged =        ϕbob conf orming ∧
                                                                                
                            tckt.refundTicket(alice) ∨ tckt.redeemTicket(carol)     →
                                                                                  
                               coin.refundBid(any) ∧ coin.redeemPremium(any)
Log Generation and Monitoring
     Our tests simulates different executions of the protocols and generated 1024, 4096, and
3888 different sets of logs for the aforementioned protocols, respectively. We again use the
hedged two-party swap as an example to show how we generate different logs to simulate
different execution of the protocol. On each contract, we enforce the order of those steps to
be executed. For example, step 3 EscrowAsset() on the ApricotSwap cannot be executed
before Step 1 is taken, i.e. the premium is deposited. This enforcement in the contract
restricts the number of possible different states in the contract. Assume we use a binary
indicator to denote whether a step is attempted by the corresponding party. 1 denotes a
step is attempted, and 0 denotes this step is skipped. If the previous step is skipped, then
the later step does not need to be attempted since it will be rejected by the contract. We
use an array to denote whether each step is taken for each contract. On each contract, the
different executions of those steps can be [1,1,1] meaning all steps are attempted, or [1,1,0]
meaning the last step is skipped, and so on. Each chain has 4 different executions. We
take the Cartesian product of arrays of two contracts to simulate different combinations of
executions on two contracts. Furthermore, if a step is attempted, we also simulate whether
                                                      96


the step is taken late, or in time. Thus we have 26 possibilities of those 6 steps. In summary,
we succeeded generating 4 · 4 · 26 = 1024 different logs.
     In our testing, after deploying the two contracts, we iterate over a 2D array of size
1024 × 12, and each time takes one possible execution denoted as an array length of 12 to
simulate the behavior of participants. For example, [1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] stands for
the first step is attempted but it is late, and the steps after second step are all attempted
in time. Indexed from 0, the even index denotes if a step is attempted or not and the odd
index denotes the former step is attempted in time or late. By the indicator given by the
array, we let parties attempt to call a function of the contract or just skip. In this way, we
produce 1024 different logs containing the events emitted in each iteration.
     We check the policies mentioned in [XH21]: liveness, safety, and ability to hedge against
sore loser attacks. Liveness means that Alice should deposit her premium on the banana
blockchain within ∆ from when the swap started(               [0,∆) ban.premium deposited(alice)) and
then Bob should deposit his premiums, and then they escrow their assets to exchange, redeem
their assets (i.e. the assets are swapped), and the premiums are refunded. In our testing,
we always call a function to settle all assets in the contract if the asset transfer is triggered
by timeout. Thus, in the specification, we also check all assets are settled:
    ϕliveness =    [0,∆) ban.premium deposited(alice) ∧        [0,2∆) apr.premium deposited(bob)∧
               [0,3∆) apr.asset escrowed(alice) ∧    [0,4∆)  ban.asset escrowed(bob)∧
               [0,5∆) ban.asset redeemed(alice) ∧     [0,6∆)  apr.asset redeemed(bob)∧
               [0,5∆) ban.premium refunded(alice) ∧        [0,6∆)  apr.premium refunded(bob)∧
               [6∆,∞)  apr.all asset settled(any) ∧    [5∆,∞)   ban.all asset settled(any)
Safety is provided only for conforming parties, since if one party is deviating
and behaving unreasonably, it is out of the scope of the protocol to protect
them.       Alice should always deposit her premium first to start the execution of
the protocol( [0,∆) ban.premium deposited(alice)) and proceed if Bob proceeds with the
next step.       For example, if Bob deposits his premium, then Alice should always
                                                  97


go ahead and escrow her asset to exchange( [0,2∆) apr.premium deposited(bob) →
  [0,3∆) apr.asset escrowed(alice)).          Alice should never release her secret if she does not
redeem, which means Bob should not be able to redeem unless Alice redeems, which is
expressed as ¬apr.asset redeemed(bob) U ban.asset redeemed(alice):
       ϕalice conform =     [0,∆) ban.premium   deposited(alice)∧
                                                                                                       
                          [0,2∆) apr.premium   deposited(bob) →       [0,3∆) apr.asset escrowed(alice) ∧
                                                                                                   
                          [0,4∆) ban.asset escrowed(bob) →       [0,5∆) ban.asset redeemed(alice) ∧
                                                                                    
                       ¬apr.asset redeemed(bob) U ban.asset redeemed(alice)
 By definition, safety means a conforming party does not end up with a negative payoff. We
track the assets transferred from parties and transferred to parties in our logs.
      Thus, a conforming party is safe. e.g. Alice, is specified as safe ϕalice saf ety :
                                                    X                             X                  
            ϕalice safety =ϕalice conform →                    amount ≥                       amount
                                               TransTo = alice              TransFrom = alice
To enable a conforming party to hedge against the sore loser attack if they escrow assets to
exchange which is refunded in the end, our protocol should guarantee the aforementioned
party get a premium as compensation, which is expressed as ϕalice hedged :
                                                                                                     
 ϕalice hedged =       ϕalice conform ∧ apr.asset escrowed(alice) ∧ apr.asset refunded(any) →
                               X                              X                                          
                                           amount ≥                       amount + apr.premium.amount
                        TransferTo = alice           TransferFrom = alice
Analysis of Results
      We put our monitor to test the traces generated by the Truffle-Ganache framework. To
monitor the 2-party swap protocol we do not divide the trace into multiple segments due to
the low number of events that are involved in the protocol. On the other hand, both 3-party
swap and auction protocol involve a higher number of events and thus we divide the trace
into two segments (g = 2). In Fig. 4.12a, we show how the runtime of the monitor is effected
by the number of events in each transaction log.
      Additionally, we generate transaction logs with different values for deadline (∆) and
                                                        98


               500
                     2-party swap; g = 1
                     3-party swap; g = 2
               100     aunction; g = 2
                50
                                                                                               100
                                                                       %-age of SAT result
 Runtime (s)
                10
                 5                                                                             80
                 1                                                                             60
                                                                      %-age of UN-SAT result
                                                                                               40
                                                                                                      2-party swap
                                                                                               20     3-party swap
                                                                                                         auction
                                                                                                 0
                     4    8    12     16     20   24   28                                               200   300 400 500        600
                               No. of events                                                                   (∆ = 500ms)
                           (a) Runtime                                                               (b) Statistical Guarantee
                                Figure 4.12: Results from the blockchain experiments.
time synchronization constant () to put the safety of the protocol in jeopardy. We observe
both true and false verdict when  ' ∆ as seen in Figure 4.12b. This is due to the non
deterministic time stamp owning to the assumption of a partially synchronous system. The
observed time stamp of each event can at most be off by . Thus, we recommend to use a
value of ∆ that is strictly greater than to the value of  when designing the smart contract.
4.5                  Summary and Limitation
               In this chapter, we propose a monitoring technique which takes an MTL formula and
a distributed computation as input.                         We apply a progression-based formula rewriting
monitoring algorithm implemented as an SMT decision problem in order to verify the
correctness of the distributed system with respect to the formula. We also conduct extensive
synthetic experiments on traces generated by the tool UPPAAL and a set of blockchain smart
contracts for cross-chain transactions.
               However, as discussed in Section 4.4, the approach does not scale well when considering
larger distributed system. Currently, the monitoring runtime increases exponentially with
increase in the number of processes or events being monitored. This is a big limiting factor
when designing a verification approach which can work in real time.
                                                               99


Chapter 5
Fault Tolerant Runtime Verification of
Synchronous Distributed Systems
5.1       Introduction
     In this chapter, we introduce an RV technique for fault-tolerant decentralized monitoring
that inspects an underlying distributed system.           Our RV framework has the following
features:
    • We assume that a set of monitors are distributed over a synchronous communication
      network. The network is a complete graph allowing all monitors to communicate with
      each other using point-to-point message passing in synchronous rounds.
    • Each monitor is subject to crash failures. A crashed monitor halts permanently and
      never recovers.
    • Each monitor has only a partial view of the underlying system. More specifically, given
      a set AP of atomic propositions that describe the global state of the system, each
      monitor reads only an arbitrary proper subset of AP.
    • The formal specification language is the popular linear temporal logic (LTL) [MP79],
    (To appear) Ritam Ganguly, Shokufeh Kazemloo, and Borzoo Bonakdarpour, Crash-Resilient
Decentralized Synchronous Runtime Verification, IEEE Transactions on Dependable and Secure Computing.
                                                  100


      where formulas are inductively constructed using the propositions in AP and operators
      that describe the temporal order of events.
Our goal is to design a distributed monitoring algorithm with the following properties:
    • Soundness: Upon termination, all local monitors compute the same monitoring verdict
      as a centralized monitor that can atomically observe the global state of the system.
    • Low overhead: One way for local monitors to share their observation of the underlying
      system is to communicate their reading of AP with each other in synchronous
      communication rounds. However, this will incur a message size of O(|AP|), which is
      exponential in the number of system variables. Thus, our goal is to find a more efficient
      way for local monitors to communicate their partial observations without compromising
      soundness.
     Our main contribution in this chapter is a decentralized synchronous t-resilient RV
algorithm, where t is the upper bound on the number of crash failures of monitors. Given a
new global state, each monitor process computes a symbolic representation of its reading of
AP and starts t+1 rounds of synchronous communication with other monitors in the network.
The number of rounds is inspired by solutions to the consensus problem in synchronous
networks, though in our problem, the monitors need to agree on a verdict that is not known a
priori and they collaboratively compute the verdict during the rounds of communication. The
symbolic representation is computed by employing a deterministic finite state automaton for
monitoring formulas in the linear temporal logic (LTL). We show that the monitor automaton
as constructed using the algorithm in [BLS11] cannot guarantee soundness in a distributed
synchronous setting. Subsequently, we propose an algorithm that transforms the automaton
into another by adding a minimum number of extra states and transitions to address cases
where local monitors run into indistinguishable states due to their partial observations.
     In order to minimize the size of the transformed automaton, we formulate an offline
optimization problem in satisfiability modulo theory (SMT). The size of the SMT instance is
expected to be small, as most practical LTL formulas are known to have at most just a few
                                             101


nested temporal operators. Even if the size of the transformed monitor is not minimized the
size of each message will be O(log(|Mϕ3 |)·|AP|), where Mϕ3 denotes the finite state automaton
for monitoring an LTL formula ϕ in the 3-valued semantics as constructed in [BLS11]. In
short, our RV framework has message complexity
                                                              !
                                                         
                                O log |Mϕ3 | |AP|n2 t + 1
for evaluating each global state, where n is the number of distributed monitors and t is the
bound on the number of crash failures. An important implication of our results is that unlike
the asynchronous fault-prone setting, where we need to increase the number of truth values
in the specification language to design consistent distributed monitors [FRT14, FRRT14,
BFR+ 16], in this chapter, we show that in a fault-prone synchronous setting, the number of
truth values is irrelevant for sound distributed monitoring.
     To enhance the efficiency further, we limit the number of rounds to the maximum
number of crashes that is possible in the system at any given state and not be constant at
t. Thus reducing the average number of rounds. Also, to limit the total number of messages
sent between monitors we let the communication happen after every l states. The partial
observation of all previous l states are preserved for communication. This considerably
decreases the number of messages being sent for inter-monitor communication, at the cost
of increase in the average size of the message because of the higher number of possible states
that the monitor automaton can be in.
     We have implemented and evaluated our approach on a variety of LTL formulas for
traces being generated using different random distributions as well as an IoT dataset,
Orange4Home [CLRC17]. We analyze the average number of rounds and total messages
being sent in the system for different values of t and l. We also analyze the change in the
average number of rounds, total number of messages, average size of a message along with
total monitor crashing in the system for different length of execution traces.
                                              102


5.2          Model of Computation
      An LTL3 monitor as defined in Definition 3 can evaluate an LTL formula ϕ with respect to
a finite execution, where each event represents the full view of the system under inspection.
From now on, we refer to such events as global events, where the value of all propositions in
the event is known. While this model is realistic in a centralized setting, it is too abstract
in a distributed setting. We now present our computation model.
5.2.1          Overall Picture
      We consider a distributed monitoring system comprising of a fixed number n of monitor
processes M = {M1 , M2 , . . . , Mn } that communicate with each other by sending and
receiving messages through point-to-point bidirectional communication links (To prevent
confusion, we refer to monitors in M as ‘monitor processes’ and the one defined in
Definition 3 as ‘LTL3 monitor’). We assume that the communication graph is synchronous
and complete. Each communication link is reliable, that is, we assume no loss or alteration
of messages. Each monitor process locally executes identical sequential algorithms. Each
run of a monitor process consists of a sequence of rounds that are identified by the successive
positive integers 1, 2, etc. The round number is a global variable and its progress is ensured
by the synchrony assumption [Lyn96]. Each round is made up of three consecutive steps:
send, receive, and local computation. The principle property of the round-based synchronous
model is the fact that a message sent by a monitor Mi to another monitor Mj , for all
i, j ∈ [1, n], during a round r is received by Mj at the very same round r. Each monitor
process can start a new round when the current is complete.
      Throughout this section, the system under inspection produces a finite trace α =
s0 s1 · · · sk , and is inspected with respect to an LTL formula ϕ by a set of synchronous
distributed monitor processes.          Informally, our synchronous distributed monitoring
architecture works as follows. For every j ∈ [0, k], between each two consecutive global
events sj and sj+1 , each monitor process Mi , where i ∈ [1, n] (we will generalize this event-
                                                103


by-event approach in Section 5.4):
   1. reads the value of propositions in sj (visible to Mi ), which results in a partial
      observation of sj ;
   2. at every synchronous round, broadcasts a message containing its current observation of
      the underlying system, and then waits to receive similar messages from other monitor
      processes;
   3. based on the messages received at each round updates its current observation by
      incorporating partial observations of other monitor processes, and composes a message
      to be sent at next round, and
   4. finally, after t + 1 rounds of communication, evaluates ϕ and emits a truth value from
      B3 , where t is the upper bound on the number of monitor process crash failures.
5.2.2      Detailed Description
     We now delve into the details of our computation model (see Algorithm 5). When an
event sj is reached in a finite trace α = s0 s1 · · · sk , each monitor process Mi ∈ M, where
i ∈ [1, n], attempts to read sj (Line 2 in Algorithm 5). Due to distribution, this results in
                             s
obtaining a partial view Si j defined next.
Definition 9. A partial view is a function S : AP 7→ {true, false, \}, i.e, a mapping from
the set of atomic propositions to values true, false, or \. The latter denotes an unknown
value for a proposition.
     Notice that the unknown value ‘\’ for a proposition is different from the unknown truth
value ‘?’ in the LTL3 semantics.
Definition 10. We say that a partial view S is consistent with a global event s ∈ Σ (denoted
S v s), if for every atomic proposition p ∈ AP, we have:
                                                             
        S(p) = true ⇔ p ∈ s ∧ S(p) = false ⇔ p 6∈ s .
     Hence, a partial view S is consistent with event s, if the value of an atomic proposition
is not unknown, then it has to be consistent with s.
                                             104


Algorithm 5: Behavior of Monitor Mi , for i ∈ [1, n].
  Input LTL formula ϕ and finite trace s0 s1 · · · sk
  Output A verdict from B3
 1: for j = 0 to k do
           s
 2: Let Si j be the initial partial view monitor Mi
                 s
 3: LSi1 ← µ(Si j , ϕ)
 4: for r = 1 to t + 1 do
 5:    Send: broadcasts symbolic view LSir
 6:    Receive: Πri ← {LSjr }j∈[1,n]
 7:    Computation: LSir+1 ←
 8:    LC(Πri )
 9: end for
10: end for
11: Emit a verdict from B3
      Monitor processes observe the system under inspection by reading partial views. We
denote the partial view of a monitor process Mi from event s ∈ Σ by Sis and assume that
Sis v s. This implies that two monitors Mi and Ml cannot have inconsistent partial views
of the same global event. That is, for any event s and partial views Sis , Sls , and for every
p ∈ AP, we have:
                             (Sis (p) 6= Sls (p) ⇒ (Sis (p) = \ ∨ Sls (p) = \).
      In Algorithm 5, one way for monitor processes to share their observation of the system
is to communicate their partial views. This way, after several rounds of communication
(due to the occurrence of faults), all monitor processes can construct the full global event.
Although this idea works in principle, it is quite inefficient, as the size of each message will
have to be at least |AP| bits. Our goal is to design a technique, where monitor processes
can communicate their observations without sending and receiving their partial views of
atomic propositions. To this end, we introduce the notion of a symbolic view that intends to
represent the partial view of a monitor processes Mi without losing information. We denote
the symbolic view of a partial view Sis with respect to an LTL formula ϕ by LSi = µ(Sis , ϕ)
(see Line 3 in Algorithm 5). In Section 5.3, we will present a concrete way of computing µ.
      Let LSir denote the symbolic view of monitor process Mi at the beginning of round r. In
Line 5, each monitor process sends its current symbolic view to all other monitor processes
                                                      105


and then receives the symbolic view of all monitor processes in Line 6. Let Πri = {LSlr }l∈[1,n]
be the set of all messages received (We note that if some monitor process crashes while
another monitor is receiving messages in Line 6, this monitor will not receive n messages
as prescribed by the algorithm. In synchronous algorithms, by the synchrony assumption,
a crash failure can be easily detected and hence, the accurate value of n can be determined
for receiving messages) by monitor process Mi during round r. Then (Line 7), the monitor
computes the new symbolic view from the messages it received using a function LC (described
in detail in Section 5.3). This new view will be broadcast during the next round.
     In order to achieve sound monitoring, we assume the full event in the system is observed
by the set M of monitor processes. We call this assumption event coverage. More specifically,
we say that a set of monitor processes cover a global event if and only if the collection of
partial views of these monitor processes cover the value of the all atomic propositions.
Definition 11. A set M = {M1 , M2 , . . . , Mn } satisfies event coverage for an event s if and
only if for every p ∈ AP, there exists Mi ∈ M such that Sis (p) 6= \.
5.2.3      Fault Model
     Each monitor process is subject to crash faults, i.e., it may halt and never recover. We
assume that up to t monitor processes can crash, where t < |M|. A monitor process may
crash at any round. To ensure the event coverage, we assume that if there is a proposition
p ∈ AP, such that at round r monitor process Mi is the only monitor aware of p, then
the message sent by Mi at round r, must be received by at least one non-faulty monitor in
round r. This is a reasonable assumption and can be implemented by including redundant
monitors. That is, there is enough number of monitors that ensure event coverage (e.g., by
using triple modular redundancy).
5.2.4      Problem Statement
     Our formal problem statement is the termination requirement for Algorithm 5. We
require that when a non-faulty monitor process runs Algorithm 5 to the end, it emits a
                                             106


verdict that a centralized monitor that has global view of the system would compute:
                         ∀i ∈ [1, n] : Mi is non-faulty ⇒ νi = [α |=3 ϕ]
where α ∈ Σ∗ , ϕ is an LTL formula, and νi is the truth value emitted by monitor Mi at the
end of Algorithm 5.
     It is easy to see that our decentralized synchronous monitoring problem, where
monitor processes are subject to crash faults is in spirit similar to the uniform consensus
problem [Lyn96]. The main difference is that in consensus, processes need to agree on one
values that they own. In our problem, they should agree on the value [α |=3 ϕ], while none of
the monitors necessarily has this value before the inner for-loop. In Section 5.4, we will show
that similar to synchronous consensus, if t monitors may fail, t + 1 rounds of communication
are sufficient to agree on the final verdict.
5.3       The General Idea and Motivating Example
     In Algorithm 5, we provided the skeleton of our synchronous monitoring algorithm.
What remains to be done is identifying concrete functions µ and LC . Our general idea is
described in the sequel and is reflected in Algorithm 6, which refines Algorithm 5.
5.3.1      Symbolic View µ
     As mentioned in Section 5.2, sharing explicit partial views is not space efficient, as each
message will need at least |AP| bits. To tackle this problem, our idea is that each monitor
process employs an LTL3 monitor, as defined in Definition 3 and the symbolic view of a
monitor process consists of the set of possible LTL3 monitor states that corresponds to its
partial view. Formally, let q be the current state of the LTL3 monitor and S be the partial
view of the monitor process. The set of possible next LTL3 monitor states can be computed
as follows:
                                    n                                   o
                         µ(S, q) = q 0    ∃s ∈ Σ. S v s ∧ δ(q, s) = q 0                     (5.1)
                                                107


                                     {a}, {b}, ∅             true
                                                  {a, b}
                                          q0                    q>
                          Figure 5.1: LTL3 monitor for ϕ = ♦(a ∧ b).
Recall that δ denotes the transition function in LTL3 monitors. For example, consider the
following LTL formula ϕ =      (a ∧ b). The LTL3 monitor of this formula is shown in Fig. 5.1,
where λ(q0 ) =? and λ(q> ) = >. Let us imagine that (1) a monitor process M1 is currently
in state q0 , (2) the global event is s = {a, b}, and (3) the current partial view of M1 is
S1s (a) = true and S1s (b) = true. This implies that monitor M1 considers q> as the only
possible next LTL3 monitor state, i.e., µ(S1s , q0 ) = {q> }. However, considering another
partial view S1s (a) = true and S1s (b) = \, monitor process M1 will have to consider {q0 , q> }
as possible next LTL3 monitor states. This is because it has to consider two possibilities
for proposition b. That is, µ(S1s , q0 ) = {q0 , q> }. We use µ as defined in Equation (5.1) to
compute the concrete symbolic view in Line 4 of Algorithm 6.
5.3.2     Computing LC
      Given a set of possible LTL3 monitor states computed by µ, in Line 7 of Algorithm 6,
each monitor process receives a set of possible states from all other monitors, denoted by
LSir for each monitor process Mi , where i ∈ [1, n] and each communication round r. Our
idea to compute LC from these sets is to simply take their intersection. The intuition behind
intersection is that it represents the conjunction of all partial views of all monitors. That is,
in Line 8 of Algorithm 6, we have:
                                                     \
                                      LC (Πri ) =          LSir .                          (5.2)
                                                   i∈[1,n]
                                                 108


Algorithm 6: Updated behavior of Monitor Mi , for i ∈ [1, n].
  Input LTL3 monitor M3ϕ = hΣ, Q, q0 , δ, λi, finite trace s0 s1 · · · sk Output Verdict from B3
 1: qcurrent ← q0
 2: for j = 0 to k do
            s
 3: Let Si j be the initial partial view of the monitor
                 s
 4: LSi1 ← µ(Si j , qcurrent )                                                               . Equation (5.1)
 5: for r = 1 to t + 1 do
 6:   Send: broadcasts symbolic view LSir
 7:   Receive: Πri ← {LSjr }j∈[1,n]
 8:   Computation: LSir+1 ← LC(Πri )                                                         . Equation (5.2)
 9: end for
10: qcurrent ← LSir+1
11: end for
12: return λ(qcurrent )
5.3.3        Motivating Example
      The above general ideas for computing µ and LC has one problem. In Line 10, one
final LTL3 monitor state should determine the final output, but in some cases, the partial
views of two monitors are too coarse and applying intersection on them cannot compute the
LTL3 monitor states that represent the aggregate knowledge of the monitors. For example,
consider again the LTL3 monitor for formula                 (a ∧ b) in Fig. 5.1. Suppose that we have a
global event s = {a, b}, two monitors M1 and M2 , both at initial state q0 , and two partial
views, where M1 knows the value of a and M2 knows the value of b. That is,
                      S1s (a) = true       S1s (b) = \       S2s (a) = \     S2s (b) = true
These monitors will compute µ as follows:
                                     µ(S1s , q0 ) = µ(S2s , q0 ) = {q0 , q> }.
Applying intersection on µ(S1s , q0 ) and µ(S2s , q0 ) will result in the same set {q0 , q> }. At this
point, no matter how many times the monitor processes communicate, at the end of the
inner for-loop, LS will not become a singleton and in Line 11, qcurrent cannot be determined
properly. This scenario is in particular, problematic since the collective knowledge of M1
and M2 (i.e., the fact that a and b are both true) should result in reconstructing s = {a, b}.
                                                       109


Surprisingly, this problem does not stem from the way we compute µ and LC . It is mainly
due to the structure of the LTL3 monitor as defined in Definition 3. Although the definition
works for centralized monitoring, it needs to be refined for distributed monitors that have
only a partial view of the underlying system. In Section 5.4, we a technique to transform
an LTL3 monitor into an equivalent one capable of encoding enough information for monitor
processes with partial views.
5.4       Monitor Transformation Algorithm
     The discussion in Section 5.3 reveals the source of the problem on the structure of the
monitor in Fig. 5.1. The self-loop on state q0 prescribes that state q0 is reachable by three
events: {a}, {b}, or {}, while a partial view of {a, b} may intersect with both {a} and {b},
which are indistinguishable from each other. If we can somehow split q0 to two states to
explicitly distinguish the cases where either of a or b are true, then applying intersection will
effectively solve the problem presented in Section 5.3.3. More specifically, consider the LTL3
monitor shown in Fig. 5.2 for formula ϕ =           (a ∧ b), where state q0 is split in two states q01
and q02 . State q02 is reached when a is true and b is false. Analogously, State q01 is reached
when b is true or both a and b are false. Now, recall the two monitors M1 and M2 and their
partial views in Section 5.3.3:
                    S1s (a) = true    S1s (b) = \      S2s (a) = \  S2s (b) = true
These monitors will compute µ as follows:
                                       µ(S1s , q0 ) = {q02 , q> }
                                       µ(S2s , q0 ) = {q01 , q> }
Applying intersection on µ(S1s , q0 ) and µ(S2s , q0 ) will now result in the singleton {q> }, which
is indeed the correct verdict for global event {a, b}. We call the monitor shown in Fig. 5.2
                                                   110


                                       {b}, ∅                     true
                                                     {a, b}
                                           q01                      q>
                                                              {a}
                                                                       {a, b}
                                               {b}, ∅
                                                                   q02
                                                                  {a}
                    Figure 5.2: Extended LTL3 monitor for ϕ = ♦(a ∧ b).
an extended LTL3 monitor.
     In this section, we present an algorithm that takes as input an LTL3 monitor and
generates as output an extended LTL3 monitor. We prove that by plugging an extended
LTL3 monitor in the distributed RV Algorithm 6, it will produce a verdict identical to that
of a centralized LTL3 monitor.
5.4.1     The Challenge of Constructing Extended Monitors
     Let Mϕ3 = hΣ, Q, q0 , δ, λi be the LTL3 monitor of an LTL formula ϕ. To simplify our
notation, we denote transitions of δ by:
                                                    L(q,q 0 )
                                                 q −−−−→ q 0 ,
where the set L(q, q 0 ) of labels is formally defined as follows:
                                                 n                          o
                                  L(q, q 0 ) = s ∈ Σ | δ(q, s) = q 0 .
When it is clear from the context, we refer to the set of labels L(q, q 0 ) simply by L.
     Now, suppose that AP = {a, b, c}, an LTL3 monitor has a transition of the form:
                                                 {a},{b,c},{a,c}
                                             q0 −−−−−−−−→ q1 ,
                                                      111


the global event is s = {a, b, c}, and the partial view of each process Mi , where i ∈ [1, n],
has the value of at most one atomic proposition (i.e., the value of other propositions are
unknown). It is straightforward to see that for any global event s ∈ Σ − {{a}, {b, c}, {a, c}},
the monitor state q1 appears in the symbolic view of every monitor process Mi , i.e., q1 ∈
µ(Sis , q0 ), and consequently, it is impossible for LSi to become a singleton. Note that q1 is
not the correct verdict. Hence, we need to split q1 into two new states q11 and q12 , which
can be done in one of the following ways:
                 {a},{b,c}                 {a,c}
   (1)       q0 −−−−−→ q11     and     q0 −−−→ q12
                 {a}                       {b,c},{a,c}
   (2)       q0 −−→ q11        and     q0 −−−−−−→ q12
                 {a},{a,c}                 {b,c}
   (3)       q0 −−−−−→ q11     and     q0 −−→ q12
     In scenarios (1) and (2) above: we further need to split q11 and q12 , respectively. But in
scenario (3), there is no need to split q11 or q12 . Thus, the choice of splitting the monitors’
blind spot, has an impact on the size of the extended LTL3 monitor. In order to minimize the
number of new states that are added to the extended LTL3 monitor, we need to compute the
minimum-size split. Finding the minimum-size split is a combinatorial optimization problem
very similar to the set cover or the hitting set problems [GJ79]. In the next subsection, we
present an SMT-based technique to obtain the minimum-size transition split.
5.4.2         Identifying the Minimum-size Split
                                                       L
Definition 12. We say that a transition q −            → q 0 covers an event s ∈ Σ if and only if
                  ∀p ∈ AP : ∃s0 ∈ L : (p ∈ s ⇔ p ∈ s0 ).
     Observe that if a transition covers an event, it does not mean that the event is in the
label set of the transitions. It only means that all of its propositions are covered.
                                                                                 L
Definition 13. We say that an event s is opaque to a transition q −             → q 0 , if (1) s 6∈ L, but
        L
(2) q − → q 0 covers s.
                                                                     {a},{b},∅
     For example, event {a, b} is opaque to transition q0 −−−−−→ q> in the LTL3 monitor in
Fig. 5.1. It is easy to observe that two partial views of an opaque event to a transition may
                                                       112


Algorithm 7: Function to determine whether a transition has to split.
 1: function SPLIT(L)
 2:   CV ← 0
 3:   for each p ∈ AP do
 4:     if (∃s, s0 ∈ L.p ∈ s ∧ p ∈
                                 / s0 ) then
 5:       CV ← CV + 1
 6:     end if
 7:   end for
 8:   if (2CV > |L|) then
 9:     return true
10:   end if
11:   return false
12:  end function
result in identical possible sets of LTL3 monitor states. When one monitor only reads a and
another monitor reads only b, then the resulting set of possible states (i.e., {q0 , q> }) are not
distinguishable from each other, because both propositions a and b are in event {a, b}. Indeed,
this is the main reason in creating ambiguity in distributed monitor processes with partial
views and such transitions need to be split in order to resolve possible ambiguities. Function
SPLIT (see Algorithm 7) determines whether or not a transition should be split. The
variable CV in the function computes the number of events covered by the input transition
                                                                              {a},{b},∅
label set. In the above example, the value of 2CV for transition q0 −−−−−→ q0 is 4 which is
strictly greater than |L| = 3. This means that the transition needs to be split.
      Our goal is to minimize the number of splits for a transition, as the number of splits
determines the final size of the extended LTL3 monitor. Formally, given an event s ∈ Σ
                                 L                                                            L
opaque to a transition q −       → q 0 , we aim at splitting the transition to transitions q −→1
                                                                                                 q1 to
   Ln                        S
q −→     qn such that (1) i∈[1,n] Li = L, (2) s is opaque to none of these transitions, and (3) n
is minimum. It is straightforward to see that this is a combinatorial optimization problem
that involves generating all subsets of L to find the best choice for L1 to Ln , i.e., a bad
choice can result in more future splits. To solve this problem, we transform it into an SMT
instance to utilize powerful SMT-solvers.
      We now define the constants, variables, constraints, and the optimization objective
                                                            L
of our SMT instance. The input is a transition q −          → q 0 and the output are two transitions
   L1               L 2
q −→     q1 and q −→    q2 such that minimum number of global events are opaque to the transition.
                                                    113


In other words, L = L1 ∪ L2 and L1 ∩ L2 = ∅ such that we minimize the number of new
states to be created.
Constants. For every atomic proposition p ∈ AP and every global event s ∈ L, we employ
a Boolean constant aps defined as follows:
                                              
                                              
                                              true
                                                       if        p∈s
                                        aps =
                                              
                                              false
                                                       if        p∈/s
Variables and functions. For every global event s ∈ L, we define two Boolean variables
xLs 1 and xLs 2 , meaning that xLs 1 = true, if s ∈ L1 , otherwise xLs 1 = false. Likewise, xLs 2 = true,
if s ∈ L2 , otherwise xLs 2 = false. We define an operator ◦ between a Boolean variable x and
a constant a as follows:                      
                                              
                                              a
                                                      if        x = true
                                     x◦a=
                                              
                                              true
                                                      if        x = false
For each atomic proposition p ∈ AP, we introduce two Boolean variables yLp 1 and yL¬p1 with
the following meaning:                   
                                         
                                         true
                                                  if      ∀s ∈ L1 : p ∈ s
                                yLp 1 =
                                         
                                         false
                                                          otherwise
                                          
                                          
                                          true
                                                  if      ∀s ∈ L1 : p ∈  /s
                                yt¬p1
                                      =
                                          
                                          false
                                                          otherwise
Analogously, for each atomic proposition p ∈ AP, we introduce Boolean variables yLp 2 and
yL¬p2 . We also include two Booleans vLp 1 and vLp 2 , whose meaning is explained later in the set
of SMT constraints. For each event s ∈ L, we define two binary integer variables wLp 1 and
wLp 2 (for the purpose of counting and optimization) as follows:
                                              
                                                            υLp 1 = true
                                              
                                              0 if
                                              
                                      wLp 1 =
                                              
                                              1
                                                           otherwise
                                                     114


                                               
                                                               υLp 2 = true
                                               
                                               0 if
                                               
                                     wLp 2 =
                                               
                                               1
                                                               otherwise
Constraints.        Informally, an event appears either in L1 or in L2 . Hence, we add the
following constraint for each s ∈ L:
                                                  xLs 2 = ¬xLs 1
The constraints to encode the meaning of variables yLp 1 and yL¬p1 are as follows:
                                                     ^
                                           yLp 1 =       (xLs 1 ◦ aps )
                                                     s∈L
                                                     ^
                                           yL¬p1 =       (xLs 1 ◦ a¬p
                                                                    s )
                                                     s∈L
It is easy to verify that yLp 1 evaluates to true if and only if for every event s ∈ L1 , we have
p ∈ s, and yL¬p1 evaluates to true if and only if for every event s ∈ L1 , we have p ∈
                                                                                     / s. Likewise,
for variables yLp 2 and yL¬p2 , we add the following constraints:
                                                     ^
                                           yLp 2 =       (xLs 2 ◦ aps )
                                                     s∈L
                                                     ^
                                           yL¬p2 =       (xLs 2 ◦ a¬p
                                                                    s )
                                                     s∈L
Finally, we need to count the number of opaque events in yLp 1 and yL¬p1 (respectively, yLp 2 and
yL¬p2 ). Hence, we add the following assertions:
                                                p
                                              vL1   = yLp 1 ∨ yL¬p1
                                                p
                                              vL2   = yLp 2 ∨ yL¬p2
Optimization objective. Our objective is to minimize the total number of opaque events
                                                        115


to transition labels L1 and L2 :
                                             X                   
                                         min        wLp 1 + wLp 2
                                             p∈AP
We remark that although SMT-solvers cannot directly handle optimization objectives such
as the above, a common practice is to find the minimum of the above sum using a simple
binary search over a coarse range.
5.4.3     The Complete Transformation Algorithm
     We now know how to split a transition to two transitions with minimum number of
opaque events. All we need to do at this point is to design an algorithm that takes as
input an LTL3 monitor Mϕ3 = hΣ, Q, q0 , δ, λi and transforms it into an extended monitor
Mϕe = hΣ, Qe , q0e , δe , λe i as output using the above SMT-based optimization technique. We
now describe the details of this transformation in Algorithm 8:
    • In Lines 2 – 29, we examine each outgoing transition of each state q of the input LTL3
      monitor transitions for splitting.
    • If a transition does not need to be split, we simply add the original transition to the
      extended monitor (Lines 26 and 27).
    • For each transition that should be split, we apply the above SMT-based optimization
      technique described in Section 5.4.2. We first add the new states to the set of states
      of the extended monitor (Line 7). Then, we distinguish two cases:
                                                                    L
         – If the transition that needs to be split, say q −        → q 0 is not a self-loop (Lines 10 –
                                              L 1            L 2
            13), then two transitions q −→        q1 and q −→      q2 with the labels returned by the
            SMT-solver are included in the extended monitor (see Fig. 5.3). We also add all
            the outgoing transitions from q 0 to q1 and q2 (Line 13).
                                                                                      L
         – If the transition that needs to be split is a self-loop, say q −           → q, (Lines 15 –
                                              L 1              L 2
            20), then two transitions q1 −→       q1 and q1 −→      q2 with the labels returned by the
            SMT-solver are included in the extended monitor (see Fig. 5.4). We also add all
                                                   116


             the outgoing transitions from q to q1 and q2 (Line 20) for the events not in the
             original self-loop.
Algorithm 8: Extended LTL3 Monitor Construction.
  Input Mϕ3 = hΣ, Q, q0 , δ, λi
  Output Mϕe = hΣ, Qe , q0e , δe , λe i
 1: Qe ← Q
 2: for every n q ∈ Qe do                        o
                                           L
 3: Lq ← L(q, q 0 ) | ∃q 0 ∈ Q.q −        → q0
 4: for every L(q, q 0 ) ∈ Lq do
 5:   if SPLIT (L(q, q 0 )) then
 6:     {L(q, q1 ), L(q, q2 )} ← SMT(L(q, q 0 ))
 7:     Qe ← (Qe ∪ {q1 , q2 }) − {q 0 }
 8:     Lq ← Lq ∪ {L(q, q1 ), L(q, q2 )}
 9:     λe (q1 ), λe (q2 ) ← λ(q 0 )
10:     if q 6= q 0 then
11:       δe (q, s) ← q1 for all s ∈ L(q, q1 )
12:       δe (q, s) ← q2 for all s ∈ L(q, q2 )
13:       δe (q1 , s), δe (q2 , s) ← δ(q 0 , s) for all s ∈ Σ
14:     end if
15:     if q = q 0 then
16:       δe (q1 , s) ← q1 for all s ∈ L(q, q1 )
17:       δe (q1 , s) ← q2 for all s ∈ L(q, q2 )
18:       δe (q2 , s) ← q1 for all s ∈ L(q, q1 )
19:       δe (q2 , s) ← q2 for all s ∈ L(q, q2 )
20:       δe (q1 , s), δe (q2 , s) ← δ(q 0 , s) for every s ∈ Σ − L(q, q 0 )
21:     end if
22:     for every q 00 such that δ(q 00 , s) = q 0 do
23:       δe (q 00 , s) ← q1
24:     end for
25:   else
26:     δe (q, s) ← q 0 for every s ∈ L(q, q 0 )
27:     λe (q 0 ) ← λ(q 0 )
28:   end if
29:   Lq ← Lq − {L(q, q 0 )}
30: end for
31: end for
         – Finally, we include the incoming transitions to each state (Line 26) and remove
             labels that are have no opacity issues (Line 29).
    • We repeat the loop until no transition needs to be split.
                                                          117


                                                                        q1
                                                                L1
                                        L
                                  q            q0             q
                                                                  L2
                                                                        q2
                             Figure 5.3: Splitting a transition to two.
                                 L                  L1       L2        L2
                                  q                 q1                  q2
                                                             L1
                              Figure 5.4: Splitting a self-loop to two.
The reader can test that running Algorithm 8 on the LTL3 monitor in Fig 5.1, will result in
the extended LTL3 monitor in Fig. 5.2.
      We now show the soundness of Algorithm 6 (as defined in the problem statement in
Section 5.2.4) when augmented by an extended LTL3 monitor as constructed by Algorithm 8.
Lemma 6. For α ∈ Σ∗ be a finite trace and ϕ be an LTL formula with Mϕ3 = hΣ, Q, q0 , δ, λi as
the LTL3 monitor. Using Algorithm 8 we get Mϕ3 = hΣ, Qe , q0e , δe , λe i such that λ(δ(q0 , α)) =
λe (δe (q0e , α))
                                                                         s i             s i
Proof. Let α = s0 s1 · · · sn . We prove that for some i ∈ [0, n], q −   →   q1 ∈ δ ⇒ q −→   q01 ∈ δe
such that λ(q1 ) = λ(q01 )
      Case 1: q = q1
      (⇒) This means that q1 was split into multiple state which includes q01 . As can be seen
in Algorithm 8, lines 16-20, the states q 0 is split into q1 and q2 , and the self-loop is preserved
by having a loop within the states it was split into. Also in lines 22-24, all outgoing and
incoming edges of q 0 is preserved with the label of q 0 being transferred to both q1 and q2 in
line 29. Thus, λ(q1 ) = λ(q01 )
      (⇐) Trivial
      Case 2: q 6= q1
      (⇒) This means that q1 was split into multiple state which includes q01 . As can be
seen in Algorithm 8, lines 10-13, the states q 0 is split into q1 and q2 , and the transitions are
preserved by having a a transition from q to both q1 and q2 respectively. Also in lines 22-24,
                                                 118


all outgoing and incoming edges of q 0 is preserved with the label of q 0 being transferred to
both q1 and q2 in line 29. Thus, λ(q1 ) = λ(q01 )
     (⇐) Trivial
     Thus, λ(δ(q0 , α)) = λe (δe (q0e , α))
Lemma 7. Let α ∈ Σ∗ be a finite trace and ϕ be an LTL formula. The return value of
Algorithm 6 augmented with an extended LTL3 monitor as constructed in Algorithm 8 is
[α |=3 ϕ] by every monitor process in the presence of up to t crash failures.
Proof. We prove Lemma 7 in three steps, similar to the proof technique for consensus in
synchronous networks (e.g., the FloodSet algorithm) [Lyn96]. First, we prove that at the
end of the inner for-loop, LS includes only one state. Then, we show that if no crash faults
occur, in one round, all monitors will compute a monitor state q, where λ(q) is the same as
what a centralized monitor that can read the global event in one atomic step would compute.
Finally, we show that if up to t monitors crash, all active monitors return λ(q) as described
in the previous step. We now delve into these three steps:
    • Step 1. Let us assume that the monitor processes in M are evaluating event sj for
      some j ∈ [0, k]. Formally, we are going to show that if no crash faults occur, then
      in Line 10 of Algorithm 6, we have |LSi1 | = 1, for all i ∈ [1, n]. First, note that if
      no faults occur, all monitors send and receive all the messages in one clean round.
      Thus, in the subsequent rounds all messages will be identical. We now prove this
      claim by contradiction. Suppose we have |LSi1 | = 2 (the case for > 2 can be trivially
      generalized). This means that at least two monitor processes sent a message containing
      two possible LTL3 monitor states, say {q1 , q2 }. This can be due to two scenarios:
         – The first scenario is that q1 and q2 are possible LTL3 monitor states, because the
           value of some atomic proposition p ∈ AP is unknown, i.e., S(p) = \. However,
           this scenario contradicts our assumption on event coverage (see Section 5.2) in
           our computation model.
         – The second scenario is that q1 and q2 are possible LTL3 monitor state, because
           sj is opaque to some outgoing transitions from qcurrent in the LTL3 monitor. This
           case contradicts with our construction of extended LTL3 monitor in Algorithm 8.
    • Step 2. We prove this step by induction on the length of the finite input trace. The
      base case is that the monitors are evaluating event s0 and qcurrent = q0 . From Step 1 of
      the proof, we know that |LSi1 | = 1. We also know that |LSir | = 1 (for all r ∈ [1, t + 1])
      and LSir contains the same content as LSi1 . Let this content be an LTL3 monitor state
      q. Our goal is to show that:
                                            λ(q) = [s0 |= ϕ].
      The proof, again, is by contradiction. This scenario can happen if the intersection of
                                               119


      all possible monitor states q, where q = δ(q0 , s0 ) and λ(q) 6= [s0 |= ϕ]. This can happen
      only if due to opacity, a wrong monitor state comes out of the intersection. This case
      contradicts with out construction of extended LTL3 monitor in Algorithm 8. Hence, q
      would be the monitor state that a centralizes monitor would compute. The induction
      step is now trivial: it is straightforward to show that for any valid qcurrent and any sj ,
      the next monitor state is the same as what a centralized monitor would compute.
    • Step 3. From Steps 1 and 2, we know that if no faults occur, in one round all monitors
      compute one and only one LTL3 monitor state q, where λ(q) = [α |= ϕ]. Now, we show
      in a fault-prone scenario, in some round 1 ≤ r ≤ t + 1, any two active monitors Mi
      and Mj compute the same single monitor state LSir = {q}, where λ(q) = [α |= ϕ].
      Since there are at most t crash failures, there has to be some round r, where no
      failures occur. Recall that in Section 5.2, we assume that if a monitor crashes and this
      monitor is the only one that is aware of some proposition p ∈ AP, this monitor sends a
      message containing its set of possible monitor states before crashing. This assumption
      ensures event coverage. This means that in any round r ≤ r0 ≤ t + 1, the value of all
      propositions are read. This in turn implies that all rounds r0 are now identical to a
      fault-free setting and, hence, Steps 1 and 2 hold.
These three steps prove the soundness of Algorithm 6 when augmented by an extended LTL3
monitor as constructed by Algorithm 8.
     We now extend our technique by monitors that evaluate a formula every l ≥ 1 global
states rather than after every global state. That is, the for-loop in Algorithm 6 iterates every
                                                s                                                     s
bk/lc and instead of a partial view Si j it evaluates a sequence of partial views Sis0 Sis1 . . . Si l−1
and so forth and, hence, the monitors communicate every l state (rather than every single
state). To this end, let us recursively extend µ from a single partial view and a monitor state
                                                                                                      s
transition (i.e., µ(S, q) as defined in Section 5.3) to a sequence of partial views Sis0 Sis1 . . . Si l−1
and a set of monitor states Q0 ⊆ Q as follows (denoted µl ):
                                                                                                  
                                       s                                               s
                  µl (Sis0 Sis1 . . . Si l−1 , Q0 ) = µ1 Sl−1 , µl−1 (Sis0 Sis1 . . . Si l−2 , Q0 ) .
Theorem 1. Let ϕ be an LTL formula, α ∈ Σ∗ with |α| = k and l a natural number, where
l ≤ k. Given the generalization of µ to µl , the output of Algorithm 6 is for µl is [α |=3 ϕ].
Proof. We prove the theorem by induction over l. The base case, (i.e., l = 1), trivially
holds by Lemma 7. For the inductive step, let the statement of the theorem be true for l,
meaning that the verdict of the algorithm is indeed for length l is the same as the verdict of
                                                         120


an LTL3 monitor. We have to show that it also holds for l + 1. This case is also discharged
by Lemma 7, since state by state evaluation results in the correct LTL3 evaluation.
Theorem 2. Let ϕ be an LTL formula and α ∈ Σ∗ be a finite trace. The message complexity
of Algorithm 6 using an extended LTL3 monitor is
                                                                  !
                                                             
                                 O log     |Mϕ3 |        2
                                                   |AP|n t + 1 |α|
where n is the number of distributed monitors.
Proof. We analyze the complexity of each part of Algorithm 6:
    • The algorithm has a nested loop. The outer loop iterates exactly |α| times.
    • The inner loop iterates exactly t + 1 times.
    • In the inner loop each monitor process sends n messages to all other monitors and
       receives n messages from all other monitors. That is, n2 messages.
This makes it a total of |α|(t + 1)n2 messages throughout the algorithm.
     We now focus on the size of each message. Let Mϕ3 = hΣ, Q, q0 , δ, λi be an LTL3 monitor
and Mϕe = hΣ, Qe , q0e , δe , λe i be its extended monitor constructed by Algorithm 8. The
algorithm may split a transition at most |AP| number of times. Hence, we have
                                          |Qe | ≤ 2|Q| · |AP|.
Recall that each message contains the possible states of the extended LTL3 monitor. This
means each message in Algorithm 6 needs
                                                              !
                                                     
                                        O log |Q| · |AP|
bits for each message. Recall that the size of an LTL3 monitor is the number of its state, i.e.,
|Mϕ3 | = |Q|. Hence, the message complexity is
                                                                   !
                                               
                                O log |Mϕ3 | · |AP||α|(t + 1)n2 .
     We note that if the distributed monitors verify the finite computation α every k state
(see Theorem 1), then the |α| factor reduces to d|α|/ke.
Theorem 3. Rather than going through t + 1 rounds of communication with peer monitors,
each monitor can only go through k + 1 rounds where k denotes the maximum number of
                                                   121


monitor crashes that are possible in a particular state without loss of any information or
correctness.
Proof. We first take a look into why we need t + 1 rounds to come to a common conclusion
at the first place. It is to accommodate for any monitor crashes during communication such
that no information is lost. We need t+1 rounds, since the system can only have a maximum
of t number of monitor crashes.
     Here, we consider a synchronous system, i.e., all the monitors share the same global
clock, thus whenever a monitor doesn’t receiver from another monitor the former considers
the later has crashed. This hold with our assumption that once a monitor has crashed it
cannot revive itself and the network we are using is clean, i.e., all the messages sent are
received by the receiver and none gets lost in transmission.
     For the first state, the maximum number of possible crashes are t. But for any
subsequent states, the maximum number of possible monitor crashes depends upon the
number of monitor crashes that has already taken place in the states leading up to it. For
example, for the i-th state, the maximum number of possible monitor crashes is k = t − c,
where c denotes the number of already crashed monitors in the system in the previous i − 1
states. Thus, we can only go through k + 1 rounds accounting for the maximum k crashes
that is possible in the present state.
5.5       Experimental Results
     In this section, we present the results of our experiments on monitoring formulas with
respect to a synthetic model of the system and monitoring correctness and behavioral
specifications on the Orange4Home [CLRC17] dataset for IoT.
5.5.1      Synthetic Experiments
Setup
     We evaluate our decentralized system using different LTL formulae generated from
specification patterns mentioned in [Dwy20]. The corresponding monitor is generated using
LTL3 tools [BLS11]. Each of the following experiments were conducted on the following
combinations of total number of monitors in the system and the maximum number of crashes
(t) that the system is prone to have:
                                              122


    • # of Monitors = 10; t = 4, 5, 6, 7, 8
    • # of Monitors = 20; t = 10, 12, 14, 16, 18
    • # of Monitors = 30; t = 10, 15, 20, 25, 28
We also extend our setting of the system under observation by considering different probability
distributions (uniform, Bernoulli (0.1), and Bernoulli (0.9)) for different aspects of the
system, namely: read distribution of an atomic proposition given the set of all monitors
and crash distribution of a monitor given the execution state. The number of crashes per
state is controlled by a right skewed normal distribution N (µ = 0, σ = 1.5) where all numbers
are positive rounded to the nearest decimal.
     A monitor may crash at two different points during its execution. The first being
immediately after having read state of the system and the next being while communicating.
If a monitor crashes immediately after reading the state of the system, i.e., before
communicating with the rest of the monitors, we assume that there exists at least one
other monitor who read the same atomic propositions. This is done to make sure, the value
of an atomic proposition is not lost with the monitoring which crashed. On the other hand,
if a monitor crashes while communicating, we assume that it was able to send its partial
observation to at least one other monitor in the system which did not crash in the same
round. This is also done, to ensure that the information of the state of the execution is not
lost with a monitor crashing.
     As can be seen in Fig. 5.5, the distribution of monitor crashes for Bernoulli (0.9) is
more left skewed when compared with uniform distribution. This is because in Bernoulli
(0.9), the likelihood of a monitor crashing is higher compared to uniform where it is 0.5.
Higher crash likelihood makes the monitor in the system crash earlier till the system reaches
the maximum number of crashes allowed. We also notice that the likelihood of a monitor
crashing is dependent on the read distribution of the atomic proposition over the monitors.
More number of monitors read a atomic proposition when distributed uniformly compared to
Bernoulli (0.1). As mentioned earlier, a monitor only crashes if there exists another monitor
                                              123


                                      6
                                                                           uniform, uniform
                                                                        bernoulli(0.1), uniform
                                      5
                                                                     bernoulli(0.1), bernoulli(0.9)
                                                                        uniform, bernoulli(0.1)
                     no. of crashes
                                      4
                                      3
                                      2
                                      1
                                          0   10   20       30   40 50 60         70    80    90      100
                                                                  trace state
                 Figure 5.5: Crash distribution over a trace of length 100.
who has read the same atomic propositions. Thus, the likelihood of a monitor crashing is
more for a read distribution of uniform compared to Bernoulli (0.1).
    The partial view of a monitor should be such that the global observation is equal to
the partial views of all the monitors taking together. If the global observation is denoted by
                                                        S
GSj , then the partial observation, Si j for monitor i should be such that:
                                                                     n
                                                                            S
                                                                     [
                                                             GSj =         Si j
                                                                     i=1
This condition is necessary as this guarantees that the entire global observation is observed
by the list of all monitors taken together.
    Similar to the tool, DECENT-MON [BF12], we test out each system configuration on
three different traces where the probability of occurrence of an atomic proposition given a
state is controlled by uniform distribution and Bernoulli distribution with 0.1 and 0.9 as a
parameter. In our experiments, we study and report on the following metrics:
   • The average number of rounds needed to traverse through the entire trace sequence.
   • The number of messages, #msg., exchanged between monitors.
   • The average size of a message, size (msg.), exchanged between the monitors.
   • The number of monitor crashes the system was a victim of.
    All of our experiments are run sufficiently enough to ensure 95% confidence interval.
                                                                   124


   No.  Type of formula   Formula                                                       Size (Before) Size (After) Change
                                                                                                                   (Times)
   1                         (¬ p)                                                      2             2            0
   2                          r → (¬ p U r)                                             4             4            0
   3    Absence              (q → (¬ p))                                                3             3            0
   4                         ((q ∧ ¬ r ∧    r) → (¬ p U r))                             4             15           2.75
   5                         (q ∧ ¬ r → (¬ p U (r ∨     ¬ p)))                          3             12           3
   6                         (p)                                                        2             2            0
   7                      ¬ r U ((p ∧ ¬ r) ∨     ¬ r)                                   3             4            0.33
   8    Existence          (¬ q) ∨     (q ∧   p)                                        3             3            0
   9                       (q ∧ ¬ r → (¬ r U ((p ∧ ¬ r) ∨ ¬ r)))                        3             55           17.33
   10                      (q ∧ ¬ r → (¬ rU (p ∧ ¬ r)))                                 3             55           17.33
   11                     (¬ p U ((p U ((¬ pU ((pU ( ¬ p ∨ p)) ∨ ¬ p)) ∨ p)) ∨          1             1            0
                            ¬ p))
   12                        r → ((¬ p ∧ ¬ r)U (r ∨ ((p ∧ ¬ r)U (r ∨ ((¬ p ∧ ¬ r)U (r ∨ 8             8            0
                          ((p ∧ ¬r)U (r ∨ (¬pU r)))))))))
   13   Bounded              q → (¬qU (q ∧ (¬pU ((pU ((¬pU ((pU ( ¬p ∨ p)) ∨            7             7            0
        Existence           ¬p)) ∨ p)) ∨ ¬p))))
   14                      ((q ∧ r) → ((¬p ∧ ¬r)U (r ∨ ((p ∧ ¬r)U (r ∨ ((¬p ∧           1             1            0
                          ¬r)U (r ∨ ((p ∧ ¬r)U (r ∨ (¬pU r))))))))))
   15                      (q → ((¬p ∧ ¬r)U (r ∨ ((p ∧ ¬r)U (r ∨ ((¬p ∧ ¬r)U (r ∨       7             12           0.71
                          ((p ∧ ¬r)U (r ∨ (¬pU (r ∨ ¬p)) ∨ p)))))))))
   16                      (p)                                                          2             2            0
   17                        r → (pU r)                                                 4             4            0
   18   Universality       (q → (p))                                                    3             3            0
   19                      ((q ∧ ¬r ∧ r) → (pU r))                                      4             11           1.75
   20                      (q ∧ ¬r → (pU (r ∨ p)))                                      3             8            1.67
   21                     ¬pU (s ∨ ¬p)                                                  3             4            0.33
   22                        r → (¬pU (s ∨ r))                                          4             8            1
   23   Precedence          ¬q ∨ (q ∧ (pU (s ∨ ¬p)))                                    3             8            1.67
   24                      ((q ∧ ¬r ∧ r) → (¬pU (s ∨ r)))                               4             16           3
   25                      (q ∧ ¬r → (¬pU ((s ∨ r) ∨ ¬p)))                              3             9            2
   26                      (p → s)                                                      1             1            0
   27                        r → (p → (¬rU (s ∧ ¬r)))U r                                4             7            0.75
   28   Response           (q → (p → s))                                                1             1            0
   29                      ((q ∧ ¬r ∧ r) → (p → (¬rU (s ∧ ¬r)))U r)                     4             38           8.5
   30                      (q ∧ ¬r → (p → (¬rU (s ∧ ¬r)))U (r ∨ (p → (¬rU (s ∧          4             36           8
                          ¬r)))))
   31                        p → (¬pU (s ∧ ¬p ∧ (¬pU t)))                               4             6            0.5
   32                        r → (¬pU (r ∨ (s ∧ ¬p ∧ (¬pU t))))                         5             16           2.2
   33                     ( ¬q) ∨ (¬qU (q ∧ p → (¬pU (s ∧ ¬p ∧ (¬pU t)))))              5             15           2
   34                      ((q ∧ r) → (¬pU (r ∨ (s ∧ ¬p ∧ (¬pU t)))))                   5             32           5.4
   35                      (q → ( p → (¬pU (r ∨ (s ∧ ¬p ∧ (¬pU t))))))                  4             20           4
   36   Precedence Chain  ( (s ∧      t)) → ((¬s)U p))                                  4             7            0.75
   37                        r → ((¬(s ∧ (¬r) ∧ (¬rU (t ∧ ¬r))))U (r ∨ p))              5             17           2.4
   38                     ( ¬q) ∨ ((¬q)U (q ∧ (( (s ∧        t)) → ((¬s)U p)))          5             18           2.6
   39                      ((q ∧ r) → ((¬(s ∧ (¬r) ∧ (¬rU (t ∧ ¬r))))U (r ∨ p)))        5             36           6.2
   40                      (q → (¬(s ∧ (¬r) ∧ (¬rU (t ∧ ¬r)))U (r ∨ p) ∨ (¬(s ∧         4             24           5
                               t))))
   41                      (s ∧      t → ( (t ∧ p)))                                    1             1            0
   42                        r → (s ∧ (¬rU t) → (¬rU (t ∧ p)))U r                       6             16           1.67
   43                      (q → (s ∧        t → (¬tU (t ∧ p))))                         1             1            0
   44                      ((q ∧ r) → (s ∧ (¬rU t) → (¬rU (t ∧ p)))U r)                 1             1            0
   45                      (q → (s ∧ (¬rU t) → (¬rU (t ∧              p)))U (r ∨ (s ∧   1             1            0
                            (¬rU r) → (¬rU (t ∧ p)))))
   46   Response Chain     (p → (s ∧        t))                                         1             4            3
   47                        r → (p → (¬rU (s ∧ ¬r ∧ (¬rU t))))U r                      5             14           1.8
   48                      (q → (p → (s ∧         t)))                                  3             12           3
   49                      ((q ∧ r) → (p → (¬rU (s ∧ ¬r ∧ (¬rU t))))U r)                7             35           4
   50                      (q → (p → (¬rU (s ∧ ¬r ∧ (¬rU t))))U (r ∨ (p → (s ∧          7             48           5.86
                               t))))
   51                      (p → (s ∧ ¬z ∧ (¬zU t)))                                     1             1            0
   52                        r → (p → (¬rU (s ∧ ¬r ∧ ¬z ∧ ((¬r ∧ ¬z)U t))))U r          5             28           4.6
   53   Constrained        (q → (p → (s ∧ ¬z ∧ (¬zU t))))                               4             28           6
        Chain
   54                      ((q ∧     r) → (p → (¬rU (s ∧ ¬r ∧ ¬z ∧ ((¬r ∧               8             40           4
                          ¬z)U t))))U r)
   55                      (q → (p → (¬rU (s ∧ ¬r ∧ ¬z ∧ ((¬r ∧ ¬z)U t))))U (r ∨        8             40           4
                           (p → (s ∧ ¬z ∧ (¬zU t)))))
                        Table 5.1: List of formulas used to check our algorithm.
Analysis of Results
      As mentioned earlier, we have put our system to the test with respect to all the LTL
formulas mentioned in [Dwy20] (for specification patterns) under all the different scenarios
explained above but, for both space and redundancy of similar observations, below we only
mention results for the following LTL formulas (the full list of formulas in [Dwy20] can be
                                                                  125


found in Table 5.1):
                 ϕ4 = ((q ∧ ¬r ∧      r) → (¬p U r))
                ϕ17 =    r → (p U r)
                ϕ38 = ( ¬q) ∨ ((¬q) U (q ∧ (( (s ∧         t)) → ((¬s) U p)))
                ϕ51 = (p →      (s ∧ ¬z ∧ (¬z U t)))
     Impact of monitor crashes. As expected a higher number of monitor crashes results
in an increase in the average number of rounds when monitoring. In Fig. 5.6a, for LTL
formula ϕ4 , we observe that the average number of rounds is significantly improved when
accounting for only the number of crashes that are possible given a state of the execution.
For example, in a system with t = 8 and with read and crash distribution being binomial
and uniform respectively, the average number of rounds is only around 3 (reduced from usual
8).
     In Fig. 5.6b, we see for ϕ4 that with increase in the number of monitor crashes, the
number of messages exchanged among the monitors increase as well, since in each round
a monitor in the system shares its observation with other monitors in the system, thereby
making the total number of messages directly proportional to the number of monitors present
and also the number of rounds.
     Following our setup described in Fig. 5.5, the distribution of crashes also have an effect
on the average number of rounds and the number of messages being passed in the system.
The more left skewed will be the distribution of the monitor crashes, the less average number
of rounds are required to come to a consensus among the monitors. This is because a left
skewed monitor crash distribution equates to the mean number of monitors present in the
system being low and there-by lower number of rounds as well as number of messages.
     Communication after l states:          We test our algorithm on different values of l,
starting from 1 when the communication between monitors take place after every state and
                                             126


                                            26
                                            24            uniform, uniform
                    Average no. of rounds
                                            22         binomial(0.1), uniform
                                            20      binomial(0.1), binomial(0.9)
                                            18         uniform, binomial(0.1)
                                            16
                                            14
                                            12
                                            10
                                             8
                                             6
                                             4
                                             2
                                                  10$4 10$6 10$820$1020$1420$1830$1030$2030$28
                                                        No. of monitors $ No. of t crashes
                                                                    (a)
                                                  ·106
                                              2
                                                          uniform, uniform
                                            1.8        binomial(0.1), uniform
                 Total no. of messages
                                            1.6     binomial(0.1), binomial(0.9)
                                            1.4        uniform, binomial(0.1)
                                            1.2
                                              1
                                            0.8
                                            0.6
                                            0.4
                                            0.2
                                                  10$4 10$6 10$820$1020$1420$1830$1030$2030$28
                                                        No. of monitors $ No. of t crashes
                                                                    (b)
Figure 5.6: Average # of rounds and total # of messages sent per situation for different read
and crash distributions for flip-flop distributed trace for ϕ4 with l = 1.
going all the way to 50 when the monitors communicate only twice for a trace length of
100. As stated in Theorem 1, the correctness of the protocol is not effected by changing the
values of l however as seen in Fig. 5.7 for different LTL specifications, the average number
of rounds and average number of messages decrease with increasing values of l. For lower l,
the communication takes place more often than higher values of l and thus accounting for
higher values of number of rounds and number of messages.
                                                                    127


     The average size of messages increases with an increase in the value of l. This is because
the size of a message depends on the number of states present in the local observation of a
monitor. With communication happening after l states, the local observation constitutes of
more number of states than when it was happening after every state. This can be seen when
comparing the results of Figure 5.7c for different LTL formula. The size of messages for ϕ38
is substantially larger when compared to that of others due to the more number of states
in its updated LTL3 monitor automata along with higher number of atomic proposition. We
also see that with increasing the value of l, the number of monitor crashes decreases. This is
because, with increasing the value of l, communication is limited to only after every l states
and there-by decreasing the the number of communicating rounds and there-by decreasing
the number of monitor crashes. Taking all the plots into consideration, we observe that the
benefit from lower the number of rounds and messages out-weights the drawback from the
increase in the size of messages for any value of l ≥ 5.
5.5.2      Orange4Home Dataset
     Orange4Home [CLRC17] is a dataset capturing routines of daily living in
Amiqual4Home’s smart home environment. It is a result of a joint work between Orange
Labs and Inria. The dataset consists of around 180 hours of recording of activities of
daily living of a single occupant, spanning 4 consecutive weeks of work days. The dataset
contains recordings of a total of 236 sensors scattered throughout the apartment and for 20
different classes of activities. We divide all specifications into two categories: (1) Behavioral
correctness: monitor the correctness of the different sensors (2) Activity of Daily Living
(ADL): monitoring the activity that the occupant is upto using the values of different sensors.
In Fig. 5.8, we show the results for various values of k keeping the read and crash distribution
to be uniform and Bernoulli (0.1) respectively and report on the number of rounds, number
of messages, size of the message and actual number of monitor crashing for a system with
                                                128


                                                                                                                             ·105
                                                12                             ϕ4                                                                         ϕ4
                                                                               ϕ17                                                                        ϕ17
                                                                                              Average no. of messages
                                                                                                                        3
                        Average no. of rounds
                                                10                             ϕ38                                                                        ϕ38
                                                 8
                                                                                                                        2
                                                 6
                                                 4
                                                                                                                        1
                                                 2
                                                 0                                                                      0
                                                      0    10   20   30   40    50                                            0       10   20   30   40   50
                                                                kRounds                                                                    kRounds
                                                 (a) Average # of rounds                                  (b) Total # of messages sent
                                                350                            ϕ4                                       15                                 ϕ4
                                                                               ϕ17                                                                         ϕ17
           Average size of messages                                                           Average no. of crashes
                                                300                            ϕ38                                                                         ϕ38
                                                250                                                                     10
                                                200
                                                                                                                        5
                                                150
                                                100
                                                      0    10   20   30   40    50                                                0   10   20   30   40    50
                                                                kRounds                                                                    kRounds
                                       (c) Average size of messages                              (d) Average number of crashes
Figure 5.7: Impact of communicating after l states for various LTL formula on synthetic data.
30 monitors and t = 20 monitoring the following specifications:
                                                ϕo4h 1 = (switch →             (light U ¬switch)
                                                ϕo4h 2 =         ≤5 (cooktop   ∨ oven)
                                                ϕo4h 3 =         ≤5 (kitchen    sink ∨ kitchen fridge ∨ kitchen cupboard)
                                                ϕo4h 4 =         ≤5 (cooktop   ∨ oven ∨ kitchen sink ∨ kitchen fridge∨
                                                          nkitchen cupboard ∨ kitchen dishwasher)
                                                      Formula      Size (Before)      Size (After)                                Change (Times)
                                                       ϕo4h 1            3                  3                                            1
                                                       ϕo4h 2            3                  4                                          0.33
                                                       ϕo4h 3            3                  6                                            1
                                                       ϕo4h 4            3                 33                                           10
                                                                Table 5.2: Formula from Orange4Home.
                                                                                      129


                                                                                                                           ·106
                                                                    ϕo4h 1                                            1                               ϕo4h 1
                                                                    ϕo4h 2                                                                            ϕo4h 2
             Average no. of rounds                                                 Average no. of messages
                                                                    ϕo4h 3                                           0.8                              ϕo4h 3
                                      10                            ϕo4h 4                                                                            ϕo4h 4
                                                                                                                     0.6
                                       5                                                                             0.4
                                                                                                                     0.2
                                       0                                                                              0
                                            0   10   20   30   40      50                                                   0     10   20   30   40      50
                                                     kRounds                                                                           kRounds
                                       (a) Average # of rounds                                           (b) Total # of messages sent
                                      180                           ϕo4h 1                                                                            ϕo4h 1
                                                                    ϕo4h 2                                           15                               ϕo4h 2
           Average size of messages                                                         Average no. of crashes
                                                                    ϕo4h 3                                                                            ϕo4h 3
                                                                    ϕo4h 4                                                                            ϕo4h 4
                                      160
                                                                                                                     10
                                      140
                                      120                                                                              5
                                      100
                                                                                                                       0
                                            0   10   20   30   40       50                                                  0     10   20   30   40      50
                                                     kRounds                                                                           kRounds
                                      (c) Average size of messages                              (d) Average number of crashes
Figure 5.8: Impact of communicating after l states for various LTL formula on data from
Orange4Home dataset.
    First we construct the equivalent LTL3 monitors using Algorithm 8. The change in
the number of states of the final automata can be observed in Table 5.2. Monitoring
ADL specifications involve the system keeping a track of the passage of time, essential in
monitoring a time bounded specification as is the case with ϕo4h 2 through ϕo4h 4 . Apart from
similar observation to the synthetic data for increasing values of k, we observe in Fig. 5.8
that monitoring specifications involving more number of atomic propositions have higher
message size. The higher message size can be explained by Theorem 2 which shows the
message complexity when using an extended LTL3 monitor is directly proportional to |AP|.
Additionally, higher value of l, decreases the number of communicating rounds and thus
accounting for lower number of monitor crashes. Subsequently, lower number of monitor
crashes equates to higher number of active monitors in the system and therefore higher
number of rounds and higher number of messages.
                                                                             130


5.6      Summary and Limitation
     In this chapter, we propose a runtime verification algorithm, where a set of decentralized
synchronous monitors that have only a partial view of the underlying system continually
evaluate formulas in the linear temporal logic (LTL). The non-deterministic nature of the
evaluation procedure due to partial observations makes resolving the current state of the
execution indistinguishable. Thus, we propose an SMT-based transformation algorithm to
obtain minimum size LTL3 monitors.
     However, the synchronous nature of the distributed system makes for a limited
application for such an approach. Also, as shown in Chapter 3, an automata based approach
often requires more results to be taken into consideration than needed. On the contrary,
a progression based approach might need less number of states of automata that needs
to be remembered by the monitors. Thereby minimizing the cost of communication by a
considerable amount.
                                             131


Chapter 6
Decentralized Runtime Verification
for Stream-based Specifications
6.1       Introduction
     In this chapter, we advocate for a runtime verification (RV) approach, to monitor the
behavior of a distributed system with respect to a formal specification. Applying RV to
multiple components of an ICS can be viewed as the general problem of distributed RV,
where a centralized or decentralized monitor(s) observe the behavior of a distributed system
in which the processes do not share a global clock. Although RV deals with finite executions,
the lack of a common global clock prohibits it from having a total ordering of events in a
distributed setting. In other words, the monitor can only form a partial ordering of events
which may yield different evaluations. Enumerating all possible interleavings of the system
at runtime incurs in an exponential blowup, making the approach not scalable. To add to
this already complex task, a PLC often requires time sensitive aggregation of data from
multiple sources.
     We propose an effective, sound and complete solution to distributed RV for the popular
    (Submitted) Ritam Ganguly, and Borzoo Bonakdarpour, Decentralized Runtime Verification of Stream-
based Partially-Synchronous Distributed System, ACM SIGBED International Conference on Embedded
Software (EMSOFT-2023).
                                                132


                                   1       3     4     6        9     11
                                . 2( − 1) 2( − 1)            2( − 1)
                            x   3             5           6               9
                            y   1             3           5               7
                       x+y     {4}           {8}          {11}           {16}
                                   {4, 6, 8} {8, 10, 9, 11} {11, 13, 14, 16}
                          Figure 6.1: Partially Synchronous LOLA.
stream-based specification language Lola [DSS+ 05]. Compared to other temporal logic,
Lola can describe both correctness/failure assertions along with statistical measures that
can be used for system profiling and coverage analysis. To present a high level of Lola
example, consider two input streams x and y and a output stream, sum as shown in Fig. 6.1.
Stream x has the value 3 until time instance 2 when it changes to 5 and so on.
input x:uint
input y:uint
output sum := x+y
     We consider a fault proof decentralized set of monitors where each monitor only has
a partial view of the system and has no access to a global clock. In order to limit the
blow-up of states posed by the absence of the global clock, we make a practical assumption
about the presence of a bounded clock skew  between all the local clocks, guaranteed by
a clock synchronization algorithm (like NTP [Mil10]). This setting is known to be partially
synchronous. As can be seen in Fig. 6.1, any two events less than  = 2 time apart is
considered to be concurrent and thus the non-determinism of the time of occurrence of each
event is restricted to  − 1 on either side. When attempting to evaluate the output stream
sum, we need to take into consideration all the possible time of occurrence of the values.
For example, when evaluating the value of sum at time 1, we need to consider the value of x
                                                 133


(resp. y) as 3 and 5 (resp. 1 and 3) which evaluates to 4, 6 and 8. The same can be observed
for evaluations across all time instances.
     Our first contribution in this chapter is introducing a partially synchronous semantics
for Lola. In other words, we define Lola which takes into consideration a clock-skew
of  when evaluating a stream expression. Second, we introduce an SMT-based associated
equation rewriting technique over a partially observable distributed system, which takes into
consideration the values observed by the monitor and rewrites the associated equation. The
monitors are able to communicate within themselves and are able to resolve the partially
evaluated equations into completely evaluated ones.
     We have proved the correctness of our approach and the upper and lower bound of
the message complexity.       Additionally, we have completely implemented our technique
and report the results of rigorous synthetic experiments, as well as monitoring correctness
and aggregated results of several ICS. As identified in [ACZ20], most attacks on ICS
components try to alter the value reported to the PLC in-order to make the PLC behave
erroneously. Through our approach, we were able to detect these attacks in-spite of the clock
asynchrony among the different components with deterministic guarantee. We also argue
that our approach was able to evaluate system behavior aggregates that makes studying
these system easier by the human operator. Unlike machine learning approaches (e.g.,
[PMA15b, PMA15a, BHBB+ 14]), our approach will never raise false negatives. We put our
monitoring technique to test, studying the effects of different parameters on the runtime and
size of the message sent from one monitor to other and report on each of them.
6.2       Partially Synchronous Lola
     In this section, we extend the semantics of Lola to one that can accommodate reasoning
about distributed systems.
                                              134


6.2.1         Distributed Streams
      Here, we refer to a global clock which will act as the “real” timekeeper. It is to be noted
that the presence of this global clock is just for theoretical reasons and it is not available to
any of the individual streams.
      We assume a partially synchronous system of n streams, denoted by A                          =
{α1 , α2 , · · · , αn }. For each stream αi , where i ∈ [1, |A|], the local clock can be represented
as a monotonically increasing function ci : Z≥0 → Z≥0 , where ci (G) is the value of the local
clock at global time G. Since we are dealing with discrete-time systems, for simplicity and
without loss of generality, we represent time with non-negative integers Z≥0 . For any two
streams αi and αj , where i 6= j, we assume:
                                       ∀G ∈ Z≥0 . | ci (G) − cj (G) |< ,
where  > 0 is the maximum clock skew. The value of  is constant and is known (e.g., to a
monitor). This assumption is met by the presence of an off-the-shelf clock synchronization
algorithm, like NTP [Mil10], to ensure bounded clock skew among all streams. The local
state of stream αi at time σ is given by αi (σ), where σ = ci (G), that is the local time of
occurrence of the event at some global time G.
Definition 14. A distributed stream consisting of A = {α1 , α2 , . . . , αn } streams of length
N + 1 is represented by the pair (E, ), where E is a set of all local states (i.e., E =
∪i∈[1,n],j∈[0,N ] αi (j)) partially ordered by Lamport’s happened-before ( ) relation [Lam78],
subject to the partial synchrony assumption:
    • For every stream αi , 1 ≤ i ≤ |A|, all the events happening on it are totally ordered,
       that is,
                                    ∀i, j, k ∈ Z≥0 : (j < k) → (αi (j)       αi (k))
    • For any two streams αi and αj and two corresponding events αi (k), αj (l) ∈ E, if k+ < l
       then, αi (k)        αj (l), where  is the maximum clock skew.
    • For events, e, f , and g, if e           f and f     g, then e      g.
Definition 15. Given a distributed stream (E, ), a subset of events C ⊆ E is said to form
a consistent cut if and only if when C contains an event e, then it should also contain all
                                                      135


such events that happened before e. Formally,
                                   ∀e, f ∈ E.(e ∈ C) ∧ (f              e) → f ∈ C. 
        The frontier of a consistent cut C, denoted by front(C) is the set of all events that
happened last in each stream in the cut. That is, front(C) is a set of αi (last) for each
i ∈ [1, |A|] and αi (last) ∈ C. We denote αi (last) as the last event in αi such that ∀αi (σ) ∈
C.(αi (σ) 6= αi (last)) → (αi (σ)          αi (last)).
6.2.2          Partially Synchronous Lola
        We define the semantics of Lola specifications for partially synchronous distributed
streams in terms of the evaluation model. The absence of a common global clock among the
stream variables and the presence of the clock synchronization makes way for the output
stream having multiple values at any given time instance. Thus, we update the evaluation
model, so that αi (j) and υ(ti )(j) are now defined by sets rather than just a single value.
This is due to nondeterminism caused by partial synchrony, i.e., the bounded clock skew .
Definition 16. Given a Lola [DSS+ 05] specification ϕ over independent variables,
t1 , · · · , tm of type T1 , · · · , Tm and dependent variables, s1 , · · · , sn of type Tm+1 , · · · , Tm+n
and τ1 , · · · , τm be the streams of length N + 1, with τi of type Ti . The tuple of streams
hα1 , · · · , αn i of length N + 1 with corresponding types is called the evaluation model in the
partially synchronous setting, if for every equation in ϕ:
                                         si = ei (t1 , · · · , tm , s1 , · · · , sn ),
hα1 , · · · , αn i satisfies the following associated equations:
                             
                    αi (j) = υ(ei )(k) | max{0, j −  + 1} ≤ k ≤ min{N, j +  − 1}
where υ(ei )(j) is defined as follows. For the base cases:
                     υ(c)(j) = {c}
                                  
                    υ(ti )(j) = τi (k) | max{0, j −  + 1} ≤ k ≤ min{N, j +  − 1}
                    υ(si )(j) = αi (j)
                                                            136


For the inductive cases:
                                     n                                                                  o
           υ f (e1 , · · · , ep ) (j) = f (e01 , · · · , e0p ) | e01 ∈ υ(e1 )(j), · · · , e0p ∈ υ(ep )(j)
                                       
                                       υ(e )(j) true ∈ υ(b)(j)
                                              1
             υ(ite(b, e1 , e2 ))(j) =
                                       υ(e2 )(j) false ∈ υ(b)(j)
                                       
                                       υ(e)(j + k) if 0 ≤ j + k ≤ N
                     υ(e[k, c])(j) =
                                       c                       otherwise
input read:bool
input write:bool
output countRead := ite(read, countRead[-1,0] + 1, countRead[-1,0])
output countWrite := ite(write, countWrite[-1,0] + 1, countWrite[-1,0])
output check := (countWrite - countRead) <= 2
Example 1. Consider the above Lola specification, ϕ, over the independent boolean
variables read and write: In Fig. 6.2, we have two input stream read and write which denotes
the time instances where the corresponding events take place. It can be imagined that read
and write are streams of type boolean with true values at time instances 4, 6, 7 and 2, 3, 5, 6
and false values at all other time instances respectively. We evaluate the above mentioned
Lola specification considering a time synchronization constant,  = 2. The corresponding
associated equations, ϕα , are:
                            
                            ite(read , 1, 0)                                                     j=0
      countRead (j) =                                                                        
                            ite read , countRead (j − 1) + 1, countRead (j)                      j ∈ [1, N )
                            
                            ite(write, 1, 0)                                                        j=0
     countWrite(j) =                                                                            
                            ite write, countWrite(j − 1) + 1, countWrite(j)                         j ∈ [1, N )
                                                                      
            check (j) = countWrite(j) − countRead (j) ≤ 2
     Similar to the synchronous case, evaluation of the partially synchronous Lola
specification involves creating the dependency graph.
Definition 17. A dependency graph for a Lola specification, ϕ is a weighted directed multi-
graph G = hV, Ei, with vertex set V = {s1 , · · · , sn , t1 , · · · , tm }. An edge e : hsi , sk , wi (resp.
                                                         137


                                    1        2        3         4         5         6         7
                          read
                         write
                    count(read) {0}    {0}      {0}     {0, 1}    {0, 1}     {1, 2} {1, 2, 3} {2, 3}
                   count(write) {0}   {0, 1} {0, 1, 2} {1, 2}     {2, 3} {2, 3, 4} {3, 4}        {4}
                         check {true} {true}   {true}   {true}           {true, false}          {true}
                                                               {true, false}       {true, false}
                           Figure 6.2: Partially Synchronous Lola Example.
e : hsi , tk , wi) labeled with a weight w = {ω | p −  < ω < p + } is in E iff the equation for
αi (j) contains αk (j + p) (resp. τk (j + p)) as a sub-expression, for some j and offset p.
      Intuitively, the dependency graph records that evaluation of a si at a particular position
depends on the value of sk (resp. tk ), with an offset in w. It is to be noted that there can
be more than one edge between a pair of vertex (si , sk ) (resp. (si , tk )). Vertices labeled by
ti do not have any outgoing edges.
Example 2. Consider the Lola specification over the independent integer variable a:
input a : uint
output b1 := b2[1, 0] + ite(b2[-1,7] <= a[1, 0], b2[-2,0], 6)
output b2 := b1[-1,8]
Its dependency graph, shown in Fig. 6.3 for  = 2, has 1 edge from b1 to a with a weight
{0, 1, 2}. Similarly, there are 3 edges from b1 to b2 with weights {0, 1, 2}, {−2, −1, 0} and
{−3, −2, −1} and 1 edge from b2 to b1 with a weight of {−2, −1, 0}
      Given a set of partially synchronous input streams {α1 , α2 , · · · , α|A| } of respective type
                                                         138


                                                              {−2, −1, 0}
                                    {0, 1, 2}
                              a                       b1                   b2
                                             {0, 1, 2}, {−2, −1, 0}, {−3, −2, −1}
                                Figure 6.3: Dependency Graph Example.
T = {T1 , T2 , · · · , T|A| } and a Lola specification, ϕ, the evaluation of ϕ is given by
                                         (α1 , α2 , · · · , α|A| ) |=P S ϕ
where, |=P S denotes the partially synchronous evaluation.
6.3      Decentralized Monitoring Architecture
6.3.1     Overall Picture
     We consider a decentralized online monitoring system comprising of a fixed number of
|M| reliable monitor processes M = {M1 , M2 , · · · , M|M| } that can communicate with each
other by sending and receiving messages through a complete point-to-point bidirectional
communication links. Each communication link is also assumed to be reliable, i.e., there is
no loss or alteration of messages. Similar to the distributed system under observation, we
assume the clock on the individual monitors are asynchronous, with clock synchronization
constant = M .
     Throughout this section we assume that the global distributed stream consisting of
complete observations of |A| streams is only partially visible to each monitor. Each monitor
process locally executes an identical sequential algorithm which consists of the following steps
(we will generalize this approach in Section 6.6). In other words, an evaluation iteration of
each monitor consists of the following steps:
   1. Reads the a subset of E events (visible to Mi ) along with the corresponding time and
      valuation of the events, which results in the construction of a partial distributed stream;
                                                         139


Algorithm 9: Behavior of a Monitor Mi , for i ∈ [1, |M|].
  1: for j = 0 to N do
  2: Let (Ei , i )j be the partial
                                distributed stream view of Mi
  3: LS j ← (E, ) |=P S ϕα
  4: Send: broadcasts symbolic view LS j
  5: Receive: Πj ← {LS kj | 1 ≤ k ≤ M}
  6: Compute: LS j+1 ← LC (Πj )
  7: end for
    2. Each monitor evaluates the Lola specification ϕ given the partial distributed stream;
    3. Every monitor, broadcasts a message containing rewritten associated equations of ϕ,
       denoted LS , and
    4. Based on the message received containing associated equations, each monitor
       amalgamates the observations of all the monitors to compose a set of associated
       equations.      After a evaluation iteration, each monitor will have the same set of
       associated equations to be evaluated on the upcoming distributed stream.
      The message sent from monitor Mi at time π to another monitor Mj , for all i, j ∈
[1, |M|], during a evaluation iteration of the monitor is assumed to reach latest by time
π + M . Thus, the length of an evaluation iteration k can be adjusted to make sure the
message from all other monitors reach before the start of the next evaluation iteration.
6.3.2       Detailed Description
      We now explain in detail the computation model (see Algorithm 9). Each monitor
process Mi ∈ M, where i ∈ [1, |M|], attempts to read e ∈ E, given the distributed stream,
(E,    ). An event can either be observable, or not observable. Due to distribution, this
results in obtaining a partial distributed stream (Ei ,        ) defined below.
Definition 18. Let (E, ) be a distributed stream. We say that (E 0 , ) is a partial
distributed stream for (E, ) and denote it by (E 0 , ) v (E, ) iff E 0 ⊆ E (the happened
before relation is obviously preserved).
      We now tie partial distributed streams to a set of decentralized monitors and the fact
that decentralized monitors can only partially observe a distributed stream. First, all un-
                                                   140


observed events is replaced by \, i.e., for all αi (σ) ∈ E if αi (σ) 6∈ Ei then Ei = Ei ∪{αi (σ) = \}.
Definition 19. Let (E, ) be a distributed stream and M = {M1 , M2 , · · · , M|M| } be a set
of monitors, where each monitor Mi , for i ∈ [1, |M|] is associated with a partial distributed
stream (Ei , ) v (E, ). We say that these monitor observations are consistent if
    • ∀e ∈ E.∃i ∈ [1, |M|].e ∈ Ei , and                         
    • ∀e ∈ Ei .∀e0 ∈ Ej .(e = e0 ∧ e 6= \) ⊕ (e = \ ∨ e0 = \) ,
where ⊕ denoted the exclusive-or operator.
     In a partially synchronous system, there are different ordering of events and each unique
ordering of events might evaluate to different values. Given a distributed stream, (E,            ), a
sequence of consistent cuts is of the form C0 C1 C2 · · · CN , where for all i ≥ 0: (1) Ci ⊆ E, and
(2) Ci ⊆ Ci+1 .
     Given the semantics of partially-synchronous Lola, evaluation of output stream variable
                                                                                 n
si at time instance j requires events αi (k), where i ∈ [1, |A|] and k ∈ π | max{0, j −+1} ≤
                    o
π ≤ {N, j+−1} . To translate monitoring of a distributed stream to a synchronous stream,
we make sure that the events in the frontier of a consistent cut, Cj are αi (k).
     Let C denote the set of all valid sequences of consistent cuts. We define the set of all
synchronous streams of (E,        ) as follows:
                                      n                                           o
                         Sr(E,   ) = front(C0 )front(C1 ) · · · | C0 C1 · · · ∈ C
Intuitively, Sr(E,     ) can be interpreted as the set of all possible “interleavings”.           The
evaluation of the Lola specification, ϕ, with respect to (E,              ) is the following :
              h                i n                                                         o
                (E,   ) |=P S ϕ = (α1 , · · · , αn ) |=S ϕ | (α1 , · · · , αn ) ∈ Sr(E,   )
This means that evaluating a partially synchronous distributed stream with respect to a
Lola specification results in a set of evaluated results, as the computation may involve
several streams. This also enables reducing the problem from evaluation of a partially
synchronous distributed system to the evaluation of multiple synchronous streams, each
                                                 141


evaluating to unique values for the output stream, with message complexity
                                    O |A| N |M|2   Ω(N |M|2 )
                                                  
6.3.3     Problem Statement
     The overall problem statement requires that upon the termination of the Algorithm 9,
the verdict of all the monitors in the decentralized monitoring architecture is the same as
that of a centralized monitor which has the global view of the system
                                                    h                   i
                             ∀i ∈ [1, m] : Resulti = (E,      ) |=P S  ϕ
where (E,     ) is the global distributed stream and ϕ is the Lola specification with Resulti
as the evaluated result by monitor Mi .
6.4      Calculating LS
     In this section, we introduce the rules of rewriting Lola associated equations given the
evaluated results and observations of the system. In our distributed setting, evaluation of a
Lola specification involves generating a set of synchronous streams and evaluating the given
Lola specification on it (explained in Section 6.5). Here, we make use of the evaluation of
Lola specification into forming our local observation to be shared with other monitors in
the system.
     Given the set of synchronous streams, (α1 , α2 , · · · , α|A| ), the symbolic locally computed
result LS (see Algorithm 9) consists of associated Lola equations, which either needs more
information (data was unobserved) from other monitors to evaluate or the concerned monitor
needs to wait (positive offset). In either case, the associated Lola specification is shared
with all other monitors in the system as the missing data can be observed by either monitors.
We divide the rewriting rules into three cases, depending upon the observability of the value
of the independent variables required for evaluating the expression ei for all i ∈ [1, n]. Each
                                                142


stream expression is categorized into three cases (1) completely unobserved, (2) completely
observed or (3) partially observed. This can be done easily by going over the dependency
graph and checking with the partial distributed stream read by the corresponding monitor.
Case 1 (Completely Observed). Formally, a completely observed stream expression si
can be identified from the dependency graph, G = hV, Ei, as for all sk (resp. tk ) hsi , sk , wi ∈
E (resp. hsi , tk , wi ∈ E), sk (j + w) 6= \ (resp. tk (j + w) 6= \) are observed for time instance
j. If yes, this signifies, that all independent and dependent variables required to evaluate
si (j), is observed by the monitor M , there by evaluating: si (j) = ei (s1 , · · · , sn , t1 , · · · , tm )
and rewriting si (j) to LS .
Case 2 (Completely Unobserved).                          Formally, we present a completely unobserved
stream expression, si from the dependency graph, G = hV, Ei, as for all sk (resp. tk ),
hsi , sk , wi ∈ E (resp. hsi , tk , wi ∈ E), sk (j + w) = \ (resp. tk (j + w) = \) are unobserved,
for time instance j . This signifies that the valuation of neither variables are known to the
monitor M . Thus, we rewrite the following stream expressions
                                            
                                            
                                            sk (j + w) 0 ≤ j + w ≤ N
                                            
                                   0
                                 sk (j) =
                                            
                                            default
                                                                  otherwise
                                            
                                            
                                            tk (j + w) 0 ≤ j + w ≤ N
                                            
                                   0
                                 tk (j) =
                                            
                                            default
                                                                  otherwise
for all hsi , sk , wi ∈ E and hsi , tk , wi ∈ E, and include the rewritten associated equation for
evaluating si (j) as
                                     si (j) = ei (s01 , · · · , s0n , t01 , · · · , t0m )
It is to be noted that the default value of a stream variable, sk (resp. tk ), depends on the
corresponding type Tk (resp. Tm+k ) of the stream.
Case 3 (Partially Observed).                      Formally, we present a partially observed stream
expression, si from the dependency graph, G = hV, Ei, as for all sk (resp. tk ), they are
                                                         143


either observed or unobserved, for time instance j. In other words, we can represent a set
Vo = {sk | ∃sk (j +w) 6= \} of all observed stream variable and a set Vu = {sk | sk (j +w) = \}
of all unobserved dependent stream variable for all hsi , sk , wi ∈ E. The set can be expanded
to include independent variables as well. For all sk ∈ Vu (resp. tk ∈ Vu ) that are unobserved,
are replaced by:
                                            
                                            
                                            sk (j + w) 0 ≤ j + w ≤ N
                                            
                             suk (j) =
                                            
                                            default
                                                                      otherwise
                                            
                                            
                                            tk (j + w) 0 ≤ j + w ≤ N
                                            
                              tuk (j) =
                                            
                                            default
                                                                     otherwise
and for all sk ∈ Vo (resp. tk ∈ Vo ) that are observed, are replaced by:
                                              sok (j + w) = value
                                               tok (j + w) = value
and there by partially evaluating si (j) as
                    si (j) = ei (so1 , · · · , son , to1 , · · · , tom , su1 , · · · , sun , tu1 , · · · , tum )
followed by adding the partially evaluated associated equation for si (j) to LS . It is to be
noted, that a consistent partial distributed stream makes sure that for all sk (resp. tk ), can
only be either observed or unobserved and not both or neither.
Example 3. Consider the Lola specification mentioned below and the stream input of length
N = 6 divided into two evaluation rounds and  = 2 as shown in Fig. 6.4 with the monitors
M1 and M2 .
input a : uint
input b : uint
output c := ite(a[-1,0] <= b[1, 0], a[1,0], b[-1, 0])
                                                             144


                                1        2     3          4       5      6
                       a             1      7      5    a      4      4     7
                       b             3      5      9    b      3      5     1
                                  Figure 6.4: Example of generating LS .
      The associated equation for the output stream is:
                      
                      ite(0 ≤ b(i + 1), a(i + 1), 0)
                      
                                                                       i=1
                      
                 c=      ite(a(i − 1) ≤ b(i + 1), a(i + 1), b(i − 1)) 2 ≤ i ≤ N − 1
                      
                      
                      ite(a(i − 1) ≤ 0, 0, b(i − 1))
                      
                                                                        i=N
     Let     the      partial      distributed   stream    read     by   monitor   M1    include
{a, (1, 1), (3, 5)}, {b, (2, 5), (3, 9)} and the partial distributed stream read by monitor
M2 include {a, (1, 1), (2, 7)}, {b, (1, 3), (3, 9). Monitor M1 evaluates c(2) = 5 and partially
evaluates c(1) and c(3). Thus LS 11 = {c(1) = a(2), c(2) = 5, c(3) = ite(a(2) ≤ b(4), a(4), 5)}.
Monitor M2 partially evaluates all c(1), c(2) and c(3) and thus LS 21 = {c(1) = ite(0 ≤
b(2), a(2), 0), c(2) = a(3), c(3) = ite(7 ≤ b(4), a(4), b(2))}.
      Let     the      partial     distributed   stream     read    by    monitor   M1   include
{a, (4, 4), (5, 4)}, {b, (4, 3), (6, 1)} and the partial distributed stream read by monitor
M2 include {a, (5, 4), (6, 7)}, {b, (4, 3), (5, 5)}. Monitor M1 evaluates c(4) = 9 and
c(5) = 3 and partially evaluates c(6). Thus LS 12 = {c(4) = 9, c(5) = 3, c(6) = b(5)}.
Monitor M2 evaluates c(6) = 5 and partially evalues c(4) and c(5) and thus
LS 22 = {c(4) = ite(a(3) ≤ 5, 4, 9), c(5) = ite(a(4) ≤ b(6), 7, 3), c(6) = 5}.
      It is to be noted, the after the first round of evaluation, the corresponding local states,
LS 1 and LS 21 will be shared which will enable evaluating the output stream for few of the
   1
partially evaluated output stream (will be discussed in Section 6.6.1). These will be included
in the local state of the following evaluation round.
      Note that generating LS takes into consideration an ordered stream. One where the
time of occurrence of events and values are comparable. It can be imagined that generating
the same for the distributed system involves generating it for all possible ordering of events.
This will be discussed in details in the following sections.
                                                    145


6.5       SMT-based Solution
6.5.1      SMT Entities
     SMT entities represent (1) Lola equations, and (2) variables used to represent the
distributed stream. Once we have generated a sequence of consistent cuts, we use the laws
discussed in Section 6.4, to construct the set of all locally computer or partially computed
Lola equations.
Distributed Stream. In our SMT encoding, the set of events, E, is represented by a bit
vector, where each bit corresponds to an individual event in the distributed stream, (E,        ).
The length of the stream under observation is k, which makes |E| = k × |A| and the length
of the entire stream is N . We conduct a pre-processing of the distributed stream where we
create a E × E matrix, hbSet to incorporate the happen-before relations. We populate hbSet
as hbSet[e][f] = 1 iff e   f , else hbSet[e][f] = 0. In order to map each event to its respective
stream, we introduce a function, µ : E → A.
     We introduce a valuation function, υ : E → T (whatever the type is in the Lola
specification), in order to represent the values of the individual events. Due to the partially
synchronous assumption of the system, the possible time of occurrence of an event is defined
by a function δ : E → Z≥0 , where ∀α(σ) ∈ E.∃σ 0 ∈ [max{0, σ −  + 1}, min{σ +  −
1}, N ].δ α(σ) = σ 0 . We update the δ function when referring to events on output streams
               
by updating the time synchronization constant to M . This accounts for the clock skew
between two monitors. Finally, we introduce an uninterpreted function ρ : Z≥0 → 2E that
identifies a sequence of consistent cuts for computing all possible evaluations of the Lola
specification, while satisfying a number of given constrains explained in Section 6.5.2.
6.5.2      SMT Constrains
     Once we have defined the necessary SMT entities, we move onto the SMT constraints.
We first define the SMT constraints for generating a sequence of consistent cuts, followed by
                                                 146


the ones for evaluating the given Lola equations ϕα .
     Constrains for consistent cuts over ρ: In order to make sure that the uninterpreted
function ρ identifies a sequence of consistent cuts, we enforce certain constraints. The first
constraint enforces that each element in the range of ρ is in fact a consistent cut:
                                                                  
                  ∀i ∈ [0, k].∀e, e0 ∈ E. (e     e0 ) ∧ (e0 ∈ ρ(i)) → (e ∈ ρ(i))
Next, we enforce that each successive consistent cut consists of all events included in the
previous consistent cut:
                                  ∀i ∈ [0, k − 1].ρ(i) ⊆ ρ(i + 1)
Next, we make sure that the front of each consistent cut constitutes of events with possible
time of occurrence in accordance with the semantics of partially-synchronous Lola:
                               ∀i ∈ [0, k].∀e ∈ front(ρ(i)).δ(e) = i
Finally, we make sure that every consistent cut consists of events from all streams:
                          ∀i ∈ [0, k].∀α ∈ A.∃e ∈ front(ρ(i)).µ(e) = α
Constrains for Lola specification:               These constraints will evaluate the Lola
specifications and will make sure that ρ will not only represent a valid sequence of consistent
cuts but also make sure that the sequence of consistent cuts evaluate the Lola equations,
given the stream expressions. As is evident that a distributed system can often evaluate
to multiple values at each instance of time.           Thus, we would need to check for both
satisfaction and violation for logical expressions and evaluate all possible values for arithmetic
expressions.   Note that monitoring all Lola specification can be reduce to evaluating
expressions that are either logical or arithmetic. Below, we mention the SMT constraint
                                                147


for evaluating different Lola equations at time instance j:
                    
                    
                    υ(e) 0 ≤ j + p ≤ N
                                                                                       
        ti [p, c] =                                    ∃e ∈ front(ρ(j + p)).(µ(e) = αi )
                    
                    c
                             otherwise
          si (j) = true front(ρ(j)) |= ϕα              (Logical expression, satisfaction)
          si (j) = ei (∀e ∈ front(ρ(j)).υ(e))       (Arithmetic expression, evaluation)
The previously evaluated result is included in the SMT instance as a entity and a additional
constrain is added that only evaluates to unique value, in order to generate all possible
evaluations. The SMT instance returns a satisfiable result iff there exists at-least one unique
evaluation of the equation. This is repeated multiple times until we are unable to generate
a sequence of consistent cut, given the constraints, i.e., generate unique values. It is to be
noted that stream expression of the form ite(si , sk , sj ) can be reduced to a set of expressions
where we first evaluate si as a logical expression followed by evaluating sj and sk accordingly.
6.6      Runtime Verification of Lola specifications
     Now that both the rules of generating rewritten Lola equations (Section 6.4) and the
working of the SMT encoding (Section 6.5) have been discussed, we can finally bring them
together in order to solve the problem introduced in Section 6.3.
6.6.1       Computing LC
     Given a set of local states computed from the SMT encoding, each monitor process
receives a set of rewritten Lola associated equations, denoted by LS ij , where i ∈ [1, |M|]
for j-th computation round. Our idea to compute LC from these sets is to simply take a
                                               148


prioritized union of all the associated equations.
                                                    ]
                                     LC (Πij ) =           LS ij
                                                 i∈[1,|M|]
The intuition behind the priority is that an evaluated Lola equation will take precedence
over a partially evaluated/unevaluated Lola equation, and two partially-evaluated Lola
equation will be combined to form a evaluated or partially evaluated Lola equation. For
example, taking the locally computed LS 11 and LS 21 from Example 3, LC (LS 11 , LS 21 ) is
computed to be {c(1) = a(2), c(2) = 5, c(3) = ite(7 ≤ b(4), a(4), 5)} at Monitor M1
and {c(1) = 7, c(2) = 5, c(3) = ite(7 ≤ b(4), a(4), 5)} at Monitor M2 . Subsequently,
LC (LS 12 , LS 22 ) is computed to be {c(4) = 9, c(5) = 3, c(6) = 5} at Monitor M1 and
{c(4) = 9, c(5) = 3, c(6) = 5} at Monitor M2 .
6.6.2       Bringing it all Together
     As stated in Section 6.3.1, the monitors are decentralized and online. Since, setting
up of a SMT instance is costly (as seen in our evaluated results in Section 6.7), we often
find it more efficient to evaluate the Lola specification after every k time instance. This
reduces the number of computation rounds to dN/ke as well as the number of messages
being transmitted over the network as well with an increase to the size of the messages. We
update Algorithm 9 to reflect our solution more closely to Algorithm 10.
     Each evaluation round starts by reading the r-th partial distributed system which
consists of events occurring between the time max{0, (r−1)×dN/ke} and min{N, r×dN/ke}
(line 3). We assume that the partial distributed system is consistent in accordance with
the assumption that each event has been read by atleast one monitor. To account for
any concurrency among the events in (r − 1)-th computation round with that in the r-th
computation round, we expand the length by  time, there-by making the length of the r-th
computation round, max{0, (r − 1) × dN/ke −  + 1} and min{N, r × dN/ke}.
     Next, we reduce the evaluation of the distributed stream problem into an SMT problem
                                               149


Algorithm 10: Computation on Monitor Mi .
  1: LS i1 [0] = ∅
  2: for r = 1 to dN/ke do
  3:  (Ei , i )r ← r-th Consistent partial distributed stream
  4:  j=0
  5:  do
  6:   j =j+1
  7:   (α1 , α2 , · · · , α|A| ) ∈ Sr(Ei , i )
       LS ir [j] ← LS ir [j − 1] ∪ (α1 , α2 , · · · , α|A| ) |=S ϕα
                                                                   
  8:
  9:  while (LS ir [j] 6= LS ir [j − 1])
10:   Send: broadcasts symbolic view LS ir [j]
11:   Receive: Πir ← {LS kr | 1 ≤ k ≤ M}
12:   Compute: LS ir+1 [0] ← LC (Πir )                                              . Section 6.6.1
13:  end for S
14:  Resulti ← r∈[1,dN/ke+1] LS ir [0]
(line 7). We represent the distributed system using SMT entities and then by the help
of SMT constraints, and we evaluate the Lola specification on the generated sequence of
consistent cuts. Each sequence of consistent cut presents a unique ordering of the events
which evaluates to a unique value for the stream expression (line 8). This is repeated until we
no longer can generate a sequence of consistent cut that evaluates ϕα to unique values (line
9). Both the evaluated as well as partially evaluated results are included in LS as associated
Lola equations. This is followed by the communication phase where each monitor shares
its locally computed LS ir , for all i ∈ [1, |M|] and r evaluation round (line 10-11).
      Once, the local states of all the monitors are received, we take a prioritized union of all
the associated equation and include them into LS ir+1 set of associated equations (line 12).
Following this, the computation shifts to next computation round and the above mentioned
steps repeat again. Once we reach the end of the computation, all the evaluated values are
contained in Resulti
Lemma 8. Let A = {S1 , S2 , · · · , Sn } be a distributed system and ϕ be an Lola specification.
Algorithm 9 terminates when monitoring a terminating distributed system.
Proof. First, we note that our algorithm is designed for terminating system, also, note that
a terminating program only produces a finite distributed computation. In order to prove
the lemma, let us assume that the system send out a stop signal to all monitor processes
when it terminates. When such a signal is received by a monitor, it starts evaluating the
                                                                150


output stream expression using the terminal associated equations. This might arise to two
cases. One where all the values required for the evaluation has been observed or one where
the values required for the evaluation has not been observed. Although the termination of
the monitor process for the first case is trivial, the termination of the monitor process for
the second case is dependent upon replacing such unobserved stream value by the default
value of the stream expression. Thus, terminating the monitor process eventually.
Theorem 4. Algorithm 10 solves the problem stated in Section 6.3.
Proof. We prove the soundness and correctness of Algorithm 10, by dividing it into three
steps. In the first step we prove that given a Lola specification, ϕ, the values of the output
stream when computed over the distributed computation, (E, ), of length N is the same as
when the distributed computation is divided into Nk computation rounds of length k each.
Second, we prove that for all time instances the stream equation is eventually evaluated after
the communication round. Finally we prove the set of all evaluated result is consistent over
all monitors in the system.
     Step 1: From our approach, we see that the value of a output stream variable, is
evaluated on the events present in the consistent cut with time j. Therefore, we can reduce
the proof to:
                                   Sr(E, ) = Sr(E1 .E2 · · · E N , )
                                                                k
    • (⇒) Let Ck be a consistent cut such that Ck is in Sr(E, ) , but not in Sr(E1 .E2 · · · E N ,
                                                                                                    k
      ), for some k ∈ [0, |E|]. This implies that the frontier of Ck , front(Ck ) 6⊆ E1 and
      front(Ck ) 6⊆ E2 and · · · and front(Ck ) 6⊆ E N . However, this is not possible, as according
                                                     k
      to the computation round construction in Section 6.6.2, there must be a Ei , where 1 ≤
      i ≤ Nk such that front(Ck ) ⊆ Ei . Therefore, such Ck cannot exist, and (α1 , α2 , · · · , αn ) ∈
      Sr(E, ) =⇒ (α1 , α2 , · · · , αn ) ∈ Sr(E1 .E2 · · · E N , ).
                                                             k
    • (⇐) Let Ck be a consistent cut such that Ck is in Sr(E1 .E2 · · · E N , ) but not in Sr(E, )
                                                                            k
      for some k ∈ [0, |E|]. This implies, front(Ck ) ⊆ Ei and front(Ck ) 6⊆ E for some i ∈ [1, Nk ].
      However, this is not possible due to the fact that ∀i ∈ [1, Nk ].Ei ⊂ E. There, such
      Ck cannot exist, and (α1 , α2 , · · · , αn ) ∈ Sr(E1 .E2 · · · E N , ) =⇒ (α1 , α2 , · · · , αn ) ∈
                                                                       k
      Sr(E, ).
     Therefore, Sr(E, ) = Sr(E1 .E2 · · · E N , ).
                                              k
     Step 2: Given a output stream expression si and the dependency graph G = hV, Ei,
for each hsi , sk , wi ∈ E, evaluating the value at time instance j ∈ [1, N ], αk (j + w) 6= \ or
αk (j + w) = \ or αk (w + j) not observed.
    • If αk (j + w) 6= \, then we evaluate the stream expression
    • If αk (j + w) = \, there exists at-least one other monitor where αk (j + w) 6= \. Thereby
      evaluating the stream expression, followed by sharing the the evaluated result with all
                                                   151


      other monitors
    • If αk (w + j) not observed, then at some future evaluation round and at some monitor
      αk (j + w) 6= \ and there-by evaluating the stream expression si
     Similarly, it can be proved for hsi , tk , wi ∈ E.
     Step 3: Each monitor in our approach is fault-proof with communication taking place
between all pairs of monitors. We also assume, all messages are eventually received by the
monitors. This guarantees all observations are either directly or indirectly read by each
monitor.
     Together with Step 1 and 2, soundness and correctness of Algorithm 9 is proved.
Theorem 5. Let ϕ be a Lola specification and (E, ) be a distributed stream consisting of
|A| streams. The message complexity of Algorithm 10 with |M| monitors is
                                  O |A| N |M|2       Ω(N |M|2 )
                                                   
Proof. We analyze the complexity of each part of Algorithm 10. The algorithm has a nested
loop. The outer loop iterates for dN/ke times, that is O(N ). The inner loop is dependent
on the number of unique evaluations of the stream expression.
    • Upper-bound Due to our assumption of partial-synchrony, each event’s time of
      occurrence can be off by . This makes the maximum number of unique evaluations in
      the order of O(|A| ).
    • Lower-bound The minimum number of unique evaluations is in the order of Ω(1).
     In the communication phase, each monitor sends |M| messages to all other monitors and
receives |M| messages from all other monitors. That is |M|2 . Hence the message complexity
is
                                  O |A| N |M|2 Ω(N |M|2 )
                                                   
As a side note, we would like to mention that in case of high readability of the monitors and
evaluation of logical expression, the complexity is closer to the lower-bound, whereas with
low readability and arithmetic expressions, the complexity is closer to the upper bound.
6.7       Case Study and Evaluation
     In this section, we analyze our SMT-based decentralized monitoring solution. We note
that we are not concerned about data collections, data transfer, etc, as given a distributed
setting, the runtime of the actual SMT encoding will be the most dominating aspect of the
                                                  152


monitoring process. We evaluate our proposed solution using traces collected from synthetic
experiments (Section 6.7.1) and case studies involving several industrial control systems and
RACE dataset (Section 6.7.2). The implementation of our approach can be found on Google
Drive(https://tinyurl.com/2p6ddjnr ).
6.7.1      Synthetic Experiments
Setup
     Each experiment consists of two stages: (1) generation of the distributed stream and (2)
verification. For data generation, we develop a synthetic program that randomly generates a
distributed stream (i.e., the state of the local computation for a set of streams). We assume
that streams are of the type Float, Integer or Boolean. For the streams of the type Float
and Integer, the initial value is a random value s[0] and we generate the subsequent values
by s[i-1] + N(0, 2), for all i ≥ 1. We also make sure that the value of a stream is always
non-negative. On the other hand, for streams of the type Boolean, we start with either
true or false and then for the subsequent values, we stay at the same value or alter using a
Bernoulli distribution of B(0.8), where a true signifies the same value and a false denotes a
change in value.
     For the monitor, we study the approach using Bernoulli distribution B(0.2), B(0.5) and
B(0.8) as the read distribution of the events. A higher readability offers each event to be
read by higher number of monitors. We also make sure that each event is read by at least
one monitor in accordance with the proposed approach. To test the approach with respect to
different types of stream expression, we use the following arithmetic and logical expressions.
input a1 : uint
input a2 : uint
output arithExp := a1 + a2
output logicExp := (a1 > 2) && (a2 < 8)
                                               153


Result - Analysis
     We study different parameters and analyze how it effects the runtime and the message
size in our approach. All experiments were conducted on a 2017 MacBook Pro with 3.5GHz
Dual-Core Intel core i7 processor and 16GB, 2133 MHz LPDDR3 RAM. Unless specified
otherwise all experiments consider number of streams, |A| = 3, time synchronization
constant, M =  = 3s, number of monitors same as the number of streams, computation
length, N = 100, with k = 3 with a read distribution B(0.8).
Time Synchronization Constant.            Increasing the value of the time synchronization
constant , increases the possible number of concurrent events that needs to be considered.
This increases the complexity of evaluating the Lola specification and there-by increasing
the runtime of the algorithm. In addition to this, higher number of  corresponds to higher
number of possible streams that needs to be considered. We observe that the runtime
increases exponentially with increasing the value of  in Fig. 6.5a, as expected. An interesting
observation is that with increasing the value of k, the runtime increases at a higher rate until
it reaches the threshold where k = . This is due to the fact, that the number of streams to
be considered increases exponentially but ultimately gets bounded by the number of events
present in the computation.
     Increasing the value of the time synchronization constant is also directly proportional
to the number of evaluated results at each instance of time. This is because, each stream
corresponds to a unique value being evaluated until it gets bounded by the total number of
possible evaluations, as can be seen in Fig. 6.6a. However, comparing Figs. 6.5a and 6.6a,
we see that the runtime increases at a faster rate to the size of the message. This owes to the
fact that initially a SMT instance evaluates unique values at all instance of time. However,
as we start reaching all possible evaluations for certain instance of time, only a fraction of
the total time instance evaluates to unique values. This is the reason behind the size of the
message reaching its threshold faster than the runtime of the monitor.
                                              154


                            k   =5                                                                                 50,000    k   =5
                  1,000     k   =4                                                                                           k   =4
                    500     k   =3                                                                                 10,000    k   =3
                            k   =2                                                                                           k   =2
 Runtime (sec.)                                                                                   Runtime (sec.)
                            k   =1                                                                                  1,000    k   =1
                   100
                    50                                                                                                500
                                                                                                                     100
                    10                                                                                                50
                     5
                                                                                                                      10
                                                                                                                       5
                     1
                                                                                                                       1
                           1       2        3       4        5                                                              2     3    4 5          7         10
                          Time Synchronization Constant (sec.) ε                                                                      Number of Streams |A|
                                     (a) Epsilon                                                                            (b) Number of Streams
                                                                            arithExp,    B(0.8)
                                                                            logicExp,    B(0.8)
                                                                    1,000   arithExp,    B(0.5)
                                                                      500   logicExp,    B(0.5)
                                                   Runtime (sec.)
                                                                            arithExp,    B(0.2)
                                                                     100    logicExp,    B(0.2)
                                                                      50
                                                                      10
                                                                       5
                                                                       1
                                                                            2   3    4 5          7                              10
                                                                                    Number of Streams |A|
                                                                      (c) Different Lola Specification
                          Figure 6.5: Impact of different parameters on runtime for synthetic data.
Type of Stream Expression. Stream expressions can be divided into two major types, one
consisting of arithmetic operations and the other involving logical operations. Arithmetic
operations can evaluate to values in the order of O(|A|.), where as logical operations can
only evaluate to either true or false. When the monitors have high readability of the
distributed stream, it is mostly the case, that the monitor was able to evaluate the stream
expression. Thus, we observe in Fig. 6.5c that the runtime grows exponentially for evaluating
arithmetic expressions but is linear for logical expressions. However, with low readability of
the computation, irrespective of the type of expression, both takes exponential time since
neither can completely evaluate the stream expression. So, each monitor has to generate all
possible streams.
                                                                                        155


                                    k   =5                                                                                                     1,000   k   =5
                                    k   =4                                                                                                             k   =4
                                                                                                                                                500
 Size of Messages (bytes)                                                                                           Size of Messages (bytes)
                                    k   =3                                                                                                             k   =3
                            100     k   =2                                                                                                             k   =2
                                    k   =1                                                                                                             k   =1
                             50                                                                                                                 100
                                                                                                                                                 50
                             10
                                                                                                                                                 10
                              5                                                                                                                   5
                                   1       2        3       4        5                                                                                 2    3    4 5          7         10
                                  Time Synchronization Constant (sec.) ε                                                                                        Number of Streams |A|
                                             (a) Epsilon                                                                                               (b) Number of Streams
                                                                                      1,000   arithExp,    B(0.8)
                                                                                              logicExp,    B(0.8)
                                                           Size of Messages (bytes)
                                                                                       500    arithExp,    B(0.5)
                                                                                              logicExp,    B(0.5)
                                                                                              arithExp,    B(0.2)
                                                                                              logicExp,    B(0.2)
                                                                                       100
                                                                                        50
                                                                                        10
                                                                                              2   3    4 5          7                                      10
                                                                                                      Number of Streams |A|
                                                                                        (c) Different Lola Specification
                                  Figure 6.6: Impact of different parameters on message size for synthetic data.
                             Similarly, for high readability and logical expressions, the message size is constant
given the monitor was was able to evaluate the stream expression. However with low
readability, message size for evaluating logical expressions matches with that of its arithmetic
counterpart. This can be seen in Fig. 6.6c and is due to the fact, that with low readability,
complete evaluation of the expression is not possible at a monitor and thus needs to send
the rewritten expression with the values observed to the other monitors where it will be
evaluated.
Number of Streams. As the number of streams increases, the number of events increase
linearly and thereby making exponential increase in the number of possible synchronous
streams (due to interleavings). This can be seen in Fig. 6.5b, where the runtime increases
                                                                                                          156


exponentially with increase in the number of streams in the distributed stream. Similarly,
in Fig. 6.6b, increase in the number of streams linearly effects the number of unique values
that the Lola expression can evaluate to and there-by increasing the size of the message.
6.7.2     Case Studies: Decentralized ICS and Flight Control RV
     We put our runtime verification approach to the test with respect to several industrial
control system datasets that includes data generated by a (1) Secure Water Treatment plant
(SWaT) [GAJM17], comprising of six processes, corresponding to different physical and
control components; (2) a Power Distribution system [SLX16] that includes readings from
four phaser measurement unit (PMU) that measures the electric waves on an electric grid,
and (3) a Gas Distribution system [BBHB13] that includes messages to and from the PLC.
In these ICS, we monitor for correctness of system properties. Additionally we monitor for
mutual separation between all pairs of aircraft in RACE [MGS19] dataset, that consists of
SBS messages from aircrafts.
SWaT Dataset Secure Water Treatment (SWaT) [GAJM17] utilizes a fully operational
scaled down water treatment plant with a small footprint, producing 5 gallons/minute of
doubly filtered water. It comprises of six main processes corresponding to the physical and
control components of the water treatment facility. It starts from process P1 where it takes
raw water and stores it in a tank. It is then passed through the pre-treatment process,
P2, where the quality of the water is assessed and maintained through chemical dosing.
The water then reaches P3 where undesirable materials are removed using fine filtration
membranes. Any remaining chlorine is destroyed in the dechlorination process in P4 and the
water is then pumped into the Reverse Osmosis system (P5) to reduce inorganic impurities.
Finally in P6, water from the RO system is stored ready for distribution.
     The dataset classifies different attack on the system into four types, based on the point
and stage of the attack: Single Stage-Single Point, Single Stage-Multi Point, Multi Stage-
Single Point and Multi Stage-Multi Point. We for the scope of this paper are the most
                                              157


interested in the attacks either covering multiple stages or multiple points. Few of the Lola
specifications used are listed below.
input FIT-101 : uint
input MV-101 : bool
input LIT-101 : uint
input P-101 : bool
input FIT-201 : uint
output inflowCorr := ite(MV-101 == true, FIT-101 > 0, FIT-101 == 0)
output outflowCorr := ite(P-101 == true, FIT-201 > 0, FIT-201 == 0)
output tankCorr := ite(MV-101 == true || P-101 == true, LIT-101 = LIT-101[-1, 0]
    + FIT-101[-1, 0] - FIT-201[-1, 0])
     where FIT-101 is the flow meter, measuring inflow into raw water tank, MV-101 is
a motorized valve that controls water flow to the raw water tank, LIT-101 is the level
transmitter of the raw water tank, P-101 is a pump that pumps water from raw water tank
to the second stage and FIT-201 is the flow transmitter for the control dosing pumps. The
above Lola specification checks the correctness of the inflow meter and valve pair (resp.
outflow meter and pump pair) in inflowCorr (resp. outflowCorr ) output expressions. On
the other hand, tankCorr checks if the water level in the tank adds up to the in-flow and
out-flow meters.
input AIT-201 : uint
input AIT-202 : uint
input AIT-203 : uint
output numObv := numObv[-1, 0] + 1
output NaClAvg := (NaClAvg[-1, 0] * numObv[-1, 0] + AIT-201) / numObv
output HClAvg := (HClAvg[-1, 0] * numObv[-1, 0] + AIT-202) / numObv
output NaOClAvg := (NaOClAvg[-1, 0] * numObv[-1, 0] + AIT-202) / numObv
     where AIT-201, AIT-202 and AIT-203 represents the NaCl, HCl and NaOCl levels in
water respectively and NaClAvg, HClAvg and NaOClAvg keeps a track of the average levels
of the corresponding chemicals in the water, where as numObv keeps a track of the total
number of observations read by the monitor.
                                             158


Power System Attack Dataset Power System Attack Dataset [SLX16] consists of three
datasets developed by Mississippi State University and Oak Ridge National Laboratory.
It consists of readings from four phaser measurement unit (PMU) or synchrophasor that
measures the electric waves on an electric grid. Each PMU measures 29 features consisting of
voltage phase angle, voltage phase magnitude, current phase angle, current phase magnitude
for Phase A-C, Pos., Neg. and Zero. It also measures the frequency for relays, the frequency
delta for relay, status flag for relays, etc. Apart from these 116 PMU measurements, the
dataset also consists of 12 control panel logs, snort alerts and relay logs of the 4 PMU.
     The dataset classifies into either natural event/no event or an attack event. Few of the
Lola specifications used are listed below. The first attempts to detect a single-line-to-ground
(1LG) fault.
input R1-I : float
input R2-I : float
input R1-Relay : bool
input R2-Relay : bool
output R1-I-low := R1-I < 200
output R1-I-high := R1-I > 1000
output R2-I-low := R2-I < 200
output R2-I-high := R2-I > 1000
output 1LG := R1-I-high && R2-I-high && R1-Relay[+2, false] && R2-Relay[+2,
    false] && R1-I-low[+4, false] && R2-I-low[+4, false]
     where R1-I and R2-I represents the current measured at the R1 and R2 PMU
respectively.   Additionally, R1-Relay and R2-Relay keeps a track of the state of the
corresponding relay. As a part of the 1LG attack detection, we first categorize the current
measured as either low or high depending upon the amount of the current measured. We
categorize an attack as 1LG if both R1 and R2 detects high current flowing followed by the
relay tripping followed by low current.
input R1-PA1-I : float
input R1-PA2-I : float
input R1-PA3-I : float
                                              159


output phaseBal := (R1-PA1-I - R1-PA2-I) <= 10 && (R1-PA2-I - R1-PA3-I) <= 10 &&
    (R1-PA3-I - R1-PA1-I) <= 10
     where R1-PA1-I, R1-PA2-I and R1-PA3-I are the amount of current measured by R1
PMU at Phase A, B and C respectively. The monitor helps us to check if the load on three
phases are equally balanced.
Gas Distribution System Gas Distributed System [BBHB13] is a collection of labeled
Remote Terminal Unit (RTU) telemetry streams from a Gas pipeline system in Mississippi
State University’s Critical Infrastructure Protection Center with collaboration from Oak
Ridge National Laboratory. The telemetry streams includes messages to and from the
Programmable Logic Controller (PLC) under normal operations and attacks involving
command injection and data injection attack. The feature set includes the pipeline pressure,
setpoint value, command data from the PLC, response to the PLC and the state of the
solenoid, pump and the Remote Terminal Unit (RTU) auto-control.
     One of the most common data injection attack is Fast Change. Here the reported
pipeline pressure value is successively varied to create a lack of confidence in the correct
operation of the system. The corresponding Lola specification monitoring against such
attack is mentioned below:
input PipePress : float
input response : bool
output fastChange := ite(response, mod(PipePress - PipePress[-1, 1000]) <= 10,
    true)
     where PipePress records the measured pipeline pressure and response is a flag variable
signifying a message to the PLC. Here we consider the default pressure is 1000 psi and
the permitted pressure change per unit time is 10 psi (these can be changed according
to the demands of the system). Similarly we have Lola specifications monitoring other
data injection attacks such as Value Wave Injection, Setpoint Value Injection, Single Data
Injection, etc. and command injection attacks such as Illegal Setpoint, Illegal PID Command,
                                             160


etc.
RACE Dataset Runtime for Airspace Concept Evaluation (RACE) [MGS19] is a
framework developed by NASA that is used to build an event based, reactive airspace
simulation. We use a dataset developed using this RACE framework. This dataset contains
three sets of data collected on three different days. Each set was recorded at around 37 N
Latitude and 121 W Longitude. The dataset includes all 8 types of messages being sent
by the SBS unit by using a Telnet application to listen to port 30003, but we only use the
messages with ID ‘MSG 3’ which is the Airborne Position Message and includes a flight’s
latitude, longitude and altitude using which we verify the mutual separation of all pairs of
aircraft. Furthermore, calculating the distance between two coordinates is computationally
expensive, as we need to factor in parameters such as curvature of the earth. In order to speed
up distance related calculations, we consider a constant latitude distance of 111.2km and
longitude distance of 87.62km, at the cost of a negligible error margin. The corresponding
Lola specification is mentioned below:
input flight1_alt : float
input flight1_lat : float
input flight1_lon : float
input flight2_alt : float
input flight2_lat : float
input flight2_lon : float
output distDiff := sqrt(pow(flight1_alt - flight2_alt, 2) + pow((flight1_lon -
     flight2_lon)*87620, 2) + pow((flight1_lat - flight2_lat)*111200, 2))
output check := distDiff > 500
      For our setting we assume, each component has its own asynchronous local clock, with
varying time synchronization constant. Next we discuss the results of verifying different ICS
with respect to Lola specifications.
Result Analysis We employed same number of monitors as the number of components
for each of the ICS case-studies and divided the entire airspace into 9 different ones with
                                              161


                                                        101.6         SWaT
                         Average % of False-Positives
                                                                 Power Distribution
                                                        101.3     Gas Distribution
                                                         101          RACE
                                                        100.7
                                                         100
                                                                0.1 0.5    1           2           3
                                                                   Time-Synchronization constant 
                      Figure 6.7: False-Positives for ICS Case-Studies.
one monitor responsible for each. We observe that our approach does not report satisfaction
of system property when there has been an attack on the system in reality (false-negative).
However, due to the assumption of partial-synchrony among the components, our approach
may report false positives, i.e., it reports a violation of the system property even when there
was no attack on the system. As can be seen in Fig. 6.7, with decreasing time synchronization
constant, the number of false-positives reduce as well. This is due to the fact that with
decreasing , less events are considered to be concurrent by the monitors. This makes the
partial-ordering of events as observed by the monitor closer to the actual-ordering of events
taking place in the system.
    We get significantly better result for aircraft monitoring with fewer false-positives
compared to the other dataset. This can be attributed towards Air Traffic Controllers
maintaining greater separation between two aircrafts than the minimum that is
recommended. As part of our monitoring of other ICS, we would like to report that our
monitoring approach could successfully detect several attacks which includes underflow and
overflow of tank and sudden change in quality of water in SWaT, differentiate between
manual tripping of the breaker from the breaker being tripped due to a short-circuit in
Power Distribution and Single-point data injection in Gas distribution.
                                                                           162


6.8       Summary and Limitation
     In this chapter, we studied distributed runtime verification w.r.t. to the popular stream-
based specification language Lola. We propose a online decentralized monitoring approach
where each monitor takes a set of associated Lola specification and a partial distributed
stream as input. By assuming partial synchrony among all streams and by reducing the
verification problem into an SMT problem, we were able to reduce the complexity of our
approach where it is no longer dependent on the time synchronization constant. We also
conducted extensive synthetic experiments, verified system properties of large Industrial
Control Systems and airspace monitoring of SBS messages. Comparing to machine learning-
based approaches to verify the correctness of these system, our approach was able to produce
sound and correct results with deterministic guarantees. As a better practice, one can also
use our RV approach along with machine-learning based during training or as a safety net
when detecting system violations.
     For future work, we plan to study monitoring of distributed systems where monitors
themselves are vulnerable to faults such as crash and Byzantine faults. This will let us
design a technique with faults and vulnerabilities mimicking a real life monitoring system
and thereby expanding the reach and application of runtime verification on more real-life
safety critical systems.
                                              163


Chapter 7
Related Work
     This section aims to summarize the in-exhaustive quantity of work done previously that
has influenced our work, beginning at the origins of distributed monitoring, followed by
runtime verification of un-timed and timed logic with different applications and finally with
robust and sound verification approaches even with faulty monitors.
7.1      Lattice-theoretic Distributed Monitoring
     Predicate detection is the problem of identifying states of a distributed computation
that satisfy a predicate [Gar02, SS95]. The problem is in general NP-complete [MG01].
Computation slicing [MG05] is a technique for reducing the size of the computation and,
hence, the number of global state to be analyzed for detecting a predicate. The slice of a
computation with respect to a predicate is the sub-computation satisfying the following two
conditions: (1) it contains all global states for which the predicate evaluates to true, and
(2) among all computations that satisfy the first condition, it contains the least number
of consistent cuts.   In [MG05], the authors propose an algorithm for detecting regular
predicates. This idea is then extended to a full blown distributed algorithm for distributed
monitoring [CGNM13]. One shortcoming of this line work is that it does not address
monitoring properties with temporal requirements. This shortcoming is partially addressed
in [OG07] for a fragment of temporal operators. In [MB15], the authors propose the first
                                             164


sound method for runtime verification of asynchronous distributed programs for the 3-valued
semantics of LTL specifications defined over the global state of the program. In the proposed
setting, monitors are not subject to faults. The technique for evaluating LTL properties is
inspired by distributed computation slicing described above. The monitoring technique is
fully decentralized. LTL formulas in this work are in terms of conjunctive predicates.
      Lattice-based techniques may suffer from the existence of too many concurrent states.
To tackle this problem in [YNV+ 16], the authors propose an algorithm and analytical bounds
if a combination of logical and physical clocks (called hybrid clocks) are used. This method is
enriched with SAT solving techniques in [VYK+ 17]. Other SMT-based predicate detection
solutions include [PMSP20], where the authors build a tool SPIDER to detect race conditions
in distributed system.
      In [VKTA20], the authors propose a two-layered monitoring algorithm that combines
the algorithm that uses Hybrid Logical Clock (HLC) that is dependent on a parameter γ
with a monitoring algorithm that uses SMT solvers to perform predicate detection. This
two layered monitoring algorithm eliminates all false positives and, depending on γ, many
or all false negatives are eliminated at a reduced cost. This makes monitoring a much faster
procedure. A completely SMT-based approach is proposed with a focus on cyber-physical
systems in [MBAB21], where the authors focus on detecting violations of predicates over
distributed continuous-time and continuous-valued signals from cyber physical systems.
7.2       Monitoring Distributed System
      Monitoring distributed system can be broadly classified using the presence or absence
of a global common clock among the processes. The algorithm in [BF16b] for monitoring
synchronous distributed systems with respect to LTL formulas is designed such that
satisfaction or violation of specifications can be detected by local monitors alone. The
framework employs disjoint alphabet for each process in the system. Thus, a local monitor
in [BF16b] can only evaluate subformulas that include its own propositions and if the
                                              165


subformula contains propositions of other processes, it sends a proof obligation to the
corresponding monitor to resolve the obligation. This technique is called formula progression.
This implies that if multiple proof obligations exist, the formula needs to be progressed by
multiple monitors in a sequence of communication rounds. Each round may increase the
size of the formula to remember what happened in the past. A similar progression-based
verification approach is studied for decentralized monitoring in [BF12]. A internet-of-things
based application of the above approach is discussed in [EHF22].
     In [CF16], the authors introduce a way of organizing sub-monitors for LTL subformulas
in a synchronous distributed system, called choreography. In particular, the monitors are
organized as a tree across the distributed system, and each child feeds intermediate results to
its parent in a manner similar to diffusing computation. They formalize choreography-based
decentralized monitoring by showing how to synthesize a network from an LTL formula, and
give a decentralized monitoring algorithm working on top of an LTL network.
     Verification is usually deployed for remote systems where the communication may be
unreliable. To study the effect of unreliable channels on monitoring, the authors in [KHF19]
start off by describing different types of mutations that may be introduced to an execution
trace and examine their effects on program monitoring. They also propose a fixed-parameter
tractable algorithm for determining the immunity of a finite automaton to a trace mutation
and show how it can be used to classify ω-regular properties as monitor-able over channels
with that mutation. An ω-regular property is one that generalizes the definition of regular
properties to infinite words.
     In [EHF18], the authors give a comprehensive overview of monitoring multi-threaded
system or more specifically the added challenges of monitoring asynchronous distributed
system. Some of the solutions discussed include Java PathExplorer (JPaX) [HR04], which
is a tool designed for multi threaded programs. It uses byte code-level automata-based
instrumentation to detect both race conditions and deadlocks in a multi threaded program
execution.
                                              166


     To include a more wide range of applications for runtime verification, various stream
runtime verification logic and algorithm has been developed. Some of the notable ones are
Striver [LSS+ 18] and TeSSLa [LSS+ 18]. In stream runtime verification, the monitor receives
a stream of rich data from the processes and the specifications include not only predicates
but aggregate functions, like average, mean, medium, etc. In [S2́1], the authors discuss
stream runtime verification for both synchronous and asynchronous system.
7.3      Monitoring Time-bounded Specification
     Time-bounded logic can be of two types depending upon the assumption of discrete or
continuous time. For discrete (non-negative integers) time we have Metric Temporal Logic
(MTL) and for continuous (non-negative real) time we have Signal Temporal Logic (STL).
In [WOH19], the authors present a monitoring algorithm that does not store any information
about the observed trace but is able to evaluate both future and past time logic of MTL. They
term the approach as “resolve the past and derive the future”. It involves the MTL formula
to be transformed into an equivalent formula with the property that it has no past time
operator rooted subformulas which are not guarded by other temporal operators for past
time sub-formula. On the other hand, for future time logic, it involves the MTL formula to be
transformed into a new MTL formula with the property that the current formula holds before
processing the newly received event if and only if the derived formula holds after processing
the event. This is very close to the concept of progression we use in our monitoring algorithm
but here in [WOH19], the authors work with a synchronous system.
     Other notable works for monitoring MTL formula includes [WOH19, FP07, BKMZ15,
BKM10]. The authors in [BKMZ15, BKM10] extend the general monitoring MTL formulas
to include a more expressive Metric First-Order Temporal Properties. It includes first order
extensions of quantifying the trace where the sub-formula should hold. In [WOH19], the
authors were able to introduce a trace-length independent monitoring procedure for an
extension of MTL with the same expressiveness of that of Monadic First-Order Logic of
                                              167


Order and Metric (FO[¡, +1]). Domain specific monitoring of time-bounded properties
includes security vulnerabilities posed by blockchains in [AGCC+ 20,AEP21,APSS21,CPR18,
PZS+ 18]. All of these work involve vulnerabilities of transactions involving smart contract.
However, they are not distributed in the sense, they do not involve transactions over multiple
blockchains.
      In order to monitor a system where components might crash or network failures can
occur the authors in [BKZ15] propose a 3-valued semantics of MTL based runtime verification
approach. The monitor uses these timestamps of the events to determine the elapsed time
between observations to check whether real-time constraints are met. To efficiently resolve
knowledge gaps and to compute verdicts, each monitor maintains a AND-OR graph where
the edges express constraints for assigning a boolean value to a node. If a monitor receives
additional information about the system behavior, it updates its graph structure by adding
and deleting nodes and edges, based on the message received.
      For monitoring of dense time bounded (signal) temporal logic (STL), authors in [DFM13,
DDG+ 17] propose monitoring approaches curated for use in cyber-physical systems.
In [DDG+ 17], the authors formalize a semantics for robust online monitoring of partial traces,
i.e., traces for which there might not be enough data to decide the Boolean satisfaction and
violation. The approach involves around given a trace and a signal property, it maps them
to an interval (l, v), where l is the greatest lower bound and v is the lowest upper bound on
the quantitative semantics of the trace.
      Authors of [LSS+ 19], bring runtime verification of incomplete traces to not only
monitoring data streams but also to timed events. They use TeSSLa [LSS+ 18] as the
specification language for non-synchronized timed event streams and defines an abstract
event streams representing the set of all possible traces that could have occurred during the
gaps in the input trace. They work under the assumption of (1) for events with imprecise
values the monitor has an idea about the range of values and (2) for data losses the monitor
is able to know the range of when it stopped getting information and when the trace becomes
                                                168


reliable again. In order to solve the problem, the authors extend the semantics of TeSSLa
to incorporate incomplete traces and define a abstraction based sliding window to monitor
the traces.
7.4       Runtime Verification of Hyperproperties
     When monitoring for information flow security policies, requires relation to be
expressed between multiple traces [CS08]. Thus, specifications are represented using Hyper
Linear Temporal Logic (HyperLTL) [CFK+ 14, FRS15]. Runtime Verification of HyperLTL
specifications were first discussed in [BF16b] where they introduce the finite-trace semantics
of HyperLTL. Later in [AB16], the authors introduce a runtime verification technique for a
subclass of hyper-properties which deals with k-safety properties. A property is called k-
safety, when the size of each finite set is at most k, it results in a k-safety hyperproperty. This
is essential for monitoring a system w.r.t. hyperproperties since a system often generates
infinite number of traces and monitoring such a set of traces becomes difficult. The proposed
monitoring approach involves introducing a procedure that aggregates a runtime progression
logic and computes verdicts using a LTL3 monitor.
     In [BF18], the authors discuss the main challenges in verifying a distributed system
where the specifications is mentioned in HyperLTL. The added challenge in verification of
hyperproperties is that the monitor when verifying hyperproperties repeatedly model checks
the growing Kripke structure compared to the monitor tracking the state of the specification
when verifying trace properties. The authors report that in case of tree-shaped Kripke
structure, the complexity is L-complete independent of the number of quantifier alternations
in HyperLTL formula. On the other hand for acyclic Kripke structures, the complexity is
PSPACE-complete in the level of the polynomial hierarchy that corresponds to the number
of quantifier alternations. However, they also report that the combined complexity in the
size of the Kripke structure and the length of the HyperLTL formula is PSPACE-complete
for both trees and acyclic Kripke structures. Thereby coming to the final conclusion that
                                                169


the size and shape of both the Kripke structure and the formula have a significant impact
on the complexity of the model checking problem.
     A number of versatile runtime verification approaches for different cases and system
specifications are presented in [BSB17, FHST17, PSS18].           As mentioned before that
monitoring hyperproperties involve monitors storing previously seen traces and this makes
the monitor to become slower and slower, and there comes a time when it inevitably runs out
of memory. In [FHST17], the authors present techniques that reduce the set of traces that
new traces must be compared against to a minimal subset. The techniques include exploiting
properties of specifications such as reflexivity, symmetry, and transitivity, to reduce the
number of comparisons. In contrast the authors in citebsb17, present a rewriting-based
technique for runtime verification for alternation-free HyperLTL. The distinguishing feature
of this proposed technique is that it is independent of the number of trace quantifiers in a
given HyperLTL formula. Authors in [PSS18], achieve efficient monitoring by reducing the
hyperproperty into trace properties for deterministic system by extracting the characteristic
predicate for a given hyperproperty, and provide a parametric monitor taking the extracted
predicate as parameter.
7.5       Fault-tolerant Distributed Monitoring
     In [FRT14,FRRT14] the authors show that if runtime monitors employ enough number
of opinions (instead of the conventional binary valuations), then it is possible to monitor
distributed tasks in a consistent manner. Building on the work in [FRT13,FRT14,FRRT14],
the authors in [BFR+ 16] show that employing the four-valued LTL [BLS10a] will result in
inconsistent distributed monitoring for some formulas. They subsequently introduce a family
of logics, called LTL2k+4 , that refines the 4-valued LTL by incorporating 2k + 4 truth values,
for each k ≥ 0. The truth values of LTL2k+4 can be effectively used by each monitor to reach
a consistent global set of verdicts for each given formula, provided k is sufficiently large.
     The authors in [FRT20], dive into investigating the factors responsible for the size of
                                                170


a decentralized monitoring approach where the monitors are susceptible to faults. They
consider a static system, one where each monitor reads an observation of the system as
input, exchanges information followed by performing individual computation and eventually
outputting the verdict that reflects the perception of validity of the system state and there
is not change in state of the system while all this is happening. The main inference of this
approach was that the authors were able to find a tight lower bound on the size of the opinion
which was dependent on the language of the property being monitored against. They also
go on to prove that for every n ≥ 1, and every k ∈ [1, n), there exists a language with
alternation number k that requires at least k opinions to be monitored with n monitors, and
there exists a language with alternation number n that requires at least n + 1 opinions to
be monitored with n monitors.
7.6       Statistical Model Checking
     Statistical Model Checking (SMC) is a method used in the field of formal verification
to check the correctness of probabilistic systems. It is particularly useful in systems that
involve randomness or uncertainty, such as computer networks, communication protocols,
and robotics. The idea behind SMC is to generate a large number of simulation traces of the
system under consideration and compare the statistical properties of these traces with the
expected behavior of the system. The statistical approach is even applicable for black-box
systems, where the behavior is not fully understood or controllable [SVA04]. This is done
by defining a set of quantitative properties, such as the probability of a particular event
occurring or the expected time taken to reach a particular state, and then comparing the
observed statistics with the expected values. SMC can be used to detect various types of
errors in probabilistic systems, including deadlocks, livelocks, and other types of performance
issues. It is often used in combination with other formal verification techniques, such as model
checking and theorem proving, to provide a more comprehensive analysis of the system. Some
popular tools for SMC include PRISM [KNP04], Storm [CDS+ 17], and Maude [CDE+ 02].
                                               171


SMC is a useful technique for validating probabilistic systems, and it is increasingly becoming
an essential tool in the development of critical systems.
     A large number of real-world systems are subject to hard requirements on time. To
analyze such systems, researchers model them as timed automata and express requirements
using variants of CTL that include operators with resource bounds as parameters.
Then, tools and techniques establish worst-case bounds on execution time and resource
consumption and perform schedulability analysis. However, there may still be a need to
choose among appropriate schedulers, preferring the one that provides the most attractive
properties in the expected or average case. Moreover, multiple timed automata (priced timed
automata, weighted CTL, etc.) are known to be undecidable [BBM06].
7.7      Beyond Runtime Verification
     Looking ahead from runtime verification [Fal10], we have predictive runtime
monitoring [JTS21] and runtime enforcement [RKG+ 19, FMRS18, PFJ+ 13] of properties.
The main differentiating factor these have from runtime verification is that they are able to
predict a vulnerability before it has actually happened in the system or are able to enforce
a property on the system. In other words, they make sure that the system does not actually
reach the vulnerable state. But, in order to do so, information about the working and
behavior of the system is required. Thus, for prediction and enforcing specifications, grey
(a Deterministic Time Markov Chain or Markov Decision Process model of the system) or
white box (implementing code) system is required.
     The authors in [SBS+ 12], introduce a runtime verification approach using state
estimation. The proposed approach is based on visualizing event sequences as observation
sequences of a Hidden Markov Model (HMM). HMM is used to fill the gaps in observation
sequences by extending the classic forward algorithm for HMM state estimation to compute
the probability that the property is satisfied by an execution of the program. However the
authors in [JTS21], the authors show that this HMM based state estimation does not scale
                                              172


well due to the combination of nondeterminism and probabilities. They model the system
as a Markov Decision Process (MDP) to take into consideration both nondeterminism and
probability in the data from imprecise sensors.
     To solve the problem of partial or noisy observation, the authors in [CBP21] propose
a neural network based predictive monitoring approach. The approach balances between
prediction accuracy, to avoid errors and computation efficiency, to support fast execution
at runtime. They employ a neural network classifier to predict reachability at any state.
They device two solutions, end-to-end where a neural monitor directly operates on the raw
observation and the other a two-step approach where a state estimator reconstructs the full
history of states and then a classifier maps the sequence to a good/bad label.
                                              173


Chapter 8
Conclusion and Future Work
     In the previous few chapters, we have explored and formed the theoretical and exhaustive
practical basis of runtime verification of distributed systems. In this Chapter, we first
summarize our contributions and then explore a few of the possible future directions of
the research.
8.1      Summary
     In Chapter 3, our focus was on distributed runtime monitoring. Both of our proposed
techniques take an LTL formula and a distributed computation as input, and by assuming
a bounded clock skew among all processes, they first chop the computation into multiple
segments and then apply either the automata-based monitoring algorithm or progression-
based monitoring algorithm implemented as an SMT decision problem in order to verify
the correctness of the said formula. We conducted rigorous synthetic experiments, as well
as case studies on monitoring consistency conditions in Cassandra and a NASA air traffic
control dataset. Our experiments demonstrate up to 35% improvement in performance in
our progression-based algorithm over our automata-based algorithm.
     In Chapter 4, we study distributed runtime verification. We propose a technique that
takes an MTL formula and a distributed computation as input. By assuming partial synchrony
among all processes, first, we chop the computation into several segments and then apply a
                                             174


progression-based formula rewriting monitoring algorithm implemented as an SMT decision
problem in order to verify the correctness of the distributed system with respect to the
formula. We conducted extensive synthetic experiments on traces generated by the tool
UPPAAL and a set of blockchain smart contracts.
      In Chapter 5, we propose a runtime verification algorithm, where a set of decentralized
synchronous monitors that have only a partial view of the underlying system continually
evaluate formulas in the linear temporal logic (LTL). We assume that the communication
network is a complete graph and each monitor is subject to crash failures. Our algorithm is
sound in the sense that upon termination, all local monitors compute the same monitoring
verdict as a centralized monitor that can atomically observe the global state of the system.
The monitors do not share their full observation of the underlying system. Rather, they
communicate a symbolic representation of their partial observations without compromising
soundness. This symbolic observation is the set of possible LTL3 monitor states. Since LTL3
monitors may not be able to resolve indistinguishable cases due to partial observations,
we also proposed an SMT-based transformation algorithm to obtain minimum size LTL3
monitors. For an LTL formula ϕ, our SMT-based algorithm only increases the size of an LTL3
monitor Mϕ3 only by a factor of O(log |Mϕ3 | · |AP|) (communicating explicit observations
would require O(|AP|) bits), where AP is the set of atomic propositions that describe the
global state of the underlying system. We put our approach through an extensive number
of experiments with varying distributions responsible for modeling monitor crashes, atomic
prepositions distributed over the states, and also the partial observation of each monitor.
Through extensive experimentation, we learn that limiting the number of rounds to not go
till t and communication between monitors now happening after every k states reduce the
average number of rounds, and number of messages sent considerably with only the average
size of the message going up by a small quantity.
      In Chapter 6, we studied distributed runtime verification w.r.t. to the popular stream-
based specification language Lola. We propose an online decentralized monitoring approach
                                             175


where each monitor takes a set of associated Lola specifications and a partially distributed
stream as input. By assuming partial synchrony among all streams and by reducing the
verification problem into an SMT problem, we were able to reduce the complexity of our
approach where it is no longer dependent on the time synchronization constant. We also
conducted extensive synthetic experiments, verified system properties of large Industrial
Control Systems, and airspace monitoring of SBS messages. Compared to machine learning-
based approaches to verify the correctness of these systems, our approach was able to produce
sound and correct results with deterministic guarantees. As a better practice, one can also
use our RV approach along with machine-learning based during training or as a safety net
when detecting system violations.
8.2       Contributions
     The main results of this dissertation in the context of runtime verification of distributed
systems are as follows:
    • We introduce an automaton and a progression-based approach for monitoring
       a partially-synchronous distributed system with respect to linear temporal logic.
       Although both produce sound and complete results, when compared, we find the
       progression-based approach is often faster than the automata-based one.
    • To monitor a partially-synchronous distributed system with respect to time-bounded
       temporal properties, we introduce progression rules for MTL specifications. We also
       study the behavior of the system to estimate the actual offset distribution between the
       processes. This enables us to verify the system with a probabilistic guarantee.
    • We introduce a fault-tolerant decentralized runtime verification technique for LTL
       specifications and an SMT-based automata extension method to remove the non-
       determinism in the evaluated verdict due to the monitors only being able to read
       the partial computation.
    • To monitor a partially-synchronous distributed stream, we introduce the semantics of
                                               176


      partially-synchronous Lola and propose a decentralized stream runtime verification
      approach where the monitors only read the partially distributed stream.
    • We have also studied the effects of our approach on the runtime and memory usage
      with respect to synthetic as well as a wide range of real-life data.
8.3      Future Work
     As introduced in Chapter 1, the future is distributed, with more and more solutions
opting for distributed/decentralized solutions.       However, checking for completeness,
soundness and compliance with system requirements are relatively an untouched part of
these solutions. In the next phase of my research, I intend to make Runtime Verification
an effective, complete and sound approach for mainly two areas of applications: (1) general
distributed systems and (2) AI-safety.
8.3.1      Distributed Systems
     Out of all the approaches discussed in this report, we notice a common inference from
all of them. For the centralized monitoring approaches, with increase in the number of
events as part of the distributed system, the runtime of the approach increases exponentially.
However, the SMT-based solution was able to provide great robustness and certification for
the correctness of the verdict generated. This limitation is addressed when we decentralized
the monitors which comes with a communication overhead. Moreover, real-life systems
are often vulnerable to faults like crash-faults, byzantine faults, network faults, etc.
Verification approaches should be able to evaluate sound and complete verdicts inspite of
these vulnerabilities.
     With evolving technologies like blockchain and cyber-physical system (CPS) it waits
to be watched what challenges emerging technology puts us with. Technologies like smart
contracts in blockchains and CPS can be modeled as a distributed system. However, the
sheer size of them makes model checking and testing not ideal approaches for debugging.
                                              177


Runtime Verification coming out as the obvious choice in these scenarios. Our future work on
this topic involves a step towards enforcing properties in real time asynchronous distributed
system. As discussed in [Fal10], verification still remains the core part of any enforcement
algorithm. Using Hidden Markov Model (HMM) we can appempt to fill up the gaps in
the observed behavior of the system and as a result, extend the classic forward algorithm
for HMM state estimation. In [BGF18b, BGF18a], we see the authors propose runtime
verification approaches using the state estimation/trace abstraction model using HMM at
its core. One of the major down side of such approaches is the assumption of a synchronous
system, which limits the broad applicability that can be achieved with such an approach for
asynchronous system.
     A predictive runtime verification or runtime enforcement of a distributed system can
only be achieved by having some information about the working of the system. In [JTS21],
we observe that the authors model the system as a Markov Decision Process (MDP) which
provides a mathematical framework for modeling decision process in situations where the
system is both non-deterministic and probabilistic. This property of a MDP can be used
to model an asynchronous system with the decision process being the different happen-
before relation that is possible given the different interleaving of the events that is possible
along with the different probabilities. A future scope of this work includes a learning based
predictive runtime verification or runtime enforcement approach that learns the working of
the system from initial runs of the system and forms a MDP model. This model can be
used in parallel to the trace logs being generated to achieve greater prediction of faults in
the system.
8.3.2      AI Safety
     We find ourselves in a world where machines and AI are becoming increasingly prevalent
and integrated into our daily lives. From automated systems in factories and warehouses to
chatbots and virtual assistants on our phones and computers, technology is rapidly advancing
                                              178


and changing the way we live and work. With the advent of self-driving cars and drones, the
possibilities of automation are limitless. With this huge application in often safety-critical
systems, a verification or monitoring approach is essential to improve the reliability of the
system.
     As identified in [NYC15], a Deep Neural Network-based classification approach was able
to categorize a noisy image as that of a lion, peacock, starfish, etc. even when a human was
not able to categorize them. The reason behind such behavior of a Machine Learning-based
approach is the lack of explainability of the approach. Formal Verification acts as a perfect
monitor for such systems that makes sure that the system always works within the defined
constraints. Applications of formal verification in this space, can be categorized as mainly
two different lines of work, (1) safe learning, and (2) monitoring.
     As artificial intelligence (AI) systems rapidly increase in size, acquire new capabilities,
and are deployed in high-stakes settings, their safety becomes extremely important [FP18].
Ensuring system safety requires more than improving accuracy, efficiency, and scalability:
it requires ensuring that systems are robust to extreme events and monitoring them for
anomalous and unsafe behavior. While traditional machine learning systems are evaluated
pointwise with respect to a fixed test set, such static coverage provides only limited assurance
when exposed to unprecedented conditions in high-stakes operating environments. Verifying
that learning components of such systems achieve safety guarantees for all possible inputs
may be difficult, if not impossible. Instead, a system’s safety guarantees will often need
to be established with respect to system-generated data from realistic (yet appropriately
pessimistic) operating environments. Safety also requires resilience to “unknown unknowns”,
which necessitates improved methods for monitoring for unexpected environmental hazards
or anomalous system behaviors, including during deployment. In some instances, safety
may further require new methods for reverse-engineering, inspecting, and interpreting the
internal logic of learned models to identify unexpected behavior that could not be found by
black-box testing alone, and methods for improving the performance by directly adapting
                                               179


the systems’ internal logic. Whatever the setting, any learning-enabled system’s end-to-
end safety guarantees must be specified clearly and precisely. Any system claiming to
satisfy a safety specification must provide rigorous evidence, through analysis corroborated
empirically and/or with mathematical proof.
     Applications of learning-based systems include large cyber-physical systems, multi-agent
systems, etc. [WOZ+ 20,CMK+ 21] Cyber-Physical Systems (CPS) are systems that integrate
physical and computational components. They are used in a wide range of applications, such
as autonomous vehicles, medical devices, and smart homes. Verification of CPS is crucial to
ensure their correctness, safety, and reliability. The verification process involves the use of
formal methods, simulation, and testing to validate the correctness of the system’s behavior.
Multi-Agent Systems (MAS) are systems that consist of multiple agents that interact with
each other to achieve a common goal. They are used in a wide range of applications, such
as intelligent transportation systems, robotics, and social networks. Verification of MAS is
crucial to ensure their correctness, safety, and reliability. The verification process involves the
use of formal methods, simulation, and testing to validate the correctness of the system’s
behavior. Formal methods are commonly used to verify MAS. They involve the use of
mathematical techniques to verify the correctness of the system’s behavior. Model checking
and theorem proving are two common formal methods used to verify MAS.
     In all these above-mentioned scenarios, the system is susceptible to change and an
unpredictable environment. These changes in the environment often affect the behavior
of the system, making runtime verification the obvious choice for maintaining system-level
correctness. Neural Network based methods perform statistically better in a predictable
environment or when the data is similar to the training data.                 However, it is the
misclassifications in the strong sector, that pose a major vulnerability for the application
system. Using runtime verification we are able to check for the correctness of the verdict,
given system specifications. Taking an autonomous vehicle as our example, we see that it
is able to maneuver the vehicle with high confidence in case of perfect weather conditions.
                                                180


                                  100
                                                                                     correct
                                  90                                                 wrong
                                  80
                                  70
                                  60
                     brightness
                                  50
                                  40
                                  30
                                  20
                                  10
                                   0
                                        0    10   20   30   40 50 60       70   80     90      100
                                                              saturation
                                            Figure 8.1: Decision boundary plot.
However, in cases where the sun is shining at a low angle, in foggy conditions, low visibility,
rainy conditions, etc. any misclassification can have a catastrophic outcome. Runtime
verification can act as a helping hand to the already highly efficient machine-learning
approach in connecting the dots in these edge cases. This will not only able the system
to perform in critical, harsh environments but also make sure the outcomes are with a
considerable formal guarantee.
    In Figure 8.1, we show a decision boundary plot for a possible classification algorithm.
We notice that with low bridgeness of the pictures being classified, the misclassification rate
increases. However, the point of worry being the misclassifications that were notices even
when the bridgeness was high enough. It is cases like these where formal verification should
come in handy. The target bring, to convert the misclassifications for high bridgeness images
to be correctly classified.
    Additionally, robotics and automated systems use reinforced learning techniques to train
a system, with rewards and punishes the system to make it work towards a goal. We
would like to design a system that is not too strict towards rewards as this might yield
                                                             181


a not so optimized solution. In the making, it is too lenient and makes the work of the
system erroneous. Runtime verification acts as an important meditator, which enables the
reinforced learning technique to be neither too strict nor too lenient, and further behaviors are
enforced using runtime verification. Consider a garbage-collecting robot, that is responsible
for collecting trash around a house and throwing it away in the dumpster. It uses a reinforced
learning-based approach where each time the robot picks up and drops the trash in the
dumpster, it receives a reward. This strategy often results in a vulnerability where the robot
decides to put the trash back from the dumpster onto the floor only to pick it up again.
Using runtime verification, we should be able to enforce that once the trash is picked up is
not dropped back.
     With more and more real-life solutions involving distributed systems, cyber-physical
systems, multi-agent systems, etc. the scope of runtime verification poses a very exciting as
well as challenging application. This would enable us to design and implement secure and
correct-by-design systems in the real life.
                                              182


                                  BIBLIOGRAPHY
[AB16]     Shreya Agrawal and Borzoo Bonakdarpour. Runtime verification of k-safety
           hyperproperties in hyperltl. In 2016 IEEE 29th Computer Security Foundations
           Symposium (CSF), pages 239–252, 2016.
[ACZ20]    Tejasvi Alladi, Vinay Chamola, and Sherali Zeadally. Industrial control systems:
           Cyberattack trends and countermeasures. Computer Communications, 155:1–8,
           2020.
[AEP21]    Shaun Azzopardi, Joshua Ellul, and Gordon J. Pace. Runtime monitoring
           processes across blockchains. In Hossein Hojjat and Mieke Massink, editors,
           Fundamentals of Software Engineering, pages 142–156, Cham, 2021. Springer
           International Publishing.
[AGCC+ 20] Alberto Aranda Garcı́a, Marı́a-Emilia Cambronero, Christian Colombo, Luis
           Llana, and Gordon J. Pace. Runtime Verification of Contracts with Themulus,
           pages 231–246. Springer International Publishing, Cham, 2020.
[AH92]     Rajeev Alur and Thomas A. Henzinger. Logics and models of real time: A
           survey. In J. W. de Bakker, C. Huizing, W. P. de Roever, and G. Rozenberg,
           editors, Real-Time: Theory in Practice, pages 74–106, Berlin, Heidelberg, 1992.
           Springer Berlin Heidelberg.
[AH94]     Rajeev Alur and Thomas A. Henzinger. A really temporal logic. J. ACM,
           41(1):181–203, jan 1994.
[APSS21]   Shaun Azzopardi, Gordon Pace, Fernando Schapachnik, and Gerardo Schneider.
           On the specification and monitoring of timed normative systems. In Lu Feng
           and Dana Fisman, editors, Runtime Verification, pages 81–99, Cham, 2021.
           Springer International Publishing.
[BBHB13]   Justin M. Beaver, Raymond C. Borges-Hink, and Mark A. Buckner.
           An evaluation of machine learning methods to detect malicious scada
           communications. In 2013 12th International Conference on Machine Learning
           and Applications, volume 2, pages 54–59, 2013.
[BBM06]    Patricia Bouyer, Thomas Brihaye, and Nicolas Markey. Improved undecidability
           results on weighted timed automata. Information Processing Letters, 98(5):188–
           194, 2006.
[BDL04]    Gerd Behrmann, Alexandre David, and Kim G. Larsen. A tutorial on uppaal.
           In Formal Methods for the Design of Real-Time Systems: 4th International
           School on Formal Methods for the Design of Computer, Communication, and
           Software Systems, SFM-RT 2004, pages 200–236, 2004.
[BF12]     Andreas Bauer and Yliès Falcone. Decentralised ltl monitoring. In FM
           2012: Formal Methods, pages 85–100, Berlin, Heidelberg, 2012. Springer Berlin
           Heidelberg.
                                           183


[BF16a]    Andreas Bauer and Yliès Falcone. Decentralised LTL monitoring. Formal
           Methods in System Design, 48(1):46–93, 2016.
[BF16b]    B. Bonakdarpour and B. Finkbeiner. Runtime verification for hyperltl. In
           Proceedings of the 16th International Conference on Runtime Verification, pages
           41–45, 2016.
[BF18]     Borzoo Bonakdarpour and Bernd Finkbeiner. The complexity of monitoring
           hyperproperties. In 2018 IEEE 31st Computer Security Foundations Symposium
           (CSF), pages 162–174, 2018.
[BFR+ 16]  B. Bonakdarpour, P. Fraigniaud, S. Rajsbaum, D. A. Rosenblueth, and
           C. Travers. Decentralized asynchronous crash-resilient runtime verification.
           In Proceedings of the 27th International Conference on Concurrency Theory
           (CONCUR), pages 16:1–16:15, 2016.
[BGF18a]   Reza Babaee, Arie Gurfinkel, and Sebastian Fischmeister. Predictive run-
           time verification of discrete-time reachability properties in black-box systems
           using trace-level abstraction and statistical learning. In Christian Colombo and
           Martin Leucker, editors, Runtime Verification, pages 187–204, Cham, 2018.
           Springer International Publishing.
[BGF18b]   Reza Babaee, Arie Gurfinkel, and Sebastian Fischmeister. Prevent: A predictive
           run-time verification framework using statistical learning. In Einar Broch
           Johnsen and Ina Schaefer, editors, Software Engineering and Formal Methods,
           pages 205–220, Cham, 2018. Springer International Publishing.
[BHBB+ 14] Raymond C. Borges Hink, Justin M. Beaver, Mark A. Buckner, Tommy
           Morris, Uttam Adhikari, and Shengyi Pan. Machine learning for power
           system disturbance and cyber-attack discrimination. In 2014 7th International
           Symposium on Resilient Control Systems (ISRCS), pages 1–8, 2014.
[BKM10]    David Basin, Felix Klaedtke, and Samuel Müller. Monitoring security policies
           with metric first-order temporal logic. In Proceedings of the 15th ACM
           Symposium on Access Control Models and Technologies, SACMAT ’10, page
           23–34, New York, NY, USA, 2010. Association for Computing Machinery.
[BKMZ13]   David Basin, Felix Klaedtke, Srdjan Marinovic, and Eugen Zălinescu.
           Monitoring of temporal first-order properties with aggregations. In Axel Legay
           and Saddek Bensalem, editors, Runtime Verification, pages 40–58, Berlin,
           Heidelberg, 2013. Springer Berlin Heidelberg.
[BKMZ15]   David Basin, Felix Klaedtke, Samuel Müller, and Eugen Zălinescu. Monitoring
           metric first-order temporal properties. J. ACM, 62(2), may 2015.
[BKZ12]    David Basin, Felix Klaedtke, and Eugen Zălinescu. Algorithms for monitoring
           real-time properties.      In Sarfraz Khurshid and Koushik Sen, editors,
           Runtime Verification, pages 260–275, Berlin, Heidelberg, 2012. Springer Berlin
           Heidelberg.
                                            184


[BKZ15]   David Basin, Felix Klaedtke, and Eugen Zalinescu. Failure-aware Runtime
          Verification of Distributed Systems. In Prahladh Harsha and G. Ramalingam,
          editors, 35th IARCS Annual Conference on Foundations of Software Technology
          and Theoretical Computer Science (FSTTCS 2015), volume 45 of Leibniz
          International Proceedings in Informatics (LIPIcs), pages 590–603, Dagstuhl,
          Germany, 2015. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[BLS10a]  A. Bauer, M. Leucker, and C. Schallhart. Comparing LTL Semantics for
          Runtime Verification. Journal of Logic and Computation, 20(3):651–674, 2010.
[BLS10b]  Andreas Bauer, Martin Leucker, and Christian Schallhart.           Comparing
          ltl semantics for runtime verification. Journal of Logic and Computation,
          20(3):651–674, 2010.
[BLS11]   A. Bauer, M. Leucker, and C. Schallhart. Runtime Verification for LTL
          and TLTL. ACM Transactions on Software Engineering and Methodology
          (TOSEM), 20(4):14:1–14:64, 2011.
[Bow93]   J. Bowen. Formal methods in safety-critical standards. In Proceedings 1993
          Software Engineering Standards Symposium, pages 168–177, 1993.
[BSB17]   Noel Brett, Umair Siddique, and Borzoo Bonakdarpour. Rewriting-based
          runtime verification for alternation-free hyperltl. In Axel Legay and Tiziana
          Margaria, editors, Tools and Algorithms for the Construction and Analysis of
          Systems, pages 77–93, Berlin, Heidelberg, 2017. Springer Berlin Heidelberg.
[CBP21]   Francesca Cairoli, Luca Bortolussi, and Nicola Paoletti. Neural predictive
          monitoring under partial observability.       In Runtime Verification: 21st
          International Conference, RV 2021, Virtual Event, October 11–14, 2021,
          Proceedings, page 121–141, Berlin, Heidelberg, 2021. Springer-Verlag.
[CCF+ 05] Patrick Cousot, Radhia Cousot, Jerôme Feret, Laurent Mauborgne, Antoine
          Miné, David Monniaux, and Xavier Rival. The astreé analyzer. In Mooly Sagiv,
          editor, Programming Languages and Systems, pages 21–30, Berlin, Heidelberg,
          2005. Springer Berlin Heidelberg.
[CDD+ 15] Cristiano Calcagno, Dino Distefano, Jeremy Dubreil, Dominik Gabi, Pieter
          Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim
          Purbrick, and Dulma Rodriguez. Moving fast with software verification. In
          Klaus Havelund, Gerard Holzmann, and Rajeev Joshi, editors, NASA Formal
          Methods, pages 3–11, Cham, 2015. Springer International Publishing.
[CDE+ 02] Manuel Clavel, Francisco Durán, Steven Eker, Patrick Lincoln, Narciso Martı́-
          Oliet, José Meseguer, and Carolyn Talcott. Maude: Specification and
          programming in rewriting logic. In José Meseguer and Steven Eker, editors,
          Rewriting Logic and Its Applications, pages 76–95. Springer Berlin Heidelberg,
          2002.
                                          185


[CDE+ 13] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher
          Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser,
          Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li,
          Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan,
          Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher
          Taylor, Ruth Wang, and Dale Woodford. Spanner: Google’s globally distributed
          database. ACM Trans. Comput. Syst., 31(3), aug 2013.
[CDS+ 17] Edmund M Clarke, Clarke Dehnert, Jeremy Sproston, Helmut Veith, and
          Zhikun Wang. Storm: a modern probabilistic model checker. International
          Journal on Software Tools for Technology Transfer, 19(2):197–215, 2017.
[CF16]    C. Colombo and Y. Falcone. Organising LTL monitors over distributed systems
          with a global clock. Formal Methods in System Design, 49(1-2):109–158, 2016.
[CFK+ 14] Michael R. Clarkson, Bernd Finkbeiner, Masoud Koleini, Kristopher K.
          Micinski, Markus N. Rabe, and César Sánchez.              Temporal logics for
          hyperproperties. In Martı́n Abadi and Steve Kremer, editors, Principles of
          Security and Trust, pages 265–284, Berlin, Heidelberg, 2014. Springer Berlin
          Heidelberg.
[CGNM13] H. Chauhan, V. K. Garg, A. Natarajan, and N. Mittal. A distributed
          abstraction algorithm for online predicate detection. In Proceedings of the
          32nd IEEE Symposium on Reliable Distributed Systems (SRDS), pages 101–
          110, 2013.
[Cli14]   William      D    Clinger.           Advantages     of   formal   specifications.
          https://course.ccs.neu.edu/cs5500f14/Notes/Communication2/formalSpecs2.html,
          Fall 2014.
[CLRC17]  Julien Cumin, Grégoire Lefebvre, Fano Ramparany, and James L. Crowley. A
          dataset of routine daily activities in an instrumented home. In Sergio F. Ochoa,
          Pritpal Singh, and José Bravo, editors, Ubiquitous Computing and Ambient
          Intelligence, pages 413–425, Cham, 2017. Springer International Publishing.
[CMK+ 21] Anthony Corso, Robert J. Moss, Mark Koren, Ritchie Lee, and Mykel J.
          Kochenderfer. A survey of algorithms for black-box safety validation of
          cyber-physical systems. Journal of Artificial Intelligence Research (JAIR),
          72(2005.02979):377–428, 2021.
[CPR18]   Xiaohong Chen, Daejun Park, and Grigore Roşu. A language-independent
          approach to smart contract verification. In Tiziana Margaria and Bernhard
          Steffen, editors, Leveraging Applications of Formal Methods, Verification
          and Validation. Industrial Practice, pages 405–413, Cham, 2018. Springer
          International Publishing.
[CS08]    Michael R. Clarkson and Fred B. Schneider. Hyperproperties. In 2008 21st
          IEEE Computer Security Foundations Symposium, pages 51–65, 2008.
                                           186


[DDG+ 17] Jyotirmoy V. Deshmukh, Alexandre Donzé, Shromona Ghosh, Xiaoqing Jin,
          Garvit Juniwal, and Sanjit A. Seshia. Robust online monitoring of signal
          temporal logic. Formal Methods in System Design, 51(1):5–30, 2017.
[DFM13]   Alexandre Donzé, Thomas Ferrère, and Oded Maler.              Efficient robust
          monitoring for stl. In Natasha Sharygina and Helmut Veith, editors, Computer
          Aided Verification, pages 264–279, Berlin, Heidelberg, 2013. Springer Berlin
          Heidelberg.
[dMB08]   L. M. de Moura and N. Bjørner. Z3: An efficient SMT solver. In Tools and
          Algorithms for the Construction and Analysis of Systems (TACAS), pages 337–
          340, 2008.
[DSS+ 05] B. D’Angelo, S. Sankaranarayanan, C. Sanchez, W. Robinson, B. Finkbeiner,
          H.B. Sipma, S. Mehrotra, and Z. Manna.               Lola: runtime monitoring
          of synchronous systems. In 12th International Symposium on Temporal
          Representation and Reasoning (TIME’05), pages 166–174, 2005.
[Dwy20]   Matthew Dwyer. Property pattern mappings for ltl. [Website], 2020.
[EHF18]   Antoine El-Hokayem and Yliès Falcone. Can we monitor all multithreaded
          programs?       In Christian Colombo and Martin Leucker, editors, Runtime
          Verification, pages 64–89, Cham, 2018. Springer International Publishing.
[EHF22]   Antoine El-Hokayem and Yliès Falcone. Bringing runtime verification home: a
          case study on the hierarchical monitoring of smart homes using decentralized
          specifications. International Journal on Software Tools for Technology Transfer,
          24(2):159–181, 2022.
[EP18]    Joshua Ellul and Gordon J Pace. Runtime verification of ethereum smart
          contracts. In 2018 14th European Dependable Computing Conference (EDCC),
          pages 158–163. IEEE, 2018.
[Fal10]   Yliès Falcone. You should better enforce than verify. In Howard Barringer, Ylies
          Falcone, Bernd Finkbeiner, Klaus Havelund, Insup Lee, Gordon Pace, Grigore
          Roşu, Oleg Sokolsky, and Nikolai Tillmann, editors, Runtime Verification, pages
          89–105, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
[FHST17]  Bernd Finkbeiner, Christopher Hahn, Marvin Stenger, and Leander Tentrup.
          Monitoring hyperproperties. In Shuvendu Lahiri and Giles Reger, editors,
          Runtime Verification, pages 190–207, Cham, 2017. Springer International
          Publishing.
[FMRS18]  Yliès Falcone, Leonardo Mariani, Antoine Rollet, and Saikat Saha. Runtime
          Failure Prevention and Reaction, pages 103–134. Springer International
          Publishing, Cham, 2018.
                                          187


[FP07]   Georgios E. Fainekos and George J. Pappas. Robust sampling for mitl
         specifications. In Jean-François Raskin and P. S. Thiagarajan, editors, Formal
         Modeling and Analysis of Timed Systems, pages 147–162, Berlin, Heidelberg,
         2007. Springer Berlin Heidelberg.
[FP18]   Nathan Fulton and André Platzer. Safe reinforcement learning via formal
         methods: Toward safe control through proof and learning. Proceedings of the
         AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018.
[FRRT14] P. Fraigniaud, S. Rajsbaum, M. Roy, and C. Travers. The opinion number of set-
         agreement. In Principles of Distributed Systems - 18th International Conference
         (OPODIS), pages 155–170, 2014.
[FRS15]  Bernd Finkbeiner, Markus N. Rabe, and César Sánchez. Algorithms for
         model checking hyperltl and hyperctl*. In Daniel Kroening and Corina S.
         Păsăreanu, editors, Computer Aided Verification, pages 30–48, Cham, 2015.
         Springer International Publishing.
[FRT13]  P. Fraigniaud, S. Rajsbaum, and C. Travers. Locality and checkability in wait-
         free computing. Distributed Computing, 26(4):223–242, 2013.
[FRT14]  P. Fraigniaud, S. Rajsbaum, and C. Travers. On the number of opinions needed
         for fault-tolerant run-time monitoring in distributed systems. In Runtime
         Verification (RV), pages 92–107, 2014.
[FRT20]  Pierre Fraigniaud, Sergio Rajsbaum, and Corentin Travers. A lower bound
         on the number of opinions needed for fault-tolerant decentralized run-time
         monitoring. Journal of Applied and Computational Topology, 4(1):141–179,
         2020.
[GAJM17] Jonathan Goh, Sridhar Adepu, Khurum Nazir Junejo, and Aditya Mathur. A
         dataset to support research in the design of secure water treatment systems.
         In Grigore Havarneanu, Roberto Setola, Hypatia Nassopoulos, and Stephen
         Wolthusen, editors, Critical Information Infrastructures Security, pages 88–99,
         Cham, 2017. Springer International Publishing.
[Gar02]  V. K. Garg. Elements of distributed computing. Wiley, 2002.
[GJ79]   M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the
         Theory of NP-Completeness. W. H. Freeman, New York, 1979.
[GMB21]  Ritam Ganguly, Anik Momtaz, and Borzoo Bonakdarpour. Distributed
         Runtime Verification Under Partial Synchrony.            In 24th International
         Conference on Principles of Distributed Systems (OPODIS 2020), volume 184,
         pages 20:1–20:17, 2021.
[Her18]  Maurice Herlihy. Atomic cross-chain swaps. In Proceedings of the 2018 ACM
         symposium on principles of distributed computing, pages 245–254, 2018.
                                          188


[HGM20]   Marieke Huisman, Dilian Gurov, and Alexander Malkis. Formal methods: From
          academia to industrial practice. a travel guide, 2020.
[HR01a]   K. Havelund and G. Rosu. Monitoring Programs Using Rewriting. In Automated
          Software Engineering (ASE), pages 135–143, 2001.
[HR01b]   Klaus Havelund and Grigore Rosu. Monitoring programs using rewriting. In
          Proceedings of the 16th IEEE International Conference on Automated Software
          Engineering, ASE ’01, page 135, USA, 2001. IEEE Computer Society.
[HR04]    Klaus Havelund and Grigore Roşu. An overview of the runtime verification tool
          java pathexplorer. Formal Methods in System Design, 24(2):189–215, 2004.
[JTS21]   Sebastian Junges, Hazem Torfah, and Sanjit A. Seshia. Runtime monitors
          for markov decision processes. In Alexandra Silva and K. Rustan M. Leino,
          editors, Computer Aided Verification, pages 553–576, Cham, 2021. Springer
          International Publishing.
[KDM+ 14] S. S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, and M. Leone. Logical
          physical clocks. In Proceedings of the 18th International Conference on
          Principles of Distributed Systems, pages 17–32, 2014.
[KHF19]   Sean Kauffman, Klaus Havelund, and Sebastian Fischmeister. Monitorability
          over unreliable channels. In Bernd Finkbeiner and Leonardo Mariani, editors,
          Runtime Verification, pages 256–272, Cham, 2019. Springer International
          Publishing.
[KKP+ 15] Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and
          Boris Yakobowski. Frama-c: A software analysis perspective. Formal Aspects
          of Computing, 27(3):573–609, 2015.
[KNP04]   Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM: Probabilistic
          symbolic model checker. International Journal on Software Tools for Technology
          Transfer, 6(2):128–142, 2004.
[Koy90]   R. Koymans. Specifying Real-Time Properties with Metric Temporal Logic.
          RealTime Systems, 2(4):255–299, 1990.
[Lam78]   Leslie Lamport. Time, clocks, and the ordering of events in a distributed system.
          Commun. ACM, 21(7):558–565, jul 1978.
[LHJ+ 14] Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman,
          and Haryadi S. Gunawi. Samc: Semantic-aware model checking for fast
          discovery of deep bugs in cloud systems. In Proceedings of the 11th USENIX
          Conference on Operating Systems Design and Implementation, OSDI’14, page
          399–414, USA, 2014. USENIX Association.
                                         189


[LLL+ 17] Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S.
          Gunawi, and Chen Tian. Dcatch: Automatically detecting distributed
          concurrency bugs in cloud systems. In Proceedings of the Twenty-Second
          International Conference on Architectural Support for Programming Languages
          and Operating Systems, ASPLOS ’17, page 677–691, New York, NY, USA, 2017.
          Association for Computing Machinery.
[LLLG16]  Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S.
          Gunawi.      Taxdc: A taxonomy of non-deterministic concurrency bugs
          in datacenter distributed systems.       In Proceedings of the Twenty-First
          International Conference on Architectural Support for Programming Languages
          and Operating Systems, ASPLOS ’16, page 517–530, New York, NY, USA, 2016.
          Association for Computing Machinery.
[LM10]    Avinash Lakshman and Prashant Malik. Cassandra: A decentralized structured
          storage system. SIGOPS Oper. Syst. Rev., 44(2):35–40, April 2010.
[LPY97]   K. G. Larsen, P.Pattersson, and W. Yi. UPPAAL in a nutshell. International
          Journal on Software Tools for Technology Transfer, 1(1-2):134–152, 1997.
[LSS+ 18] Martin Leucker, César Sánchez, Torben Scheffel, Malte Schmitz, and Alexander
          Schramm. Tessla: Runtime verification of non-synchronized real-time streams.
          In ACM Symposium on Applied Computing (SAC), France, 04/2018 2018. ACM,
          ACM.
[LSS+ 19] Martin Leucker, César Sánchez, Torben Scheffel, Malte Schmitz, and Daniel
          Thoma. Runtime verification for timed event streams with partial information.
          In Bernd Finkbeiner and Leonardo Mariani, editors, Runtime Verification,
          pages 273–291, Cham, 2019. Springer International Publishing.
[Lyn96]   N. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, San Mateo,
          CA, 1996.
[MB15]    M. Mostafa and B. Bonakdarpour. Decentralized runtime verification of
          LTL specifications in distributed systems. In Proceedings of the 29th IEEE
          International Parallel and Distributed Processing Symposium (IPDPS), pages
          494–503, 2015.
[MBAB21] Anik Momtaz, Niraj Basnet, Houssam Abbas, and Borzoo Bonakdarpour.
          Predicate monitoring in distributed cyber-physical systems. In Lu Feng and
          Dana Fisman, editors, Runtime Verification, pages 3–22, Cham, 2021. Springer
          International Publishing.
[MG01]    Neeraj Mittal and Vijay K. Garg.          On detecting global predicates in
          distributed computations. In Proceedings of the 21st International Conference
          on Distributed Computing Systems (ICDCS 2001), Phoenix, Arizona, USA,
          April 16-19, 2001, pages 3–10, 2001.
                                          190


[MG05]    N. Mittal and V. K. Garg. Techniques and applications of computation slicing.
          Distributed Computing, 17(3):251–277, 2005.
[MGS19]   Peter Mehlitz, Dimitra Giannakopoulou, and Nastaran Shafiei. Analyzing
          airspace data with race. In 2019 IEEE/AIAA 38th Digital Avionics Systems
          Conference (DASC), pages 1–10, 2019.
[Mil10]   D. Mills.     Network time protocol version 4:       Protocol and algorithms
          specification. RFC 5905, RFC Editor, June 2010.
[MLD+ 13] Yannick Moy, Emmanuel Ledinot, Hervé Delseny, Virginie Wiels, and Benjamin
          Monate. Testing or formal verification: Do-178c alternatives and industrial
          experience. IEEE Software, 30(3):50–57, 2013.
[MP79]    Z. Manna and A. Pnueli. The modal logic of programs. In Proceedings of the
          6th Colloquium on Automata, Languages and Programming (ICALP), pages
          385–409, 1979.
[MP95]    Zohar Manna and Amir Pnueli. Temporal Verification of Reactive Systems -
          Safety. Springer, 1995.
[Nol13]   Tier Nolan. Alt chains and atomic transfers. https://bitcointalk.org/index.php?
          topic=193281.0, May, 2013. Bitcoin Forum.
[NYC15]   Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily
          fooled: High confidence predictions for unrecognizable images, 2015.
[OG07]    V. A. Ogale and V. K. Garg. Detecting temporal logic predicates on
          distributed computations. In Proceedings of the 21st International Symposium
          on Distributed Computing (DISC), pages 420–434, 2007.
[PFJ+ 13] Srinivas Pinisetty, Yliès Falcone, Thierry Jéron, Hervé Marchand, Antoine
          Rollet, and Omer Landry Nguena Timo. Runtime enforcement of timed
          properties. In Shaz Qadeer and Serdar Tasiran, editors, Runtime Verification,
          pages 229–244, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
[PMA15a]  Shengyi Pan, Thomas Morris, and Uttam Adhikari.                Classification of
          disturbances and cyber-attacks in power systems using heterogeneous time-
          synchronized data. IEEE Transactions on Industrial Informatics, 11(3):650–
          662, 2015.
[PMA15b]  Shengyi Pan, Thomas Morris, and Uttam Adhikari. Developing a hybrid
          intrusion detection system using data mining for power systems. IEEE
          Transactions on Smart Grid, 6(6):3104–3113, 2015.
[PMSP20]  João Carlos Pereira, Nuno Machado, and Jorge Sousa Pinto. Testing for
          race conditions in distributed systems via smt solving. In Wolfgang Ahrendt
          and Heike Wehrheim, editors, Tests and Proofs, pages 122–140, Cham, 2020.
          Springer International Publishing.
                                          191


[Pnu77]   A. Pnueli. The temporal logic of programs. In Symposium on Foundations of
          Computer Science (FOCS), pages 46–57, 1977.
[PSS18]   Srinivas Pinisetty, Gerardo Schneider, and David Sands. Runtime verification
          of hyperproperties for deterministic programs. In 2018 IEEE/ACM 6th
          International FME Workshop on Formal Methods in Software Engineering
          (FormaliSE), pages 20–29, 2018.
[PZS+ 18] Daejun Park, Yi Zhang, Manasvi Saxena, Philip Daian, and Grigore Roşu. A
          formal verification tool for ethereum vm bytecode. In Proceedings of the 2018
          26th ACM Joint Meeting on European Software Engineering Conference and
          Symposium on the Foundations of Software Engineering, ESEC/FSE 2018, page
          912–915, New York, NY, USA, 2018. Association for Computing Machinery.
[RKG+ 19] Denise Ratasich, Faiq Khalid, Florian Geissler, Radu Grosu, Muhammad
          Shafique, and Ezio Bartocci. A roadmap toward the resilient internet of things
          for cyber-physical systems. IEEE Access, 7:13260–13283, 2019.
[RTC22]   RTCA. Do-178c software considerations in airborne systems and equipment
          certification. [Website], 2022.
[S2́1]    César Sánchez. Synchronous and asynchronous stream runtime verification.
          In Proceedings of the 5th ACM International Workshop on Verification and
          MOnitoring at Runtime EXecution, VORTEX 2021, page 5–7, New York, NY,
          USA, 2021. Association for Computing Machinery.
[SBS+ 12] Scott D. Stoller, Ezio Bartocci, Justin Seyster, Radu Grosu, Klaus Havelund,
          Scott A. Smolka, and Erez Zadok. Runtime verification with state estimation.
          In Sarfraz Khurshid and Koushik Sen, editors, Runtime Verification, pages 193–
          207, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[Ser23a]  Amazon         Web      Services.              Eks      runtime    monitoring.
          https://docs.aws.amazon.com/guardduty/latest/ug/guardduty-eks-runtime-
          monitoring.html, As of 2023.
[Ser23b]  Amazon        Web     Services.           What      is    amazon    guardduty.
          https://docs.aws.amazon.com/guardduty/latest/ug/what-is-guardduty.html,
          As of 2023.
[SLX16]   Chih-Che Sun, Chen-Ching Liu, and Jing Xie. Cyber-physical system security
          of a power grid: State-of-the-art. Electronics, 5(3), 2016.
[SP18]    Wolfgang Schwab and Mathieu Poujol. The state of industrial cybersecurity
          2018. Trend Study Kaspersky Reports, 33, 2018.
[SS95]    Scott D. Stoller and Fred B. Schneider. Verifying programs that use causally-
          ordered message-passing. Sci. Comput. Program., 24(2):105–128, 1995.
                                           192


[SSS16]   Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry. Towards verified artificial
          intelligence, 2016.
[SVA04]   Koushik Sen, Mahesh Viswanathan, and Gul Agha. Statistical model checking
          of black-box probabilistic systems. In Rajeev Alur and Doron A. Peled, editors,
          Computer Aided Verification, pages 202–215, Berlin, Heidelberg, 2004. Springer
          Berlin Heidelberg.
[SVAG04]  K. Sen, A. Vardhan, G. Agha, and G.Rosu. Efficient decentralized monitoring
          of safety in distributed systems. In Proceedings of the 26th International
          Conference on Software Engineering (ICSE), pages 418–427, 2004.
[SWDD09] Jean Souyris, Virginie Wiels, David Delmas, and Hervé Delseny. Formal
          verification of avionics software products. In Ana Cavalcanti and Dennis R.
          Dams, editors, FM 2009: Formal Methods, pages 532–546, Berlin, Heidelberg,
          2009. Springer Berlin Heidelberg.
[Tec17]   Parity Technologies. https://github.com/paritytech/ parity, As of 2017.
[VKTA20]  Vidhya Tekken Valapil, Sandeep Kulkarni, Eric Torng, and Gabe Appleton.
          Efficient two-layered monitor for partially synchronous distributed systems. In
          2020 International Symposium on Reliable Distributed Systems (SRDS), pages
          123–132, 2020.
[VYK+ 17] V. T. Valapil, S. Yingchareonthawornchai, S. S. Kulkarni, E. Torng, and
          M. Demirbas. Monitoring partially synchronous distributed systems using
          SMT solvers. In Proceedings of the 17th International Conference on Runtime
          Verification (RV), pages 277–293, 2017.
[WOH19]   James Worrell, Joël Ouaknine, and Hsi-Ming Ho. On the expressiveness and
          monitoring of metric temporal logic. Logical Methods in Computer Science, 15,
          2019.
[WOZ+ 20] H. Wu, A. Ozdemir, A. Zeljić, K. Julian, A. Irfan, D. Gopinath, S. Fouladi,
          G. Katz, C. Pasareanu, and C. Barrett. Parallelization techniques for
          verifying neural networks. In 2020 Formal Methods in Computer Aided Design
          (FMCAD), pages 128–137, 2020.
[XH21]    Yingjie Xue and Maurice Herlihy. Hedging against sore loser attacks in cross-
          chain transactions. arXiv preprint arXiv:2105.06322, 2021.
[YNV+ 16] Sorrachai Yingchareonthawornchai, Duong N. Nguyen, Vidhya Tekken Valapil,
          Sandeep S. Kulkarni, and Murat Demirbas. Precision, recall, and sensitivity of
          monitoring partially synchronous distributed systems. In Runtime Verification
          - 16th International Conference, RV 2016, Madrid, Spain, September 23-30,
          2016, Proceedings, pages 420–435, 2016.
                                          193