TRACING DISTRIBUTED ALGORITHMS USING REPLAY CLOCKS

By

Ishaan Kiran Lagwankar

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Master of Science

2024

ABSTRACT

In this thesis, we introduce replay clocks (RepCl), a novel clock infrastructure that allows us to do

offline analyses of distributed computations. The replay clock structure provides a methodology

to replay a computation as it happened, with the ability to represent concurrent events effectively.

It builds on the structures introduced by vector clocks (VC) and the Hybrid Logical Clock (HLC),

combining their infrastructures to provide efficient replay. With such a clock, a user can replay a

computation whilst considering multiple paths of executions, and check for constraint violations

and properties that potential pathways could take, especially in the presence of concurrent events.

Specifically, if event 𝑒 must occur before 𝑓 then the replay clock must ensure that 𝑒 is replayed

before 𝑓 . On the other hand, if 𝑒 and 𝑓 could occur in any order, replay should not force an order

between them.

After identifying the limitations of existing clocks to provide the replay primitive, we present

the RepCl structure and identify an efficient representation for the same. We demonstrate that RepCl

can be implemented with less than four integers for 64 processes for various system parameters if

clocks are synchronized within 1𝑚𝑠.

Furthermore, the overhead of RepCl (for computing/comparing timestamps and message size)

is proportional to the size of the clock. Using simulations in a custom distributed system and

NS-3, a state-of-the-art network simulator, we identify the expected overhead of RepCl based on

the given system settings. We also identify how a user can then identify feasibility region for RepCl.

Specifically, given the desired overhead of RepCl, it identifies the region where unabridged replay

is possible.

Using the RepCl, we provide a tracer for distributed computations, that allows any computation

using the RepCl to be replayed efficiently. The visualization allows users to analyze specific

properties and constraints in an online fashion, with the ability to consider concurrent paths

independently. The visualization provides per-process views and an overarching view of the whole

computation based on the time recorded by the RepCl for each event.

Copyright by
ISHAAN KIRAN LAGWANKAR
2024

ACKNOWLEDGEMENTS

First and foremost, I would like to express my deepest appreciation to my committee in supporting

me through this work in the course of my Master’s program. I am deeply indebted to Dr. Sandeep

S. Kulkarni, and the ideas he has put forth that allowed me to progress in this research. The

completion of this thesis would not have been possible without his support. I am deeply indebted

to Dr. Li Xiao and Dr. Philip K. McKinley for providing me guidance through the committee and

providing feedback for the work done in this research. I would like to extend my sincere thanks

to the Department of Computer Science and Michigan State University, with gratitude to Vincent

Mattison and Brenda Hodge, for supporting me with administrative decisions and financial support

for the ICDCN ’24 conference in which this work was first published. I would also like to thank my

peers in the Department, for their constant support and feedback on the work I have done, without

which this thesis would not be complete. Lastly, I express deep gratitude to my family for providing

me with the opportunities that brought me to this stage, and I will be forever indebted to them for

their constant nurturing and support throughout my academic career.

iv

TABLE OF CONTENTS

CHAPTER 1

1.1 Contributions .

.

.

.

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2

PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

CHAPTER 3

REPLAY WITH CLOCKS . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Limitations of Existing Clocks for Replay . . . . . . . . . . . . . . . . . . . .
3.2 Requirements of Replay Clock (RepCl) . . . . . . . . . . . . . . . . . . . . . .

1
4

6

8
8
9

CHAPTER 4

ALGORITHM FOR THE REPLAY CLOCK (RepCl) . . . . . . . . . . . 11
4.1 Structure of RepCl Timestamp . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Efficient traversal and lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Helper functions .
4.4 Description of the RepCl Algorithm . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Comparing RepCl Timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Properties of RepCl
. 21
4.7 Effect of discretization and comparison with Hybrid Vector Clocks [1] . . . . . 24
4.8 Representation of the RepCl and its Overhead . . . . . . . . . . . . . . . . . . 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

CHAPTER 5

SIMULATOR SETUP . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1 CDES, A Custom Discrete Event Simulator . . . . . . . . . . . . . . . . . . . . 29
5.2 NS-3 Simulator
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

.

.

.

CHAPTER 6

SIMULATION RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1 Effect of Clock Skew (E) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Effect of Interval Size(I)
. 38
6.3 Effect of Message Delay(𝛿) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Feasibility Regions

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

CHAPTER 7

7.1
7.2 User View .

Implementation . .
. .

VISUALIZING TRACES WITH REPVIZ . . . . . . . . . . . . . . . . . 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

. .

.
.

CHAPTER 8

8.1 Clocks in Distributed Systems
8.2 Visualizing Traces .
.
8.3 Discussion . .

RELATED WORK AND DISCUSSION . . . . . . . . . . . . . . . . . 50
. 50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
.

. . . . . . . . . . . . . . . . . . . . . . . . .

. .

.
.

CHAPTER 9

CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . 55

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

v

CHAPTER 1

INTRODUCTION

According to the observer effect, when we try to measure something, we change it to some extent

[2]. Therefore, precise measurement is never truly possible. Computer programs suffer from this

same difficulty. When you try to measure something in a computation, it changes the underlying

computation. In an ideal world, a program may want to make sure that every step that it is taking is

correct with respect to any environmental changes. However, the time taken for performing these

checks may cause the program to be incorrect. In other words, it is possible that adding excessive

safety checks or checks for guaranteeing fairness may cause the system to spend substantial time

on those checks, thereby violate system requirements, even though those requirements would never

have been violated without those checks in the first place.

This issue is even more complicated in distributed computing where each process (component,

node, etc.) relies on partial information. Hence, computing the required safety checks (or checking

for the satisfaction of fairness requirements, etc.) would require processes to communicate with

each other. In turn, the time for computing them would be even higher.

As an illustration, consider two drones 𝐴 and 𝐵 that are cooperating to perform a task. Each

drone may take independent actions based on some environment that the other drone cannot see.

It is required that the area covered by the drones remains 50% or above at all times (100% of the

time). It is also preferred that this area remains at 75% most of the time (75% of the time). Here,

we would like to know (1) how frequently a given point 𝑥 was covered by one of the drones, (2) how

frequently a given point was covered by both the drones, (3) what was the minimum coverage at

any time, etc. (Assume that they are at two different altitudes so safety measures such as preventing

collision are not necessary). One possibility is to require 𝐴 to notify 𝐵 of its actions at all times

to ensure that 𝐵 can adjust its plan to account for what 𝐴 is doing, and vice versa. However, this

will require 𝐴 to unnecessarily spend more time communicating with 𝐵, and vice versa. In other

words, it may be necessary for 𝐴 and 𝐵 to move independently. Doing all these checks during the

execution would require that 𝐴 and 𝐵 communicate with each other before they make any move. In

1

turn, it would change the behavior of the drones completely thereby preventing us from making any

conclusion about their behavior in the absence of these checks. Furthermore, the problem would

be even more complicated if we had a larger number of drones.

One way to address this problem is to log the computation as it is happening so we can evaluate

it later for all properties of interest. These properties may include non-critical safety properties or

desirable performance criteria, etc. To be beneficial, the amount of storage for the log or the time to

create that log should be small enough so that the underlying computation is affected minimally. In

other words, the measurement should not change the underlying computation substantially. At the

same time, the log should capture the non-determinism that is inherently present in any distributed

computation. We also want to make sure that the creation of the logs is performed independently

by each process, i.e., each process stores its local state whenever it changes along with a timestamp

(discussed next) that identifies when the change was made. We also assume that all messages are

logged as well. We consider various approaches for storing the timestamps and their implications.

The simplest approach we can consider is to let the timestamp be the physical time of the

relevant process. Here, the storage and computation cost is very low. However, the physical clocks

of processes often differ. Hence, drone 𝐴 may send a message at time 50 (local time of 𝐴) but it

is received by 𝐵 at time 40 (𝐵’s local time). When we try to reply to this log to evaluate the given

properties, it will cause 𝐵 to receive the message before 𝐴 has sent it. This is unacceptable, as it

violates the system’s consistency.

The next approach we can consider is vector clocks. Vector clocks introduce two concerns:

Their size of 𝑂 (𝑛), where 𝑛 is the number of processes, may be too high. Another challenge is that

vector clocks do not have any reference to the physical clock and do not account for communication

outside the system. For example, it is possible that drone 𝐴 activated a green LED at physical time

𝑡1 and 𝐵 activated a white LED at time 𝑡2 where 𝑡2 >> 𝑡1. In other words, an external observer will

know that the action of 𝐴 occurred before 𝐵. However, if 𝐴 and 𝐵 did not communicate then the

corresponding events will be concurrent [3]. Thus, when we replay the log, it is possible that the

white LED event could be replayed before the green LED event. This is also unacceptable.

2

Hybrid logical clocks (HLC) [4] combine logical clocks and physical clocks. Specifically, they

rely on a system where physical clocks are synchronized within the acceptable limit of clock skew,

E, they guarantee that ℎ𝑙𝑐.𝑒 < ℎ𝑙𝑐. 𝑓 if 𝑒 happened before 𝑓 or 𝑝𝑡.𝑒 + E < 𝑝𝑡. 𝑓 ([3]). Here, ℎ𝑙𝑐.𝑒

denotes the Hybrid Logical Clock of process 𝑒, and 𝑝𝑡.𝑒 denotes the physical time observed on

process 𝑒. In other words, ℎ𝑙𝑐.𝑒 < ℎ𝑙𝑐. 𝑓 if 𝑓 causally depends upon 𝑒 or 𝑓 occurred substantially

after 𝑒. They eliminate the problem associated with physical clocks as HLC respects causality. They

also eliminate the problem caused by vector clocks as the HLC timestamp of activating the green

LED will be less than the 𝐻 𝐿𝐶 timestamp of activating the white LED. 𝐻 𝐿𝐶 does create another

problem though. Consider the case where we have events 𝑒 and 𝑓 such that | 𝑝𝑡.𝑒 − 𝑝𝑡. 𝑓 | < E and

𝑒|| 𝑓 , i.e., the events are causally concurrent and very close to each other in physical time. Without

loss of generality, let ℎ𝑙𝑐.𝑒 < ℎ𝑙𝑐. 𝑓 . In this situation, when we replay the log, 𝑒 will always occur

before 𝑓 . In other words, the log does not have the necessary information that could allow it to

replay 𝑓 before 𝑒 even though they could have occurred in any order.

An extension of HLC, hybrid vector clocks [1] reduces some of these issues. However, as we

highlight in Section 4.6, this overhead is still quite high.

Based on these limitations, in this thesis, we focus on building a new clock, the Replay Clock

(RepCl), that combines hybrid logical clocks and vector clocks to eliminate their limitations. Our

goal is to investigate scenarios under which RepCl permits efficient replay of events. To understand

why we may need to limit RepCl to specific scenarios, observe that if the underlying system was

asynchronous (unbounded clock drift) then it is required to have 𝑂 (𝑛) vector clocks to enable

replay of events. Systems that communicate frequently will need more information stored to replay

events. Thus, we focus on the following problem: Given the amount of permissible overhead for

logging events, what are the scenarios where perfect replay of events is possible?

Once we identify the scenarios in which perfect replay is available, we design a trace visualizer,

named RepViz, that allows us to depict candidate traces with the ability to replay concurrent

events in any order of execution. This visualizer takes in a RepCl-timestamped trace to generate

a visualization depicted in Chapter 7. We provide per-process views with timelines of events to

3

depict the events occurring while indicating causality between those events. The trace visualizer

provides a interactive display to the user with orderings of events, and allows the user to reorder

concurrent events to view different candidate traces. The visualizer API is discussed in Chapter 7.

1.1 Contributions

• We present RepCl, a replay clock that enables the replay of events in a distributed system.

It guarantees that if there is a causal relation [3] or if 𝑓 occurred far later than 𝑒 then

RepCl.𝑒 < RepCl. 𝑓 , i.e., the replay will cause 𝑒 to replayed before 𝑓 . On the other hand,

if 𝑒 and 𝑓 are causally concurrent and occurred close in physical time then they could be

replayed in any order.

• By considering various system parameters, clock skew (E), message rate (𝛼), and message

delay (𝛿), we identify the feasibility region for RepCl.

• We implement an API for the RepCl for NS-3, a widely used distributed network simulator.

It provides all the operations and documentation on how it integrates with different network

components available to the NS-3 simulations.

• We design RepViz, a visualizer that generates traces from RepCl-timestamped logs in a

distributed computation. The visualizer provides the user with various orders of replay, and

allows the user to view different candidate traces and evaluate various constraints along those

traces through the visualization.

Organization of the thesis:

This thesis is organized as follows.

In Chapter 2, we describe

the model of computation for distributed systems including the notion of causality and clock

synchronization. We move forward to the idea of the replay clock and the problems it solved in

Chapter 3. in Chapter 4 we describe the algorithms associated with the RepCl, and describe the

properties of the RepCl. Additionally, we describe the method of representation for the RepCl and

describe the various overheads that are characteristic of the design of the clock. Chapter 5 talks

about the design of the simulators we used to collect metrics for the RepCl. These metrics are

4

discussed in Chapter 6 with analysis on the size of the clock and feasibility of implementation of

the clock. Chapter 7 talks about the design of the RepViz, the visualization system for trace building

using the RepCl. Chapter 8 discusses related work and identifies questions raised by RepCl. Finally,

in Chapter 9, we conclude and discuss future work.

5

CHAPTER 2

PRELIMINARIES

A distributed system is a set of processes 1..𝑛. Each process has three types of events (1) 𝑠𝑒𝑛𝑑,

where it sends a message to another process, (2) 𝑟𝑒𝑐𝑒𝑖𝑣𝑒, where it receives a message from another

process, and (3) 𝑙𝑜𝑐𝑎𝑙, where it performs some local computation.

We define the happened-before (denoted by hb ) relation [3] among the events in a distributed

computation.

• If 𝑒 and 𝑓 happened on the same process and 𝑒 occurred before 𝑓 then 𝑒 hb 𝑓 .

• If 𝑒 was a send event and 𝑓 was the corresponding receive event then 𝑒 hb 𝑓 .

• The hb relation is transitive, i.e., if there exist events 𝑒, 𝑓 , and 𝑔 such that 𝑒 hb 𝑔 and 𝑔 hb 𝑓

then 𝑒 hb 𝑓

We say that 𝑒|| 𝑓 iff ¬(𝑒 hb 𝑓 ) ∧ ¬( 𝑓 hb 𝑒). In other words, 𝑒 is concurrent with 𝑓 iff 𝑒 did not

happen before 𝑓 and 𝑓 did not happen before 𝑒.

A timestamping algorithm assigns a timestamp for every event 𝑒 in the system as soon as the

event is created. Additionally, the timestamping algorithm defines a < relation that identifies how

two timestamps are compared.

As an illustration, Lamport’s logical clock [5] assigns an integer timestamp 𝑙.𝑒 to every event

𝑒. The < relation for Lamport’s logical clocks is the standard < for integers. Likewise, the

physical timestamping algorithm assigns 𝑝𝑡.𝑒 for every event 𝑒 where 𝑝𝑡.𝑒 is the physical time

of the process where event 𝑒 occurred when it occurred, and the < relation is the same as that

over integers. A vector clock [6][7] assigns event 𝑒 a timestamp 𝑣𝑐.𝑒 where 𝑣𝑐.𝑒 is a vector that

includes an entry 𝑣𝑐.𝑒. 𝑗 for every process 𝑗. The < relation on two vector clocks 𝑣𝑐.𝑒 and 𝑣𝑐. 𝑓

requires that each element in 𝑣𝑐.𝑒 is less than or equal to the corresponding element in 𝑓 and

some element in 𝑒 is less than the corresponding element in 𝑓 . In other words, 𝑣𝑐.𝑒 < 𝑣𝑐. 𝑓 iff

(∀ 𝑗 :: 𝑣𝑐.𝑒. 𝑗 ≤ 𝑣𝑐. 𝑓 . 𝑗) ∧ (∃ 𝑗 :: 𝑣𝑐.𝑒. 𝑗 < 𝑣𝑐. 𝑓 . 𝑗).

6

Note that while the < relation is defined by the timestamping algorithm, the properties of the

< relation vary. For example, logical clocks provide one-way causality information, i.e., 𝑒 hb 𝑓 ⇒

𝑙.𝑒 < 𝑙. 𝑓 . Vector clocks provide two-way causality information, i.e., 𝑒 hb 𝑓 ⇔ 𝑣𝑐.𝑒 < 𝑣𝑐. 𝑓 .

By contrast, (unsynchronized) physical clocks may not provide any guarantees. For example, it is

possible that (𝑒 hb 𝑓 ) and 𝑝𝑡.𝑒 ≮ 𝑝𝑡. 𝑓 are simultaneously true.

We assume that each process 𝑗 in the system is associated with a physical clock, 𝑝𝑡. 𝑗. Clocks

of processes are synchronized with a protocol such as NTP [8] such that the clock of two processes

differ by at most E, where E is a parameter, i.e., ∀ 𝑗, 𝑘 :: | 𝑝𝑡. 𝑗 − 𝑝𝑡.𝑘 | ≤ E. We also assume

that individual clocks are monotonically increasing. We assume that messages are delivered with

a minimum message delay of 𝛿. We do not assume maximum message delay; it could be ∞ if

messages are permitted to be lost. Since our focus is on the replay of events, if a message is lost, it

simply implies that the corresponding received message is never replayed.

7

CHAPTER 3

REPLAY WITH CLOCKS

In this chapter, we focus on how clocks can be used to replay a given computation. We also discuss

some of the limitations of using logical clocks and vector clocks in the replay process. Note that

the goal of this chapter is only to illustrate the concept and the goals of the replay; it does not focus

on developing an efficient algorithm for the same.

As discussed in the introduction, the goal of replay is to order the events so that we can evaluate

various properties of interest. To replay a given computation, we begin with the set where each

entry is of the form ⟨𝑒, 𝑡𝑠.𝑒⟩, where 𝑒 is the event (send/receive/local) and 𝑡𝑠.𝑒 is the timestamp of 𝑒.

To replay the given set of events, we first find events 𝑒 such that all events with smaller timestamps

than 𝑒 have already been replayed. In other words, we find the set {𝑒|¬(∃ 𝑓 : 𝑡𝑠. 𝑓 < 𝑡𝑠.𝑒)}. We

replay one of these events randomly. Then, we remove event 𝑒. The process is continued until all

events are replayed. Thus, the algorithm for replay is shown in algorithm 3.1.

Algorithm 3.1 ReplayEvents Operation

1. Input: 𝑆: Set of Events and timestamp

2. While 𝑆 ≠ 𝜙 do

3.

4.

5.

𝐹𝑟𝑜𝑛𝑡 𝐿𝑖𝑛𝑒 = {𝑒|(𝑒, 𝑡𝑠.𝑒) ∈ 𝑆 ∧ ¬(∃ 𝑓 : ( 𝑓 , 𝑡𝑠. 𝑓 ) ∈ 𝑆 ∧ 𝑡𝑠. 𝑓 < 𝑡𝑠.𝑒)}

Choose a random event 𝑒 from 𝐹𝑟𝑜𝑛𝑡 𝐿𝑖𝑛𝑒 and replay it

𝑆 = 𝑆 − {𝑒}

6. end while

3.1 Limitations of Existing Clocks for Replay

As an example, consider the execution in Figure 3.1. Here, we have four events 𝐴, 𝐵, 𝐶, and

𝐷. Their physical timestamps and logical timestamps are shown in Figure 3.1. If we replay these

events using physical clocks then the possible outcomes are 𝐶𝐵𝐴𝐷 or 𝐶𝐵𝐷 𝐴. Note that both these

outcomes are undesirable, as 𝐴 should occur before 𝐵 based on the causality (happened-before)

relation.

8

Figure 3.1 Sample Execution Sequence and Application of Replay Algorithm.

If we replay them by logical clocks, the possible outcome is 𝐶 𝐴𝐵𝐷. However, there is no

option to replay 𝐵 before 𝐶 even though 𝐵||𝐶.

If we replay them with vector clocks, possible orderings are 𝐴𝐵𝐶𝐷, 𝐴𝐶𝐵𝐷, or 𝐶 𝐴𝐵𝐷. If the

clocks were synchronized to be within 5 time units then 𝐴𝐵𝐶𝐷 and 𝐴𝐶𝐵𝐷 are incorrect.

3.2 Requirements of Replay Clock (RepCl)

In this thesis, we focus on a system where the physical clocks are synchronized to be within E,

i.e., for any two processes 𝑗 and 𝑘, | 𝑝𝑡. 𝑗 − 𝑝𝑡.𝑘 | ≤ E. The goal of RepCl is to assign a timestamp

RepCl.𝑒 to event 𝑒 such that

Requirement 1.

If 𝑒 happened before 𝑓 then

RepCl.𝑒 < RepCl. 𝑓 , i.e., 𝑒 will always be replayed before 𝑓

Requirement 2.

If 𝑓 occurred far after 𝑒, i.e., 𝑒 and 𝑓 could not have occurred simultaneously

under clock drift guarantee of E1 where E1 ≈ E then 𝑒 will be replayed before 𝑓 , i.e.,

RepCl.𝑒 < RepCl. 𝑓

Requirement 3.

If 𝑒 and 𝑓 could have occurred in any order in a system where clocks were

synchronized to be within E2, where E2 ≈ E then RepCl.𝑒||RepCl. 𝑓 (i.e., ¬(RepCl.𝑒 <

RepCl. 𝑓 ) ∧ ¬(RepCl. 𝑓 < RepCl.𝑒))

In the last two requirements, we have chosen E1 and E2 instead of E itself as it can permit more

efficient implementation by allowing us to maintain a coarse-grained clock. We discuss this further

in section 4.6.

9

With these requirements in mind, we show that the RepCl provides efficient replays of distributed

computations, without suffering with the overhead that vector clocks impose, and the shortcomings

of replay with logical clocks like the HLC. The RepCl combines the best of both these worlds, and

provides a better mechanism to replay computations.

10

CHAPTER 4

ALGORITHM FOR THE REPLAY CLOCK (RepCl)

In this chapter, we present our approach for RepCl. As discussed earlier, we assume that the physical

clocks are synchronized to be within E. We discretize the process execution in terms of epochs,

where each epoch corresponds to an increment of the clock by I, 0 < I ≤ E such that E = 𝜖 ∗ I,

where 𝜖 is an integer. The timeline of a process is split into epochs where each epoch is of size I(in

the local process clock). In other words, the epoch of process 𝑗 is obtained by ⌊

𝑝𝑡. 𝑗
I ⌋.

4.1 Structure of RepCl Timestamp

With such discretization, the timestamp of process 𝑗 (or event 𝑒) is of the form

⟨𝑚𝑥. 𝑗, bitmap. 𝑗 [], offset. 𝑗 [], counter. 𝑗 []⟩,

(4.1)

where 𝑚𝑥. 𝑗 is an integer for the approximation of the top-level 𝐻 𝐿𝐶, and bitmap. 𝑗, offset. 𝑗 and

counter. 𝑗 are bitsets [9] that store at most one entry bitmap. 𝑗 .𝑘, offset. 𝑗 .𝑘 and counter. 𝑗 .𝑘

for process 𝑘. Each of these bitsets is treated as an array but are serialized as integers in packets

for efficiency.

The intuition behind these variables is as follows:

• 𝑚𝑥. 𝑗 denotes the maximum epoch process 𝑗 is aware of (either due to the value of 𝑝𝑡. 𝑗 or

the value of epochs learned from messages it receives).

• bitmap. 𝑗 .𝑘 is essentially an array of bits, where each bit with index 𝑘 denotes whether the

offset. 𝑗 .𝑘 is being stored. If the bit is 1, process 𝑗 is actively maintaining offset. 𝑗 .𝑘.

This will come in handy for efficient updates to the clock.

• 𝑚𝑥. 𝑗 − offset. 𝑗 .𝑘 denotes the maximum epoch value of 𝑘 that 𝑗 has learnt (either via

direct/indirect message from 𝑘, clock drift assumption, etc). If there exists an offset between

two processes 𝑗 and 𝑘, offset. 𝑗 .𝑘 denotes the difference between 𝑚𝑥. 𝑗 and 𝑚𝑥.𝑘 as seen

on process 𝑗.

11

• Counters are used to deal with the scenario where multiple events happen within the same

epoch, and have the same offsets. If two clocks that are not concurrent have the same 𝑚𝑥

and the same offset values, then the two clocks differ on the counters. The clock with the

lower counter value is replayed first.

For example, the timestamp ⟨50, [1, 1, 1], [0, 1, 2], [4, 5, 6]⟩ denotes that this event is aware of

epoch 50 of process 0 (as 50 − 0), 49 of process 1 (as 50 − 1), and 48 of process 2 (as 50 − 2). And,

the counter values are 4, 5 and 6 respectively.

4.2 Efficient traversal and lookup

All computations are optimized using the bitmap. While the bitmap does not serve a purpose

to the timestamp itself, it allows us to traverse and update the clock efficiently. Here we describe

the traversal and lookup of offsets based on the bitmap. For brevity, we describe the rest of the

algorithms as a simple traversal, but it is important to note that each traversal takes O(number of

1s in the bitmap) time complexity, and getters and setters take O(1) complexity. To describe these

implementations, we use the integer representations of offset. 𝑗 [] and counter. 𝑗 [].

4.2.1 Traversal

The traversal operation is described in algorithm 4.1, which iterates through the bitmap to find

all processes for which the offset is being maintained. In the algorithm, we get every index that

has a set bit in the bitmap. The set bit at position 𝑘 indicates that offset. 𝑗 .𝑘 is being stored by

process 𝑗.

4.2.2 Extract

The extract operation extracts 𝑘 bits from position 𝑝. The algorithm is described in algorithm

4.2.

4.2.3 GetOffsetAtIndex

GetOffsetAtIndex is an operation that gets the offset stored at a particular index. This is an

O(1) lookup operation using the index obtained from the bitmap traversal or a specific offset that

the clock may need at any index. The algorithm is described in algorithm 4.3. Here, 𝜏 denotes the

12

Algorithm 4.1 Traversal Operation

1. Define Traversal operation.

2. Traversal(𝑡𝑠)

3. While 𝑡𝑠.𝑏𝑖𝑡𝑚𝑎 𝑝 > 0

4.

5.

6.

𝑖𝑛𝑑𝑒𝑥 = log2(¬(𝑡𝑠.𝑏𝑖𝑡𝑚𝑎 𝑝 ⊕ (¬(𝑡𝑠.𝑏𝑖𝑡𝑚𝑎 𝑝 − 1)) + 1) >> 1)

// Get or set any offset at index

𝑡𝑠.𝑏𝑖𝑡𝑚𝑎 𝑝 = 𝑡𝑠.𝑏𝑖𝑡𝑚𝑎 𝑝 ∧ (𝑡𝑠.𝑏𝑖𝑡𝑚𝑎 𝑝 − 1)

7. End While

Algorithm 4.2 Extract Operation

1. Define Extract operation.

2. Extract(𝑛𝑢𝑚𝑏𝑒𝑟, 𝑘, 𝑝)

3. Return ((1 << 𝑘) − 1 ∧ (𝑛𝑢𝑚𝑏𝑒𝑟 >> 𝑝))

max offset size allowed by the user (in bits). The max offset size is usually set to 𝑙𝑜𝑔(𝜖).

Algorithm 4.3 GetOffsetAtIndex Operation

1. Define GetOffsetAtIndex operation.

2. GetOffsetAtIndex(𝑡𝑠, 𝑖𝑛𝑑𝑒𝑥)

3. 𝑜 𝑓 𝑓 𝑠𝑒𝑡 = 𝐸𝑥𝑡𝑟𝑎𝑐𝑡 (𝑜 𝑓 𝑓 𝑠𝑒𝑡. 𝑗 [].𝑇 𝑜𝐼𝑛𝑡𝑒𝑔𝑒𝑟 (), 𝜏, 𝜏 ∗ 𝑖𝑛𝑑𝑒𝑥)

4. Return 𝑜 𝑓 𝑓 𝑠𝑒𝑡

4.2.4 SetOffsetAtIndex

SetOffsetAtIndex is an operation that sets the offset at a particular index. This is an O(1) setter

operation using the index obtained from the bitmap traversal or a specific offset that the clock may

need at any index, like the GetOffsetAtIndex algorithm. The algorithm is described in algorithm

4.4.

13

Algorithm 4.4 SetOffsetAtIndex Operation

1. Define SetOffsetAtIndex operation.

2. SetOffsetAtIndex(𝑡𝑠, 𝑖𝑛𝑑𝑒𝑥, 𝑛𝑒𝑤𝑜 𝑓 𝑓 𝑠𝑒𝑡)

3.

𝑓 𝑖𝑟 𝑠𝑡 𝑝𝑎𝑟𝑡 = 𝐸𝑥𝑡𝑟𝑎𝑐𝑡 (𝑜 𝑓 𝑓 𝑠𝑒𝑡𝑠. 𝑗, 𝜏 ∗ 𝑖𝑛𝑑𝑒𝑥, 0)

4. 𝑟𝑒𝑠| = 𝑓 𝑖𝑟 𝑠𝑡 𝑝𝑎𝑟𝑡

5. 𝑟𝑒𝑠| = 𝑛𝑒𝑤𝑜 𝑓 𝑓 𝑠𝑒𝑡 << 𝑖𝑛𝑑𝑒𝑥 ∗ 𝜏

6. 𝑙𝑎𝑠𝑡 𝑝𝑎𝑟𝑡 = 𝐸𝑥𝑡𝑟𝑎𝑐𝑡 (𝑜 𝑓 𝑓 𝑠𝑒𝑡𝑠. 𝑗, 𝜏 ∗ 𝑁 − (𝜏 ∗ (𝑖𝑛𝑑𝑒𝑥 + 1)), 𝜏 ∗ (𝑖𝑛𝑑𝑒𝑥 + 1))

7. 𝑟𝑒𝑠| = 𝑙𝑎𝑠𝑡 𝑝𝑎𝑟𝑡 << (𝑖𝑛𝑑𝑒𝑥 + 1) ∗ 𝜏

8. Return 𝑟𝑒𝑠

4.2.5 RemoveOffsetAtIndex

The RemoveOffsetAtIndex operation removes an offset given an index. This is an O(1) removal

operation once the position is found by the traversal operation. This algorithm is described in

Algorithm 4.5.

Algorithm 4.5 RemoveOffsetAtIndex Operation

1. Define RemoveOffsetAtIndex operation.

2. RemoveOffsetAtIndex(𝑡𝑠, 𝑖𝑛𝑑𝑒𝑥, 𝑛𝑒𝑤𝑜 𝑓 𝑓 𝑠𝑒𝑡)

3.

𝑓 𝑖𝑟 𝑠𝑡 𝑝𝑎𝑟𝑡 = 𝐸𝑥𝑡𝑟𝑎𝑐𝑡 (𝑜 𝑓 𝑓 𝑠𝑒𝑡𝑠. 𝑗, 𝜏 ∗ 𝑖𝑛𝑑𝑒𝑥, 0)

4. 𝑟𝑒𝑠| = 𝑓 𝑖𝑟 𝑠𝑡 𝑝𝑎𝑟𝑡

5. 𝑙𝑎𝑠𝑡 𝑝𝑎𝑟𝑡 = 𝐸𝑥𝑡𝑟𝑎𝑐𝑡 (𝑜 𝑓 𝑓 𝑠𝑒𝑡𝑠. 𝑗, 𝜏 ∗ 𝑁 − (𝜏 ∗ (𝑖𝑛𝑑𝑒𝑥 + 1)), 𝜏 ∗ (𝑖𝑛𝑑𝑒𝑥 + 1))

6. 𝑟𝑒𝑠| = 𝑙𝑎𝑠𝑡 𝑝𝑎𝑟𝑡 << (𝑖𝑛𝑑𝑒𝑥 + 1) ∗ 𝜏

7. Return 𝑟𝑒𝑠

4.3 Helper functions

Now that we have described the traversals and auxiliary operations, we move on to the clock

helper algorithms. For brevity, we assume all traversals and assignments are the algorithms

described in the previous subsection. We omit the bitmap to make it easier to understand the

14

Figure 4.1 Working of Shift() on Process 0. Here, 𝜖 = 15, and the shift is issued to advance to 𝑚𝑥
20. Process 3’s offset becomes 20, but since 𝜖 = 15, the offset is set to 𝜖 (due to clock skew
limit guarantees).

algorithms that follow. We discuss two helper functions, Shift and MergeSameEpoch, that will

come in handy when we design the main clock processing algorithms.

4.3.1 Shift Operation

The Shift function allows us to change the value of 𝑚𝑥. Since 𝑚𝑥. 𝑗 − offset. 𝑗 .𝑘 denotes

the knowledge of 𝑗 has about the epoch of 𝑘, if 𝑚𝑥 is changed to newmx without providing 𝑗 any

additional knowledge of the clock of 𝑘 then 𝑛𝑒𝑤𝑚𝑥 − newoffset. 𝑗 .𝑘 should remain the same

as 𝑚𝑥. 𝑗 − offset. 𝑗 .𝑘. Hence, Shift operation changes offset. 𝑗 .𝑘 to be offset. 𝑗 .𝑘 + (𝑛𝑒𝑤𝑚𝑥 −

𝑚𝑥). Furthermore, if this value is more than 𝜖 then we reset it to 𝜖, as guaranteed by the clock

drift assumption. (Note that process 𝑗 can learn about the clock of 𝑘 via clock synchronization

assumption even if 𝑗 and 𝑘 do not communicate.)

For example, shifting the timestamp ⟨12, [0, 2, 10]⟩ so that 𝑚𝑥 is changed to 20 will result in

⟨20, [8, 10, 18]⟩. If 𝜖 = 15, this will change to ⟨20, [8, 10, 𝜖]⟩ (cf. Figure 4.1).

4.3.2 MergeSameEpoch Operation

The MergeSameEpoch function takes two timestamps 𝑡1 and 𝑡2 with the same 𝑚𝑥 value and

combines their offsets by setting to be offset. 𝑗 .𝑘 to be the 𝑚𝑖𝑛(𝑡1.offset. 𝑗 .𝑘, 𝑡2.offset. 𝑗 .𝑘).

For example, merging ⟨50, [0, 1, 2]⟩ and ⟨50, [2, 0, 1]⟩ results in ⟨50, [0, 0, 1]⟩.

4.3.3 EqualOffset Operation

The EqualOffset function takes in two timestamps 𝑡1 and 𝑡2 and checks whether the offset

arrays and 𝑚𝑥 values are the same. This is used particularly to update the counter array if the

other values are equal.

15

Algorithm 4.6 Shift Operation

1. Define Shift operation.

2. Shift(𝑡𝑠, 𝑛𝑒𝑤𝑚𝑥)

3. For each 𝑘 do

4.

5.

6.

7.

𝑡𝑠.offset.𝑘 = offset.𝑘 + (𝑛𝑒𝑤𝑚𝑥 − 𝑡𝑠.𝑚𝑥)

If 𝑡𝑠.offset.𝑘 > 𝜖 then

𝑡𝑠.offset.𝑘 = 𝜖

End If

8. End For

9. Output: 𝑡𝑠

Algorithm 4.7 MergeSameEpoch Operation

1. Input: Timestamp 𝑡1, Timestamp 𝑡2

2. Timestamp 𝑡𝑠 = new Timestamp

3. For each 𝑘 do

4.

𝑡𝑠.offset. 𝑗 .𝑘 = min(𝑡1.offset. 𝑗 .𝑘, 𝑡2.offset. 𝑗 .𝑘)

5. End For

6. Return 𝑡𝑠

Algorithm 4.8 EqualOffset Operation

1. Input: Timestamp 𝑡1, Timestamp 𝑡2

2. If 𝑡1.𝑚𝑥 ≠ 𝑡2.𝑚𝑥 ∨ (∃ 𝑗 𝑡1.offset. 𝑗 ≠ 𝑡2.offset. 𝑗) then

3.

Return false

4. Else

5.

Return true

6. End If

16

4.4 Description of the RepCl Algorithm

In this section, we discuss the key clock processing algorithms. These algorithms update

the clock based on the type of event observed on the process. We describe two key operations:

Send/Local and Receive.

4.4.1 Local/Send event

Here, we describe how RepCl. 𝑗 is updated when 𝑗 sends a message. Let the current timestamp

of 𝑗 be

⟨𝑚𝑥. 𝑗, bitmap. 𝑗 [], offset. 𝑗 [], counter. 𝑗 []⟩

(4.2)

First, 𝑚𝑥. 𝑗 needs to be increased if the clock of 𝑗 has advanced beyond epoch 𝑚𝑥. 𝑗. Hence, we

first compute 𝑛𝑒𝑤𝑚𝑥. 𝑗 which is equal to 𝑚𝑎𝑥(𝑚𝑥. 𝑗, epoch. 𝑗). When 𝑗 sends a message, it does

not learn any new information about the clock of process 𝑘.

We consider two cases: The first case is for the scenario where the newly created event 𝑓 is in

a new epoch as the previous event, 𝑒. This will happen if 𝑚𝑥 remains unchanged and offset. 𝑗 . 𝑗

is unchanged. In this case, we increase counter. 𝑗 . 𝑗.

The second case deals with the scenario where 𝑓 is in a new interval. Thus, the offset associated

with 𝑘 is changed using the Shift operation. Note that the Shift operation computes the shift of all

processes except 𝑗. offset. 𝑗 . 𝑗 should be based on the value of epoch. 𝑗. Hence, we set it equal

to 𝑛𝑒𝑤𝑚𝑥 − epoch. 𝑗. The Shift operation is illustrated in Algorithm 4.6.

4.4.2 Receive event

Next, we describe how RepCl is updated when 𝑗 with timestamp

⟨𝑚𝑥. 𝑗, offset. 𝑗 [], counter. 𝑗 []⟩

(4.3)

receives a message 𝑚 with timestamp ⟨𝑚𝑥.𝑚, offset.𝑚 [], counter.𝑚 []⟩.

First, we compute 𝑛𝑒𝑤𝑚𝑥 which is the maximum of 𝑚𝑥. 𝑗, 𝑚𝑥.𝑚 and pt. 𝑗. Timestamps of 𝑗

and 𝑚 are then shifted to 𝑛𝑒𝑤𝑚𝑥 using the Shift operation. These timestamps are then merged to

obtain the 𝑚𝑥 and offset values of the new event, say 𝑓 .

17

Algorithm 4.9 Send Message

1. 𝑛𝑒𝑤𝑚𝑥 = max(𝑚𝑥. 𝑗, pt. 𝑗)

2. new_offset = 𝑛𝑒𝑤𝑚𝑥 − pt. 𝑗

3. If (𝑚𝑥. 𝑗 = 𝑛𝑒𝑤𝑚𝑥 ∧ offset. 𝑗 . 𝑗 = new_offset) then

4.

counter. 𝑗 . 𝑗 = counter. 𝑗 . 𝑗 + 1

5. Else

6.

7.

8.

𝑡𝑠. 𝑗 = 𝑆ℎ𝑖 𝑓 𝑡 (𝑡𝑠. 𝑗, 𝑛𝑒𝑤𝑚𝑥)

offset. 𝑗 . 𝑗 = min(𝑛𝑒𝑤𝑚𝑥 − pt. 𝑗, 𝜖)

counter. 𝑗 = [0, 0, . . . , 0]

9. End If

Now, we check if the knowledge that 𝑓 has about epochs is the same as that of 𝑒 (the previous

event on 𝑗) or 𝑚. If all three are in the same epoch then counter. 𝑗 . 𝑗 is set to one more than the

maximum of counter. 𝑗 .𝑘 and counter.𝑚.𝑘. If only 𝑒 and 𝑓 are in the same epoch, counter. 𝑗 . 𝑗

is incremented by 1. If only 𝑚 and 𝑓 are in the same epoch, counter. 𝑗 is set to counter.𝑚 and

the value of counter. 𝑗 . 𝑗 is incremented by 1. If none of these conditions apply then counters are

reset to 0.

4.5 Comparing RepCl Timestamps

The happens-before relation in RepCl is codified in Algorithm 4.11. In this algorithm, we check

whether timestamp 𝑡1 happens-before timestamp 𝑡2. In this relation, we first compare the HLCs of

the two RepCl timestamps. Since HLC provides the top-level information of the physical clock of

the message, it resolves ties with clocks having different HLCs. If the HLCs are equal we move to

the offsets. For 𝑡1 to be strictly happening before 𝑡2, we follow the comparison used in traditional

vector clocks, where 𝑣𝑐.𝑒 < 𝑣𝑐. 𝑓 iff (∀ 𝑗 :: 𝑣𝑐.𝑒. 𝑗 ≤ 𝑣𝑐. 𝑓 . 𝑗) ∧ (∃ 𝑗 :: 𝑣𝑐.𝑒. 𝑗 < 𝑣𝑐. 𝑓 . 𝑗). If the

offsets for the two timestamps are also equal, we check for the counters.

We say two events are concurrent by the definition introduced in Requirement 3, which basically

states that if ¬(RepCl.𝑒 < RepCl. 𝑓 ) ∧ ¬(RepCl. 𝑓 < RepCl.𝑒), then RepCl.𝑒||RepCl. 𝑓 .

18

Algorithm 4.10 Receive Message

1. Input: Received Message 𝑚

2. 𝑛𝑒𝑤𝑚𝑥 = max(𝑚𝑥. 𝑗, 𝑚𝑥.𝑚, pt. 𝑗)

3. 𝑡𝑠.𝑎 = Shift(𝑡𝑠. 𝑗, 𝑛𝑒𝑤𝑚𝑥)

4. 𝑡𝑠.𝑏 = Shift(𝑡𝑠.𝑚, 𝑛𝑒𝑤𝑚𝑥)

5. 𝑡𝑠.𝑐 = MergeSameEpoch(𝑡𝑠.𝑎, 𝑡𝑠.𝑏)

6. If EqualOffset(𝑡𝑠. 𝑗, 𝑡𝑠.𝑐)∧ EqualOffset(𝑡𝑠.𝑚, 𝑡𝑠.𝑐) then

7.

8.

9.

For each 𝑘 do

counter. 𝑗 .𝑘 = max(counter. 𝑗 .𝑘, counter.𝑚.𝑘)

End For

10.

counter. 𝑗 . 𝑗 = counter. 𝑗 . 𝑗 + 1

11. End If

12. If EqualOffset(𝑡𝑠. 𝑗, 𝑡𝑠.𝑐) ∧ ¬ EqualOffset(𝑡𝑠.𝑚, 𝑡𝑠.𝑐) then

13.

counter. 𝑗 . 𝑗 = counter. 𝑗 . 𝑗 + 1

14. If ¬ EqualOffset(𝑡𝑠. 𝑗, 𝑡𝑠.𝑐)∧ EqualOffset(𝑡𝑠.𝑚, 𝑡𝑠.𝑐) then

15.

16.

counter. 𝑗 = counter.𝑚

counter. 𝑗 . 𝑗 = counter. 𝑗 . 𝑗 + 1

17. If ¬ EqualOffset(𝑡𝑠. 𝑗, 𝑡𝑠.𝑐) ∧ ¬ EqualOffset(𝑡𝑠.𝑚, 𝑡𝑠.𝑐) then

18.

counter. 𝑗 = [0, 0, . . . , 0]

19

Algorithm 4.11 Compare Operation

1. Input: Timestamp 𝑡1, Timestamp 𝑡2

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

If t1.𝑚𝑥 < t2.𝑚𝑥 then

Return 𝑇𝑟𝑢𝑒

Else If t1.𝑚𝑥 > t2.𝑚𝑥 then

Return 𝐹𝑎𝑙𝑠𝑒

Else then

For 𝑖, 𝑗 in t1.offsets, t2.offsets

If t1.offsets.𝑖 > t2.offsets. 𝑗

Return 𝐹𝑎𝑙𝑠𝑒

If t1.counters <= t2.counters

Return 𝑇𝑟𝑢𝑒

End For

Return 𝐹𝑎𝑙𝑠𝑒

End If

As an illustration, consider the execution of the program in Figure 3.1. Assuming that 𝜖 = 5, I =

1, and E = 5, the RepCl timestamps will be as shown in Figure 4.2. Here, event 𝐴 has physical

time of 50. Since process P1 has not heard from anyone else so far, the offsets for P2 and P3 will

be 𝜖. The offset for process P1 will be 0. Regarding event 𝐶, the situation is similar except that the

offset for process P3 is 0. When event 𝐵 is created upon receiving message 𝑚1, process P2 is aware

of times 50 from 𝑃1. And, it is the maximum epoch it is aware of. Hence, offsets are [0, 2, 𝜖]

respectively. When event 𝐷 is created, process P2 is aware of epoch 52 (from P2) and epoch 50

(from P1). It is aware of timestamp 40 from P3. However, this information is overridden by the

clock synchronization guarantee that says that the clock of P3 is at least 47. Thus, the offsets are

set to [3, 𝑒, 𝜖] Here, the permissible ordering is 𝐶 𝐴𝐵𝐷.

In this figure, if 𝜖 were 20 then the timestamp of 𝐷 would be changed to [3, 2, 12]. Furthermore,

20

𝐵 and 𝐶 could be replayed in any order. Thus, the permissible replays would be 𝐶 𝐴𝐵𝐷 or 𝐴𝐵𝐶𝐷

or 𝐴𝐶𝐵𝐷.

Figure 4.2 Replay of the Execution in Figure 3.1 with RepCl.

4.6 Properties of RepCl

In this section, first, we define the < relation on two timestamps RepCl.𝑒 and RepCl. 𝑓 . Then,

we identify the properties of this < relation and the happened-before relation.

Given timestamps

RepCl.𝑒 = ⟨𝑚𝑥.𝑒, offset.𝑒[], counter.𝑒[]⟩ and

RepCl. 𝑓 = ⟨𝑚𝑥. 𝑓 , offset. 𝑓 [], counter. 𝑓 []⟩,

we say that RepCl.𝑒 < RepCl. 𝑓 iff

21

𝑚𝑥. 𝑓 > 𝑚𝑥.𝑒 + E

(cid:32)

∨

|𝑚𝑥. 𝑓 − 𝑚𝑥.𝑒| ≤ E

(cid:32) (cid:32)

∧

∀𝑙 (𝑚𝑥.𝑒 − offset.𝑒.𝑙) ≤ (𝑚𝑥. 𝑓 − offset. 𝑓 .𝑙)

(cid:32)

(cid:33)

(cid:33)(cid:33)

∧

∃𝑙 (𝑚𝑥.𝑒 − offset.𝑒.𝑙) < (𝑚𝑥. 𝑓 − offset. 𝑓 .𝑙)

(cid:32)

∨

∀𝑙 (𝑚𝑥.𝑒 − offset.𝑒.𝑙) = (𝑚𝑥. 𝑓 − offset. 𝑓 .𝑙)

(cid:32)

∧

∀𝑙 (counter.𝑒.𝑙) ≤ (𝑚𝑥. 𝑓 − counter. 𝑓 .𝑙)

∧ ∃𝑙 (counter.𝑒.𝑙) < (counter. 𝑓 .𝑙)

(cid:33)(cid:33) (cid:33)

The above < relation first compares if 𝑚𝑥. 𝑓 and 𝑚𝑥.𝑒 are far apart. If that is the case, we define

RepCl.𝑒 < RepCl. 𝑓 . If they are close, i.e., |𝑚𝑥. 𝑓 − 𝑚𝑥.𝑒| ≤ 𝜖, then, we compare the offsets. Since

𝑚𝑥.𝑒−offset.𝑒.𝑘 identifies the knowledge 𝑒 had about the epoch of process 𝑘, we use a comparison

that is similar to vector clocks to determine if < relation holds between RepCl.𝑒 and RepCl. 𝑓 .

Finally, if the offsets are also equal then we use the comparison of counters (again in the same

fashion as vector clocks).

We overload the || relation for comparing timestamps as well. Specifically, given timestamps

RepCl.𝑒 and RepCl. 𝑓 , we say that RepCl.𝑒||RepCl. 𝑓 iff

¬(RepCl.𝑒 < RepCl. 𝑓 ) ∧ ¬(RepCl. 𝑓 < RepCl.𝑒)

(4.4)

From the construction of the timestamp algorithm, we have the following two lemmas:

Lemma 1:

(e happened before f) ⇒ RepCl.𝑒 < RepCl. 𝑓

Lemma 2:

|𝑚𝑥.𝑒 − 𝑚𝑎𝑥𝑡. 𝑓 | ≤ E ∧ (𝑒|| 𝑓 ) ⇒ RepCl.𝑒||RepCl. 𝑓

22

Requirement 1 of RepCl: Observe that Lemma 1 satisfies the first requirement of RepCl; if 𝑒

happened before 𝑓 then 𝑒 must be replayed before 𝑓 .

Requirement 2 of RepCl: Now, we focus on the second requirement. Specifically, we show that

by letting E1 = E + I, the second requirement is satisfied.

Observe that in the RepCl algorithm, messages carry the epoch values of multiple processes.

This allows a process to learn epoch information about other processes. For the subsequent

discussion, imagine that the messages also carried the actual physical time as well. In this case,

𝑗 will learn about the clock of a process 𝑘 via such messages. Additionally, 𝑗 will also learn

about the clock of a process 𝑘 based on the assumption of clock synchronization. Likewise, when

event 𝑒 is created, it will have some information about the clock of each process. Let 𝑚𝑥 𝑝ℎ.𝑒 and

𝑚𝑥 𝑝ℎ. 𝑓 be the maximum clock (of any process) that 𝑒 and 𝑓 are aware of when they occurred. If

𝑚𝑥 𝑝ℎ. 𝑓 > 𝑚𝑥 𝑝ℎ.𝑒 + E1 then 𝑓 cannot occur before 𝑒 under the clock synchronization guarantee

of E1. Now, we show that in this situation, it is guaranteed that RepCl.𝑒 < RepCl. 𝑓 .

By definition of 𝑚𝑥, 𝑚𝑥. 𝑓 = ⌊

𝑚𝑥 𝑝ℎ. 𝑓
I

⌋ and 𝑚𝑥.𝑒 = ⌊

𝑚 𝑝𝑡.𝑒
I

⌋. Additionally, we have

−1 < (𝑥 − ⌊𝑥⌋) − (𝑦 − ⌊𝑦⌋) < 1

=⇒ −1 < (

=⇒ −1 < (

𝑚𝑥 𝑝ℎ. 𝑓
I
𝑚𝑥 𝑝ℎ. 𝑓
I

− ⌊

𝑚𝑥 𝑝ℎ. 𝑓
I
− 𝑚𝑥. 𝑓 ) − (

⌋) − (

𝑚𝑥 𝑝ℎ.𝑒
I

− ⌊

𝑚𝑥 𝑝ℎ.𝑒
I

⌋) < 1

𝑚𝑥 𝑝ℎ.𝑒
I

− 𝑚𝑥.𝑒) < 1

=⇒ −1 < (

𝑚𝑥 𝑝ℎ. 𝑓 −𝑚𝑥 𝑝ℎ.𝑒
I

) − (𝑚𝑥. 𝑓 − 𝑚𝑥.𝑒) < 1

=⇒ −I ≤ ((𝑚𝑥 𝑝ℎ. 𝑓 − 𝑚𝑥 𝑝ℎ.𝑒) − (𝑚𝑥. 𝑓 − 𝑚𝑥.𝑒)I ≤ I

Now, if 𝑚𝑥 𝑝ℎ. 𝑓 − 𝑚𝑥 𝑝ℎ.𝑒 > E + I then we can rewrite the second inequality as ( E+I

I −
(𝑚𝑥. 𝑓 − 𝑚𝑥.𝑒) ≤ 1. Using the fact that E = 𝜖 ∗ I, we have 𝜖 < (𝑚𝑥. 𝑓 − 𝑚𝑥.𝑒). In other words,

𝑚𝑥 𝑝ℎ. 𝑓 − 𝑚𝑥 𝑝ℎ.𝑒 > E + I ⇒ (𝑚𝑥. 𝑓 > 𝑚𝑥.𝑒 + 𝜖). which gives us RepCl.𝑒 < RepCl. 𝑓 . In other

words, we have

23

Lemma 3:

If 𝑓 occurred far after 𝑒, i.e., 𝑓 could not have occurred before 𝑒 in a system that

guarantees that clocks are synchronized within E1 = E + I then 𝑒 will be replayed before 𝑓 , i.e.,

RepCl.𝑒 < RepCl. 𝑓 .

Requirement 3 of RepCl: Next, we consider the case where 𝑒 and 𝑓 could have occurred in any

order if the underlying system guaranteed that clocks were synchronized to be within E2 = E − I.

Letting the maximum clock that event 𝑒 (respectively, 𝑓 ) was aware of to be 𝑚𝑥 𝑝ℎ.𝑒 (respectively,

𝑚𝑥 𝑝ℎ. 𝑓 ), we observe that |𝑚𝑥 𝑝ℎ.𝑒 − 𝑚𝑥 𝑝ℎ. 𝑓 | ≤ E2. Furthermore, 𝑒 and 𝑓 must be causally

concurrent. Under this scenario, we show that RepCl.𝑒||RepCl. 𝑓 .

If |𝑚𝑥 𝑝ℎ.𝑒 − 𝑚𝑥 𝑝ℎ. 𝑓 | ≤ E − I, we have

=⇒ |

=⇒ |⌊

𝑚 𝑝𝑡.𝑒

I −
𝑚 𝑝𝑡.𝑒
I

𝑚 𝑝𝑡. 𝑓
I

| ≤ 𝜖 − 1

// since E = 𝜖 ∗ I

⌋ − ⌊

𝑚 𝑝𝑡. 𝑓
I

⌋| ≤ 𝜖

since |(𝑥 − ⌊𝑥⌋) − (𝑦 − ⌊𝑦⌋)| < 1

=⇒ |𝑚𝑥.𝑒 − 𝑚𝑥. 𝑓 | ≤ 𝜖

by definition of 𝑚𝑥

Now, from Lemma 2, RepCl.𝑒||RepCl. 𝑓 . In other words,

Lemma 4:

If 𝑒 and 𝑓 could have occurred in any order in a system where clocks were synchro-

nized to be within E − I then RepCl.𝑒||RepCl. 𝑓 .

4.7 Effect of discretization and comparison with Hybrid Vector Clocks [1]

We note that the discretization of the clock via I has caused the bounds used for clock synchro-

nization in Lemmas 1 and 2 to be different. We could have eliminated this if we had not discretized

the clocks. (Discretization with I was not done in [1].) However, without discretization, the values

of offsets will be very large. Without discretization, we will need to rely on just the physical clocks

which have a granularity of under 1 nanosecond. Now, if E = 1𝑚𝑠 then the value of the offset

could be as large as 106. By discretizing the clock, it would be possible to keep offsets to be

very small. We expect that the discretization will not seriously impact the replay. For example, if

E = 1𝑚𝑠 and I = 0.1𝑚𝑠 then our algorithm will guarantee that causally concurrent events within

0.9𝑚𝑠 can be replayed in any order. And, events that could not occur simultaneously under a clock

24

Figure 4.3 RepCl representation.

synchronization guarantee of 1.1𝑚𝑠 will be replayed only in one order. Additionally, if 𝑒 happened

before 𝑓 then 𝑒 will always be replayed before 𝑓 .

4.8 Representation of the RepCl and its Overhead

In this section, we identify how RepCl can be stored to permit efficient computation. As written,

RepCl will require 2𝑛 + 1 integers. However, a more compact representation will be possible when

we account for the fact that it is being used in a system where the clocks are synchronized to be

within E. Thus, if 𝑗 does not hear from 𝑘 (directly or indirectly) for a long time then the knowledge

𝑗 would have about the clock of 𝑘 is the same that is provided by the clock synchronization

assumption. In this case, offset. 𝑗 .𝑘 = 𝜖. It follows that there is no need to store this value if we

interpret no information about the offset of 𝑘 to mean that offset. 𝑗 .𝑘 = 𝜖.

With this intuition, we represent RepCl. 𝑗 as shown in Figure 4.3. Here, the value of 𝑚𝑥. 𝑗

(represented by the first word) is 50. The next word identifies the bitmap. Since the bit corresponding

to process 1 is 0, it implies that offset. 𝑗 .0 = 𝜖. The offset for process 2 is 10 (first 4 bits of the

offset) and 𝑐𝑜𝑢𝑛𝑡𝑒𝑟. 𝑗 .2 is 2 (first 2 bits of the counter). We note that the bits for each offset and

counter are hard-coded based on the system parameters (cf. Chapter 6).

The second word is a bitmap that identifies whether offset. 𝑗 .𝑘 is stored for process 𝑘. Each

offset is stored with a fixed number of bits in the subsequent word(s).

25

Next, we show that this representation allows us to reduce the cost of storage as well as the cost

of computing timestamps or comparing them (using < relation). Specifically, all these costs are

proportional to the number of processes that have communicated with 𝑗 recently.

With representation in Figure 4.3, first, we note that finding the location of the 1s in the given

bitmap can be done in time that is proportional to the number of 1s in the bitmap. [((𝑛− (𝑛&(𝑛−1)))

will return the number with only the rightmost 1]. Thus, we have

Observation 1: Shift and MergeSameEpoch can be implemented using 𝑂 (𝑥) time where 𝑥 is the

number of bits set to 1 in the bitmap.

Note that this means that the time to compute the timestamp for send/receive at process 𝑗 is

not dependent upon the number of processes in the system. But only processes that have recently

communicated with process 𝑗. In turn, this means that

Observation 2: Send and Receive can be implemented using 𝑂 (𝑥) time where 𝑥 is the number

of bits set to 1 in the bitmap.

Observation 3: Given two timestamps, RepCl.𝑒 and RepCl. 𝑓 , we can determine if RepCl.𝑒 <

RepCl. 𝑓 in 𝑂 (𝑥) time where 𝑥 is the number of bits set to 1 in 𝑒 and 𝑓 .

It follows that the number of bits that are 1 in a given timestamp identifies not only the storage

cost of the timestamps but also the time to compute these timestamps at run time. Effectively, this

also identifies the overhead of the timestamps that enable the replay of the computation. Hence, in

Chapter 6, we focus on identifying scenarios where the cost of storing these offsets is within the

limits identified by the user.

We note that the above approach will work as long as the number of processes is less than the

number of bits in a word (typically, 64 in today’s systems). We expect that this will be more than

sufficient for many systems in practice. If there are more than 64 processes, we expect that process

𝑗 is communicating with only a subset of these processes. And, if process 𝑗 does not communicate

with someone there is no need to store offsets for them. Thus, this approach can be extended for

26

the case where the number of processes is larger. However, the details are out of the scope of this

thesis.

27

CHAPTER 5

SIMULATOR SETUP

In this chapter, we discuss the construction of a custom discrete event simulator, to serve as a

validation to state-of-the-art simulators available for research. We first design a baseline simulator

build for natively supporting the RepCl infrastructure. We call this the Custom Discrete Event

Simulator, or CDES. The goal of the RepCl was to serve as a plug-and-play structure that allows

replays to be supported natively by virtue of the clock’s algorithm. In other words, the clock should

be self-contained, where replay should only require RepCl-timestamped logs to function correctly,

and any simulator should be able to incorporate the visualization just by implementing the clock in

its infrastructure. Details of this simulator are further discussed in Section 5.1.

However, building a simulator for the RepCl introduces a bias in the design. To validate the

results to more real-world scenarios, we considered using a state-of-the-art (SOTA) simulator, NS-3

[10]. NS-3 proved to be an effective simulator to implement the RepCl structure. However, NS-3

posed challenges in making node-local noisy clocks, which is discussed further in Section 5.2. For

this reason, we revise NS-3 to implement structures that would aid us in approximating what a

node-local clock would look like.

Since the NS-3 team expressed the desire to enhance NS-3 with node-local noisy clocks, we

decided to build a custom implementation of the same. We designed the node-local clock in a way

to permit any arbitrary node level clock (e.g., HLC [11], Vector clock [6][7], Logical Clocks [5],

etc.).

The following chapter is organized as follows. Section 5.1 describes the custom simulator we

designed to validate the results obtained by any generic simulator, and identify key differences in

the results. During our research, we chose to implement the custom simulator (CDES) first, as the

choice of the SOTA simulator was not apparent. Additionally, the SOTA simulator should produce

the same results as the CDES, as CDES was built solely for the RepCl. The CDES served as a

ground truth system, and any architecture we chose would be validated against the results of this

simulator. Section 5.2 talks about NS-3, and the revisions that needed to be made to implement

28

our clock infrastructure. We discuss the application implemented, and the complexities that the

application had to handle to provide correct results.

5.1 CDES, A Custom Discrete Event Simulator

In this section, we detail the design of our own custom discrete event simulator (CDES). We

modeled processes containing a physical clock 𝑝𝑡 and the RepCl. The design of the CDES is

discussed below.

Each process maintained a vector 𝑚𝑠𝑔_𝑞𝑢𝑒𝑢𝑒, that queued messages sent to that process. The

messages contained the RepCl of the sending process, along with the time the message was to be

processed by the receiving process. This receiving time was configured by the sending process by

reading the clock of the receiver and adding the message delay to the time. When the receiving

process obtained this message, it would compare its physical time 𝑝𝑡 with the receiving time, and

process the message. Each process had a skewing node-local physical clock. We implemented this

by randomly advancing the clock of a process based on a seed, and chose not to advance clocks.

We maintained the invariant that for no two processes 𝑖 and 𝑗, | 𝑝𝑡.𝑖 − 𝑝𝑡. 𝑗 | > E ∗ I, same as the

case in NS-3.

To test the clock, we used the same parameters for 𝑛, the number of processes, E, I, 𝛿 and 𝛼.

The message delay is modeled as the average delay experienced in the simulated network, and can

vary by some nonzero Δ in production. In the simulation, at every clock tick, a process delivers

any messages it is expected to deliver at that clock tick. It also sends a message to other processes

based on the message rate 𝛼. When a message is sent, the corresponding receive event is added to

the receiver’s queue based on the value of 𝛿. We also compute the actual value of the maximum

clock skew observed in the simulation to ensure that if E = 1𝑚𝑠 then the worst-case clock skew is

indeed 1𝑚𝑠. The simulation was initialized such that each process started with the same starting

clock and at each microsecond time step, each process made a decision to send a message. The

process first generated a random number in the range [0, 100], and if this number was lower than

𝛼, the process elected to send a message to any other process in the simulation or perform a local

event. Each process in the simulation had a uniform chance to send a message with constant delay

29

Parameter Minimum Value Maximum Value
𝑁
E
I
𝛿
𝛼

32 processes
10 units
100 microseconds
1 microsecond
10 messages/s

64 processes
1000 units
1000 microseconds
8 microseconds
160 messages/s

Increments
32 processes
50 units
50 microseconds
2x microseconds
2x messages/s

Table 5.1 Parameter configurations for the NS-3 Simulations. We only selected the configu-
rations where E ∗ I % 1000 == 0 to give us acceptable clock skew limits of [1ms, 6ms].

𝛿 to any other process in the system. The total messages sent varied with alpha, within the range of

[140, 10208] messages over 10,000 steps of the simulation. The parameters we varied are detailed

in Table 5.1.

5.2 NS-3 Simulator

Network Simulator-3 (NS-3) [10] provides a generic discrete event simulator, that works on

top of different devices implementing applications in different topologies. What makes NS-3 an

attractive option to test clock infrastructures is its versatility in types of nodes available, its ease of

use in topology configuration and application design, and most importantly, the configurability it

offers in designing simulations. The NS-3 infrastructure describes a generic Simulator, that allows

different network configurations to be supported and tested by changing a few parameters of this

Simulator class. These factors made NS-3 an attractive option to test and incorporate the RepCl.

However, NS-3 posed a few challenges. NS-3, as of the time of writing this thesis, does not

provide support for node-local noisy clocks. Clocks in NS-3 are synchronized with the top level

Simulator, and do not contain their own clock implementations. There have been attempts in

creating a node-local noisy clock, but have not been incorporated into the NS-3 infrastructure.

Another challenge stems from this. Due to the absence of node-local noisy clocks, NS-3 does not

handle clock drifts. Due to the absence of clock drifts, the RepCl would not store any offsets, as all

processes would tick in sync each time with the Simulator class.

Hence, we devised an API to overcome these key challenges. We implemented a node-local

noisy clock, which would approximate a node reading its physical time, and added a value 𝛿 to

approximate the noise produced by skewing physical clocks. The algorithm for the node-local

30

noisy clock is described in Algorithm 5.1. Here ,𝑛𝑡.𝑖 denotes the node-local time of process 𝑖.

Algorithm 5.1 Node-Local Noisy Clock: Get Operation

1. Input: 𝑆𝑖𝑚𝑢𝑙𝑎𝑡𝑜𝑟𝑇𝑖𝑚𝑒

2.

3.

𝑛𝑡.𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚(𝑛𝑡.𝑖, 𝑆𝑖𝑚𝑢𝑙𝑎𝑡𝑜𝑟𝑇𝑖𝑚𝑒 + (E ∗ I))

Return 𝑛𝑡.𝑖

As described in Algorithm 5.1, we receive a clock that maintains the relation that for no two

processes 𝑖 and 𝑗, is | 𝑝𝑡.𝑖 − 𝑝𝑡. 𝑗 | > E ∗ I, but produces clocks that skew with respect to each other.

This helps us produce offsets between different processes in a dynamic fashion, and allows us to

handle clock drifts.

Using this node-local clock, we design an application in NS-3 called the ReplaySimulatorAp-

plication, with nodes implementing the node-local clocks and the RepCl. The RepCl uses the

local clock to perform updates on itself. The ReplaySimulatorApplication picks a random node

candidate for each nodes and sends a message at intervals defined by the message rate 𝛼. The

channels implemented provide a maximum data rate of 500 Mbps, and a message delay defined by

𝛿. We also provided the option to choose E and I for the purposes of testing the clock sizes. In a

more real-world implementation, only the I would be changeable by the user. All other parameters

would be specified by the distributed system’s operating constraints.

We simulated a distributed environment to test the clock with five parameters - the number of

processes (𝑛), the maximum allowed clock skew (E), the interval size (I), the message rate (𝛼), and

the message delay in microseconds (𝛿). We collected results for about 20 seconds for each run.

Table 5.1 defines the variation statistics of each of these parameters.

A key change we made in the NS-3 simulator is the counter storage. We store only the sum

of counters in the counter array space. This is explained further in section 7.3.1. The key idea

is that there are very few events that actually store counters for all processes, and by storing just

the sum of counters of all processes helps us condense the information needed, without the loss of

generality in most cases. We do risk missing a few orderings, but it is an acceptable tradeoff as the

31

cases where a lot of counters are stored are rare.

32

CHAPTER 6

SIMULATION RESULTS

As demonstrated in Section 4.6, the overhead of RepCl depends upon the number of offsets/counters

that need to be stored. And, this value depends upon the number of processes that communicate

with a given process in the E time. In other words, the system parameters will determine the size

of RepCl. In this section, we evaluate the overhead of RepCl via simulation. For the purposes of

the results, we denote the offset array size as 𝜃 and the counter array size as 𝜎.

In the following sections, we outline the effects of changing different parameters both in the

custom simulator and in NS-3. Note that the results for the CDES are reported in 32-bit word

lengths, and the results for NS-3 are reported in bits, due to the different data collection techniques

used for each.

6.1 Effect of Clock Skew (E)

In this section, we measure the trends in E while varying the other parameters to see how the 𝜃

and 𝜎 are affected. We compare each parameter pair-wise with E to see the effect of the parameter

on the clock skew trend with 𝜃.

6.1.1 Analysis of Varying Interval Size (I) with the CDES

Here, we keep the 𝛿 and 𝛼 constant to see how 𝜃 and 𝜎 change with E.

• In case of 𝜃 vs E curve, we notice that the value of I has little bearing on 𝜃. As expected,

as the value of E increases, 𝜃 increases with it. This is true for all values of 𝛼, and we

consistently store more offsets as 𝛼 increases. For any given value of 𝛼, however, the value

of I can be chosen to set the granularity of the user’s choice and would allow more flexibility

in the clock information. Regardless of the choice of I by the user, the offset sizes increase

with roughly the same trend. These trends are illustrated in Figure 6.1.

• In the case of 𝜎 vs E curve, we see not too much of a variation in 𝜎, as most events reach a

different epoch. On average, we do not see many events storing counters; roughly 0.78% of

events store counters, and the values of such counters do not exceed 5 in most cases. Hence,

33

we would need very little space to store these counters. These trends are illustrated in Figure

6.2. Since this observation is true for all simulations in this paper, we do not discuss the

analysis for 𝜎 in the subsequent sections.

(a) 𝛼 = 20 msgs/s, 𝑛 = 32.

(b) 𝛼 = 40 msgs/s, 𝑛 = 32.

(c) 𝛼 = 160 msgs/s, 𝑛 = 32.

(d) 𝛼 = 20 msgs/s, 𝑛 = 64.

(e) 𝛼 = 40 msgs/s, 𝑛= 64.
Figure 6.1 Custom Simulator: 𝜃 vs E when varying I, 𝛿 = 8𝜇𝑠.

(f) 𝛼 = 160 msgs/s, 𝑛 = 64.

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.2 Custom Simulator: 𝜎 vs E when varying I, 𝛿 = 8 𝜇𝑠, 𝛼 = 160 msgs/s.

34

6.1.2 Analysis of Varying Interval Size (I) with NS-3

• In the case of the 𝜃 vs E curve in the NS-3 simulation, we observe a similar trend of 𝜃

increasing with E. This is in agreement with our findings from the custom simulator, and

true for all values of 𝛼. These trends are illustrated in Figure 6.3. While we see a decrease in

number of bits stored as I decreases, the difference is only significant in some cases, notably

in the case of lower message rates. At higher message rates, the gap reduces. This is due

to the amount of communication happening, and most processes tend to store close to the

acceptable limit of the number of offsets we want to store.

• In the case of the 𝜎 vs E curve in the NS-3 simulation, we store only the sum of counters.

Hence, we see some variations in counter size. With the increase of epsilon, we store slightly

higher counters, and this does not change with variation in I. However, the sizes of the

counters vary only from 0.03 bytes in the lowest case of 𝑛 = 32 to 0.1 bytes in the highest

case. Hence, the number stored as the counter value is not too large. This is true even for the

case of 𝑛 = 64, and is depicted in Figure 6.4.

(a) 𝛼 = 20 msgs/s, 𝑛 = 32.

(b) 𝛼 = 40 msgs/s, 𝑛 = 32.

(c) 𝛼 = 160 msgs/s, 𝑛 = 32.

(d) 𝛼 = 20 msgs/s, 𝑛 = 64.

(e) 𝛼 = 40 msgs/s, 𝑛= 64.
Figure 6.3 NS-3 Simulator: 𝜃 vs E when varying I, 𝛿 = 1𝜇𝑠.

(f) 𝛼 = 160 msgs/s, 𝑛 = 64.

35

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.4 NS-3 Simulator: 𝜎 vs E when varying I, 𝛿 = 1 𝜇𝑠, 𝛼 = 20 msgs/s.

Since the total size of RepCl depends upon the number of bits for each offset and the total

number of offsets, we consider a specific example here. For Figure 4.3, the number of offsets is 2

and the size of each offset is 4 bits. Therefore, one word is sufficient to store offsets. Likewise, one

word is enough for counters. Thus, we need a total of 4 words to store this timestamp.

(Note that the counters can be stored in the same amount of memory as we have a number of

extraneous bits in this representation, specifically in the max epoch word if we elect to store the

sum of all these counters here. By doing this, we would lose some information, but considering that

the number of events that record meaningful counters is low, this may be an acceptable trade-off.)

It is straightforward to observe that the size of RepCl grows linearly with the number of offsets in

it. And, the total size of RepCl will require the use of the floor function to identify the number of

words necessary to store it. Since the floor operation loses some of the relevant data, we present

the value of 𝜃 in this section.

6.1.3 Analysis of Varying Message Delay (𝛿) for the CDES

Here, we fix the I and 𝛼, and for different 𝛿 values, and we identify how 𝜃 changes with E.

As E increases, we see higher values of 𝜃, implying a higher number of offsets stored on average.

We observe that higher values of 𝛿 produce a lower number of offsets in each case, barring some

noise. This is expected as an increase in 𝛿 implies messages would reach in a delayed fashion, and

would lead to processes setting other process offsets to 𝜖 due to non-receipt of messages. As the E

increases, a process hears from more processes (directly or indirectly) within time E. Hence, the

36

number of offsets increases. This is illustrated in Figure 6.5.

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.5 Custom Simulator: 𝜃 vs E when varying 𝛿, I = 8 𝜇𝑠, 𝛼 = 40 msgs/second.

6.1.4 Analysis of Varying Message Delay (𝛿) for the NS-3

As in the case of the Custom simulator, we see higher values of 𝜃 as the E increases. We do

not see too much of a difference as 𝛿 varies however, where the number of bits stored for each 𝛿

value are between ± 1 bit. The increase in offset size is somewhat linear for the most part as the

E increases, which is the same as observed in the analysis of varying I. The 𝛿 variations are not

pronounced as much due to the fact that the clocks implicitly synchronize when messages are sent

between each other. A message sent from far back into the past effectively does not change the

clock of the receiver. If a sender gets a clock from the future (in its local observation), it modifies its

own clock to push forward to this future timestamp to guarantee the acceptable E limit. This allows

different 𝛿 values to show close to no variation. This is illustrated in Figure 6.6. It is important to

note, however, that when 𝛿 exceeds the E limit, no process stores any offsets.

6.1.5 Analysis of Varying Message Rate (𝛼) for the CDES

Here, we fix the I and 𝛿, and for different 𝛼 values, we identify how 𝜃 changes with E.

As expected, for lower values of 𝛼, we consistently store lower offsets, as communication

between processes is sporadic. As the E increases, 𝜃 increases linearly until the bound of 𝑛 is

reached. This is due to the same reason mentioned earlier, as the bound lengths on epochs are

larger, even sporadic messages tend to store more offsets on other processes, causing the overall

37

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.6 NS-3 Simulator: 𝜃 vs E when varying 𝛿, I = 20 𝜇𝑠, 𝛼 = 160 msgs/second.

value of 𝜃 to increase. This is illustrated in Figure 6.7.

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.7 Custom Simulator: 𝜃 vs E when varying 𝛼, I = 4 𝜇𝑠, 𝛿 = 8 𝜇𝑠.

6.1.6 Analysis of Varying Message Rate (𝛼) for NS-3

Our results from the custom simulator are confirmed by the experiments in NS-3, where lower

values of 𝛼 store lower offsets due to lower communication. In the case of NS-3, higher values of

𝛼 store many more offsets than what we would like our upper limit to be (about one word of offsets

stored per clock). This is guaranteed by having lower values of 𝛼. This is illustrated in Figure 6.8.

6.2 Effect of Interval Size(I)

In this section, we observe the trends in I with respect to 𝛿 and 𝛼.

6.2.1 Analysis of Varying Message Delay (𝛿) for the CDES

Here, we fix the E and 𝛼 and check how 𝜃 changes with I.

38

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.8 NS-3 Simulator: 𝜃 vs E when varying 𝛼, I = 20 𝜇𝑠, 𝛿 = 4 𝑚𝑠.

From Figure 6.9, we observe that the value of I does not really have a significant effect on 𝜃

(Note that the 𝑌 axis of this figure varies only from 1.2 to 1.5.) This means that the selection of I

does not affect the number of offsets maintained by a process. However, it affects the size of each

offset. Specifically, the max value of the offset is 𝜖 = E

I and the number of bits required for each
𝜖. Hence, a larger value of I is better for reducing the size of the RepCl. However,

offset is log2

with larger I, the guarantees provided by RepCl are lower. Specifically, Lemma 3 shows that some

unforced reordering may occur when events 𝑒 and 𝑓 differ by time E + I. Users should therefore

choose the value of I based on the desired guarantees of RepCl or the maximum desirable offset.

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.9 Custom Simulator: 𝜃 vs I when varying 𝛿, E = 2 ms, 𝛼 = 20 msgs/s.

39

6.2.2 Analysis of Varying Message Delay (𝛿) for NS-3

In the case of the NS-3 simulator, we observed that the size of offsets decreased linearly with

increase in I. This is attributed to more information being stored in counters, as the length of the

interval increases, and less offsets being stored. The likelihood that all processes are in the same

epoch increases as the I increases, leading to lower number of offsets stored. The variation with 𝛿

is negligible, and seemingly random, which is confirmed with the custom simulator results.

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.10 NS-3 Simulator: 𝜃 vs I when varying 𝛿, E = 3 ms, 𝛼 = 20 msgs/s.

6.2.3 Analysis of Varying Message Rate (𝛼) for the CDES

For every point in the I − 𝜃 trend, lower 𝛼 values produce lower 𝜃. This is consistent with

our observations so far as communication is infrequent and processes tend to not hear from other

processes, subsequently not storing their offsets. As I increases, the 𝜃 remains the same, as the 𝛿

remains the same. When 𝛿 is constant, and E is constant, messages being sent either remain in the

same epoch (𝑚𝑥) as they would in the previous I chosen, or a message that was between different

intervals in a lower I, now would be in the same interval. Overall, this does not change the total 𝜃.

This is illustrated in Figure 6.11.

6.2.4 Analysis of Varying Message Rate (𝛼) for NS-3

As observed in the custom simulator, lower 𝛼 values produce lower 𝜃, due to infrequency in

communication. We additionally observe that the 𝜃 remains constant with increasing I, consistent

with our results from the CDES simulator in the previous section. This is illustrated in Figure 6.12.

40

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.11 Custom Simulator: 𝜃 vs I when varying 𝛼, E = 2 ms, 𝛿 = 8 𝜇𝑠.

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.12 NS-3 Simulator: 𝜃 vs I when varying 𝛼, E = 3 𝑚𝑠, 𝛿 = 2 𝑚𝑠.

6.3 Effect of Message Delay(𝛿)

In this section, we observe the effect of 𝛿 on 𝜃 while fixing E and I.

6.3.1 Analysis of Varying Message Delay (𝛿) for the CDES

Here, we observe that the value of 𝛿 has minimal effect on 𝜃. Specifically, as shown in Figure

6.13, we observe that the value of 𝜃 increases as the value of 𝛼 increases. However, for a fixed

value of 𝛼, the 𝜃 remains the same. When the E is fixed, and the I is fixed, the only way the 𝜃 would

go down is when processes would send messages that were received after the E limit. Because we

enforce the limit as 𝛿 ≤ E, this limit is never exceeded, and the 𝜃 hence, remains the same. This is

illustrated in Figure 6.13.

41

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.13 Custom Simulator: 𝜃 vs 𝛿 when varying 𝛼, E = 4 ms, I = 16 𝜇𝑠.

6.3.2 Analysis of Varying Message Delay (𝛿) for NS-3

We gather similar results in the case of NS-3 as we did in the CDES, which validates our

observations. This is illustrated in Figure 6.14. We also notice that for higher values of 𝛼, the 𝜃

increases drastically. Hence, it is important to note the feasibility of the RepCl, mainly that it is

advantageous to use it in a setting of low 𝛼 (message rates). This is covered further in the following

section.

(a) 𝑛 = 32.

(b) 𝑛 = 64.

Figure 6.14 NS-3 Simulator: 𝜃 vs 𝛿 when varying 𝛼, E = 3 ms, I = 20 𝜇𝑠.

6.4 Feasibility Regions

In this section, we review the simulations to define the notion of feasible regions. As discussed

earlier, the goal of RepCl is to enable the replay of a distributed computation with a small overhead.

Here, we consider the case where the user identifies the expected overhead of RepCl to identify

42

scenarios under which RepCl can be used to provide a perfect-replay that meets all the requirements

from Section 4.6. Since the overhead of the counters remains virtually unchanged, we only focus

on the overhead of the number of offsets, i.e., the value of 𝜃.

For 𝜃 = 8, the feasibility regions are shown in Figure 6.15a. Here, the blue dots identify the

data points where 𝜃 = 8 is feasible and the red dots represent the data points where 𝜃 = 8 is not

feasible. The green line identifies the bounds where 𝜃 = 8 is feasible. We find that the size of the

feasible region remains fairly unchanged with the value of 𝑛. However, it shrinks when the value

of E is increased. This is expected based on how 𝜃 changes with E. We note that the feasibility

region only identifies the case where perfect-replay meets all the requirements from Chapter 4.6.

If the user needs to utilize RepCl in an infeasible region, the user can obtain partial-replay. To

understand this, consider the case where the actual value of E is 4𝑚𝑠 but the user specifies it to be

2𝑚𝑠 while constructing RepCl. In this case, if 𝑒 and 𝑓 are within 2𝑚𝑠 then RepCl will allow them

to be replayed in any order. However, if 𝑓 occurred 3𝑚𝑠 after 𝑒 then 𝑒 will always be replayed

before 𝑓 . We anticipate that even in a system where the clock skew is 4𝑚𝑠, the actual clock skew

at a given moment is likely to be smaller than 4𝑚𝑠. This implies that the forced order between 𝑒

and 𝑓 will be quite infrequent. Hence, we anticipate that RepCl will be applicable even in domains

where the system parameters cause it to fall in an infeasible region, and is referred to in Section 8.3.

43

(a) E = 1 ms, 𝑛 = 32, and 𝜃 = 8.

(b) E = 2 ms, 𝑛 = 32, and 𝜃 = 8.

(c) E = 1 ms, 𝑛 = 64, and 𝜃 = 8.

(d) E = 2 ms, 𝑛 = 64, and 𝜃 = 8.

Figure 6.15 Feasibility regions for 𝛼 and 𝛿 settings.

44

CHAPTER 7

VISUALIZING TRACES WITH REPVIZ

We have now seen the various intricacies of the RepCl infrastructure, and we now describe how

to apply it to visualization systems. The goal of a visualization is simple, to show the user a view

of the computation both from an overall system perspective and a per-process view of how the

computation looked on that specific node. Various implementations of visualizers (ref. Section

8.2) achieve this, but in all of them, the underlying timestamping algorithm enforces an order

between concurrent events. We therefore, need a visualizer that allows the user to choose the order

of concurrent events, and view multiple execution traces simultaneously.

In this Chapter, we introduce RepViz, the third infrastructure component of this work. RepViz

is a visualizer that works on top of a log generated by the RepCl. It takes in a RepCl-timestamped

log, and generates a web-based visualization of the traces the algorithm generates. The user has the

option to replay events that are concurrent in any order, while the other events are ordered according

to the RepCl. Once the user has selected a replay order, a web visualization is displayed for that

trace. The web visualization is under progress, but a sample representation is depicted in Figure

7.1. In the following sections, we will go over the different methods that went into implementing

RepViz.

7.1

Implementation

The visualizer defines a top level component, called Tracer. This component contains the

following functions amd members:

• SortEvents(): This function sorts all events according to their RepCl timestamp.

• RunReplay(): The top level function that provides an interactive view to the user to replay

events.

• EventList: The list of events obtained from a RepCl timestamped log.

A Tracer object contains a set of Event objects. Each Event object has the following properties:

45

• EventID: The Message ID.

• EventType: The type of event, i.e., Send, Local or Recv.

• EventTime: TheRepCl timestamped to this event.

• Sender: The IP address of the sender.

• Receiver: The IP address of the receiver.

The Sender and Receiver fields are populated as the (Sender, Receiver) when its a Send/Local event,

and as (Receiver, Sender) for a Recv event.

Now, we will go into detail on how each of the Tracer methods are implemented.

7.1.1 SortEvents

The Tracer first orders all the events according to the RepCl timestamp. Events are sorted

based on their RepCl timestamps fed by the logs of the algorithm. The sorting algorithm sorts all

timestamps by the happens-before ( hb ) relation discussed in Chapter 2. Once the ordering is set,

the events are then given to the matching function. The sorting rules are described in Algorithm

4.11. The algorithm compares two RepCl timestamps 𝑡1 and 𝑡2, and returns True if 𝑡1 < 𝑡2 and

False otherwise. The other comparisons can be implicitly derived from the same algorithm.

7.1.2 RunReplay

Once the events are ordered, the Tracer runs the replay according to the event timeline generated

by the sorter. Every concurrent event pool is given to the user as a choice to replay any one of the

outstanding events waiting to be replayed. Once an event has been replayed, that event is removed

from the replay pool. This follows Algorithm 3.1.

7.2 User View

In the current iteration of RepViz, the prototype is run on an ASCII terminal. Here is a brief

run output of a sample trace snippet generated through an NS-3 simulation:

[(EventID=1, EventType=SEND, EventTime=[(NodeId=10.1.1.3, HLC=21, Offsets=[-15,

-15, 0, -15, -15], Counters=0)], Sender=10.1.1.3, Receiver=10.1.1.4)]

46

[(EventID=2, EventType=SEND, EventTime=[(NodeId=10.1.1.1, HLC=42, Offsets=[0,

-15, -15, -15, -15], Counters=0)], Sender=10.1.1.1, Receiver=10.1.1.2)]

[(EventID=3, EventType=SEND, EventTime=[(NodeId=10.1.1.4, HLC=44, Offsets=[-15,

-15, -15, 0, -15], Counters=0)], Sender=10.1.1.4, Receiver=10.1.1.5)]

[(EventID=4, EventType=SEND, EventTime=[(NodeId=10.1.1.2, HLC=55, Offsets=[-15,

0, -15, -15, -15], Counters=0)], Sender=10.1.1.2, Receiver=10.1.1.3)]

[(EventID=4, EventType=RECV, EventTime=[(NodeId=10.1.1.3, HLC=55, Offsets=[-15,

0, 0, -15, -15], Counters=1)], Sender=10.1.1.3, Receiver=10.1.1.2)]

[(EventID=5, EventType=SEND, EventTime=[(NodeId=10.1.1.1, HLC=57, Offsets=[0,

-15, -15, -15, -15], Counters=0)], Sender=10.1.1.1, Receiver=10.1.1.2)]

[(EventID=6, EventType=SEND, EventTime=[(NodeId=10.1.1.3, HLC=59, Offsets=[-15,

4, 0, -15, -15], Counters=0)], Sender=10.1.1.3, Receiver=10.1.1.4)]

[(EventID=7, EventType=SEND, EventTime=[(NodeId=10.1.1.5, HLC=61, Offsets=[-15,

-15, -15, -15, 0], Counters=0)], Sender=10.1.1.5, Receiver=10.1.1.1)]

[(EventID=7, EventType=RECV, EventTime=[(NodeId=10.1.1.1, HLC=61, Offsets=[0,

-15, -15, -15, 0], Counters=0)], Sender=10.1.1.1, Receiver=10.1.1.5)]

[(EventID=8, EventType=SEND, EventTime=[(NodeId=10.1.1.1, HLC=61, Offsets=[0,

-15, -15, -15, 0], Counters=1)], Sender=10.1.1.1, Receiver=10.1.1.2)]

Concurrent events detected!

0. [(EventID=9, EventType=SEND, EventTime=[(NodeId=10.1.1.2, HLC=62,

Offsets=[-15, 0, -15, -15, -15], Counters=0)], Sender=10.1.1.2,

Receiver=10.1.1.3)]

1. [(EventID=10, EventType=SEND, EventTime=[(NodeId=10.1.1.5, HLC=62,

Offsets=[-15, -15, -15, -15, 0], Counters=0)], Sender=10.1.1.5,

Receiver=10.1.1.1)]

2. [(EventID=2, EventType=RECV, EventTime=[(NodeId=10.1.1.2, HLC=62, Offsets=[0,

0, -15, -15, -15], Counters=0)], Sender=10.1.1.2, Receiver=10.1.1.1)]

Please choose the event to replay: 0

[(EventID=9, EventType=SEND, EventTime=[(NodeId=10.1.1.2, HLC=62, Offsets=[-15,

47

0, -15, -15, -15], Counters=0)], Sender=10.1.1.2, Receiver=10.1.1.3)]

Please choose the event to replay: 2

[(EventID=2, EventType=RECV, EventTime=[(NodeId=10.1.1.2, HLC=62, Offsets=[-15,

0, -15, -15, -15], Counters=1)], Sender=10.1.1.2, Receiver=10.1.1.1)]

Please choose the event to replay: 1

[(EventID=10, EventType=SEND, EventTime=[(NodeId=10.1.1.5, HLC=62, Offsets=[-15,

-15, -15, -15, 0], Counters=0)], Sender=10.1.1.5, Receiver=10.1.1.1)]

The above log would transform to the visualization on a WebUI as shown in Figure 7.1. In

Figure 7.1a, the user does not have a choice in the replay, as all events are ordered with the hb

relation detailed in Section 4.5 in the absence of concurrency. The user simply taps the right arrow

key to move forward in the replay. At some point in the replay, the user encounters concurrent

events, depicted in Figure 7.1b. Here, the user elects to replay event 1 by providing an input of

1 through the keyboard. The event marked 1 is added to the replay. Next the user has to choose

between events 2 and 3, depicted in Figure 7.1c. The user chooses event 2, and the last event left to

replay is event 3, which is replayed in Figure 7.1d. With this visualization, it is easy for a user to try

different combinations of replay and analyse how parameters change with the event order chosen.

The visualization can also generate exhaustive logs of all possible replays should the user require

it.

48

(a) No concurrency conflicts, push the right arrow key.

(b) Concurrency conflicts detected, replay event 1 first.

(c) Event 1 replayed, replay event 2 next.

(d) Event 2 replayed, replay event 3 next.

Figure 7.1 Sample Visualization: Here the user uses arrow keys to replay events that do not have
concurrency conflicts, and then uses number keys to input which events to replay in which order.

49

CHAPTER 8

RELATED WORK AND DISCUSSION

8.1 Clocks in Distributed Systems

Logical clocks were proposed in 1978 by Leslie Lamport [3] to trace the ordering of events in

a distributed system. Vector time was designed independently by multiple researchers [12][7][13],

and they proposed the idea of representing time in a distributed system as a set of n-dimensional

non-negative integer vectors. According to [14], Vector clocks are defined by three properties:

Isomorphism, Strong consistency, and Event Counting. Isomorphism suggests that if two events 𝑥

and 𝑦 have timestamps 𝑣ℎ and 𝑣𝑘, respectively, then 𝑥 −→ 𝑦 ⇐⇒ 𝑣ℎ < 𝑣𝑘. Here, −→ implies a

partial ordering between a set of events. Strong consistency implies that by examining the vector

timestamp of two events, we can determine the causal relationship between the two events. Event

counting suggests that if 𝑑 is always 1 in the rule 𝑅1, then the 𝑖𝑡ℎ component of the vector clock at

process 𝑝𝑖, 𝑣𝑡𝑖 [𝑖], denotes the number of events that have occurred at 𝑝𝑖 until that instant.

There have been several prior implementations of vector clocks including Singhal-Kshemkalyani’s

differential technique [15] and Fowler-Zwaenepoel direct dependency technique [16]. While vector

clocks are upper bounded by 𝑂 (𝑛) complexity in terms of both time and memory complexity,

the different implementations of the past have tried to reduce this complexity and generate more

efficient representations, with some success. Singham-Kshemkalyani’s differential technique relied

on piggybacking using the last sent and last update, without updating every vector clock. This

method relies on the assumption that even though the number of processes is large, only a few key

processes in a system would interact frequently by passing messages. A benefit of this method is

that it cuts down storage overhead at each process to 𝑂 (𝑛). However, this method doesn’t make a

substantial contribution to reducing the time complexity incurred when updating the vector clock,

as it relies on piggybacking to work.

Fowler-Zwaenepoel’s direct-dependency technique cuts down storage complexity again by re-

ducing the message size during transmission by transmitting only a scalar value in the messages.

Here, a process only maintains information regarding direct dependencies on other processes. The

50

downside of this method is that it has a high computational overhead as it has to trace dependencies

and update the vector clock, especially in systems where a few key processes may have a large

number of events.

Clock synchronization using Network Time Protocol (NTP) uses the Offset Delay Estimation

method to ensure physical clocks are synchronized across the internet. Clock offsets and delays are

calculated, and timestamps are issued between different machines within a system accordingly. The

system then attempts to establish a causal relationship by using the corrected timestamps. This,

however, can be computationally expensive and is open to error, as the delay estimation may not

always be accurate and result in a violation of the causal relationship between processes issued by

different machines.

One existing limitation between vector clocks representing logical time and physical clock

synchronization is the difficulty in reconciling one with the other. To overcome this challenge,

Hybrid Logical clocks were introduced by Kulkarni et al. [4] to capture the causality relationship

of a logical clock with the characteristics of a physical clock embedded into it. Another variant of

the hybrid clock is the Hybrid Vector Clock [17], [1], which, unlike the Hybrid Logical Clock, can

provide all possible/potential consistent snapshots for a given time. For this experiment, we use the

Hybrid Vector Clock design presented in Yingchareonthawornchai et al. as it provides desirable

characteristics to build our visualization framework.

8.2 Visualizing Traces

Mattern [7] talks about how distributed systems use the concept of global state to communicate

information, and the need to characterize this global state. They talk about how a process can

only approximate the global view of the system, and no process can have, at any given instant, a

consistent view of the global state. To verify a distributed system, the author provides a comparison

between three key approaches: simulating a synchronous distributed system given an asynchronous

system, simulating a common clock or simulating a global state. They highlight the need of a

vector clock system to provide a consistent snapshot of the global state, as each process having a

clock that stores only its own state is not enough to describe the global state of the computation.

51

PARAVER [18] uses the PVM message passing library to analyze traces generated from a

computation. PVM primarily uses parallel message passing, and PARAVER analyses these parallel

traces using data analytics and provides a graphical description of the analysis. This was one of the

earliest visualization works on distributed systems simulated only for parallel traces. It used a parser

to parse through the logs of the PVM-generated traces and analyze CPU activity, communications,

and user events. This however required the addition of functionality to the PVM itself and was not

generalized to any distributed system interface. It also did not provide generic support and incurred

a larger overhead to profile system resources while computation went on.

VAMPIR [12] provided analysis of MPI programs by generating timeline traces by profiling

MPI applications. It used different visualization metrics to show whether processes were still active

or not. It also provided views of system activities and aggregated statistics about the system itself.

However, it was made specifically only for MPI applications and added to the profiling interfaces

of MPI.

D3S [19] allowed developers to specify predicates on distributed properties of the system. These

predicates can vary depending on what consistency checks one would require on the distributed

system. They modeled the tracing as a consistency checker and generated traces of predicate

evaluation. The predicates are injected dynamically at compile time into the system and are

evaluated based on the customization provided by the user. However, we believe this approach

would add overhead to running the distributed computation due to complex predicate checking.

Zinsight [20] provides hierarchies of tasks and provides aggregated metrics to show timeline

visualizations of events. It also provides users with changing the granularity of the metric they

want to see with sequences of computations per process.

Trumper et al.

[21] present a dynamic analysis tool that uses boundary tracing and post-

processing to analyze system behavior through a distributed computation. These are task-based

visualizations, where tasks are mapped to memory resources. However, this may not always be the

case where processes share the same memory, as in the case of OpenMP-based infrastructures.

Dapper [22] is Google’s tracing software for distributed systems where they provided low

52

overhead, application-level transparency, and scalability. Dapper uses annotations and spans to

generate traces through RPCs. However, the authors mention that Dapper cannot correctly point

to causal history, as it uses annotations in non-standard control primitives, that may mislead the

causality calculations. Our approach would overcome this, as causality is enforced through a lattice

of clocks, rather than the events itself.

Isaacs et al.

[23] provides a comprehensive survey of different distributed monitoring and

tracing tools in the past decade, providing detailed descriptions and categorizations based on task

parallelism, causality information, and so on.

Isaacs, Bremer et al. [24] design a trace visualization system purely relying on logical clocks

and then transposing those clocks back to real-time clocks in the visualization. Processes are also

clustered based on logical behavior. However, this would incur more overhead than our solution

and may cause conflicts in enforcing causality, due to the usage of a standard logical clock.

Verdi [25] provides developers with choosing the fault system to diagnose, and verify the

implementation of the system. This is a formal verification system where it provides the developer

with an idealized fault model, and once this is verified, it applies the correctness to a more realistic

fault model.

ShiViz [26] uses vector clocks in generating distributed system traces using happens-before

relationships. By using vector clocks, it provides a verifiable and accurate notion of causality.

However, since it uses traditional vector clocks, it uses a higher complexity than our proposed

model.

8.3 Discussion

In Section 6, we identified feasible regions for the given permissible overhead for RepCl. Thus,

the natural question is: what can a user do if the given system parameters fall into the infeasible

region? Here, observe that E provides one way to reduce the overhead if we accept some imperfect

replay. To explain this, consider the case where we are using a system with clock skew to be E𝑎. If

the user implements RepCl with E < E𝑎 then the resulting replay will still satisfy requirements 1

and 2 (cf. Section 4.6). Requirement 3 will be satisfied with E2 = E − I.

53

Figure 8.1 Effect of using E instead of E𝑎 in RepCl.

Looking at this situation closely, we observe that the clock skew between two processes follows

a structure shown in Figure 8.1. Specifically, at a given instance, the clocks of two processes 𝑗 and

𝑘 differ by some amount that is less than E𝑎. However, the actual clock difference at a fixed point

in time (that is not visible to either 𝑗 and 𝑘) is often less than E𝑎. Hence, if 𝑒 and 𝑓 occurred at the

same global time, the probability that the respective clocks differed by E depends upon the area of

the shaded part (cf. Figure 8.1). In this case, 𝑒 and 𝑓 would still be replayed in arbitrary order.

Only if the clocks fall in the non-shaded area then the replay will force an order between 𝑒 and 𝑓 .

In other words, even if the system parameters fall in the infeasible region, it would be possible to

use RepCl that provides a valid replay. It is just that it will not be able to reproduce all possible

replays.

54

CHAPTER 9

CONCLUSION AND FUTURE WORK

In this paper, we focused on the problem of replay clocks in systems where clocks are synchronized

to be within E. The purpose of these clocks is to reproduce a distributed computation with all its

certainties and uncertainties. By certainty, we mean that if event 𝑒 must have happened before 𝑓

then the replay must ensure that 𝑒 is replayed before 𝑓 . Specifically, this required that if 𝑒 happened

before 𝑓 (as defined in [3]) or 𝑓 occurred E1 ≈ E time after 𝑒 then 𝑒 must occur before 𝑓 . And,

by uncertainty, we mean that if 𝑒 and 𝑓 could occur in any order then the replay should not force

an order between them. Specifically, if 𝑒|| 𝑓 (as defined in [3]) and 𝑒 and 𝑓 occurred within time

E2 ≈ E then the replay permits them to be replayed in any order. We presented RepCl to solve the

replay problem with E1 = E +I and E2 = E −I, where I is a parameter to RepCl. We analyzed RepCl

for various system parameters (clock skew (E), message rate (𝛼), message delay (𝛿)). We find that

for various system parameters, the size of RepCl and the overhead to create timestamps and/or

compare them is small. For the purpose of replay, RepCl provides several advantages over existing

approaches. For example, unlike logical clocks, they do not force certain unneeded event ordering.

They have a significantly lower overhead compared to vector clocks. Also, they do not generate

illegitimate replays that can occur with the user of vector clocks. The overhead of RepCl. 𝑗 depends

upon the number of processes that communicate with 𝑗 (directly or indirectly) in E window. This

is different from the case in vector clocks where the overhead is always proportional to the number

of processes in the system.

With the design of the RepCl, we ensured that the clock size is not a leading factor in slowing

down computation. The RepCl is a non-invasive method to ensure that causality is maintained in the

presence of skewing clocks, and this is particularly useful in various applications. To facilitate the

ease of development with this clock, we provide an API and a sample implementation of the API in

NS-3, a widely used distributed network simulator. We illustrate with the help of NS-3, the various

invariants our clock provides, such as the size scaling while varying various parameters. We also

identify feasibility regions that would provide perfect replay through the clock. For systems out of

55

this feasible region, we additionally provide techniques to approximate replay that is acceptable.

We have utilized RepCl to enable users to visualize a distributed computation using our tool,

RepViz. The goal of the visualization is to allow the user to identify an event 𝑓 where a failure

occurred. Then, they can use a replay of events just preceding 𝑓 to determine whether the error

would go away. If it does, it would imply that it is a synchronization error. Likewise, a user can

replay some portion of the computation. Since the replay of events may occur in a different order,

it will help identify potential synchronization errors.

RepCl is designed mainly for offline analysis, where the event data is stored during execution

and analyzed at a later time. However, RepCl can be used for run-time monitoring/analysis as well

if the data related to the timestamps is sent to a monitor, that monitor could analyze it for potential

properties of interest if the analysis can be done quickly. However, a key challenge in this context

will be whether the run-time monitors can keep up with the execution of the system. There are

several future directions for RepCl. If the size of RepCl needed for perfect replay is too large, the

user can reduce the size of RepCl by choosing a lower value of E. In this case, the resulting replay

will force some ordering between concurrent events. One of the future works is to identify the

effect of reducing E in this manner.

Another potential future extension would be the ability to evaluate different veins of execution

for different properties. Currently, it is difficult to compare and contrast different traces of execution

for a specific invariant. With the help of RepViz, users can identify key characteristics on how data is

moved between processes, and if there is an efficient way to coordinate movement. Another possible

characteristic that could be measured is resource utilization and how it is affected by different veins

of execution. Specifically, we aim to answer the question: Are there stark differences when an

event is replayed first before another based on the amount of work needed to perform that event?

This would give better methodologies to evaluate data movement operations.

Furthermore, we intend to expand this clock structure to beyond 64 processes. A potential

avenue to do this is to implement it on a hierarchical structure.

If a network is structured as a

network of switches, with each switch connected to a cluster of nodes, we can implement a replay

56

clock for each level independently. All we would need is a mechanism to merge the clocks coming

in from the cluster to the switch and have the switch relay clock information from its cluster to the

other clustered nodes on the network.

57

BIBLIOGRAPHY

[1] S. Yingchareonthawornchai, D. N. Nguyen, S. S. Kulkarni, and M. Demirbas, “Analysis of
bounds on hybrid vector clocks,” IEEE Transactions on Parallel and Distributed Systems,
vol. 29, no. 9, pp. 1947–1960, 2018.

[2] T. Mytkowicz, P. F. Sweeney, M. Hauswirth, and A. Diwan, “Observer effect and measure-
ment bias in performance analysis,” Computer Science Technical Reports CU-CS-1042-08,
University of Colorado, Boulder, 2008.

[3] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Commun.

ACM, vol. 21, pp. 558–565, July 1978.

[4] S. S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, and M. Leone, “Logical physical clocks,”
in Principles of Distributed Systems: 18th International Conference, OPODIS 2014, Cortina
d’Ampezzo, Italy, December 16-19, 2014. Proceedings 18, pp. 17–32, Springer, 2014.

[5] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Commun.

ACM, vol. 21, no. 7, pp. 558–565, 1978.

[6] C. J. Fidge, “Timestamps in message-passing systems that preserve the partial ordering,” in
Proceedings of the 11th Australian Computer Science Conference (ACSC) (K. Raymond, ed.),
pp. 56–66, 1988.

[7] F. Mattern et al., Virtual time and global states of distributed systems. Univ., Department of

Computer Science, 1988.

[8] D. L. Mills, “Internet time synchronization: the network time protocol,” IEEE Transactions

on communications, vol. 39, no. 10, pp. 1482–1493, 1991.

[9] H. Schildt, “C++ complete reference,” 1998.

[10] G. F. Riley and T. R. Henderson, “The ns-3 network simulator,” in Modeling and tools for

network simulation, pp. 15–34, Springer, 2010.

[11] S. S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, and M. Leone, “Logical physical clocks,”
in International Conference on Principles of Distributed Systems, pp. 17–32, Springer, 2014.

[12] W. E. Nagel, A. Arnold, M. Weber, H.-C. Hoppe, and K. Solchenbach, “Vampir: Visualization

and analysis of mpi resources,” 1996.

[13] F. B. Schmuck, “The use of efficient broadcast protocols in asynchronous distributed systems,”

tech. rep., Cornell University, 1988.

[14] A. D. Kshemkalyani and M. Singhal, Distributed computing: principles, algorithms, and

systems. Cambridge University Press, 2011.

[15] M. Singhal and A. Kshemkalyani, “An efficient implementation of vector clocks,” Information

Processing Letters, vol. 43, no. 1, pp. 47–52, 1992.

58

[16] J. Fowler and W. Zwaenepoel, “Causal distributed breakpoints,” in Proceedings of the Tenth

International Conference on Distributed Computer Systems, 1990.

[17] M. Demirbas and S. Kulkarni, “Beyond truetime: Using augmentedtime for improving span-

ner,” Aug, vol. 23, pp. 1–5, 2013.

[18] V. Pillet, J. Labarta, T. Cortes, and S. Girona, “Paraver: A tool to visualize and analyze
parallel code,” in Proceedings of WoTUG-18: transputer and occam developments, vol. 44,
pp. 17–31, 1995.

[19] X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F. Kaashoek, and Z. Zhang,

“D3s: Debugging deployed distributed systems,” in NSDI, 2008.

[20] W. De Pauw and S. Heisig, “Zinsight: A visual and analytic environment for exploring large
event traces,” in Proceedings of the 5th international symposium on Software visualization,
pp. 143–152, 2010.

[21] J. Trümper, J. Bohnet, and J. Döllner, “Understanding complex multithreaded software systems
by using trace visualization,” in Proceedings of the 5th international symposium on Software
visualization, pp. 133–142, 2010.

[22] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan,

and C. Shanbhag, “Dapper, a large-scale distributed systems tracing infrastructure,” 2010.

[23] K. E. Isaacs, A. Giménez, I. Jusufi, T. Gamblin, A. Bhatele, M. Schulz, B. Hamann, and P.-T.

Bremer, “State of the art of performance visualization.,” EuroVis (STARs), 2014.

[24] K. E. Isaacs, P.-T. Bremer, I. Jusufi, T. Gamblin, A. Bhatele, M. Schulz, and B. Hamann,
“Combing the communication hairball: Visualizing parallel execution traces using logical
time,” IEEE transactions on visualization and computer graphics, vol. 20, no. 12, pp. 2349–
2358, 2014.

[25] J. R. Wilcox, D. Woos, P. Panchekha, Z. Tatlock, X. Wang, M. D. Ernst, and T. Anderson,
“Verdi: a framework for implementing and formally verifying distributed systems,” in Pro-
ceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and
Implementation, pp. 357–368, 2015.

[26] I. Beschastnikh, P. Wang, Y. Brun, and M. D. Ernst, “Debugging distributed systems,”

Communications of the ACM, vol. 59, no. 8, pp. 32–37, 2016.

59