TRACING DISTRIBUTED ALGORITHMS USING REPLAY CLOCKS By Ishaan Kiran Lagwankar A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Scienceโ€”Master of Science 2024 ABSTRACT In this thesis, we introduce replay clocks (RepCl), a novel clock infrastructure that allows us to do offline analyses of distributed computations. The replay clock structure provides a methodology to replay a computation as it happened, with the ability to represent concurrent events effectively. It builds on the structures introduced by vector clocks (VC) and the Hybrid Logical Clock (HLC), combining their infrastructures to provide efficient replay. With such a clock, a user can replay a computation whilst considering multiple paths of executions, and check for constraint violations and properties that potential pathways could take, especially in the presence of concurrent events. Specifically, if event ๐‘’ must occur before ๐‘“ then the replay clock must ensure that ๐‘’ is replayed before ๐‘“ . On the other hand, if ๐‘’ and ๐‘“ could occur in any order, replay should not force an order between them. After identifying the limitations of existing clocks to provide the replay primitive, we present the RepCl structure and identify an efficient representation for the same. We demonstrate that RepCl can be implemented with less than four integers for 64 processes for various system parameters if clocks are synchronized within 1๐‘š๐‘ . Furthermore, the overhead of RepCl (for computing/comparing timestamps and message size) is proportional to the size of the clock. Using simulations in a custom distributed system and NS-3, a state-of-the-art network simulator, we identify the expected overhead of RepCl based on the given system settings. We also identify how a user can then identify feasibility region for RepCl. Specifically, given the desired overhead of RepCl, it identifies the region where unabridged replay is possible. Using the RepCl, we provide a tracer for distributed computations, that allows any computation using the RepCl to be replayed efficiently. The visualization allows users to analyze specific properties and constraints in an online fashion, with the ability to consider concurrent paths independently. The visualization provides per-process views and an overarching view of the whole computation based on the time recorded by the RepCl for each event. Copyright by ISHAAN KIRAN LAGWANKAR 2024 ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest appreciation to my committee in supporting me through this work in the course of my Masterโ€™s program. I am deeply indebted to Dr. Sandeep S. Kulkarni, and the ideas he has put forth that allowed me to progress in this research. The completion of this thesis would not have been possible without his support. I am deeply indebted to Dr. Li Xiao and Dr. Philip K. McKinley for providing me guidance through the committee and providing feedback for the work done in this research. I would like to extend my sincere thanks to the Department of Computer Science and Michigan State University, with gratitude to Vincent Mattison and Brenda Hodge, for supporting me with administrative decisions and financial support for the ICDCN โ€™24 conference in which this work was first published. I would also like to thank my peers in the Department, for their constant support and feedback on the work I have done, without which this thesis would not be complete. Lastly, I express deep gratitude to my family for providing me with the opportunities that brought me to this stage, and I will be forever indebted to them for their constant nurturing and support throughout my academic career. iv TABLE OF CONTENTS CHAPTER 1 1.1 Contributions . . . . INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 REPLAY WITH CLOCKS . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Limitations of Existing Clocks for Replay . . . . . . . . . . . . . . . . . . . . 3.2 Requirements of Replay Clock (RepCl) . . . . . . . . . . . . . . . . . . . . . . 1 4 6 8 8 9 CHAPTER 4 ALGORITHM FOR THE REPLAY CLOCK (RepCl) . . . . . . . . . . . 11 4.1 Structure of RepCl Timestamp . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 Efficient traversal and lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Helper functions . 4.4 Description of the RepCl Algorithm . . . . . . . . . . . . . . . . . . . . . . . 17 4.5 Comparing RepCl Timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.6 Properties of RepCl . 21 4.7 Effect of discretization and comparison with Hybrid Vector Clocks [1] . . . . . 24 4.8 Representation of the RepCl and its Overhead . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 5 SIMULATOR SETUP . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.1 CDES, A Custom Discrete Event Simulator . . . . . . . . . . . . . . . . . . . . 29 5.2 NS-3 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 . . . CHAPTER 6 SIMULATION RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.1 Effect of Clock Skew (E) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.2 Effect of Interval Size(I) . 38 6.3 Effect of Message Delay(๐›ฟ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.4 Feasibility Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 7 7.1 7.2 User View . Implementation . . . . VISUALIZING TRACES WITH REPVIZ . . . . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . . . . CHAPTER 8 8.1 Clocks in Distributed Systems 8.2 Visualizing Traces . . 8.3 Discussion . . RELATED WORK AND DISCUSSION . . . . . . . . . . . . . . . . . 50 . 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 9 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . 55 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 v CHAPTER 1 INTRODUCTION According to the observer effect, when we try to measure something, we change it to some extent [2]. Therefore, precise measurement is never truly possible. Computer programs suffer from this same difficulty. When you try to measure something in a computation, it changes the underlying computation. In an ideal world, a program may want to make sure that every step that it is taking is correct with respect to any environmental changes. However, the time taken for performing these checks may cause the program to be incorrect. In other words, it is possible that adding excessive safety checks or checks for guaranteeing fairness may cause the system to spend substantial time on those checks, thereby violate system requirements, even though those requirements would never have been violated without those checks in the first place. This issue is even more complicated in distributed computing where each process (component, node, etc.) relies on partial information. Hence, computing the required safety checks (or checking for the satisfaction of fairness requirements, etc.) would require processes to communicate with each other. In turn, the time for computing them would be even higher. As an illustration, consider two drones ๐ด and ๐ต that are cooperating to perform a task. Each drone may take independent actions based on some environment that the other drone cannot see. It is required that the area covered by the drones remains 50% or above at all times (100% of the time). It is also preferred that this area remains at 75% most of the time (75% of the time). Here, we would like to know (1) how frequently a given point ๐‘ฅ was covered by one of the drones, (2) how frequently a given point was covered by both the drones, (3) what was the minimum coverage at any time, etc. (Assume that they are at two different altitudes so safety measures such as preventing collision are not necessary). One possibility is to require ๐ด to notify ๐ต of its actions at all times to ensure that ๐ต can adjust its plan to account for what ๐ด is doing, and vice versa. However, this will require ๐ด to unnecessarily spend more time communicating with ๐ต, and vice versa. In other words, it may be necessary for ๐ด and ๐ต to move independently. Doing all these checks during the execution would require that ๐ด and ๐ต communicate with each other before they make any move. In 1 turn, it would change the behavior of the drones completely thereby preventing us from making any conclusion about their behavior in the absence of these checks. Furthermore, the problem would be even more complicated if we had a larger number of drones. One way to address this problem is to log the computation as it is happening so we can evaluate it later for all properties of interest. These properties may include non-critical safety properties or desirable performance criteria, etc. To be beneficial, the amount of storage for the log or the time to create that log should be small enough so that the underlying computation is affected minimally. In other words, the measurement should not change the underlying computation substantially. At the same time, the log should capture the non-determinism that is inherently present in any distributed computation. We also want to make sure that the creation of the logs is performed independently by each process, i.e., each process stores its local state whenever it changes along with a timestamp (discussed next) that identifies when the change was made. We also assume that all messages are logged as well. We consider various approaches for storing the timestamps and their implications. The simplest approach we can consider is to let the timestamp be the physical time of the relevant process. Here, the storage and computation cost is very low. However, the physical clocks of processes often differ. Hence, drone ๐ด may send a message at time 50 (local time of ๐ด) but it is received by ๐ต at time 40 (๐ตโ€™s local time). When we try to reply to this log to evaluate the given properties, it will cause ๐ต to receive the message before ๐ด has sent it. This is unacceptable, as it violates the systemโ€™s consistency. The next approach we can consider is vector clocks. Vector clocks introduce two concerns: Their size of ๐‘‚ (๐‘›), where ๐‘› is the number of processes, may be too high. Another challenge is that vector clocks do not have any reference to the physical clock and do not account for communication outside the system. For example, it is possible that drone ๐ด activated a green LED at physical time ๐‘ก1 and ๐ต activated a white LED at time ๐‘ก2 where ๐‘ก2 >> ๐‘ก1. In other words, an external observer will know that the action of ๐ด occurred before ๐ต. However, if ๐ด and ๐ต did not communicate then the corresponding events will be concurrent [3]. Thus, when we replay the log, it is possible that the white LED event could be replayed before the green LED event. This is also unacceptable. 2 Hybrid logical clocks (HLC) [4] combine logical clocks and physical clocks. Specifically, they rely on a system where physical clocks are synchronized within the acceptable limit of clock skew, E, they guarantee that โ„Ž๐‘™๐‘.๐‘’ < โ„Ž๐‘™๐‘. ๐‘“ if ๐‘’ happened before ๐‘“ or ๐‘๐‘ก.๐‘’ + E < ๐‘๐‘ก. ๐‘“ ([3]). Here, โ„Ž๐‘™๐‘.๐‘’ denotes the Hybrid Logical Clock of process ๐‘’, and ๐‘๐‘ก.๐‘’ denotes the physical time observed on process ๐‘’. In other words, โ„Ž๐‘™๐‘.๐‘’ < โ„Ž๐‘™๐‘. ๐‘“ if ๐‘“ causally depends upon ๐‘’ or ๐‘“ occurred substantially after ๐‘’. They eliminate the problem associated with physical clocks as HLC respects causality. They also eliminate the problem caused by vector clocks as the HLC timestamp of activating the green LED will be less than the ๐ป ๐ฟ๐ถ timestamp of activating the white LED. ๐ป ๐ฟ๐ถ does create another problem though. Consider the case where we have events ๐‘’ and ๐‘“ such that | ๐‘๐‘ก.๐‘’ โˆ’ ๐‘๐‘ก. ๐‘“ | < E and ๐‘’|| ๐‘“ , i.e., the events are causally concurrent and very close to each other in physical time. Without loss of generality, let โ„Ž๐‘™๐‘.๐‘’ < โ„Ž๐‘™๐‘. ๐‘“ . In this situation, when we replay the log, ๐‘’ will always occur before ๐‘“ . In other words, the log does not have the necessary information that could allow it to replay ๐‘“ before ๐‘’ even though they could have occurred in any order. An extension of HLC, hybrid vector clocks [1] reduces some of these issues. However, as we highlight in Section 4.6, this overhead is still quite high. Based on these limitations, in this thesis, we focus on building a new clock, the Replay Clock (RepCl), that combines hybrid logical clocks and vector clocks to eliminate their limitations. Our goal is to investigate scenarios under which RepCl permits efficient replay of events. To understand why we may need to limit RepCl to specific scenarios, observe that if the underlying system was asynchronous (unbounded clock drift) then it is required to have ๐‘‚ (๐‘›) vector clocks to enable replay of events. Systems that communicate frequently will need more information stored to replay events. Thus, we focus on the following problem: Given the amount of permissible overhead for logging events, what are the scenarios where perfect replay of events is possible? Once we identify the scenarios in which perfect replay is available, we design a trace visualizer, named RepViz, that allows us to depict candidate traces with the ability to replay concurrent events in any order of execution. This visualizer takes in a RepCl-timestamped trace to generate a visualization depicted in Chapter 7. We provide per-process views with timelines of events to 3 depict the events occurring while indicating causality between those events. The trace visualizer provides a interactive display to the user with orderings of events, and allows the user to reorder concurrent events to view different candidate traces. The visualizer API is discussed in Chapter 7. 1.1 Contributions โ€ข We present RepCl, a replay clock that enables the replay of events in a distributed system. It guarantees that if there is a causal relation [3] or if ๐‘“ occurred far later than ๐‘’ then RepCl.๐‘’ < RepCl. ๐‘“ , i.e., the replay will cause ๐‘’ to replayed before ๐‘“ . On the other hand, if ๐‘’ and ๐‘“ are causally concurrent and occurred close in physical time then they could be replayed in any order. โ€ข By considering various system parameters, clock skew (E), message rate (๐›ผ), and message delay (๐›ฟ), we identify the feasibility region for RepCl. โ€ข We implement an API for the RepCl for NS-3, a widely used distributed network simulator. It provides all the operations and documentation on how it integrates with different network components available to the NS-3 simulations. โ€ข We design RepViz, a visualizer that generates traces from RepCl-timestamped logs in a distributed computation. The visualizer provides the user with various orders of replay, and allows the user to view different candidate traces and evaluate various constraints along those traces through the visualization. Organization of the thesis: This thesis is organized as follows. In Chapter 2, we describe the model of computation for distributed systems including the notion of causality and clock synchronization. We move forward to the idea of the replay clock and the problems it solved in Chapter 3. in Chapter 4 we describe the algorithms associated with the RepCl, and describe the properties of the RepCl. Additionally, we describe the method of representation for the RepCl and describe the various overheads that are characteristic of the design of the clock. Chapter 5 talks about the design of the simulators we used to collect metrics for the RepCl. These metrics are 4 discussed in Chapter 6 with analysis on the size of the clock and feasibility of implementation of the clock. Chapter 7 talks about the design of the RepViz, the visualization system for trace building using the RepCl. Chapter 8 discusses related work and identifies questions raised by RepCl. Finally, in Chapter 9, we conclude and discuss future work. 5 CHAPTER 2 PRELIMINARIES A distributed system is a set of processes 1..๐‘›. Each process has three types of events (1) ๐‘ ๐‘’๐‘›๐‘‘, where it sends a message to another process, (2) ๐‘Ÿ๐‘’๐‘๐‘’๐‘–๐‘ฃ๐‘’, where it receives a message from another process, and (3) ๐‘™๐‘œ๐‘๐‘Ž๐‘™, where it performs some local computation. We define the happened-before (denoted by hb ) relation [3] among the events in a distributed computation. โ€ข If ๐‘’ and ๐‘“ happened on the same process and ๐‘’ occurred before ๐‘“ then ๐‘’ hb ๐‘“ . โ€ข If ๐‘’ was a send event and ๐‘“ was the corresponding receive event then ๐‘’ hb ๐‘“ . โ€ข The hb relation is transitive, i.e., if there exist events ๐‘’, ๐‘“ , and ๐‘” such that ๐‘’ hb ๐‘” and ๐‘” hb ๐‘“ then ๐‘’ hb ๐‘“ We say that ๐‘’|| ๐‘“ iff ยฌ(๐‘’ hb ๐‘“ ) โˆง ยฌ( ๐‘“ hb ๐‘’). In other words, ๐‘’ is concurrent with ๐‘“ iff ๐‘’ did not happen before ๐‘“ and ๐‘“ did not happen before ๐‘’. A timestamping algorithm assigns a timestamp for every event ๐‘’ in the system as soon as the event is created. Additionally, the timestamping algorithm defines a < relation that identifies how two timestamps are compared. As an illustration, Lamportโ€™s logical clock [5] assigns an integer timestamp ๐‘™.๐‘’ to every event ๐‘’. The < relation for Lamportโ€™s logical clocks is the standard < for integers. Likewise, the physical timestamping algorithm assigns ๐‘๐‘ก.๐‘’ for every event ๐‘’ where ๐‘๐‘ก.๐‘’ is the physical time of the process where event ๐‘’ occurred when it occurred, and the < relation is the same as that over integers. A vector clock [6][7] assigns event ๐‘’ a timestamp ๐‘ฃ๐‘.๐‘’ where ๐‘ฃ๐‘.๐‘’ is a vector that includes an entry ๐‘ฃ๐‘.๐‘’. ๐‘— for every process ๐‘—. The < relation on two vector clocks ๐‘ฃ๐‘.๐‘’ and ๐‘ฃ๐‘. ๐‘“ requires that each element in ๐‘ฃ๐‘.๐‘’ is less than or equal to the corresponding element in ๐‘“ and some element in ๐‘’ is less than the corresponding element in ๐‘“ . In other words, ๐‘ฃ๐‘.๐‘’ < ๐‘ฃ๐‘. ๐‘“ iff (โˆ€ ๐‘— :: ๐‘ฃ๐‘.๐‘’. ๐‘— โ‰ค ๐‘ฃ๐‘. ๐‘“ . ๐‘—) โˆง (โˆƒ ๐‘— :: ๐‘ฃ๐‘.๐‘’. ๐‘— < ๐‘ฃ๐‘. ๐‘“ . ๐‘—). 6 Note that while the < relation is defined by the timestamping algorithm, the properties of the < relation vary. For example, logical clocks provide one-way causality information, i.e., ๐‘’ hb ๐‘“ โ‡’ ๐‘™.๐‘’ < ๐‘™. ๐‘“ . Vector clocks provide two-way causality information, i.e., ๐‘’ hb ๐‘“ โ‡” ๐‘ฃ๐‘.๐‘’ < ๐‘ฃ๐‘. ๐‘“ . By contrast, (unsynchronized) physical clocks may not provide any guarantees. For example, it is possible that (๐‘’ hb ๐‘“ ) and ๐‘๐‘ก.๐‘’ โ‰ฎ ๐‘๐‘ก. ๐‘“ are simultaneously true. We assume that each process ๐‘— in the system is associated with a physical clock, ๐‘๐‘ก. ๐‘—. Clocks of processes are synchronized with a protocol such as NTP [8] such that the clock of two processes differ by at most E, where E is a parameter, i.e., โˆ€ ๐‘—, ๐‘˜ :: | ๐‘๐‘ก. ๐‘— โˆ’ ๐‘๐‘ก.๐‘˜ | โ‰ค E. We also assume that individual clocks are monotonically increasing. We assume that messages are delivered with a minimum message delay of ๐›ฟ. We do not assume maximum message delay; it could be โˆž if messages are permitted to be lost. Since our focus is on the replay of events, if a message is lost, it simply implies that the corresponding received message is never replayed. 7 CHAPTER 3 REPLAY WITH CLOCKS In this chapter, we focus on how clocks can be used to replay a given computation. We also discuss some of the limitations of using logical clocks and vector clocks in the replay process. Note that the goal of this chapter is only to illustrate the concept and the goals of the replay; it does not focus on developing an efficient algorithm for the same. As discussed in the introduction, the goal of replay is to order the events so that we can evaluate various properties of interest. To replay a given computation, we begin with the set where each entry is of the form โŸจ๐‘’, ๐‘ก๐‘ .๐‘’โŸฉ, where ๐‘’ is the event (send/receive/local) and ๐‘ก๐‘ .๐‘’ is the timestamp of ๐‘’. To replay the given set of events, we first find events ๐‘’ such that all events with smaller timestamps than ๐‘’ have already been replayed. In other words, we find the set {๐‘’|ยฌ(โˆƒ ๐‘“ : ๐‘ก๐‘ . ๐‘“ < ๐‘ก๐‘ .๐‘’)}. We replay one of these events randomly. Then, we remove event ๐‘’. The process is continued until all events are replayed. Thus, the algorithm for replay is shown in algorithm 3.1. Algorithm 3.1 ReplayEvents Operation 1. Input: ๐‘†: Set of Events and timestamp 2. While ๐‘† โ‰  ๐œ™ do 3. 4. 5. ๐น๐‘Ÿ๐‘œ๐‘›๐‘ก ๐ฟ๐‘–๐‘›๐‘’ = {๐‘’|(๐‘’, ๐‘ก๐‘ .๐‘’) โˆˆ ๐‘† โˆง ยฌ(โˆƒ ๐‘“ : ( ๐‘“ , ๐‘ก๐‘ . ๐‘“ ) โˆˆ ๐‘† โˆง ๐‘ก๐‘ . ๐‘“ < ๐‘ก๐‘ .๐‘’)} Choose a random event ๐‘’ from ๐น๐‘Ÿ๐‘œ๐‘›๐‘ก ๐ฟ๐‘–๐‘›๐‘’ and replay it ๐‘† = ๐‘† โˆ’ {๐‘’} 6. end while 3.1 Limitations of Existing Clocks for Replay As an example, consider the execution in Figure 3.1. Here, we have four events ๐ด, ๐ต, ๐ถ, and ๐ท. Their physical timestamps and logical timestamps are shown in Figure 3.1. If we replay these events using physical clocks then the possible outcomes are ๐ถ๐ต๐ด๐ท or ๐ถ๐ต๐ท ๐ด. Note that both these outcomes are undesirable, as ๐ด should occur before ๐ต based on the causality (happened-before) relation. 8 Figure 3.1 Sample Execution Sequence and Application of Replay Algorithm. If we replay them by logical clocks, the possible outcome is ๐ถ ๐ด๐ต๐ท. However, there is no option to replay ๐ต before ๐ถ even though ๐ต||๐ถ. If we replay them with vector clocks, possible orderings are ๐ด๐ต๐ถ๐ท, ๐ด๐ถ๐ต๐ท, or ๐ถ ๐ด๐ต๐ท. If the clocks were synchronized to be within 5 time units then ๐ด๐ต๐ถ๐ท and ๐ด๐ถ๐ต๐ท are incorrect. 3.2 Requirements of Replay Clock (RepCl) In this thesis, we focus on a system where the physical clocks are synchronized to be within E, i.e., for any two processes ๐‘— and ๐‘˜, | ๐‘๐‘ก. ๐‘— โˆ’ ๐‘๐‘ก.๐‘˜ | โ‰ค E. The goal of RepCl is to assign a timestamp RepCl.๐‘’ to event ๐‘’ such that Requirement 1. If ๐‘’ happened before ๐‘“ then RepCl.๐‘’ < RepCl. ๐‘“ , i.e., ๐‘’ will always be replayed before ๐‘“ Requirement 2. If ๐‘“ occurred far after ๐‘’, i.e., ๐‘’ and ๐‘“ could not have occurred simultaneously under clock drift guarantee of E1 where E1 โ‰ˆ E then ๐‘’ will be replayed before ๐‘“ , i.e., RepCl.๐‘’ < RepCl. ๐‘“ Requirement 3. If ๐‘’ and ๐‘“ could have occurred in any order in a system where clocks were synchronized to be within E2, where E2 โ‰ˆ E then RepCl.๐‘’||RepCl. ๐‘“ (i.e., ยฌ(RepCl.๐‘’ < RepCl. ๐‘“ ) โˆง ยฌ(RepCl. ๐‘“ < RepCl.๐‘’)) In the last two requirements, we have chosen E1 and E2 instead of E itself as it can permit more efficient implementation by allowing us to maintain a coarse-grained clock. We discuss this further in section 4.6. 9 With these requirements in mind, we show that the RepCl provides efficient replays of distributed computations, without suffering with the overhead that vector clocks impose, and the shortcomings of replay with logical clocks like the HLC. The RepCl combines the best of both these worlds, and provides a better mechanism to replay computations. 10 CHAPTER 4 ALGORITHM FOR THE REPLAY CLOCK (RepCl) In this chapter, we present our approach for RepCl. As discussed earlier, we assume that the physical clocks are synchronized to be within E. We discretize the process execution in terms of epochs, where each epoch corresponds to an increment of the clock by I, 0 < I โ‰ค E such that E = ๐œ– โˆ— I, where ๐œ– is an integer. The timeline of a process is split into epochs where each epoch is of size I(in the local process clock). In other words, the epoch of process ๐‘— is obtained by โŒŠ ๐‘๐‘ก. ๐‘— I โŒ‹. 4.1 Structure of RepCl Timestamp With such discretization, the timestamp of process ๐‘— (or event ๐‘’) is of the form โŸจ๐‘š๐‘ฅ. ๐‘—, bitmap. ๐‘— [], offset. ๐‘— [], counter. ๐‘— []โŸฉ, (4.1) where ๐‘š๐‘ฅ. ๐‘— is an integer for the approximation of the top-level ๐ป ๐ฟ๐ถ, and bitmap. ๐‘—, offset. ๐‘— and counter. ๐‘— are bitsets [9] that store at most one entry bitmap. ๐‘— .๐‘˜, offset. ๐‘— .๐‘˜ and counter. ๐‘— .๐‘˜ for process ๐‘˜. Each of these bitsets is treated as an array but are serialized as integers in packets for efficiency. The intuition behind these variables is as follows: โ€ข ๐‘š๐‘ฅ. ๐‘— denotes the maximum epoch process ๐‘— is aware of (either due to the value of ๐‘๐‘ก. ๐‘— or the value of epochs learned from messages it receives). โ€ข bitmap. ๐‘— .๐‘˜ is essentially an array of bits, where each bit with index ๐‘˜ denotes whether the offset. ๐‘— .๐‘˜ is being stored. If the bit is 1, process ๐‘— is actively maintaining offset. ๐‘— .๐‘˜. This will come in handy for efficient updates to the clock. โ€ข ๐‘š๐‘ฅ. ๐‘— โˆ’ offset. ๐‘— .๐‘˜ denotes the maximum epoch value of ๐‘˜ that ๐‘— has learnt (either via direct/indirect message from ๐‘˜, clock drift assumption, etc). If there exists an offset between two processes ๐‘— and ๐‘˜, offset. ๐‘— .๐‘˜ denotes the difference between ๐‘š๐‘ฅ. ๐‘— and ๐‘š๐‘ฅ.๐‘˜ as seen on process ๐‘—. 11 โ€ข Counters are used to deal with the scenario where multiple events happen within the same epoch, and have the same offsets. If two clocks that are not concurrent have the same ๐‘š๐‘ฅ and the same offset values, then the two clocks differ on the counters. The clock with the lower counter value is replayed first. For example, the timestamp โŸจ50, [1, 1, 1], [0, 1, 2], [4, 5, 6]โŸฉ denotes that this event is aware of epoch 50 of process 0 (as 50 โˆ’ 0), 49 of process 1 (as 50 โˆ’ 1), and 48 of process 2 (as 50 โˆ’ 2). And, the counter values are 4, 5 and 6 respectively. 4.2 Efficient traversal and lookup All computations are optimized using the bitmap. While the bitmap does not serve a purpose to the timestamp itself, it allows us to traverse and update the clock efficiently. Here we describe the traversal and lookup of offsets based on the bitmap. For brevity, we describe the rest of the algorithms as a simple traversal, but it is important to note that each traversal takes O(number of 1s in the bitmap) time complexity, and getters and setters take O(1) complexity. To describe these implementations, we use the integer representations of offset. ๐‘— [] and counter. ๐‘— []. 4.2.1 Traversal The traversal operation is described in algorithm 4.1, which iterates through the bitmap to find all processes for which the offset is being maintained. In the algorithm, we get every index that has a set bit in the bitmap. The set bit at position ๐‘˜ indicates that offset. ๐‘— .๐‘˜ is being stored by process ๐‘—. 4.2.2 Extract The extract operation extracts ๐‘˜ bits from position ๐‘. The algorithm is described in algorithm 4.2. 4.2.3 GetOffsetAtIndex GetOffsetAtIndex is an operation that gets the offset stored at a particular index. This is an O(1) lookup operation using the index obtained from the bitmap traversal or a specific offset that the clock may need at any index. The algorithm is described in algorithm 4.3. Here, ๐œ denotes the 12 Algorithm 4.1 Traversal Operation 1. Define Traversal operation. 2. Traversal(๐‘ก๐‘ ) 3. While ๐‘ก๐‘ .๐‘๐‘–๐‘ก๐‘š๐‘Ž ๐‘ > 0 4. 5. 6. ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ = log2(ยฌ(๐‘ก๐‘ .๐‘๐‘–๐‘ก๐‘š๐‘Ž ๐‘ โŠ• (ยฌ(๐‘ก๐‘ .๐‘๐‘–๐‘ก๐‘š๐‘Ž ๐‘ โˆ’ 1)) + 1) >> 1) // Get or set any offset at index ๐‘ก๐‘ .๐‘๐‘–๐‘ก๐‘š๐‘Ž ๐‘ = ๐‘ก๐‘ .๐‘๐‘–๐‘ก๐‘š๐‘Ž ๐‘ โˆง (๐‘ก๐‘ .๐‘๐‘–๐‘ก๐‘š๐‘Ž ๐‘ โˆ’ 1) 7. End While Algorithm 4.2 Extract Operation 1. Define Extract operation. 2. Extract(๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ, ๐‘˜, ๐‘) 3. Return ((1 << ๐‘˜) โˆ’ 1 โˆง (๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ >> ๐‘)) max offset size allowed by the user (in bits). The max offset size is usually set to ๐‘™๐‘œ๐‘”(๐œ–). Algorithm 4.3 GetOffsetAtIndex Operation 1. Define GetOffsetAtIndex operation. 2. GetOffsetAtIndex(๐‘ก๐‘ , ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ) 3. ๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก = ๐ธ๐‘ฅ๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘ก (๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก. ๐‘— [].๐‘‡ ๐‘œ๐ผ๐‘›๐‘ก๐‘’๐‘”๐‘’๐‘Ÿ (), ๐œ, ๐œ โˆ— ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ) 4. Return ๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก 4.2.4 SetOffsetAtIndex SetOffsetAtIndex is an operation that sets the offset at a particular index. This is an O(1) setter operation using the index obtained from the bitmap traversal or a specific offset that the clock may need at any index, like the GetOffsetAtIndex algorithm. The algorithm is described in algorithm 4.4. 13 Algorithm 4.4 SetOffsetAtIndex Operation 1. Define SetOffsetAtIndex operation. 2. SetOffsetAtIndex(๐‘ก๐‘ , ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ, ๐‘›๐‘’๐‘ค๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก) 3. ๐‘“ ๐‘–๐‘Ÿ ๐‘ ๐‘ก ๐‘๐‘Ž๐‘Ÿ๐‘ก = ๐ธ๐‘ฅ๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘ก (๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก๐‘ . ๐‘—, ๐œ โˆ— ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ, 0) 4. ๐‘Ÿ๐‘’๐‘ | = ๐‘“ ๐‘–๐‘Ÿ ๐‘ ๐‘ก ๐‘๐‘Ž๐‘Ÿ๐‘ก 5. ๐‘Ÿ๐‘’๐‘ | = ๐‘›๐‘’๐‘ค๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก << ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ โˆ— ๐œ 6. ๐‘™๐‘Ž๐‘ ๐‘ก ๐‘๐‘Ž๐‘Ÿ๐‘ก = ๐ธ๐‘ฅ๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘ก (๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก๐‘ . ๐‘—, ๐œ โˆ— ๐‘ โˆ’ (๐œ โˆ— (๐‘–๐‘›๐‘‘๐‘’๐‘ฅ + 1)), ๐œ โˆ— (๐‘–๐‘›๐‘‘๐‘’๐‘ฅ + 1)) 7. ๐‘Ÿ๐‘’๐‘ | = ๐‘™๐‘Ž๐‘ ๐‘ก ๐‘๐‘Ž๐‘Ÿ๐‘ก << (๐‘–๐‘›๐‘‘๐‘’๐‘ฅ + 1) โˆ— ๐œ 8. Return ๐‘Ÿ๐‘’๐‘  4.2.5 RemoveOffsetAtIndex The RemoveOffsetAtIndex operation removes an offset given an index. This is an O(1) removal operation once the position is found by the traversal operation. This algorithm is described in Algorithm 4.5. Algorithm 4.5 RemoveOffsetAtIndex Operation 1. Define RemoveOffsetAtIndex operation. 2. RemoveOffsetAtIndex(๐‘ก๐‘ , ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ, ๐‘›๐‘’๐‘ค๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก) 3. ๐‘“ ๐‘–๐‘Ÿ ๐‘ ๐‘ก ๐‘๐‘Ž๐‘Ÿ๐‘ก = ๐ธ๐‘ฅ๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘ก (๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก๐‘ . ๐‘—, ๐œ โˆ— ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ, 0) 4. ๐‘Ÿ๐‘’๐‘ | = ๐‘“ ๐‘–๐‘Ÿ ๐‘ ๐‘ก ๐‘๐‘Ž๐‘Ÿ๐‘ก 5. ๐‘™๐‘Ž๐‘ ๐‘ก ๐‘๐‘Ž๐‘Ÿ๐‘ก = ๐ธ๐‘ฅ๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘ก (๐‘œ ๐‘“ ๐‘“ ๐‘ ๐‘’๐‘ก๐‘ . ๐‘—, ๐œ โˆ— ๐‘ โˆ’ (๐œ โˆ— (๐‘–๐‘›๐‘‘๐‘’๐‘ฅ + 1)), ๐œ โˆ— (๐‘–๐‘›๐‘‘๐‘’๐‘ฅ + 1)) 6. ๐‘Ÿ๐‘’๐‘ | = ๐‘™๐‘Ž๐‘ ๐‘ก ๐‘๐‘Ž๐‘Ÿ๐‘ก << (๐‘–๐‘›๐‘‘๐‘’๐‘ฅ + 1) โˆ— ๐œ 7. Return ๐‘Ÿ๐‘’๐‘  4.3 Helper functions Now that we have described the traversals and auxiliary operations, we move on to the clock helper algorithms. For brevity, we assume all traversals and assignments are the algorithms described in the previous subsection. We omit the bitmap to make it easier to understand the 14 Figure 4.1 Working of Shift() on Process 0. Here, ๐œ– = 15, and the shift is issued to advance to ๐‘š๐‘ฅ 20. Process 3โ€™s offset becomes 20, but since ๐œ– = 15, the offset is set to ๐œ– (due to clock skew limit guarantees). algorithms that follow. We discuss two helper functions, Shift and MergeSameEpoch, that will come in handy when we design the main clock processing algorithms. 4.3.1 Shift Operation The Shift function allows us to change the value of ๐‘š๐‘ฅ. Since ๐‘š๐‘ฅ. ๐‘— โˆ’ offset. ๐‘— .๐‘˜ denotes the knowledge of ๐‘— has about the epoch of ๐‘˜, if ๐‘š๐‘ฅ is changed to newmx without providing ๐‘— any additional knowledge of the clock of ๐‘˜ then ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ โˆ’ newoffset. ๐‘— .๐‘˜ should remain the same as ๐‘š๐‘ฅ. ๐‘— โˆ’ offset. ๐‘— .๐‘˜. Hence, Shift operation changes offset. ๐‘— .๐‘˜ to be offset. ๐‘— .๐‘˜ + (๐‘›๐‘’๐‘ค๐‘š๐‘ฅ โˆ’ ๐‘š๐‘ฅ). Furthermore, if this value is more than ๐œ– then we reset it to ๐œ–, as guaranteed by the clock drift assumption. (Note that process ๐‘— can learn about the clock of ๐‘˜ via clock synchronization assumption even if ๐‘— and ๐‘˜ do not communicate.) For example, shifting the timestamp โŸจ12, [0, 2, 10]โŸฉ so that ๐‘š๐‘ฅ is changed to 20 will result in โŸจ20, [8, 10, 18]โŸฉ. If ๐œ– = 15, this will change to โŸจ20, [8, 10, ๐œ–]โŸฉ (cf. Figure 4.1). 4.3.2 MergeSameEpoch Operation The MergeSameEpoch function takes two timestamps ๐‘ก1 and ๐‘ก2 with the same ๐‘š๐‘ฅ value and combines their offsets by setting to be offset. ๐‘— .๐‘˜ to be the ๐‘š๐‘–๐‘›(๐‘ก1.offset. ๐‘— .๐‘˜, ๐‘ก2.offset. ๐‘— .๐‘˜). For example, merging โŸจ50, [0, 1, 2]โŸฉ and โŸจ50, [2, 0, 1]โŸฉ results in โŸจ50, [0, 0, 1]โŸฉ. 4.3.3 EqualOffset Operation The EqualOffset function takes in two timestamps ๐‘ก1 and ๐‘ก2 and checks whether the offset arrays and ๐‘š๐‘ฅ values are the same. This is used particularly to update the counter array if the other values are equal. 15 Algorithm 4.6 Shift Operation 1. Define Shift operation. 2. Shift(๐‘ก๐‘ , ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ) 3. For each ๐‘˜ do 4. 5. 6. 7. ๐‘ก๐‘ .offset.๐‘˜ = offset.๐‘˜ + (๐‘›๐‘’๐‘ค๐‘š๐‘ฅ โˆ’ ๐‘ก๐‘ .๐‘š๐‘ฅ) If ๐‘ก๐‘ .offset.๐‘˜ > ๐œ– then ๐‘ก๐‘ .offset.๐‘˜ = ๐œ– End If 8. End For 9. Output: ๐‘ก๐‘  Algorithm 4.7 MergeSameEpoch Operation 1. Input: Timestamp ๐‘ก1, Timestamp ๐‘ก2 2. Timestamp ๐‘ก๐‘  = new Timestamp 3. For each ๐‘˜ do 4. ๐‘ก๐‘ .offset. ๐‘— .๐‘˜ = min(๐‘ก1.offset. ๐‘— .๐‘˜, ๐‘ก2.offset. ๐‘— .๐‘˜) 5. End For 6. Return ๐‘ก๐‘  Algorithm 4.8 EqualOffset Operation 1. Input: Timestamp ๐‘ก1, Timestamp ๐‘ก2 2. If ๐‘ก1.๐‘š๐‘ฅ โ‰  ๐‘ก2.๐‘š๐‘ฅ โˆจ (โˆƒ ๐‘— ๐‘ก1.offset. ๐‘— โ‰  ๐‘ก2.offset. ๐‘—) then 3. Return false 4. Else 5. Return true 6. End If 16 4.4 Description of the RepCl Algorithm In this section, we discuss the key clock processing algorithms. These algorithms update the clock based on the type of event observed on the process. We describe two key operations: Send/Local and Receive. 4.4.1 Local/Send event Here, we describe how RepCl. ๐‘— is updated when ๐‘— sends a message. Let the current timestamp of ๐‘— be โŸจ๐‘š๐‘ฅ. ๐‘—, bitmap. ๐‘— [], offset. ๐‘— [], counter. ๐‘— []โŸฉ (4.2) First, ๐‘š๐‘ฅ. ๐‘— needs to be increased if the clock of ๐‘— has advanced beyond epoch ๐‘š๐‘ฅ. ๐‘—. Hence, we first compute ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ. ๐‘— which is equal to ๐‘š๐‘Ž๐‘ฅ(๐‘š๐‘ฅ. ๐‘—, epoch. ๐‘—). When ๐‘— sends a message, it does not learn any new information about the clock of process ๐‘˜. We consider two cases: The first case is for the scenario where the newly created event ๐‘“ is in a new epoch as the previous event, ๐‘’. This will happen if ๐‘š๐‘ฅ remains unchanged and offset. ๐‘— . ๐‘— is unchanged. In this case, we increase counter. ๐‘— . ๐‘—. The second case deals with the scenario where ๐‘“ is in a new interval. Thus, the offset associated with ๐‘˜ is changed using the Shift operation. Note that the Shift operation computes the shift of all processes except ๐‘—. offset. ๐‘— . ๐‘— should be based on the value of epoch. ๐‘—. Hence, we set it equal to ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ โˆ’ epoch. ๐‘—. The Shift operation is illustrated in Algorithm 4.6. 4.4.2 Receive event Next, we describe how RepCl is updated when ๐‘— with timestamp โŸจ๐‘š๐‘ฅ. ๐‘—, offset. ๐‘— [], counter. ๐‘— []โŸฉ (4.3) receives a message ๐‘š with timestamp โŸจ๐‘š๐‘ฅ.๐‘š, offset.๐‘š [], counter.๐‘š []โŸฉ. First, we compute ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ which is the maximum of ๐‘š๐‘ฅ. ๐‘—, ๐‘š๐‘ฅ.๐‘š and pt. ๐‘—. Timestamps of ๐‘— and ๐‘š are then shifted to ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ using the Shift operation. These timestamps are then merged to obtain the ๐‘š๐‘ฅ and offset values of the new event, say ๐‘“ . 17 Algorithm 4.9 Send Message 1. ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ = max(๐‘š๐‘ฅ. ๐‘—, pt. ๐‘—) 2. new_offset = ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ โˆ’ pt. ๐‘— 3. If (๐‘š๐‘ฅ. ๐‘— = ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ โˆง offset. ๐‘— . ๐‘— = new_offset) then 4. counter. ๐‘— . ๐‘— = counter. ๐‘— . ๐‘— + 1 5. Else 6. 7. 8. ๐‘ก๐‘ . ๐‘— = ๐‘†โ„Ž๐‘– ๐‘“ ๐‘ก (๐‘ก๐‘ . ๐‘—, ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ) offset. ๐‘— . ๐‘— = min(๐‘›๐‘’๐‘ค๐‘š๐‘ฅ โˆ’ pt. ๐‘—, ๐œ–) counter. ๐‘— = [0, 0, . . . , 0] 9. End If Now, we check if the knowledge that ๐‘“ has about epochs is the same as that of ๐‘’ (the previous event on ๐‘—) or ๐‘š. If all three are in the same epoch then counter. ๐‘— . ๐‘— is set to one more than the maximum of counter. ๐‘— .๐‘˜ and counter.๐‘š.๐‘˜. If only ๐‘’ and ๐‘“ are in the same epoch, counter. ๐‘— . ๐‘— is incremented by 1. If only ๐‘š and ๐‘“ are in the same epoch, counter. ๐‘— is set to counter.๐‘š and the value of counter. ๐‘— . ๐‘— is incremented by 1. If none of these conditions apply then counters are reset to 0. 4.5 Comparing RepCl Timestamps The happens-before relation in RepCl is codified in Algorithm 4.11. In this algorithm, we check whether timestamp ๐‘ก1 happens-before timestamp ๐‘ก2. In this relation, we first compare the HLCs of the two RepCl timestamps. Since HLC provides the top-level information of the physical clock of the message, it resolves ties with clocks having different HLCs. If the HLCs are equal we move to the offsets. For ๐‘ก1 to be strictly happening before ๐‘ก2, we follow the comparison used in traditional vector clocks, where ๐‘ฃ๐‘.๐‘’ < ๐‘ฃ๐‘. ๐‘“ iff (โˆ€ ๐‘— :: ๐‘ฃ๐‘.๐‘’. ๐‘— โ‰ค ๐‘ฃ๐‘. ๐‘“ . ๐‘—) โˆง (โˆƒ ๐‘— :: ๐‘ฃ๐‘.๐‘’. ๐‘— < ๐‘ฃ๐‘. ๐‘“ . ๐‘—). If the offsets for the two timestamps are also equal, we check for the counters. We say two events are concurrent by the definition introduced in Requirement 3, which basically states that if ยฌ(RepCl.๐‘’ < RepCl. ๐‘“ ) โˆง ยฌ(RepCl. ๐‘“ < RepCl.๐‘’), then RepCl.๐‘’||RepCl. ๐‘“ . 18 Algorithm 4.10 Receive Message 1. Input: Received Message ๐‘š 2. ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ = max(๐‘š๐‘ฅ. ๐‘—, ๐‘š๐‘ฅ.๐‘š, pt. ๐‘—) 3. ๐‘ก๐‘ .๐‘Ž = Shift(๐‘ก๐‘ . ๐‘—, ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ) 4. ๐‘ก๐‘ .๐‘ = Shift(๐‘ก๐‘ .๐‘š, ๐‘›๐‘’๐‘ค๐‘š๐‘ฅ) 5. ๐‘ก๐‘ .๐‘ = MergeSameEpoch(๐‘ก๐‘ .๐‘Ž, ๐‘ก๐‘ .๐‘) 6. If EqualOffset(๐‘ก๐‘ . ๐‘—, ๐‘ก๐‘ .๐‘)โˆง EqualOffset(๐‘ก๐‘ .๐‘š, ๐‘ก๐‘ .๐‘) then 7. 8. 9. For each ๐‘˜ do counter. ๐‘— .๐‘˜ = max(counter. ๐‘— .๐‘˜, counter.๐‘š.๐‘˜) End For 10. counter. ๐‘— . ๐‘— = counter. ๐‘— . ๐‘— + 1 11. End If 12. If EqualOffset(๐‘ก๐‘ . ๐‘—, ๐‘ก๐‘ .๐‘) โˆง ยฌ EqualOffset(๐‘ก๐‘ .๐‘š, ๐‘ก๐‘ .๐‘) then 13. counter. ๐‘— . ๐‘— = counter. ๐‘— . ๐‘— + 1 14. If ยฌ EqualOffset(๐‘ก๐‘ . ๐‘—, ๐‘ก๐‘ .๐‘)โˆง EqualOffset(๐‘ก๐‘ .๐‘š, ๐‘ก๐‘ .๐‘) then 15. 16. counter. ๐‘— = counter.๐‘š counter. ๐‘— . ๐‘— = counter. ๐‘— . ๐‘— + 1 17. If ยฌ EqualOffset(๐‘ก๐‘ . ๐‘—, ๐‘ก๐‘ .๐‘) โˆง ยฌ EqualOffset(๐‘ก๐‘ .๐‘š, ๐‘ก๐‘ .๐‘) then 18. counter. ๐‘— = [0, 0, . . . , 0] 19 Algorithm 4.11 Compare Operation 1. Input: Timestamp ๐‘ก1, Timestamp ๐‘ก2 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. If t1.๐‘š๐‘ฅ < t2.๐‘š๐‘ฅ then Return ๐‘‡๐‘Ÿ๐‘ข๐‘’ Else If t1.๐‘š๐‘ฅ > t2.๐‘š๐‘ฅ then Return ๐น๐‘Ž๐‘™๐‘ ๐‘’ Else then For ๐‘–, ๐‘— in t1.offsets, t2.offsets If t1.offsets.๐‘– > t2.offsets. ๐‘— Return ๐น๐‘Ž๐‘™๐‘ ๐‘’ If t1.counters <= t2.counters Return ๐‘‡๐‘Ÿ๐‘ข๐‘’ End For Return ๐น๐‘Ž๐‘™๐‘ ๐‘’ End If As an illustration, consider the execution of the program in Figure 3.1. Assuming that ๐œ– = 5, I = 1, and E = 5, the RepCl timestamps will be as shown in Figure 4.2. Here, event ๐ด has physical time of 50. Since process P1 has not heard from anyone else so far, the offsets for P2 and P3 will be ๐œ–. The offset for process P1 will be 0. Regarding event ๐ถ, the situation is similar except that the offset for process P3 is 0. When event ๐ต is created upon receiving message ๐‘š1, process P2 is aware of times 50 from ๐‘ƒ1. And, it is the maximum epoch it is aware of. Hence, offsets are [0, 2, ๐œ–] respectively. When event ๐ท is created, process P2 is aware of epoch 52 (from P2) and epoch 50 (from P1). It is aware of timestamp 40 from P3. However, this information is overridden by the clock synchronization guarantee that says that the clock of P3 is at least 47. Thus, the offsets are set to [3, ๐‘’, ๐œ–] Here, the permissible ordering is ๐ถ ๐ด๐ต๐ท. In this figure, if ๐œ– were 20 then the timestamp of ๐ท would be changed to [3, 2, 12]. Furthermore, 20 ๐ต and ๐ถ could be replayed in any order. Thus, the permissible replays would be ๐ถ ๐ด๐ต๐ท or ๐ด๐ต๐ถ๐ท or ๐ด๐ถ๐ต๐ท. Figure 4.2 Replay of the Execution in Figure 3.1 with RepCl. 4.6 Properties of RepCl In this section, first, we define the < relation on two timestamps RepCl.๐‘’ and RepCl. ๐‘“ . Then, we identify the properties of this < relation and the happened-before relation. Given timestamps RepCl.๐‘’ = โŸจ๐‘š๐‘ฅ.๐‘’, offset.๐‘’[], counter.๐‘’[]โŸฉ and RepCl. ๐‘“ = โŸจ๐‘š๐‘ฅ. ๐‘“ , offset. ๐‘“ [], counter. ๐‘“ []โŸฉ, we say that RepCl.๐‘’ < RepCl. ๐‘“ iff 21 ๐‘š๐‘ฅ. ๐‘“ > ๐‘š๐‘ฅ.๐‘’ + E (cid:32) โˆจ |๐‘š๐‘ฅ. ๐‘“ โˆ’ ๐‘š๐‘ฅ.๐‘’| โ‰ค E (cid:32) (cid:32) โˆง โˆ€๐‘™ (๐‘š๐‘ฅ.๐‘’ โˆ’ offset.๐‘’.๐‘™) โ‰ค (๐‘š๐‘ฅ. ๐‘“ โˆ’ offset. ๐‘“ .๐‘™) (cid:32) (cid:33) (cid:33)(cid:33) โˆง โˆƒ๐‘™ (๐‘š๐‘ฅ.๐‘’ โˆ’ offset.๐‘’.๐‘™) < (๐‘š๐‘ฅ. ๐‘“ โˆ’ offset. ๐‘“ .๐‘™) (cid:32) โˆจ โˆ€๐‘™ (๐‘š๐‘ฅ.๐‘’ โˆ’ offset.๐‘’.๐‘™) = (๐‘š๐‘ฅ. ๐‘“ โˆ’ offset. ๐‘“ .๐‘™) (cid:32) โˆง โˆ€๐‘™ (counter.๐‘’.๐‘™) โ‰ค (๐‘š๐‘ฅ. ๐‘“ โˆ’ counter. ๐‘“ .๐‘™) โˆง โˆƒ๐‘™ (counter.๐‘’.๐‘™) < (counter. ๐‘“ .๐‘™) (cid:33)(cid:33) (cid:33) The above < relation first compares if ๐‘š๐‘ฅ. ๐‘“ and ๐‘š๐‘ฅ.๐‘’ are far apart. If that is the case, we define RepCl.๐‘’ < RepCl. ๐‘“ . If they are close, i.e., |๐‘š๐‘ฅ. ๐‘“ โˆ’ ๐‘š๐‘ฅ.๐‘’| โ‰ค ๐œ–, then, we compare the offsets. Since ๐‘š๐‘ฅ.๐‘’โˆ’offset.๐‘’.๐‘˜ identifies the knowledge ๐‘’ had about the epoch of process ๐‘˜, we use a comparison that is similar to vector clocks to determine if < relation holds between RepCl.๐‘’ and RepCl. ๐‘“ . Finally, if the offsets are also equal then we use the comparison of counters (again in the same fashion as vector clocks). We overload the || relation for comparing timestamps as well. Specifically, given timestamps RepCl.๐‘’ and RepCl. ๐‘“ , we say that RepCl.๐‘’||RepCl. ๐‘“ iff ยฌ(RepCl.๐‘’ < RepCl. ๐‘“ ) โˆง ยฌ(RepCl. ๐‘“ < RepCl.๐‘’) (4.4) From the construction of the timestamp algorithm, we have the following two lemmas: Lemma 1: (e happened before f) โ‡’ RepCl.๐‘’ < RepCl. ๐‘“ Lemma 2: |๐‘š๐‘ฅ.๐‘’ โˆ’ ๐‘š๐‘Ž๐‘ฅ๐‘ก. ๐‘“ | โ‰ค E โˆง (๐‘’|| ๐‘“ ) โ‡’ RepCl.๐‘’||RepCl. ๐‘“ 22 Requirement 1 of RepCl: Observe that Lemma 1 satisfies the first requirement of RepCl; if ๐‘’ happened before ๐‘“ then ๐‘’ must be replayed before ๐‘“ . Requirement 2 of RepCl: Now, we focus on the second requirement. Specifically, we show that by letting E1 = E + I, the second requirement is satisfied. Observe that in the RepCl algorithm, messages carry the epoch values of multiple processes. This allows a process to learn epoch information about other processes. For the subsequent discussion, imagine that the messages also carried the actual physical time as well. In this case, ๐‘— will learn about the clock of a process ๐‘˜ via such messages. Additionally, ๐‘— will also learn about the clock of a process ๐‘˜ based on the assumption of clock synchronization. Likewise, when event ๐‘’ is created, it will have some information about the clock of each process. Let ๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ and ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ be the maximum clock (of any process) that ๐‘’ and ๐‘“ are aware of when they occurred. If ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ > ๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ + E1 then ๐‘“ cannot occur before ๐‘’ under the clock synchronization guarantee of E1. Now, we show that in this situation, it is guaranteed that RepCl.๐‘’ < RepCl. ๐‘“ . By definition of ๐‘š๐‘ฅ, ๐‘š๐‘ฅ. ๐‘“ = โŒŠ ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ I โŒ‹ and ๐‘š๐‘ฅ.๐‘’ = โŒŠ ๐‘š ๐‘๐‘ก.๐‘’ I โŒ‹. Additionally, we have โˆ’1 < (๐‘ฅ โˆ’ โŒŠ๐‘ฅโŒ‹) โˆ’ (๐‘ฆ โˆ’ โŒŠ๐‘ฆโŒ‹) < 1 =โ‡’ โˆ’1 < ( =โ‡’ โˆ’1 < ( ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ I ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ I โˆ’ โŒŠ ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ I โˆ’ ๐‘š๐‘ฅ. ๐‘“ ) โˆ’ ( โŒ‹) โˆ’ ( ๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ I โˆ’ โŒŠ ๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ I โŒ‹) < 1 ๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ I โˆ’ ๐‘š๐‘ฅ.๐‘’) < 1 =โ‡’ โˆ’1 < ( ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ โˆ’๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ I ) โˆ’ (๐‘š๐‘ฅ. ๐‘“ โˆ’ ๐‘š๐‘ฅ.๐‘’) < 1 =โ‡’ โˆ’I โ‰ค ((๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ โˆ’ ๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’) โˆ’ (๐‘š๐‘ฅ. ๐‘“ โˆ’ ๐‘š๐‘ฅ.๐‘’)I โ‰ค I Now, if ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ โˆ’ ๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ > E + I then we can rewrite the second inequality as ( E+I I โˆ’ (๐‘š๐‘ฅ. ๐‘“ โˆ’ ๐‘š๐‘ฅ.๐‘’) โ‰ค 1. Using the fact that E = ๐œ– โˆ— I, we have ๐œ– < (๐‘š๐‘ฅ. ๐‘“ โˆ’ ๐‘š๐‘ฅ.๐‘’). In other words, ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ โˆ’ ๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ > E + I โ‡’ (๐‘š๐‘ฅ. ๐‘“ > ๐‘š๐‘ฅ.๐‘’ + ๐œ–). which gives us RepCl.๐‘’ < RepCl. ๐‘“ . In other words, we have 23 Lemma 3: If ๐‘“ occurred far after ๐‘’, i.e., ๐‘“ could not have occurred before ๐‘’ in a system that guarantees that clocks are synchronized within E1 = E + I then ๐‘’ will be replayed before ๐‘“ , i.e., RepCl.๐‘’ < RepCl. ๐‘“ . Requirement 3 of RepCl: Next, we consider the case where ๐‘’ and ๐‘“ could have occurred in any order if the underlying system guaranteed that clocks were synchronized to be within E2 = E โˆ’ I. Letting the maximum clock that event ๐‘’ (respectively, ๐‘“ ) was aware of to be ๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ (respectively, ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ ), we observe that |๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ โˆ’ ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ | โ‰ค E2. Furthermore, ๐‘’ and ๐‘“ must be causally concurrent. Under this scenario, we show that RepCl.๐‘’||RepCl. ๐‘“ . If |๐‘š๐‘ฅ ๐‘โ„Ž.๐‘’ โˆ’ ๐‘š๐‘ฅ ๐‘โ„Ž. ๐‘“ | โ‰ค E โˆ’ I, we have =โ‡’ | =โ‡’ |โŒŠ ๐‘š ๐‘๐‘ก.๐‘’ I โˆ’ ๐‘š ๐‘๐‘ก.๐‘’ I ๐‘š ๐‘๐‘ก. ๐‘“ I | โ‰ค ๐œ– โˆ’ 1 // since E = ๐œ– โˆ— I โŒ‹ โˆ’ โŒŠ ๐‘š ๐‘๐‘ก. ๐‘“ I โŒ‹| โ‰ค ๐œ– since |(๐‘ฅ โˆ’ โŒŠ๐‘ฅโŒ‹) โˆ’ (๐‘ฆ โˆ’ โŒŠ๐‘ฆโŒ‹)| < 1 =โ‡’ |๐‘š๐‘ฅ.๐‘’ โˆ’ ๐‘š๐‘ฅ. ๐‘“ | โ‰ค ๐œ– by definition of ๐‘š๐‘ฅ Now, from Lemma 2, RepCl.๐‘’||RepCl. ๐‘“ . In other words, Lemma 4: If ๐‘’ and ๐‘“ could have occurred in any order in a system where clocks were synchro- nized to be within E โˆ’ I then RepCl.๐‘’||RepCl. ๐‘“ . 4.7 Effect of discretization and comparison with Hybrid Vector Clocks [1] We note that the discretization of the clock via I has caused the bounds used for clock synchro- nization in Lemmas 1 and 2 to be different. We could have eliminated this if we had not discretized the clocks. (Discretization with I was not done in [1].) However, without discretization, the values of offsets will be very large. Without discretization, we will need to rely on just the physical clocks which have a granularity of under 1 nanosecond. Now, if E = 1๐‘š๐‘  then the value of the offset could be as large as 106. By discretizing the clock, it would be possible to keep offsets to be very small. We expect that the discretization will not seriously impact the replay. For example, if E = 1๐‘š๐‘  and I = 0.1๐‘š๐‘  then our algorithm will guarantee that causally concurrent events within 0.9๐‘š๐‘  can be replayed in any order. And, events that could not occur simultaneously under a clock 24 Figure 4.3 RepCl representation. synchronization guarantee of 1.1๐‘š๐‘  will be replayed only in one order. Additionally, if ๐‘’ happened before ๐‘“ then ๐‘’ will always be replayed before ๐‘“ . 4.8 Representation of the RepCl and its Overhead In this section, we identify how RepCl can be stored to permit efficient computation. As written, RepCl will require 2๐‘› + 1 integers. However, a more compact representation will be possible when we account for the fact that it is being used in a system where the clocks are synchronized to be within E. Thus, if ๐‘— does not hear from ๐‘˜ (directly or indirectly) for a long time then the knowledge ๐‘— would have about the clock of ๐‘˜ is the same that is provided by the clock synchronization assumption. In this case, offset. ๐‘— .๐‘˜ = ๐œ–. It follows that there is no need to store this value if we interpret no information about the offset of ๐‘˜ to mean that offset. ๐‘— .๐‘˜ = ๐œ–. With this intuition, we represent RepCl. ๐‘— as shown in Figure 4.3. Here, the value of ๐‘š๐‘ฅ. ๐‘— (represented by the first word) is 50. The next word identifies the bitmap. Since the bit corresponding to process 1 is 0, it implies that offset. ๐‘— .0 = ๐œ–. The offset for process 2 is 10 (first 4 bits of the offset) and ๐‘๐‘œ๐‘ข๐‘›๐‘ก๐‘’๐‘Ÿ. ๐‘— .2 is 2 (first 2 bits of the counter). We note that the bits for each offset and counter are hard-coded based on the system parameters (cf. Chapter 6). The second word is a bitmap that identifies whether offset. ๐‘— .๐‘˜ is stored for process ๐‘˜. Each offset is stored with a fixed number of bits in the subsequent word(s). 25 Next, we show that this representation allows us to reduce the cost of storage as well as the cost of computing timestamps or comparing them (using < relation). Specifically, all these costs are proportional to the number of processes that have communicated with ๐‘— recently. With representation in Figure 4.3, first, we note that finding the location of the 1s in the given bitmap can be done in time that is proportional to the number of 1s in the bitmap. [((๐‘›โˆ’ (๐‘›&(๐‘›โˆ’1))) will return the number with only the rightmost 1]. Thus, we have Observation 1: Shift and MergeSameEpoch can be implemented using ๐‘‚ (๐‘ฅ) time where ๐‘ฅ is the number of bits set to 1 in the bitmap. Note that this means that the time to compute the timestamp for send/receive at process ๐‘— is not dependent upon the number of processes in the system. But only processes that have recently communicated with process ๐‘—. In turn, this means that Observation 2: Send and Receive can be implemented using ๐‘‚ (๐‘ฅ) time where ๐‘ฅ is the number of bits set to 1 in the bitmap. Observation 3: Given two timestamps, RepCl.๐‘’ and RepCl. ๐‘“ , we can determine if RepCl.๐‘’ < RepCl. ๐‘“ in ๐‘‚ (๐‘ฅ) time where ๐‘ฅ is the number of bits set to 1 in ๐‘’ and ๐‘“ . It follows that the number of bits that are 1 in a given timestamp identifies not only the storage cost of the timestamps but also the time to compute these timestamps at run time. Effectively, this also identifies the overhead of the timestamps that enable the replay of the computation. Hence, in Chapter 6, we focus on identifying scenarios where the cost of storing these offsets is within the limits identified by the user. We note that the above approach will work as long as the number of processes is less than the number of bits in a word (typically, 64 in todayโ€™s systems). We expect that this will be more than sufficient for many systems in practice. If there are more than 64 processes, we expect that process ๐‘— is communicating with only a subset of these processes. And, if process ๐‘— does not communicate with someone there is no need to store offsets for them. Thus, this approach can be extended for 26 the case where the number of processes is larger. However, the details are out of the scope of this thesis. 27 CHAPTER 5 SIMULATOR SETUP In this chapter, we discuss the construction of a custom discrete event simulator, to serve as a validation to state-of-the-art simulators available for research. We first design a baseline simulator build for natively supporting the RepCl infrastructure. We call this the Custom Discrete Event Simulator, or CDES. The goal of the RepCl was to serve as a plug-and-play structure that allows replays to be supported natively by virtue of the clockโ€™s algorithm. In other words, the clock should be self-contained, where replay should only require RepCl-timestamped logs to function correctly, and any simulator should be able to incorporate the visualization just by implementing the clock in its infrastructure. Details of this simulator are further discussed in Section 5.1. However, building a simulator for the RepCl introduces a bias in the design. To validate the results to more real-world scenarios, we considered using a state-of-the-art (SOTA) simulator, NS-3 [10]. NS-3 proved to be an effective simulator to implement the RepCl structure. However, NS-3 posed challenges in making node-local noisy clocks, which is discussed further in Section 5.2. For this reason, we revise NS-3 to implement structures that would aid us in approximating what a node-local clock would look like. Since the NS-3 team expressed the desire to enhance NS-3 with node-local noisy clocks, we decided to build a custom implementation of the same. We designed the node-local clock in a way to permit any arbitrary node level clock (e.g., HLC [11], Vector clock [6][7], Logical Clocks [5], etc.). The following chapter is organized as follows. Section 5.1 describes the custom simulator we designed to validate the results obtained by any generic simulator, and identify key differences in the results. During our research, we chose to implement the custom simulator (CDES) first, as the choice of the SOTA simulator was not apparent. Additionally, the SOTA simulator should produce the same results as the CDES, as CDES was built solely for the RepCl. The CDES served as a ground truth system, and any architecture we chose would be validated against the results of this simulator. Section 5.2 talks about NS-3, and the revisions that needed to be made to implement 28 our clock infrastructure. We discuss the application implemented, and the complexities that the application had to handle to provide correct results. 5.1 CDES, A Custom Discrete Event Simulator In this section, we detail the design of our own custom discrete event simulator (CDES). We modeled processes containing a physical clock ๐‘๐‘ก and the RepCl. The design of the CDES is discussed below. Each process maintained a vector ๐‘š๐‘ ๐‘”_๐‘ž๐‘ข๐‘’๐‘ข๐‘’, that queued messages sent to that process. The messages contained the RepCl of the sending process, along with the time the message was to be processed by the receiving process. This receiving time was configured by the sending process by reading the clock of the receiver and adding the message delay to the time. When the receiving process obtained this message, it would compare its physical time ๐‘๐‘ก with the receiving time, and process the message. Each process had a skewing node-local physical clock. We implemented this by randomly advancing the clock of a process based on a seed, and chose not to advance clocks. We maintained the invariant that for no two processes ๐‘– and ๐‘—, | ๐‘๐‘ก.๐‘– โˆ’ ๐‘๐‘ก. ๐‘— | > E โˆ— I, same as the case in NS-3. To test the clock, we used the same parameters for ๐‘›, the number of processes, E, I, ๐›ฟ and ๐›ผ. The message delay is modeled as the average delay experienced in the simulated network, and can vary by some nonzero ฮ” in production. In the simulation, at every clock tick, a process delivers any messages it is expected to deliver at that clock tick. It also sends a message to other processes based on the message rate ๐›ผ. When a message is sent, the corresponding receive event is added to the receiverโ€™s queue based on the value of ๐›ฟ. We also compute the actual value of the maximum clock skew observed in the simulation to ensure that if E = 1๐‘š๐‘  then the worst-case clock skew is indeed 1๐‘š๐‘ . The simulation was initialized such that each process started with the same starting clock and at each microsecond time step, each process made a decision to send a message. The process first generated a random number in the range [0, 100], and if this number was lower than ๐›ผ, the process elected to send a message to any other process in the simulation or perform a local event. Each process in the simulation had a uniform chance to send a message with constant delay 29 Parameter Minimum Value Maximum Value ๐‘ E I ๐›ฟ ๐›ผ 32 processes 10 units 100 microseconds 1 microsecond 10 messages/s 64 processes 1000 units 1000 microseconds 8 microseconds 160 messages/s Increments 32 processes 50 units 50 microseconds 2x microseconds 2x messages/s Table 5.1 Parameter configurations for the NS-3 Simulations. We only selected the configu- rations where E โˆ— I % 1000 == 0 to give us acceptable clock skew limits of [1ms, 6ms]. ๐›ฟ to any other process in the system. The total messages sent varied with alpha, within the range of [140, 10208] messages over 10,000 steps of the simulation. The parameters we varied are detailed in Table 5.1. 5.2 NS-3 Simulator Network Simulator-3 (NS-3) [10] provides a generic discrete event simulator, that works on top of different devices implementing applications in different topologies. What makes NS-3 an attractive option to test clock infrastructures is its versatility in types of nodes available, its ease of use in topology configuration and application design, and most importantly, the configurability it offers in designing simulations. The NS-3 infrastructure describes a generic Simulator, that allows different network configurations to be supported and tested by changing a few parameters of this Simulator class. These factors made NS-3 an attractive option to test and incorporate the RepCl. However, NS-3 posed a few challenges. NS-3, as of the time of writing this thesis, does not provide support for node-local noisy clocks. Clocks in NS-3 are synchronized with the top level Simulator, and do not contain their own clock implementations. There have been attempts in creating a node-local noisy clock, but have not been incorporated into the NS-3 infrastructure. Another challenge stems from this. Due to the absence of node-local noisy clocks, NS-3 does not handle clock drifts. Due to the absence of clock drifts, the RepCl would not store any offsets, as all processes would tick in sync each time with the Simulator class. Hence, we devised an API to overcome these key challenges. We implemented a node-local noisy clock, which would approximate a node reading its physical time, and added a value ๐›ฟ to approximate the noise produced by skewing physical clocks. The algorithm for the node-local 30 noisy clock is described in Algorithm 5.1. Here ,๐‘›๐‘ก.๐‘– denotes the node-local time of process ๐‘–. Algorithm 5.1 Node-Local Noisy Clock: Get Operation 1. Input: ๐‘†๐‘–๐‘š๐‘ข๐‘™๐‘Ž๐‘ก๐‘œ๐‘Ÿ๐‘‡๐‘–๐‘š๐‘’ 2. 3. ๐‘›๐‘ก.๐‘– = ๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š(๐‘›๐‘ก.๐‘–, ๐‘†๐‘–๐‘š๐‘ข๐‘™๐‘Ž๐‘ก๐‘œ๐‘Ÿ๐‘‡๐‘–๐‘š๐‘’ + (E โˆ— I)) Return ๐‘›๐‘ก.๐‘– As described in Algorithm 5.1, we receive a clock that maintains the relation that for no two processes ๐‘– and ๐‘—, is | ๐‘๐‘ก.๐‘– โˆ’ ๐‘๐‘ก. ๐‘— | > E โˆ— I, but produces clocks that skew with respect to each other. This helps us produce offsets between different processes in a dynamic fashion, and allows us to handle clock drifts. Using this node-local clock, we design an application in NS-3 called the ReplaySimulatorAp- plication, with nodes implementing the node-local clocks and the RepCl. The RepCl uses the local clock to perform updates on itself. The ReplaySimulatorApplication picks a random node candidate for each nodes and sends a message at intervals defined by the message rate ๐›ผ. The channels implemented provide a maximum data rate of 500 Mbps, and a message delay defined by ๐›ฟ. We also provided the option to choose E and I for the purposes of testing the clock sizes. In a more real-world implementation, only the I would be changeable by the user. All other parameters would be specified by the distributed systemโ€™s operating constraints. We simulated a distributed environment to test the clock with five parameters - the number of processes (๐‘›), the maximum allowed clock skew (E), the interval size (I), the message rate (๐›ผ), and the message delay in microseconds (๐›ฟ). We collected results for about 20 seconds for each run. Table 5.1 defines the variation statistics of each of these parameters. A key change we made in the NS-3 simulator is the counter storage. We store only the sum of counters in the counter array space. This is explained further in section 7.3.1. The key idea is that there are very few events that actually store counters for all processes, and by storing just the sum of counters of all processes helps us condense the information needed, without the loss of generality in most cases. We do risk missing a few orderings, but it is an acceptable tradeoff as the 31 cases where a lot of counters are stored are rare. 32 CHAPTER 6 SIMULATION RESULTS As demonstrated in Section 4.6, the overhead of RepCl depends upon the number of offsets/counters that need to be stored. And, this value depends upon the number of processes that communicate with a given process in the E time. In other words, the system parameters will determine the size of RepCl. In this section, we evaluate the overhead of RepCl via simulation. For the purposes of the results, we denote the offset array size as ๐œƒ and the counter array size as ๐œŽ. In the following sections, we outline the effects of changing different parameters both in the custom simulator and in NS-3. Note that the results for the CDES are reported in 32-bit word lengths, and the results for NS-3 are reported in bits, due to the different data collection techniques used for each. 6.1 Effect of Clock Skew (E) In this section, we measure the trends in E while varying the other parameters to see how the ๐œƒ and ๐œŽ are affected. We compare each parameter pair-wise with E to see the effect of the parameter on the clock skew trend with ๐œƒ. 6.1.1 Analysis of Varying Interval Size (I) with the CDES Here, we keep the ๐›ฟ and ๐›ผ constant to see how ๐œƒ and ๐œŽ change with E. โ€ข In case of ๐œƒ vs E curve, we notice that the value of I has little bearing on ๐œƒ. As expected, as the value of E increases, ๐œƒ increases with it. This is true for all values of ๐›ผ, and we consistently store more offsets as ๐›ผ increases. For any given value of ๐›ผ, however, the value of I can be chosen to set the granularity of the userโ€™s choice and would allow more flexibility in the clock information. Regardless of the choice of I by the user, the offset sizes increase with roughly the same trend. These trends are illustrated in Figure 6.1. โ€ข In the case of ๐œŽ vs E curve, we see not too much of a variation in ๐œŽ, as most events reach a different epoch. On average, we do not see many events storing counters; roughly 0.78% of events store counters, and the values of such counters do not exceed 5 in most cases. Hence, 33 we would need very little space to store these counters. These trends are illustrated in Figure 6.2. Since this observation is true for all simulations in this paper, we do not discuss the analysis for ๐œŽ in the subsequent sections. (a) ๐›ผ = 20 msgs/s, ๐‘› = 32. (b) ๐›ผ = 40 msgs/s, ๐‘› = 32. (c) ๐›ผ = 160 msgs/s, ๐‘› = 32. (d) ๐›ผ = 20 msgs/s, ๐‘› = 64. (e) ๐›ผ = 40 msgs/s, ๐‘›= 64. Figure 6.1 Custom Simulator: ๐œƒ vs E when varying I, ๐›ฟ = 8๐œ‡๐‘ . (f) ๐›ผ = 160 msgs/s, ๐‘› = 64. (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.2 Custom Simulator: ๐œŽ vs E when varying I, ๐›ฟ = 8 ๐œ‡๐‘ , ๐›ผ = 160 msgs/s. 34 6.1.2 Analysis of Varying Interval Size (I) with NS-3 โ€ข In the case of the ๐œƒ vs E curve in the NS-3 simulation, we observe a similar trend of ๐œƒ increasing with E. This is in agreement with our findings from the custom simulator, and true for all values of ๐›ผ. These trends are illustrated in Figure 6.3. While we see a decrease in number of bits stored as I decreases, the difference is only significant in some cases, notably in the case of lower message rates. At higher message rates, the gap reduces. This is due to the amount of communication happening, and most processes tend to store close to the acceptable limit of the number of offsets we want to store. โ€ข In the case of the ๐œŽ vs E curve in the NS-3 simulation, we store only the sum of counters. Hence, we see some variations in counter size. With the increase of epsilon, we store slightly higher counters, and this does not change with variation in I. However, the sizes of the counters vary only from 0.03 bytes in the lowest case of ๐‘› = 32 to 0.1 bytes in the highest case. Hence, the number stored as the counter value is not too large. This is true even for the case of ๐‘› = 64, and is depicted in Figure 6.4. (a) ๐›ผ = 20 msgs/s, ๐‘› = 32. (b) ๐›ผ = 40 msgs/s, ๐‘› = 32. (c) ๐›ผ = 160 msgs/s, ๐‘› = 32. (d) ๐›ผ = 20 msgs/s, ๐‘› = 64. (e) ๐›ผ = 40 msgs/s, ๐‘›= 64. Figure 6.3 NS-3 Simulator: ๐œƒ vs E when varying I, ๐›ฟ = 1๐œ‡๐‘ . (f) ๐›ผ = 160 msgs/s, ๐‘› = 64. 35 (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.4 NS-3 Simulator: ๐œŽ vs E when varying I, ๐›ฟ = 1 ๐œ‡๐‘ , ๐›ผ = 20 msgs/s. Since the total size of RepCl depends upon the number of bits for each offset and the total number of offsets, we consider a specific example here. For Figure 4.3, the number of offsets is 2 and the size of each offset is 4 bits. Therefore, one word is sufficient to store offsets. Likewise, one word is enough for counters. Thus, we need a total of 4 words to store this timestamp. (Note that the counters can be stored in the same amount of memory as we have a number of extraneous bits in this representation, specifically in the max epoch word if we elect to store the sum of all these counters here. By doing this, we would lose some information, but considering that the number of events that record meaningful counters is low, this may be an acceptable trade-off.) It is straightforward to observe that the size of RepCl grows linearly with the number of offsets in it. And, the total size of RepCl will require the use of the floor function to identify the number of words necessary to store it. Since the floor operation loses some of the relevant data, we present the value of ๐œƒ in this section. 6.1.3 Analysis of Varying Message Delay (๐›ฟ) for the CDES Here, we fix the I and ๐›ผ, and for different ๐›ฟ values, and we identify how ๐œƒ changes with E. As E increases, we see higher values of ๐œƒ, implying a higher number of offsets stored on average. We observe that higher values of ๐›ฟ produce a lower number of offsets in each case, barring some noise. This is expected as an increase in ๐›ฟ implies messages would reach in a delayed fashion, and would lead to processes setting other process offsets to ๐œ– due to non-receipt of messages. As the E increases, a process hears from more processes (directly or indirectly) within time E. Hence, the 36 number of offsets increases. This is illustrated in Figure 6.5. (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.5 Custom Simulator: ๐œƒ vs E when varying ๐›ฟ, I = 8 ๐œ‡๐‘ , ๐›ผ = 40 msgs/second. 6.1.4 Analysis of Varying Message Delay (๐›ฟ) for the NS-3 As in the case of the Custom simulator, we see higher values of ๐œƒ as the E increases. We do not see too much of a difference as ๐›ฟ varies however, where the number of bits stored for each ๐›ฟ value are between ยฑ 1 bit. The increase in offset size is somewhat linear for the most part as the E increases, which is the same as observed in the analysis of varying I. The ๐›ฟ variations are not pronounced as much due to the fact that the clocks implicitly synchronize when messages are sent between each other. A message sent from far back into the past effectively does not change the clock of the receiver. If a sender gets a clock from the future (in its local observation), it modifies its own clock to push forward to this future timestamp to guarantee the acceptable E limit. This allows different ๐›ฟ values to show close to no variation. This is illustrated in Figure 6.6. It is important to note, however, that when ๐›ฟ exceeds the E limit, no process stores any offsets. 6.1.5 Analysis of Varying Message Rate (๐›ผ) for the CDES Here, we fix the I and ๐›ฟ, and for different ๐›ผ values, we identify how ๐œƒ changes with E. As expected, for lower values of ๐›ผ, we consistently store lower offsets, as communication between processes is sporadic. As the E increases, ๐œƒ increases linearly until the bound of ๐‘› is reached. This is due to the same reason mentioned earlier, as the bound lengths on epochs are larger, even sporadic messages tend to store more offsets on other processes, causing the overall 37 (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.6 NS-3 Simulator: ๐œƒ vs E when varying ๐›ฟ, I = 20 ๐œ‡๐‘ , ๐›ผ = 160 msgs/second. value of ๐œƒ to increase. This is illustrated in Figure 6.7. (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.7 Custom Simulator: ๐œƒ vs E when varying ๐›ผ, I = 4 ๐œ‡๐‘ , ๐›ฟ = 8 ๐œ‡๐‘ . 6.1.6 Analysis of Varying Message Rate (๐›ผ) for NS-3 Our results from the custom simulator are confirmed by the experiments in NS-3, where lower values of ๐›ผ store lower offsets due to lower communication. In the case of NS-3, higher values of ๐›ผ store many more offsets than what we would like our upper limit to be (about one word of offsets stored per clock). This is guaranteed by having lower values of ๐›ผ. This is illustrated in Figure 6.8. 6.2 Effect of Interval Size(I) In this section, we observe the trends in I with respect to ๐›ฟ and ๐›ผ. 6.2.1 Analysis of Varying Message Delay (๐›ฟ) for the CDES Here, we fix the E and ๐›ผ and check how ๐œƒ changes with I. 38 (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.8 NS-3 Simulator: ๐œƒ vs E when varying ๐›ผ, I = 20 ๐œ‡๐‘ , ๐›ฟ = 4 ๐‘š๐‘ . From Figure 6.9, we observe that the value of I does not really have a significant effect on ๐œƒ (Note that the ๐‘Œ axis of this figure varies only from 1.2 to 1.5.) This means that the selection of I does not affect the number of offsets maintained by a process. However, it affects the size of each offset. Specifically, the max value of the offset is ๐œ– = E I and the number of bits required for each ๐œ–. Hence, a larger value of I is better for reducing the size of the RepCl. However, offset is log2 with larger I, the guarantees provided by RepCl are lower. Specifically, Lemma 3 shows that some unforced reordering may occur when events ๐‘’ and ๐‘“ differ by time E + I. Users should therefore choose the value of I based on the desired guarantees of RepCl or the maximum desirable offset. (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.9 Custom Simulator: ๐œƒ vs I when varying ๐›ฟ, E = 2 ms, ๐›ผ = 20 msgs/s. 39 6.2.2 Analysis of Varying Message Delay (๐›ฟ) for NS-3 In the case of the NS-3 simulator, we observed that the size of offsets decreased linearly with increase in I. This is attributed to more information being stored in counters, as the length of the interval increases, and less offsets being stored. The likelihood that all processes are in the same epoch increases as the I increases, leading to lower number of offsets stored. The variation with ๐›ฟ is negligible, and seemingly random, which is confirmed with the custom simulator results. (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.10 NS-3 Simulator: ๐œƒ vs I when varying ๐›ฟ, E = 3 ms, ๐›ผ = 20 msgs/s. 6.2.3 Analysis of Varying Message Rate (๐›ผ) for the CDES For every point in the I โˆ’ ๐œƒ trend, lower ๐›ผ values produce lower ๐œƒ. This is consistent with our observations so far as communication is infrequent and processes tend to not hear from other processes, subsequently not storing their offsets. As I increases, the ๐œƒ remains the same, as the ๐›ฟ remains the same. When ๐›ฟ is constant, and E is constant, messages being sent either remain in the same epoch (๐‘š๐‘ฅ) as they would in the previous I chosen, or a message that was between different intervals in a lower I, now would be in the same interval. Overall, this does not change the total ๐œƒ. This is illustrated in Figure 6.11. 6.2.4 Analysis of Varying Message Rate (๐›ผ) for NS-3 As observed in the custom simulator, lower ๐›ผ values produce lower ๐œƒ, due to infrequency in communication. We additionally observe that the ๐œƒ remains constant with increasing I, consistent with our results from the CDES simulator in the previous section. This is illustrated in Figure 6.12. 40 (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.11 Custom Simulator: ๐œƒ vs I when varying ๐›ผ, E = 2 ms, ๐›ฟ = 8 ๐œ‡๐‘ . (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.12 NS-3 Simulator: ๐œƒ vs I when varying ๐›ผ, E = 3 ๐‘š๐‘ , ๐›ฟ = 2 ๐‘š๐‘ . 6.3 Effect of Message Delay(๐›ฟ) In this section, we observe the effect of ๐›ฟ on ๐œƒ while fixing E and I. 6.3.1 Analysis of Varying Message Delay (๐›ฟ) for the CDES Here, we observe that the value of ๐›ฟ has minimal effect on ๐œƒ. Specifically, as shown in Figure 6.13, we observe that the value of ๐œƒ increases as the value of ๐›ผ increases. However, for a fixed value of ๐›ผ, the ๐œƒ remains the same. When the E is fixed, and the I is fixed, the only way the ๐œƒ would go down is when processes would send messages that were received after the E limit. Because we enforce the limit as ๐›ฟ โ‰ค E, this limit is never exceeded, and the ๐œƒ hence, remains the same. This is illustrated in Figure 6.13. 41 (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.13 Custom Simulator: ๐œƒ vs ๐›ฟ when varying ๐›ผ, E = 4 ms, I = 16 ๐œ‡๐‘ . 6.3.2 Analysis of Varying Message Delay (๐›ฟ) for NS-3 We gather similar results in the case of NS-3 as we did in the CDES, which validates our observations. This is illustrated in Figure 6.14. We also notice that for higher values of ๐›ผ, the ๐œƒ increases drastically. Hence, it is important to note the feasibility of the RepCl, mainly that it is advantageous to use it in a setting of low ๐›ผ (message rates). This is covered further in the following section. (a) ๐‘› = 32. (b) ๐‘› = 64. Figure 6.14 NS-3 Simulator: ๐œƒ vs ๐›ฟ when varying ๐›ผ, E = 3 ms, I = 20 ๐œ‡๐‘ . 6.4 Feasibility Regions In this section, we review the simulations to define the notion of feasible regions. As discussed earlier, the goal of RepCl is to enable the replay of a distributed computation with a small overhead. Here, we consider the case where the user identifies the expected overhead of RepCl to identify 42 scenarios under which RepCl can be used to provide a perfect-replay that meets all the requirements from Section 4.6. Since the overhead of the counters remains virtually unchanged, we only focus on the overhead of the number of offsets, i.e., the value of ๐œƒ. For ๐œƒ = 8, the feasibility regions are shown in Figure 6.15a. Here, the blue dots identify the data points where ๐œƒ = 8 is feasible and the red dots represent the data points where ๐œƒ = 8 is not feasible. The green line identifies the bounds where ๐œƒ = 8 is feasible. We find that the size of the feasible region remains fairly unchanged with the value of ๐‘›. However, it shrinks when the value of E is increased. This is expected based on how ๐œƒ changes with E. We note that the feasibility region only identifies the case where perfect-replay meets all the requirements from Chapter 4.6. If the user needs to utilize RepCl in an infeasible region, the user can obtain partial-replay. To understand this, consider the case where the actual value of E is 4๐‘š๐‘  but the user specifies it to be 2๐‘š๐‘  while constructing RepCl. In this case, if ๐‘’ and ๐‘“ are within 2๐‘š๐‘  then RepCl will allow them to be replayed in any order. However, if ๐‘“ occurred 3๐‘š๐‘  after ๐‘’ then ๐‘’ will always be replayed before ๐‘“ . We anticipate that even in a system where the clock skew is 4๐‘š๐‘ , the actual clock skew at a given moment is likely to be smaller than 4๐‘š๐‘ . This implies that the forced order between ๐‘’ and ๐‘“ will be quite infrequent. Hence, we anticipate that RepCl will be applicable even in domains where the system parameters cause it to fall in an infeasible region, and is referred to in Section 8.3. 43 (a) E = 1 ms, ๐‘› = 32, and ๐œƒ = 8. (b) E = 2 ms, ๐‘› = 32, and ๐œƒ = 8. (c) E = 1 ms, ๐‘› = 64, and ๐œƒ = 8. (d) E = 2 ms, ๐‘› = 64, and ๐œƒ = 8. Figure 6.15 Feasibility regions for ๐›ผ and ๐›ฟ settings. 44 CHAPTER 7 VISUALIZING TRACES WITH REPVIZ We have now seen the various intricacies of the RepCl infrastructure, and we now describe how to apply it to visualization systems. The goal of a visualization is simple, to show the user a view of the computation both from an overall system perspective and a per-process view of how the computation looked on that specific node. Various implementations of visualizers (ref. Section 8.2) achieve this, but in all of them, the underlying timestamping algorithm enforces an order between concurrent events. We therefore, need a visualizer that allows the user to choose the order of concurrent events, and view multiple execution traces simultaneously. In this Chapter, we introduce RepViz, the third infrastructure component of this work. RepViz is a visualizer that works on top of a log generated by the RepCl. It takes in a RepCl-timestamped log, and generates a web-based visualization of the traces the algorithm generates. The user has the option to replay events that are concurrent in any order, while the other events are ordered according to the RepCl. Once the user has selected a replay order, a web visualization is displayed for that trace. The web visualization is under progress, but a sample representation is depicted in Figure 7.1. In the following sections, we will go over the different methods that went into implementing RepViz. 7.1 Implementation The visualizer defines a top level component, called Tracer. This component contains the following functions amd members: โ€ข SortEvents(): This function sorts all events according to their RepCl timestamp. โ€ข RunReplay(): The top level function that provides an interactive view to the user to replay events. โ€ข EventList: The list of events obtained from a RepCl timestamped log. A Tracer object contains a set of Event objects. Each Event object has the following properties: 45 โ€ข EventID: The Message ID. โ€ข EventType: The type of event, i.e., Send, Local or Recv. โ€ข EventTime: TheRepCl timestamped to this event. โ€ข Sender: The IP address of the sender. โ€ข Receiver: The IP address of the receiver. The Sender and Receiver fields are populated as the (Sender, Receiver) when its a Send/Local event, and as (Receiver, Sender) for a Recv event. Now, we will go into detail on how each of the Tracer methods are implemented. 7.1.1 SortEvents The Tracer first orders all the events according to the RepCl timestamp. Events are sorted based on their RepCl timestamps fed by the logs of the algorithm. The sorting algorithm sorts all timestamps by the happens-before ( hb ) relation discussed in Chapter 2. Once the ordering is set, the events are then given to the matching function. The sorting rules are described in Algorithm 4.11. The algorithm compares two RepCl timestamps ๐‘ก1 and ๐‘ก2, and returns True if ๐‘ก1 < ๐‘ก2 and False otherwise. The other comparisons can be implicitly derived from the same algorithm. 7.1.2 RunReplay Once the events are ordered, the Tracer runs the replay according to the event timeline generated by the sorter. Every concurrent event pool is given to the user as a choice to replay any one of the outstanding events waiting to be replayed. Once an event has been replayed, that event is removed from the replay pool. This follows Algorithm 3.1. 7.2 User View In the current iteration of RepViz, the prototype is run on an ASCII terminal. Here is a brief run output of a sample trace snippet generated through an NS-3 simulation: [(EventID=1, EventType=SEND, EventTime=[(NodeId=10.1.1.3, HLC=21, Offsets=[-15, -15, 0, -15, -15], Counters=0)], Sender=10.1.1.3, Receiver=10.1.1.4)] 46 [(EventID=2, EventType=SEND, EventTime=[(NodeId=10.1.1.1, HLC=42, Offsets=[0, -15, -15, -15, -15], Counters=0)], Sender=10.1.1.1, Receiver=10.1.1.2)] [(EventID=3, EventType=SEND, EventTime=[(NodeId=10.1.1.4, HLC=44, Offsets=[-15, -15, -15, 0, -15], Counters=0)], Sender=10.1.1.4, Receiver=10.1.1.5)] [(EventID=4, EventType=SEND, EventTime=[(NodeId=10.1.1.2, HLC=55, Offsets=[-15, 0, -15, -15, -15], Counters=0)], Sender=10.1.1.2, Receiver=10.1.1.3)] [(EventID=4, EventType=RECV, EventTime=[(NodeId=10.1.1.3, HLC=55, Offsets=[-15, 0, 0, -15, -15], Counters=1)], Sender=10.1.1.3, Receiver=10.1.1.2)] [(EventID=5, EventType=SEND, EventTime=[(NodeId=10.1.1.1, HLC=57, Offsets=[0, -15, -15, -15, -15], Counters=0)], Sender=10.1.1.1, Receiver=10.1.1.2)] [(EventID=6, EventType=SEND, EventTime=[(NodeId=10.1.1.3, HLC=59, Offsets=[-15, 4, 0, -15, -15], Counters=0)], Sender=10.1.1.3, Receiver=10.1.1.4)] [(EventID=7, EventType=SEND, EventTime=[(NodeId=10.1.1.5, HLC=61, Offsets=[-15, -15, -15, -15, 0], Counters=0)], Sender=10.1.1.5, Receiver=10.1.1.1)] [(EventID=7, EventType=RECV, EventTime=[(NodeId=10.1.1.1, HLC=61, Offsets=[0, -15, -15, -15, 0], Counters=0)], Sender=10.1.1.1, Receiver=10.1.1.5)] [(EventID=8, EventType=SEND, EventTime=[(NodeId=10.1.1.1, HLC=61, Offsets=[0, -15, -15, -15, 0], Counters=1)], Sender=10.1.1.1, Receiver=10.1.1.2)] Concurrent events detected! 0. [(EventID=9, EventType=SEND, EventTime=[(NodeId=10.1.1.2, HLC=62, Offsets=[-15, 0, -15, -15, -15], Counters=0)], Sender=10.1.1.2, Receiver=10.1.1.3)] 1. [(EventID=10, EventType=SEND, EventTime=[(NodeId=10.1.1.5, HLC=62, Offsets=[-15, -15, -15, -15, 0], Counters=0)], Sender=10.1.1.5, Receiver=10.1.1.1)] 2. [(EventID=2, EventType=RECV, EventTime=[(NodeId=10.1.1.2, HLC=62, Offsets=[0, 0, -15, -15, -15], Counters=0)], Sender=10.1.1.2, Receiver=10.1.1.1)] Please choose the event to replay: 0 [(EventID=9, EventType=SEND, EventTime=[(NodeId=10.1.1.2, HLC=62, Offsets=[-15, 47 0, -15, -15, -15], Counters=0)], Sender=10.1.1.2, Receiver=10.1.1.3)] Please choose the event to replay: 2 [(EventID=2, EventType=RECV, EventTime=[(NodeId=10.1.1.2, HLC=62, Offsets=[-15, 0, -15, -15, -15], Counters=1)], Sender=10.1.1.2, Receiver=10.1.1.1)] Please choose the event to replay: 1 [(EventID=10, EventType=SEND, EventTime=[(NodeId=10.1.1.5, HLC=62, Offsets=[-15, -15, -15, -15, 0], Counters=0)], Sender=10.1.1.5, Receiver=10.1.1.1)] The above log would transform to the visualization on a WebUI as shown in Figure 7.1. In Figure 7.1a, the user does not have a choice in the replay, as all events are ordered with the hb relation detailed in Section 4.5 in the absence of concurrency. The user simply taps the right arrow key to move forward in the replay. At some point in the replay, the user encounters concurrent events, depicted in Figure 7.1b. Here, the user elects to replay event 1 by providing an input of 1 through the keyboard. The event marked 1 is added to the replay. Next the user has to choose between events 2 and 3, depicted in Figure 7.1c. The user chooses event 2, and the last event left to replay is event 3, which is replayed in Figure 7.1d. With this visualization, it is easy for a user to try different combinations of replay and analyse how parameters change with the event order chosen. The visualization can also generate exhaustive logs of all possible replays should the user require it. 48 (a) No concurrency conflicts, push the right arrow key. (b) Concurrency conflicts detected, replay event 1 first. (c) Event 1 replayed, replay event 2 next. (d) Event 2 replayed, replay event 3 next. Figure 7.1 Sample Visualization: Here the user uses arrow keys to replay events that do not have concurrency conflicts, and then uses number keys to input which events to replay in which order. 49 CHAPTER 8 RELATED WORK AND DISCUSSION 8.1 Clocks in Distributed Systems Logical clocks were proposed in 1978 by Leslie Lamport [3] to trace the ordering of events in a distributed system. Vector time was designed independently by multiple researchers [12][7][13], and they proposed the idea of representing time in a distributed system as a set of n-dimensional non-negative integer vectors. According to [14], Vector clocks are defined by three properties: Isomorphism, Strong consistency, and Event Counting. Isomorphism suggests that if two events ๐‘ฅ and ๐‘ฆ have timestamps ๐‘ฃโ„Ž and ๐‘ฃ๐‘˜, respectively, then ๐‘ฅ โˆ’โ†’ ๐‘ฆ โ‡โ‡’ ๐‘ฃโ„Ž < ๐‘ฃ๐‘˜. Here, โˆ’โ†’ implies a partial ordering between a set of events. Strong consistency implies that by examining the vector timestamp of two events, we can determine the causal relationship between the two events. Event counting suggests that if ๐‘‘ is always 1 in the rule ๐‘…1, then the ๐‘–๐‘กโ„Ž component of the vector clock at process ๐‘๐‘–, ๐‘ฃ๐‘ก๐‘– [๐‘–], denotes the number of events that have occurred at ๐‘๐‘– until that instant. There have been several prior implementations of vector clocks including Singhal-Kshemkalyaniโ€™s differential technique [15] and Fowler-Zwaenepoel direct dependency technique [16]. While vector clocks are upper bounded by ๐‘‚ (๐‘›) complexity in terms of both time and memory complexity, the different implementations of the past have tried to reduce this complexity and generate more efficient representations, with some success. Singham-Kshemkalyaniโ€™s differential technique relied on piggybacking using the last sent and last update, without updating every vector clock. This method relies on the assumption that even though the number of processes is large, only a few key processes in a system would interact frequently by passing messages. A benefit of this method is that it cuts down storage overhead at each process to ๐‘‚ (๐‘›). However, this method doesnโ€™t make a substantial contribution to reducing the time complexity incurred when updating the vector clock, as it relies on piggybacking to work. Fowler-Zwaenepoelโ€™s direct-dependency technique cuts down storage complexity again by re- ducing the message size during transmission by transmitting only a scalar value in the messages. Here, a process only maintains information regarding direct dependencies on other processes. The 50 downside of this method is that it has a high computational overhead as it has to trace dependencies and update the vector clock, especially in systems where a few key processes may have a large number of events. Clock synchronization using Network Time Protocol (NTP) uses the Offset Delay Estimation method to ensure physical clocks are synchronized across the internet. Clock offsets and delays are calculated, and timestamps are issued between different machines within a system accordingly. The system then attempts to establish a causal relationship by using the corrected timestamps. This, however, can be computationally expensive and is open to error, as the delay estimation may not always be accurate and result in a violation of the causal relationship between processes issued by different machines. One existing limitation between vector clocks representing logical time and physical clock synchronization is the difficulty in reconciling one with the other. To overcome this challenge, Hybrid Logical clocks were introduced by Kulkarni et al. [4] to capture the causality relationship of a logical clock with the characteristics of a physical clock embedded into it. Another variant of the hybrid clock is the Hybrid Vector Clock [17], [1], which, unlike the Hybrid Logical Clock, can provide all possible/potential consistent snapshots for a given time. For this experiment, we use the Hybrid Vector Clock design presented in Yingchareonthawornchai et al. as it provides desirable characteristics to build our visualization framework. 8.2 Visualizing Traces Mattern [7] talks about how distributed systems use the concept of global state to communicate information, and the need to characterize this global state. They talk about how a process can only approximate the global view of the system, and no process can have, at any given instant, a consistent view of the global state. To verify a distributed system, the author provides a comparison between three key approaches: simulating a synchronous distributed system given an asynchronous system, simulating a common clock or simulating a global state. They highlight the need of a vector clock system to provide a consistent snapshot of the global state, as each process having a clock that stores only its own state is not enough to describe the global state of the computation. 51 PARAVER [18] uses the PVM message passing library to analyze traces generated from a computation. PVM primarily uses parallel message passing, and PARAVER analyses these parallel traces using data analytics and provides a graphical description of the analysis. This was one of the earliest visualization works on distributed systems simulated only for parallel traces. It used a parser to parse through the logs of the PVM-generated traces and analyze CPU activity, communications, and user events. This however required the addition of functionality to the PVM itself and was not generalized to any distributed system interface. It also did not provide generic support and incurred a larger overhead to profile system resources while computation went on. VAMPIR [12] provided analysis of MPI programs by generating timeline traces by profiling MPI applications. It used different visualization metrics to show whether processes were still active or not. It also provided views of system activities and aggregated statistics about the system itself. However, it was made specifically only for MPI applications and added to the profiling interfaces of MPI. D3S [19] allowed developers to specify predicates on distributed properties of the system. These predicates can vary depending on what consistency checks one would require on the distributed system. They modeled the tracing as a consistency checker and generated traces of predicate evaluation. The predicates are injected dynamically at compile time into the system and are evaluated based on the customization provided by the user. However, we believe this approach would add overhead to running the distributed computation due to complex predicate checking. Zinsight [20] provides hierarchies of tasks and provides aggregated metrics to show timeline visualizations of events. It also provides users with changing the granularity of the metric they want to see with sequences of computations per process. Trumper et al. [21] present a dynamic analysis tool that uses boundary tracing and post- processing to analyze system behavior through a distributed computation. These are task-based visualizations, where tasks are mapped to memory resources. However, this may not always be the case where processes share the same memory, as in the case of OpenMP-based infrastructures. Dapper [22] is Googleโ€™s tracing software for distributed systems where they provided low 52 overhead, application-level transparency, and scalability. Dapper uses annotations and spans to generate traces through RPCs. However, the authors mention that Dapper cannot correctly point to causal history, as it uses annotations in non-standard control primitives, that may mislead the causality calculations. Our approach would overcome this, as causality is enforced through a lattice of clocks, rather than the events itself. Isaacs et al. [23] provides a comprehensive survey of different distributed monitoring and tracing tools in the past decade, providing detailed descriptions and categorizations based on task parallelism, causality information, and so on. Isaacs, Bremer et al. [24] design a trace visualization system purely relying on logical clocks and then transposing those clocks back to real-time clocks in the visualization. Processes are also clustered based on logical behavior. However, this would incur more overhead than our solution and may cause conflicts in enforcing causality, due to the usage of a standard logical clock. Verdi [25] provides developers with choosing the fault system to diagnose, and verify the implementation of the system. This is a formal verification system where it provides the developer with an idealized fault model, and once this is verified, it applies the correctness to a more realistic fault model. ShiViz [26] uses vector clocks in generating distributed system traces using happens-before relationships. By using vector clocks, it provides a verifiable and accurate notion of causality. However, since it uses traditional vector clocks, it uses a higher complexity than our proposed model. 8.3 Discussion In Section 6, we identified feasible regions for the given permissible overhead for RepCl. Thus, the natural question is: what can a user do if the given system parameters fall into the infeasible region? Here, observe that E provides one way to reduce the overhead if we accept some imperfect replay. To explain this, consider the case where we are using a system with clock skew to be E๐‘Ž. If the user implements RepCl with E < E๐‘Ž then the resulting replay will still satisfy requirements 1 and 2 (cf. Section 4.6). Requirement 3 will be satisfied with E2 = E โˆ’ I. 53 Figure 8.1 Effect of using E instead of E๐‘Ž in RepCl. Looking at this situation closely, we observe that the clock skew between two processes follows a structure shown in Figure 8.1. Specifically, at a given instance, the clocks of two processes ๐‘— and ๐‘˜ differ by some amount that is less than E๐‘Ž. However, the actual clock difference at a fixed point in time (that is not visible to either ๐‘— and ๐‘˜) is often less than E๐‘Ž. Hence, if ๐‘’ and ๐‘“ occurred at the same global time, the probability that the respective clocks differed by E depends upon the area of the shaded part (cf. Figure 8.1). In this case, ๐‘’ and ๐‘“ would still be replayed in arbitrary order. Only if the clocks fall in the non-shaded area then the replay will force an order between ๐‘’ and ๐‘“ . In other words, even if the system parameters fall in the infeasible region, it would be possible to use RepCl that provides a valid replay. It is just that it will not be able to reproduce all possible replays. 54 CHAPTER 9 CONCLUSION AND FUTURE WORK In this paper, we focused on the problem of replay clocks in systems where clocks are synchronized to be within E. The purpose of these clocks is to reproduce a distributed computation with all its certainties and uncertainties. By certainty, we mean that if event ๐‘’ must have happened before ๐‘“ then the replay must ensure that ๐‘’ is replayed before ๐‘“ . Specifically, this required that if ๐‘’ happened before ๐‘“ (as defined in [3]) or ๐‘“ occurred E1 โ‰ˆ E time after ๐‘’ then ๐‘’ must occur before ๐‘“ . And, by uncertainty, we mean that if ๐‘’ and ๐‘“ could occur in any order then the replay should not force an order between them. Specifically, if ๐‘’|| ๐‘“ (as defined in [3]) and ๐‘’ and ๐‘“ occurred within time E2 โ‰ˆ E then the replay permits them to be replayed in any order. We presented RepCl to solve the replay problem with E1 = E +I and E2 = E โˆ’I, where I is a parameter to RepCl. We analyzed RepCl for various system parameters (clock skew (E), message rate (๐›ผ), message delay (๐›ฟ)). We find that for various system parameters, the size of RepCl and the overhead to create timestamps and/or compare them is small. For the purpose of replay, RepCl provides several advantages over existing approaches. For example, unlike logical clocks, they do not force certain unneeded event ordering. They have a significantly lower overhead compared to vector clocks. Also, they do not generate illegitimate replays that can occur with the user of vector clocks. The overhead of RepCl. ๐‘— depends upon the number of processes that communicate with ๐‘— (directly or indirectly) in E window. This is different from the case in vector clocks where the overhead is always proportional to the number of processes in the system. With the design of the RepCl, we ensured that the clock size is not a leading factor in slowing down computation. The RepCl is a non-invasive method to ensure that causality is maintained in the presence of skewing clocks, and this is particularly useful in various applications. To facilitate the ease of development with this clock, we provide an API and a sample implementation of the API in NS-3, a widely used distributed network simulator. We illustrate with the help of NS-3, the various invariants our clock provides, such as the size scaling while varying various parameters. We also identify feasibility regions that would provide perfect replay through the clock. For systems out of 55 this feasible region, we additionally provide techniques to approximate replay that is acceptable. We have utilized RepCl to enable users to visualize a distributed computation using our tool, RepViz. The goal of the visualization is to allow the user to identify an event ๐‘“ where a failure occurred. Then, they can use a replay of events just preceding ๐‘“ to determine whether the error would go away. If it does, it would imply that it is a synchronization error. Likewise, a user can replay some portion of the computation. Since the replay of events may occur in a different order, it will help identify potential synchronization errors. RepCl is designed mainly for offline analysis, where the event data is stored during execution and analyzed at a later time. However, RepCl can be used for run-time monitoring/analysis as well if the data related to the timestamps is sent to a monitor, that monitor could analyze it for potential properties of interest if the analysis can be done quickly. However, a key challenge in this context will be whether the run-time monitors can keep up with the execution of the system. There are several future directions for RepCl. If the size of RepCl needed for perfect replay is too large, the user can reduce the size of RepCl by choosing a lower value of E. In this case, the resulting replay will force some ordering between concurrent events. One of the future works is to identify the effect of reducing E in this manner. Another potential future extension would be the ability to evaluate different veins of execution for different properties. Currently, it is difficult to compare and contrast different traces of execution for a specific invariant. With the help of RepViz, users can identify key characteristics on how data is moved between processes, and if there is an efficient way to coordinate movement. Another possible characteristic that could be measured is resource utilization and how it is affected by different veins of execution. Specifically, we aim to answer the question: Are there stark differences when an event is replayed first before another based on the amount of work needed to perform that event? This would give better methodologies to evaluate data movement operations. Furthermore, we intend to expand this clock structure to beyond 64 processes. A potential avenue to do this is to implement it on a hierarchical structure. If a network is structured as a network of switches, with each switch connected to a cluster of nodes, we can implement a replay 56 clock for each level independently. All we would need is a mechanism to merge the clocks coming in from the cluster to the switch and have the switch relay clock information from its cluster to the other clustered nodes on the network. 57 BIBLIOGRAPHY [1] S. Yingchareonthawornchai, D. N. Nguyen, S. S. Kulkarni, and M. Demirbas, โ€œAnalysis of bounds on hybrid vector clocks,โ€ IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 9, pp. 1947โ€“1960, 2018. [2] T. Mytkowicz, P. F. Sweeney, M. Hauswirth, and A. Diwan, โ€œObserver effect and measure- ment bias in performance analysis,โ€ Computer Science Technical Reports CU-CS-1042-08, University of Colorado, Boulder, 2008. [3] L. Lamport, โ€œTime, clocks, and the ordering of events in a distributed system,โ€ Commun. ACM, vol. 21, pp. 558โ€“565, July 1978. [4] S. S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, and M. Leone, โ€œLogical physical clocks,โ€ in Principles of Distributed Systems: 18th International Conference, OPODIS 2014, Cortina dโ€™Ampezzo, Italy, December 16-19, 2014. Proceedings 18, pp. 17โ€“32, Springer, 2014. [5] L. Lamport, โ€œTime, clocks, and the ordering of events in a distributed system,โ€ Commun. ACM, vol. 21, no. 7, pp. 558โ€“565, 1978. [6] C. J. Fidge, โ€œTimestamps in message-passing systems that preserve the partial ordering,โ€ in Proceedings of the 11th Australian Computer Science Conference (ACSC) (K. Raymond, ed.), pp. 56โ€“66, 1988. [7] F. Mattern et al., Virtual time and global states of distributed systems. Univ., Department of Computer Science, 1988. [8] D. L. Mills, โ€œInternet time synchronization: the network time protocol,โ€ IEEE Transactions on communications, vol. 39, no. 10, pp. 1482โ€“1493, 1991. [9] H. Schildt, โ€œC++ complete reference,โ€ 1998. [10] G. F. Riley and T. R. Henderson, โ€œThe ns-3 network simulator,โ€ in Modeling and tools for network simulation, pp. 15โ€“34, Springer, 2010. [11] S. S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, and M. Leone, โ€œLogical physical clocks,โ€ in International Conference on Principles of Distributed Systems, pp. 17โ€“32, Springer, 2014. [12] W. E. Nagel, A. Arnold, M. Weber, H.-C. Hoppe, and K. Solchenbach, โ€œVampir: Visualization and analysis of mpi resources,โ€ 1996. [13] F. B. Schmuck, โ€œThe use of efficient broadcast protocols in asynchronous distributed systems,โ€ tech. rep., Cornell University, 1988. [14] A. D. Kshemkalyani and M. Singhal, Distributed computing: principles, algorithms, and systems. Cambridge University Press, 2011. [15] M. Singhal and A. Kshemkalyani, โ€œAn efficient implementation of vector clocks,โ€ Information Processing Letters, vol. 43, no. 1, pp. 47โ€“52, 1992. 58 [16] J. Fowler and W. Zwaenepoel, โ€œCausal distributed breakpoints,โ€ in Proceedings of the Tenth International Conference on Distributed Computer Systems, 1990. [17] M. Demirbas and S. Kulkarni, โ€œBeyond truetime: Using augmentedtime for improving span- ner,โ€ Aug, vol. 23, pp. 1โ€“5, 2013. [18] V. Pillet, J. Labarta, T. Cortes, and S. Girona, โ€œParaver: A tool to visualize and analyze parallel code,โ€ in Proceedings of WoTUG-18: transputer and occam developments, vol. 44, pp. 17โ€“31, 1995. [19] X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F. Kaashoek, and Z. Zhang, โ€œD3s: Debugging deployed distributed systems,โ€ in NSDI, 2008. [20] W. De Pauw and S. Heisig, โ€œZinsight: A visual and analytic environment for exploring large event traces,โ€ in Proceedings of the 5th international symposium on Software visualization, pp. 143โ€“152, 2010. [21] J. Trรผmper, J. Bohnet, and J. Dรถllner, โ€œUnderstanding complex multithreaded software systems by using trace visualization,โ€ in Proceedings of the 5th international symposium on Software visualization, pp. 133โ€“142, 2010. [22] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, โ€œDapper, a large-scale distributed systems tracing infrastructure,โ€ 2010. [23] K. E. Isaacs, A. Gimรฉnez, I. Jusufi, T. Gamblin, A. Bhatele, M. Schulz, B. Hamann, and P.-T. Bremer, โ€œState of the art of performance visualization.,โ€ EuroVis (STARs), 2014. [24] K. E. Isaacs, P.-T. Bremer, I. Jusufi, T. Gamblin, A. Bhatele, M. Schulz, and B. Hamann, โ€œCombing the communication hairball: Visualizing parallel execution traces using logical time,โ€ IEEE transactions on visualization and computer graphics, vol. 20, no. 12, pp. 2349โ€“ 2358, 2014. [25] J. R. Wilcox, D. Woos, P. Panchekha, Z. Tatlock, X. Wang, M. D. Ernst, and T. Anderson, โ€œVerdi: a framework for implementing and formally verifying distributed systems,โ€ in Pro- ceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 357โ€“368, 2015. [26] I. Beschastnikh, P. Wang, Y. Brun, and M. D. Ernst, โ€œDebugging distributed systems,โ€ Communications of the ACM, vol. 59, no. 8, pp. 32โ€“37, 2016. 59