ENGINEERING SCALABLE DIGITAL MODELS TO STUDY MAJOR TRANSITIONS IN EVOLUTION By Matthew Andres Moreno A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Doctor of Philosophy Ecology, Evolutionary Biology and Behavior - Dual Major 2022 ABSTRACT Evolutionary transitions occur when previously-independent replicating entities unite to form more complex individuals. Such major transitions in individuality have profoundly shaped complexity, novelty, and adaptation over the course of natural history. Regard for their causes and consequences drives many fundamental questions in biology. Likewise, evolutionary transitions have been highlighted as a hallmark of true open-ended evolution in artificial life. As such, experiments with digital multicellularity promise to help realize computational systems with properties that more closely resemble those of biological systems, ultimately providing insights about the origins of complex life in the natural world and contributing to bio-inspired distributed algorithm design. Major challenges exist, however, in applying high-performance computing to the dynamic, large-scale digital artificial life simulations required for such work. This dissertation presents two new tools that facilitate such simulations at scale: the Conduit library for best-effort communication and the hstrat (“hereditary stratigraphy”) library, which debuts novel decentralized algorithms to estimate phylogenetic distance between evolving agents. Most current high-performance computing work emphasizes logical determinism: extra effort is expended to guarantee reliable communication between processing elements. When necessary, computation halts in order to await expected messages. Determinism does enable hardware-independent results and perfect re- producibility, however adopting a best-effort communication model can substantially reduce synchronization overhead and allow dynamic (albeit, potentially lossy) scaling of communication load to fully utilize available resources. We present a set of experiments that test the best-effort communication model implemented by the Conduit library on commercially available high-performance computing hardware. We find that best- effort communication enables significantly better computational performance under high thread and process counts and can achieve significantly better solution quality within a fixed time constraint. In a similar vein, phylogenetic analysis in digital evolution work has traditionally used a perfect tracking model where each birth event is recorded in a centralized data structure. This approach, however, is difficult scale robustly and efficiently to distributed computing environments where agents may migrate between a dynamic set of disjoint processing elements. To provide for phylogenetic analyses in these environments, we propose an approach to infer phylogenies via heritable genetic annotations. We introduce hereditary stratigraphy, an algorithm that enables tunable trade-offs between annotation memory footprint and accuracy of phylogenetic inference. Simulating inference over known lineages, we recover up to 85% of the information contained in the true phylogeny using only a 64-bit annotation. We harness these tools in DISHTINY, a distributed digital evolution system designed to study digital organisms as they undergo major evolutionary transitions in individuality. This system allows digital cells to form and replicate kin groups by selectively adjoining or expelling daughter cells. The capability to recognize kin-group membership enables preferential communication and cooperation between cells. We report group- level traits characteristic of fraternal transitions, including reproductive division of labor, resource sharing within kin groups, resource investment in offspring groups, asymmetrical behaviors mediated by messaging, morphological patterning, and adaptive apoptosis. In one detailed case study, we track the co-evolution of novelty, complexity, and adaptation over the evolutionary history of an experiment. We characterize ten qualitatively distinct multicellular morphologies, several of which exhibit asymmetrical growth and distinct life stages. Our case study suggests a loose relationship can exist among novelty, complexity, and adaptation. The constructive potential inherent in major evolutionary transitions holds great promise for progress toward replicating the capability and robustness of natural organisms. Coupled with shrewd software engi- neering and innovative model design informed by evolutionary theory, contemporary hardware systems could plausibly already suffice to realize paradigm-shifting advances in open-ended evolution and, ultimately, scien- tific understanding of major transitions themselves. This work establishes important new tools and method- ologies to support continuing progress in this direction. Copyright by MATTHEW ANDRES MORENO 2022 Time, funding, freedom, peace, encouragement, presumed competence, education, role models, advisorship, and colleagueship — for reparation of profound inadequacies in equitable and universal affordance of such privilege. v ACKNOWLEDGEMENTS To my colleagues and collaborators in the DEVOLAB and BEACON, I benefited greatly from your insight and your camaraderie. Thank you. Notable mentions here include Dr. Acacia Ackles, Dr. Wolfgang Banzhaf, Cliff Bohm, Dr. Emily Dolson, Austin Ferguson, Jose Hernandez, Dr. Alex Lalejini, Dr. Josh Nahum, Dr. Anselmo Pontes, and Kate Skocelas, and Dr. Anya Vostinar. Thank you to my mentees for your valuable work. Sara Boyd and Tait Wecht made huge improvements to the Empirical library’s web UI toolkit. Katherine Perry, Nathan Rizik, and Santiago Rodriguez Papa helped build software foundations for the experiments reported in this dissertation. I am grateful for the opportunity to have worked with each of you. On frustrating days with my own work, I am always glad to think of the good you’re out to do. In particular, Santiago Rodriguez Papa merits special recognition for jumping in the trenches to help bring this dissertation over the finish line. Over the last four years, I have been endlessly entertained by your encyclopedic knowledge of amusing grotesqueries in society and technology. I have also been endlessly impressed by your determination and cleverness in engineering better ways to do almost everything. Thank you. My own mentors invested time and personal support in my development. Thank you to Dr. Rex Cole, Dr. America Chambers, Dr. John Fowler, Dr. Simon Garnier, Dr. Jason Graham, Mary Peterson, and Dr. Adam Smith. Dr. America Chamber’s encouraging and wise advisorship on my undergraduate thesis cemented my research interests and laid the foundation for my graduate career. I am also grateful for my time training under the devoted mathematics and computer science faculty at the University of Puget Sound. Thank you to Dr. Marisa Silver. You are singularly responsible for getting me out the other end of middle school in one piece, with some serviceable writing ability to boot. Without the benevolent grace of omnipotent administrative support staff, I could not have survived Michigan State University with a paycheck, health insurance, and graduation requirements. Thank you to Barbara Bloemers, Deanne Hubbell, Connie James, and Melissa Williams. This dissertation has benefited from the advice of my committee members: Dr. Wolfgang Banzhaf, Dr. Emily Dolson, Dr. Charles Ofria, and Dr. Bill Punch. Thank you for especially for refining the focus of this work (and preventing global deforestation from production of a print copy). What hasn’t been said about Dr. Charles Ofria across dozens of advisorial acknowledgments? I will add this: thank you most of all for planting yourself wholly in my corner. From the very beginning, you made clear that I had your unconditional and total support. I quickly grew to trust and rely on it. I am glad to have been able to share my challenges with you, both technical and personal. Thank you. Thank you to my friends, family, and loved ones for your support. vi This research was supported in part by NSF grants DEB-1655715 and DBI-0939454 as well as by Michi- gan State University through the computational resources provided by the Institute for Cyber-Enabled Re- search. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1424871. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. vii TABLE OF CONTENTS Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Part I Designing Computational Infrastructure to Enable Scalable Digital Multicellularity Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2 Design and Scalability Analysis of Conduit: a Best-effort Communication Software Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 3 Methods to Enable Decentralized Phylogenetic Tracking in a Distributed Digital Evolution System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Part II Evolving Complexity, Novelty, and Adaptation in Digital Multicells . . . . . . . 67 Chapter 4 Exploring Evolved Multicellular Life Histories in an Open-Ended Digital Evolution System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter 5 A Case Study of Novelty, Complexity, and Adaptation in a Multicellular System . . . . 87 Chapter 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Appendix A Design and Scalability Analysis of Conduit: a Best-effort Communication Software Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Appendix B Methods to Enable Decentralized Phylogenetic Tracking in a Distributed Digital Evolution System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Appendix C Exploring Evolved Multicellular Life Histories in a Open-Ended Digital Evolution System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Appendix D Case Study of Novelty, Complexity, and Adaptation in a Multicellular System . . . . . 296 viii Chapter 1 Introduction Portions of this chapter are adapted from (Moreno and Ofria, 2019), (Moreno and Ofria, 2020), and from (Moreno, 2020). 1.1 Major Evolutionary Transitions and Open-Ended Evolution Emergence of new replicating entities from the union of simpler entities constitutes some of the most profound events in natural evolutionary history (Smith and Szathmary, 1997). In an evolutionary transi- tion of individuality, a new, more complex replicating entity is derived from the combination of cooperating replicating entities that have irrevocably entwined their long-term fates (West et al., 2015). Eusocial in- sect colonies and multicellular organisms exemplify this phenomenon (Smith and Szathmary, 1997). Such transitions in individuality are essential to the evolution of the most complex forms of life. As such, these transitions have been highlighted as key research targets with respect to the question of open-ended evolution (Banzhaf et al., 2016; Ray and Thearling, 1996). In particular, this dissertation focuses on fraternal transitions in individuality — events where closely- related kin come together or stay together to form a higher-level organism (Queller, 1997). Potential evolv- ability properties of fraternal collectives makes them an attractive evolutionary substrate. Multicellular bodies configured through generative development (i.e., with indirect genetic representation) can promote scalable properties (Lipson et al., 2007) such as modularity, regularity, and hierarchy (Clune et al., 2011; Hornby, 2005). Developmental processes may also promote canalization (Stanley and Miikkulainen, 2003), for example through exploratory processes and compensatory adjustments (Gerhart and Kirschner, 2007). Scientific understanding of fraternal transitions in individuality benefits from experimental work probing the origins of multicellularity. In the biological domain, Ratcliff et al. have demonstrated evolution of multicellularity in yeast, deriving fraternal clusters of cells that cling together in order to maximize their settling rate (Ratcliff et al., 2012). The contributions of Goldsby and collaborators are particularly notable among computational artificial life work on the origins of multicellularity. Goldsby’s work extends the Avida model system (Ofria et al., 2009), breaking the toroidal grid into isolated pockets where colonies are grown up from a single progenitor cell. Direct selection for collective, colony-level characteristics drives evolution of cooperative cellular traits characteristic of a transition to colony-level individuality. When a colony meets selection criteria, a propagule from that colony is inoculated into a freshly-cleared population slot. Cells explicitly self-designate eligibility to parent a propagule. This clear distinction between somatic and gametogenic modes of reproduction has proven particularly useful in experiments studying the origin of soma (Goldsby et al., 2014) and multicellular entrenchment (Goldsby 1 et al., 2020). Other work by Goldsby et al. has investigated the evolution of division of labor (Goldsby et al., 2012, 2010) and the evolution of morphological development (Goldsby et al., 2017). This dissertation build’s on Goldsby’s work by relaxing simulation constraints to enable broad genetic determination of multicellular life history and allowing for unconstrained cellular interactions between mul- ticellular bodies. This approach enables new perspective in digital evolution work, especially with respect to biotic interactions. 1.2 Digital Evolution Models Digital evolution techniques complement traditional wet-lab evolution experiments by enabling re- searchers to address questions that would be otherwise limited by: • reproduction rate (which determines the number of generations that can be observed in a set amount of time), • incomplete observations (every event in a digital system can be tracked), • physically-impossible experimental manipulations (any event in a digital system can can be arbitrarily altered), or • resource- and labor-intensity (digital experiments can be automated). Despite their versatility and rapid generational turnover, digital artificial life experiments generally operate at comparable or modest scales compared to laboratory biological evolution experiments. Although digital evolution techniques can feasibly simulate populations numbering in the millions, such experiments require simple agents with limited interactions. With more complex agents controlled by genetic programs, neural networks, or the like, feasible population sizes can dwindle down to thousands or even hundreds of agents. When considering major transitions to multicellularity — where individual organisms are composed of many agents — population sizes may drop to tens of organisms, far below desirable for many evolution experiments. 1.3 Putting Scale in Perspective One example of a digital evolution platform is Avida, a popular software system for evolutionary ex- periments with self-replicating computer programs. In this system, a population of ten thousand digital organisms can undergo approximately twenty thousand generations (or about two hundred million individ- ual replication cycles) per day (Ofria et al., 2009). Each flask in the Lenski Long-Term Evolution Experiment hosts a similar number of replication cycles; with an effective population size of 30 million E. coli that un- dergo a bit more than 6.6 doublings per day, the bacteria experience about 180 million replication events per day (Good et al., 2017). Likewise, in Ratcliff’s work studying the evolution of multicellularity in S. 2 cerevisiae, about six doublings per day occur among a population numbering on the order of a billion cells (Ratcliff et al., 2012). These numbers translate to approximately six billion cellular replication cycles elapsed per day in this system. Although artificial life practitioners traditionally describe instances of their simulations as “worlds,” with serial processing power their scale aligns (in naive terms) more along the lines of a single flask. Of course, such a comparison neglects profound disparities between Avidians and bacteria or yeast in terms of complexity. Natural organisms have vastly more information content in their genomes and their cellular state, as well as more (and more diverse) interactions with the environment and with other cells. Recent work with SignalGP has sought to address some of these shortcomings by developing digital evolution substrates suited to dynamic environmental and agent-agent interactions (Lalejini and Ofria, 2018) that more effectively incorporate state information (Lalejini et al., 2020, 2021; Moreno, 2019). However, more sophisticated and interactive evolving agents will necessarily consume more CPU time on a per-replication-cycle basis — further shrinking the magnitude of experiments tractable with serial processing. 1.4 Thesis Statement Scalable digital evolution systems leveraging best-effort communication will enable us to study key phe- nomena associated with open-ended evolution: the origins of novel traits and behaviors, complex organisms and ecologies, and major evolutionary transitions in individuality. 1.5 A Path of Expanding Computational Scale The idea that orders-of-magnitude increases in compute power will open up qualitatively different pos- sibilities with respect to open-ended evolution is both promising and well founded. Spectacular advances achieved with artificial neural networks over the last decade illuminate a possible path toward this outcome. As with digital evolution, artificial neural networks (ANNs) were traditionally understood as a versatile, but auxiliary methodology — both techniques have been described as “the second best way to do almost anything” (Eiben and Smith, 2015; Miaoulis and Plemenos, 2008). However, the utility and ubiquity of ANNs has since increased dramatically. The development of AlexNet is widely considered pivotal to this transformation. AlexNet united methodological innovations from the field (such as big datasets, dropout, and ReLU) with GPU computing that enabled training of orders-of-magnitude-larger networks. In fact, some aspects of their deep learning architecture were expressly modified to accommodate multi-GPU train- ing (Krizhevsky et al., 2012). By adapting existing methodology to exploit commercially available hardware, AlexNet spurred the greater availability of compute resources to the research domain and eventually the introduction of custom hardware to expressly support deep learning (Jouppi et al., 2017). Notably within the domain of artificial life, David Ackley has envisioned an ambitious design for modular 3 distributed hardware at a theoretically unlimited scale (Ackley and Cannon, 2011). Progress toward realizing artificial life systems with such indefinite scalability seems likely to unfold as incremental achievements that spur additional interest and resources in a positive feedback loop with the development of methodology, software, and eventually specialized hardware to take advantage of those resources. In addition to developing hardware-agnostic theory and methodology, we believe that pushing the envelope of open-ended evolution will analogously require designing systems that leverage existing commercially-available parallel and distributed compute resources at circumstantially-feasible scales. 1.6 The Future is Parallel Throughout much of the 20th century, serial processing enjoyed regular advances in computational ca- pacity due to quickening clock cycles, burgeoning RAM caches, and increasingly clever packing together of instructions during execution. Since, however, performance of serial processing has bumped up against apparent fundamental limits to the current technological foundations of computing (Sutter et al., 2005). Instead, advances in 21st century computing power have arrived largely via multiprocessing (Hennessy and Patterson, 2011, p. 55) and specialized hardware acceleration (e.g., GPU, FPGA, etc.) (Che et al., 2008). Contemporary high-performance computing clusters link multiprocessors and accelerators with fast inter- connects to enable coordinated work on a single problem (Hennessy and Patterson, 2011, p. 436). High-end clusters already make hundreds of thousands or millions of cores available. More loosely-affiliated banks of servers can also muster significant computational power. For example, Sentient Technologies notably employed a distributed network of over a million CPUs to run evolutionary algorithms (Miikkulainen et al., 2019). The availability of orders-of-magnitude greater parallel computing resources in ten and twenty years’ time seems probable, whether through incremental advances with traditional silicon-based technology (Don- garra et al., 2014; Gropp and Snir, 2013) or via emerging, unconventional technologies such as bio-computing (Benenson, 2009) and molecular electronics (Xiang et al., 2016). Such emerging technologies could greatly expand the collections of computing devices that are feasible, albeit at the potential cost of component speed (Bonnet et al., 2013; Ellenbogen and Love, 2000) and perhaps also component reliability. Making effective use of massively parallel processing power may require fundamental shifts in existing programming practices. 1.7 Traditional Approaches to Digital Evolution at Scale Favor Iso- lation Digital evolution practitioners have a rich history of leveraging distributed hardware. It is common practice to distribute multiple self-isolated instantiations of evolutionary runs across multiple hardware units. In scientific contexts, this practice yields replicate datasets that provide statistical power to answer research questions (Dolson and Ofria, 2017). In applied contexts, this practice yields many converged populations 4 that can be scavenged for the best solutions overall (Hornby et al., 2006). Another established practice is to use “island models” where individuals are transplanted between populations residing on different pieces of distributed hardware. Koza and collaborators’ genetic programming work with a 1,000-CPU Beowulf cluster typifies this approach (Bennett III et al., 1999). In recent years, Sentient Technologies spearheaded evolutionary computation projects on an unprece- dented computational scale, comprising over a million CPUs and capable of a peak performance of 9 petaflops (Miikkulainen et al., 2019). According to its proponents, the scale and scalability of this “DarkCycle” sys- tem was a key aspect of its conceptualization (Gilbert, 2015). Much of the assembled infrastructure was pieced together from heterogeneous providers and employed on a time-available basis (Blondeau et al., 2009). Unlike typical island models where selection occurs entirely independently on each CPU, this scheme trans- ferred evaluation criteria between computational instances in addition to individual genomes (Hodjat and Shahrzad, 2013). Sentient Technologies also notably exploited a large pool of hardware accelerators (e.g., 100 GPUs) in work evolving neural network architectures by performing each candidate architecture’s costly model training and evaluation process (Miikkulainen et al., 2019). Existing parallel and distributed digital evolution systems typically minimize interaction between simu- lation components on disjoint hardware. Such independence facilitates simple and efficient implementation. This approach typically involves independent evaluation of sub-populations (i.e., island models) or individu- als (i.e., primary-subordinate or controller-responder parallelism (Cantú-Paz, 2001)). Cases where evaluation of a single individual are parallelized often involve data-parallel evaluation over a set of independent test cases, which are subsequently consolidated into a single fitness profile (Harding and Banzhaf, 2007b; Langdon and Banzhaf, 2019). However, several notable parallel and distributed digital evolution systems have incorporated rich inter- actions between parallelized simulation components. Harding applied GPU acceleration to cellular automata models of artificial development systems, which involve intensive interaction between spatially-distributed instantiation of a genetic program (Harding and Banzhaf, 2007a). Work on Network Tierra by Tom Ray featured arbitrary communication between digital organisms residing on different machines (Ray, 1995) More recently, in a continuation of much earlier work, Christian Heinemann’s ongoing ALIEN project has leveraged GPU acceleration for perform physics-based simulation of soft body agents within a 2D arena (Heinemann, 2008). 1.8 Open-Ended Evolution at Scale Should Prioritize Interaction We argue that open-ended artificial life systems should prioritize dynamic interactions between simula- tion elements situated across physically distributed hardware components. 5 Unlike most existing applications of distributed computing in digital evolution, open-ended evolution re- search demands dynamic interactions among distributed simulation elements. Many of the important natural phenomena, including ecologies, co-evolutionary dynamics, and social behavior, all arise from interactions among individuals. Likewise, at the scale of an individual organism, developmental processes and emergent phenotypic functionality necessitate dynamic interactions. A best-effort communication model could enable maximization of available bandwidth (Byna et al., 2010) while avoiding scaling issues typically associated with communication-intensive distributed computing (Cardwell and Song, 2019). Under such a model, processes compute simulation updates unimpeded and incorporate communication from collaborating processes as it happens to become available in real time. As stochastic algorithms performing computational search with a broad set of acceptable outcomes, many digital evolution simulations are well suited to such a best-effort approach. 1.9 Digital Multicellularity Suits Distributed Computing Multicellularity poses an attractive model to harness distributed computing power for digital evolution. The basic notion is to achieve simulation dynamics that outstrip the capabilities of individual hardware components via an interacting network of discrete cellular components simple enough to reside on individual pieces of hardware. Indeed, early thinking around composing digital organisms of differentiated components revolved around the possibility of multithreading and multiprocessing. However, this work eschews a spatial model for cellular interaction in favor of a logical approach where “cellular” threads traversed logical space within a replicating program (Ofria et al., 1999; Ray and Hart, 2000). Only later did Goldsby’s multicellularity experiments introduce a spatial model for digital multicellu- larity, in which cells composing each digital “multicell” occupied tiles in a unique two-dimensional subgrid (Goldsby et al., 2014). The clonal colony of cells constituting each multicell exists within an isolated spatial domain provisioned by the simulation. Two distinct modes of reproduction occur in these experiments: (1) cells replicate within a multicell and (2) multicells reproduce by sending a single cell to found a new organ- ism — the target multicell is sterilized then re-innoculated with the cell supplied by the parent. Although Goldsby did not pursue hardware acceleration of cell components within a multicell, such a spatial approach could facilitate parallelization. Assuming local interactions, cells in a spatial model communicate directly with relatively few other simulation elements (i.e., their neighbors). Such a limitation suits a distributed computing approach. In fact, at truly vast scales where physical distance between hardware components limits viable commu- nication, simulation topology that maps into three-dimensional space will become highly advantageous. (This argument is a foundational tenet of Ackley’s “indefinite” scalability concept (Ackley and Cannon, 2011).) 6 Ackley’s recent work on emergent digital protocells exemplifies algorithm engineering grounded in spatial considerations with respect to potential underlying distributed physical hardware (Ackley, 2018, 2019). The approach presented in this dissertation extends Goldsby’s spatial model of digital multicellularity by developing mechanics to enable arbitrary interactions between multicells (e.g., competition, parental care for offspring, etc.) within a unified spatial realm. (The DISHTINY model incorporates other notable changes, as well, such as an event-driven genetic programming substrate and directionally-symmetric agent evaluation.) 1.10 Contributions Deepening our scientific understanding of major evolutionary transitions in individuality provides crucial insight into how the remarkable diversity and complexity of biological life came to be and may yield facilitate replication of lifelike capabilities in silico. Digital evolution enables unique experimental approaches to investigate evolutionary questions, but computational limitations restrict the scope of systems that can be modeled. Such practicalities are particularly cumbersome to digital models of multicellularity. This dissertation develops and tests approaches to improve scalability of artificial life simulations and applies them to construct a scalable simulation system for digital multicellularity. We then use this system to study the relationships between major transitions, complexity, novelty, and adaptation. Contributions of this dissertation include: • de novo production of complex multicellular organisms without employing a segregating topology to force such a transition, • demonstrating metrics that can efficiently quantify complexity and adaptation in an system with im- plicit selection dynamics, • characterizing the evolution of complexity, novelty, and adaptation of digital multicells in an open- ended system, • implementing and evaluating techniques for general-purpose best-effort high-performance computing, • developing and implementing new methodologies for scalable simulations of evolving digital multicells that allows for arbitrary interactions between multicells in a unified spatial realm, and • providing a new technique for genome annotation to facilitate phylogenetic analyses in distributed digital evolution experiments. The work described here aims to spur reciprocal innovations: • distributed computing will expand the scope of experiments possible in artificial life systems by allowing us to evolve complex multicellular digital organisms, and 7 • the unique objectives and latitude of artificial life will foster novel algorithms and distributed computing techniques. 1.11 Outline The remainder of this dissertation is divided up as followed: Part I describes computational infrastructure developed to enable scalable digital multicellularity ex- periments. • Chapter 2 presents the Conduit library for best-effort high-performance computing, experimentally demonstrating the scalability benefits of the best-effort approach, and • Chapter 3 proposes and tests the “hereditary stratigraphy” approach to record phylogenetic information in decentralized artificial life experiments. Although not delved into here, additional algorithm and software development work took place on regulation- enabled tag lookup and efficient event-driven virtual CPUs. Part II reports experiments performed using the DISHTINY digital multicellularity framework. • Chapter 4 surveys multicellular life histories evolved within the framework, and • Chapter 5 studies the coevolution of complexity, novelty, and adaptation in a case study lineage. Finally, Chapter 6 provides concluding remarks and describes directions in which this research should continue. 8 Part I Designing Computational Infrastructure to Enable Scalable Digital Multicellularity Experiments 9 Chapter 2 Design and Scalability Analysis of Conduit: a Best-effort Communication Software Framework Authors: Matthew Andres Moreno, Santiago Rodriguez Papa, and Charles Ofria Portions of this chapter have appeared as (Moreno et al., 2021b) in the ACM Workshop on Parallel and Dis- tributed Evolutionary Inspired Methods (WS-PDEIM) at the 2021 Genetic and Evolutionary Computation Conference (GECCO 2021) and as (Moreno et al., 2020) in the 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems (MSPDS 2020) at the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020). This chapter develops the Conduit C++ library for best-effort communication in parallel and distributed high-performance computing and tests it through a series of on-hardware experiments. We find that the best- effort approach significantly increases performance at high CPU count. Because real-time volatility affects the outcome of computation under the best-effort model, we additionally designed and measured a suite of quality of service metrics. Scaling experiments show that median quality of service generally remains stable as CPU count increases. 2.1 Introduction The parallel and distributed processing capacity of high-performance computing (HPC) clusters contin- ues to grow rapidly and enable profound scientific and industrial innovations (Gagliardi et al., 2019). These advances in hardware capacity and economy afford great opportunity, but also pose a serious challenge: developing approaches to effectively harness it. As HPC systems scale, it becomes increasingly difficult to write software that makes efficient use of available hardware and also provides reproducible results (or even near-perfectly reproducible results — i.e., up to effects from floating point non-transitivity) consistent with models of computation as being performed a reliable digital machine (Heroux, 2014). The bulk synchronous parallel (BSP) model, which is prevalent among HPC applications (Dongarra et al., 2014), illustrates the challenge. This model segments fragments of computation into sequential global supersteps, with fragments at superstep i depending only on data from strictly preceding fragments < i, often just i − 1. Computational fragments are assigned across a pool of available processing components. The BSP model assumes perfectly reliable messaging: all dispatched messages between computational frag- ments are faithfully delivered. In practice, realizing this assumption introduces overhead costs: secondary acknowledgment messages to confirm delivery and mechanisms to dispatch potential resends as the need arises. Global synchronization occurs between supersteps, with computational fragments held until their preceding superstep has completed (Valiant, 1990). This ensures that computational fragments will have 10 at hand every single expected input, including those required from fragments located on other processing elements, before proceeding. So, supersteps only turn over once the entire pool of processing components have completed their work for that superstep. Put another way, all processing components stall until the most laggardly component catches up. In a game of double dutch with several jumpers, this would be like slowing the tempo to whoever is most slow-footed each particular turn of the rope. Heterogeneous computational fragments, with some easy to process and others much slower, would result in poor efficiency under a naive approach where each processing element handled just one fragment. Some processing elements with easy tasks would finish early then idle while more difficult tasks carry on. To counteract such load imbalances, programmers can allow for “parallel slack” by ensuring computational fragments greatly outnumber processing elements or even performing dynamic load balancing at runtime (Valiant, 1990). Unfortunately, hardware factors on the underlying processing elements ensure that inherent global su- perstep jitter will persist: memory access time varies due to cache effects, message delivery time varies due to network conditions, extra processing due to error detection and recovery, delays due to unfavorable process scheduling by the operating system, etc. (Dongarra et al., 2014). Power management concerns on future machines will likely introduce even more variability (Gropp and Snir, 2013). Worse yet, as we work with more and more processes, the expected magnitude of the worst-sampled jitter grows and grows — and in lockstep with it, our expected superstep duration. In the double dutch analogy, with enough jumpers, at almost every turn of the rope someone will need to stop and tie their shoe. The global synchronization operations underpinning the BSP model further hinder its scalability. Irrespective of time to complete com- putational fragments within a superstep, the cost of performing a global synchronization operation increases with processor count (Dongarra et al., 2014). Efforts to recover scalability by relaxing superstep synchronization fall under two banners. The first approach, termed “Relaxed Bulk-Synchronous Programming” (rBSP), hides latency by performing collective operations asynchronously, essentially allowing useful computation to be performed at the same time as synchronization primitives for a single superstep percolate through the collective (Heroux, 2014). So, the time cost required to perform that synchronization can be discounted, up to the time taken up by computa- tional work at one superstep. Likewise, individual processes experiencing heavier workloads or performance degradation due to hardware factors can fall behind by up to a single superstep without slowing the entire collective. However, this approach cannot mask synchronization costs or cumulative performance degrada- tion exceeding a single superstep’s duration. The second approach, termed relaxed barrier synchronization, forgoes global synchronization entirely (Kim et al., 1998). Instead, computational fragments at superstep i only wait on expected inputs from the subset of superstep i − 1 fragments that they directly interact with. 11 Imagine a double-dutch routine where each jumper exchanges patty cakes with both neighboring jumpers at every turn of the rope. Relaxed barrier synchronization would dispense entirely with the rope. Instead, play- ers would be free to proceed to their next round of patty cakes as soon as they had successfully patty-caked both neighbors. With n players, player 0 could conceivably advance n rounds ahead of player n − 1 (each player would be one round ahead of their right neighbor). Assuming fragment interactions form a graph structure that persists across supersteps, in the general case before causing the entire collective to slow an individual fragment can fall behind at most a number of supersteps equal to the graph diameter (Gamell et al., 2015). Even though this approach can shield the collective from most one-off performance degradations of a single fragment (especially in large-diameter cases), persistently laggard hardware or extreme one-off degradations will ultimately still hobble efficiency. Dynamic task scheduling and migration aim to address this shortcoming, redistributing work in order to “catch up” delinquent fragments (Acun et al., 2014). With our double-dutch analogy, we could think of this something like a team coach temporarily benching a jumper who skinned their knee and instructing the other jumpers to pick up their roles in the routine. In addition to concerns over efficiency, resiliency poses another inexorable problem to massive HPC sys- tems. In small scales, it can suffice to assume that failures occur negligibly, with any that do transpire likely to cause an (acceptably rare) global interruption or failure. At large scales, however, software crashes and hardware failures become the rule rather than the exception (Dongarra et al., 2014) — running a simulation to completion could even require so many retries as to be practically infeasible. A typical contemporary approach to improve resiliency is checkpointing: the system periodically records global state then, when a failure arises, progress is rolled back to the most recent global known-good state and runtime restarts (Hursey et al., 2007). Global checkpoint-based recovery is expensive, especially at scale due to overhead as- sociated with regularly recording global state, losing progress since the most recent checkpoint, and actually performing a global teardown and restart procedure. In fact, at large enough scales global recovery durations could conceivably exceed mean time between failures, making any forward simulation progress all but impos- sible (Dongarra et al., 2014). The local failure, local recovery (LFLR) paradigm eschews global recovery by maintaining persistent state on a process-wise basis and providing a recovery function to initialize a step-in replacement process (Heroux, 2014; Teranishi and Heroux, 2014). In practice, such an approach can require keeping running logs of all messaging traffic in order to replay them for the benefit of any potential step-in replacement (Chakravorty and Kale, 2004). Returning once more to the double dutch analogy, LFLR would transpire as something like a handful teammates pulling a stricken teammate aside to catch them up after an amnesia attack (rather than starting the entire team’s routine back at the top of the current track). The intervening jumpers would have to remind the stricken teammate of a previously recorded position then discreetly re-feign some of their moves that the stricken teammate had cued off of between that recorded 12 position and the amnesia episode. The possibility of multiple simultaneous failure (perhaps, for example, of dozens of processes resident on a single node) poses an even more difficult, although not insurmountable, challenge for LFLR that would likely necessitate even greater overhead. On approach involves pairing up with a remote “buddy” process. The “buddy” hangs to the focal process’ snapshots and is carbon-copied on all of that process’ messages in order to ensure an independently survivable log. Unfortunately, this could potentially require forwarding all messaging traffic between simulation elements coresident on the focal process to its buddy, dragging inter-node communication into some otherwise trivial simulation operations (Chakravorty and Kalé, 2007). Efforts to ensure resiliency beyond single-node failures currently appear unnecessary (Ni, 2016, p. 12). Even though LFLR saves the cost of global spin-down and spin-up, all processes will potentially have to wait for work lost since the last checkpoint to be recompleted, although in some cases this could be helped along by tapping idle hardware to take over delinquent work from the failed process and help catch it up (Dongarra et al., 2014). Still more insidious to the reliable digital machine model, though, are soft errors — events where corruption of data in memory occurs, usually do to environmental interference (i.e., “cosmic rays”) (Karnik and Hazucha, 2004). Further miniaturization and voltage reduction, which are assumed as a likely vehicle for continuing advances in hardware efficiency and performance, could conceivably worsen susceptibility to such errors (Dongarra et al., 2014; Kajmakovic et al., 2020). What makes soft errors so dangerous is their potential indetectability. Unlike typical hardware or software failures, which explicitly result in an explicit, observable outcome (i.e., an error code, an exception, or even just a crash), soft errors can transpire silently and lead to incorrect computational results without leaving anyone the wiser. Luckily, soft errors occur rarely enough to be largely neglected in most single-processor applications (except the most safety-critical settings); however, at scale soft errors occur at a non-trivial rate (Scoles, 2018; Sridharan et al., 2015). Redundancy (be it duplicated hardware components or error correction codes) can reduce the rate of uncorrected (or at least undetected) soft errors, although at a non-trivial cost (Sridharan et al., 2015; Vankeirsbilck et al., 2015). In some application domains with symmetries or conservation principles, the rate of soft errors (or, at least, silent soft errors) could be also reduced through so-called “skeptical” assertions at runtime (Dongarra et al., 2014), although this too comes at a cost. Even if soft errors can be effectively eradicated — or at least suppressed to a point of inconsequentiality — the nondeterministic mechanics of fault recovery and dynamic task scheduling could conceivably make guaranteeing bitwise reproducibility at exascale effectively impossible, or at least an unreasonable engineering choice (Dongarra et al., 2014). However, the assumption of the reliable digital machine model remains near- universal within parallel and distributed algorithm design (Chakradhar and Raghunathan, 2010). Be it just 13 costly or simply a practical impossibility, the worsening burden of synchronization, fault recovery, and error correction begs the question of whether it is viable to maintain, or even to strive to maintain, the reliable digital machine model at scale. Indeed, software and hardware that relaxes guarantees of correctness and determinism — a so-called “best-effort model” — have been shown to improve speed (Chakrapani et al., 2008), energy efficiency (Bocquet et al., 2018; Chakrapani et al., 2008), and scalability (Meng et al., 2009). Discussion around “approximate computing” overlaps significantly with “best-effort computing,” although focusing more heavily on using algorithm design to shirk non-essential computation (i.e., reducing floating point precision, inexact memoization, etc.) (Mittal, 2016). As technology advances, computing is becoming more distributed and we are colliding with physical limits for speed and reliability. Massively distributed systems are becoming inevitable, and indeed if we are to truly achieve “indefinite scalability” (Ackley and Cannon, 2011) we must shift from guaranteed accuracy to best-effort methods that operate asynchronously and degrade gracefully under hardware failure. The suitability of the best-effort model varies from application to application. Some domains are clear cut in favor of the reliable digital machine model — for example, due to regulatory issues (Dongarra et al., 2014). However, a subset of HPC applications can tolerate — or even harness — occasionally flawed or even fundamentally nondeterministic computation (Chakradhar and Raghunathan, 2010). Various approximation algorithms or heuristics fall into this category, with notable work being done on best-effort stochastic gradient descent for artificial neural network applications (Dean et al., 2012; Niu et al., 2011; Noel and Osindero, 2014; Rhodes et al., 2019; Zhao et al., 2019). Best-effort, real-time computing approaches have also been used in some artificial life models (Ray, 1995). Likewise, algorithms relying on pseudo-stochastic methods that tend to exploit noise (rather than destabilize due to it) also make good candidates (Chakradhar and Raghunathan, 2010; Chakrapani et al., 2008). Real-time control systems that cannot afford to pause or retry, by necessity, fall into the best-effort category (Rahmati et al., 2011; Rhodes et al., 2019). For this dissertation we will, of course, focus on this latter case of systems well-suited to best-effort methods, as evolving systems already require noise to fuel variation. This work distills best-effort communication from the larger issue of best-effort computing, paying it special attention and generally pretermiting the larger issue. Specifically, we investigate the implications of relaxing synchronization and message delivery requirements. Under this model, the runtime strives to minimize message latency and loss, but guarantees elimination of neither. Instead, processes continue their compute work unimpeded and incorporate communication from collaborating processes as it happens to become available. We still assume that messages, if and when they are delivered, retain contentual integrity. We see best-effort communication as a particularly fruitful target for investigation. Firstly, synchroniza- tion constitutes the root cause of many contemporary scaling bottlenecks, well below the mark of thousands 14 or millions of cores where runtime failures and soft errors become critical considerations. Secondly, future HPC hardware is expected to provide more heterogeneous, more variable (i.e., due to power management), and generally lower (relative to compute) communication bandwidth (Acun et al., 2014; Gropp and Snir, 2013); a best-effort approach suits these challenges. A best-effort communication model presents the pos- sibility of runtime adaptation to effectively utilize available resources given the particular ratio of compute and communication capability at any one moment in any one rack. Complex biological organisms exhibit characteristic best-effort properties: trillions of cells interact asyn- chronously while overcoming all but the most extreme failures in a noisy world. As such, bio-inspired algo- rithms present strong potential to benefit from best-effort communication strategies. For example, evolution- ary algorithms commonly use guided stochastic methods (i.e., selection and mutation operators) resulting in a search process that does not guarantee optimality, but typically produces a diverse range of high-quality results. Indeed, island model genetic algorithms are easy to parallelize and have been shown to perform well with asynchronous migration (Izzo et al., 2009). Likewise, artificial life simulations commonly rely on a bottom-up approach and seek to model life-as-it-could-be evolving in a noisy environment akin to the natural world, yet distinct from it (Bonabeau and Theraulaz, 1994). Although perfect reproducibility and observability have uniquely enabled digital evolution experiments to ask and answer otherwise intractable questions (Bundy et al., 2021; Covert et al., 2013; Dolson et al., 2020; Dolson and Ofria, 2017; Fortuna et al., 2019; Goldsby et al., 2014; Grabowski et al., 2013; Lenski et al., 2003; Pontes et al., 2020; Zaman et al., 2011), the reliable digital machine model is not strictly necessary for all such work. Issues of distributed and parallel computing are of special interest within the the artificial life subdomain of open-ended evolution (OEE) (Ackley and Small, 2014), which studies long-term dynamics of evolutionary systems in order to understand factors that affect potential to generate ongoing novelty (Taylor et al., 2016). Recent evidence suggests that the generative potential of at least some model systems are — at least in part — meaningfully constrained by available compute resources (Channon, 2019). Much exciting work on best-effort computing has incorporated bespoke experimental hardware (Ackley and Williams, 2011; Chakrapani et al., 2008; Chippa et al., 2014; Cho et al., 2012; Rhodes et al., 2019). However, here, we focus on exploring best-effort communication among parallel and distributed elements within existing, commercially-available hardware. Existing software libraries, though, do not explicitly expose a convenient best-effort communication interface for such work. As such, best-effort approaches remain rare in production software and efforts to study best-effort communication must make use of a combination of limited existing support and the development of new software tools. The Message Passing Interface (MPI) standard (Gropp et al., 1996) represents the mainstay for high- performance computing applications. This standard exposes communication primitives directly to the end 15 user. MPI’s nonblocking communication primitives, in particular, are sufficient to program distributed computations with relaxed synchronization requirements. Although its explicit, the imperative nature of the MPI protocols enables precise control over execution; unfortunately it also poses significant expense in terms of programmability. This cost manifests in terms of reduced programmer productivity and software quality, while increasing domain knowledge requirements and the effort required to tune for performance due to program brittleness (Gu and Becchi, 2019; Tang et al., 2014). In response to programmability concerns, many frameworks have arisen to offer useful parallel and distributed programming abstractions. Task-based frameworks such as Charm++ (Kale and Krishnan, 1993), Legion (Bauer et al., 2012), Cilk (Blumofe et al., 1996), and Threading Building Blocks (TBB) (Reinders, 2007) describe the dependency relationships among computational tasks and associated data and relies on an associated runtime to automatically schedule and manage execution. These frameworks assume a deterministic relationship between tasks. In a similar vein, programming languages and extensions like Unified Parallel C (UPC) (El-Ghazawi and Smith, 2006) and Chapel (Chamberlain et al., 2007) rely on programmers to direct execution, but equips them with powerful abstractions, such as global shared memory. However, Chapel’s memory model explicitly forbids data races and UPC ultimately relies on a barrier model for data transfer. To bridge these shortcomings, we employ a new software framework, the Conduit C++ Library for Best- Effort High Performance Computing (Moreno et al., 2021b). The Conduit library provides tools to perform best-effort communication in a flexible, intuitive interface and uniform inter-operation of serial, parallel, and distributed modalities. Although Conduit currently implements distributed functionality via MPI intrinsics, in future work we will explore lower-level protocols like InfiniBand Unreliable Datagrams (Kashyap, 2006; Koop et al., 2007). Here, we present a set of on-hardware experiments to empirically characterize Conduit’s best-effort communication model. In order to survey across workload profiles, we tested performance under both a communication-intensive graph coloring solver and a compute-intensive artificial life simulation. First, we determine whether best-effort communication strategies can benefit performance compared to the traditional perfect communication model. We considered two measures of performance: computational steps executed per unit time and solution quality achieved within a fixed-duration run window. We compare the best-effort and perfect-computation strategies across processor counts, expecting to see the marginal benefit from best-effort communication increase at higher processor counts. We focus on weak scaling, growing overall problem size proportional to processor count. Put another way, we hold problem size per processor constant.1 This approach prevents interference from shifts in processes’ workload profiles 1 As opposed to strong scaling, where the problem size is held fixed while processor count increases. 16 in observation of the effects of scaling up processor count. To survey across hardware configurations, we tested scaling CPU count via threading on a single node and scaling CPU count via multiprocessing with each process assigned to a distinct node. In addition to a fully best-effort mode and a perfect communication mode, we also tested two intermediate, partially synchronized modes: one where the processor pool completed a global barrier (i.e., they aligned at a synchronization point) at predetermined, rigidly scheduled timepoints and another where global barriers occurred on a rolling basis spaced out by fixed-length delays from the end of the last synchronization.2 Second, we sought to more closely characterize variability in message dispatch, transmission, and de- livery under the best-effort model. Unlike perfect communication, real-time volatility affects the outcome of computation under the best-effort model. Because real-time processing speed degradations and message latency or loss alters inputs to simulation elements, characterizing the distribution of these phenomena across processing components and over time is critical to understanding the actual computation being performed. For example, consistently faster execution or lower messaging latency for some subset of processing elements could violate uniformity or symmetry assumptions within a simulation. It is even possible to imagine re- ciprocal interactions between real-time best-effort dynamics and simulation state. In the case of a positive feedback loop, the magnitude of effects might become extreme. For example, in artificial life scenarios, agents may evolve strategies that selectively increase messaging traffic so as to encumber neighboring processing elements or even cause important messages to be dropped. We monitor five aspects of real-time behavior, which we refer to as quality of service metrics (Karakus and Durresi, 2017), • wall-time simulation update rate (“simstep period”), • simulation-time message latency, • wall-time message latency, • steadiness of message inflow (“delivery clumpiness”), and • delivery failure rate. In an initial set of experiments, we use the graph coloring problem to test this suite of quality of service metrics across runtime conditions expected to strongly influence them. We compare • increasing compute workload per simulation update step, 2 Our motivation for these intermediate synchronization modes was interest in the effect of clearing any potentially-unbounded accumulation of message backlogs on laggard processes. 17 • within-node versus between-node process placement, and • multithreading versus multiprocessing. We perform these experiments using a graph coloring solver configured to maximize communication relative to computation (i.e., just one simulation element per CPU) in order to maximize sensitivity of quality of service to the runtime manipulations. Finally, we extend our understanding of performance scaling from the preceding experiments by ana- lyzing how each quality of service metric fares as problem size and processor count grow together, a “weak scaling” experiment. This analysis would detect a scenario where raw performance remains stable under weak scaling, but quality of service (and, therefore, potentially quality of computation) degrades. 2.2 Methods We performed two benchmarks to compare the performance of Conduit’s best-effort approach to a traditional synchronous model. We tested our benchmarks across both a multithread, shared-memory context and a distributed, multinode context. In each hardware context, we assessed performance on two algorithmic contexts: a communication-intensive distributed graph coloring problem (Section 2.2.2) and a compute- intensive digital evolution simulation (Section 2.2.1). The latter benchmark — presented in Section 2.2.1 — grew out of the original work developing the Conduit library to support large-scale experimental systems to study open-ended evolution. The former benchmark — presented in Section 2.2.2 — complements the first by providing a clear definition of solution quality. Metrics to define solution quality in the open-ended digital evolution context remain a topic of active research. 2.2.1 Digital Evolution Benchmark The digital evolution benchmark runs the DISHTINY (DIStributed Hierarchical Transitions in Indi- viduality) artificial life framework. This system is designed to study major transitions in evolution, events where lower-level organisms unite to form a self-replicating entity. The evolution of multicellularity and eusociality exemplify such transitions. Previous work with DISHTINY has explored methods for selecting traits characteristic of multicellularity such as reproductive division of labor, resource sharing within kin groups, resource investment in offspring, and adaptive apoptosis (Moreno and Ofria, 2019). DISHTINY simulates a fixed-size toroidal grid populated by digital cells. Cells can sense attributes of their immediate neighbors, can communicate with those neighbors through arbitrary message passing, and can interact with neighboring cells cooperatively through resource sharing or competitively through antagonistic competition to spawn daughter cells into limited space. This cell behavior is controlled by SignalGP event-driven linear genetic programs (Lalejini and Ofria, 2018). Full details of the DISHTINY simulation are available in (Moreno and Ofria, 2022). 18 We use Conduit-based messaging channels to manage all interactions between neighboring cells. Conduit models messaging channels as independent objects. However, support is provided for behind-the-scenes consolidation of communication along these channels between pairs of processes. Pooling joins together exactly one message per messaging channel to create a fixed-size consolidated message. Aggregation joins together arbitrarily many messages per channel to create a variable-size consolidated message. During a computational update, each cell advances its internal state and pushes information about its current state to neighbor cells. Several independent messaging layers handle disparate aspects of cell-cell interaction, including • Cell spawn messages, which contain arbitrary-length genomes (seeded at 100 12-byte instructions with a hard cap of 1000 instructions). These are handled every 16 updates and use Conduit’s built-in aggregation support for inter-process transfer. • Resource transfer messages, consisting of a 4-byte float value. These are handled every update and use Conduit’s built-in pooling support for inter-process transfer. • Cell-cell communication messages, consisting of arbitrarily many 20-byte packets dispatched by genetic program execution. These are handled every 16 updates and use Conduit’s built-in aggregation support for inter-process transfer. • Environmental state messages, consisting of a 216-byte struct of data. These are handled every 8 updates and use conduit’s built-in pooling support for inter-process transfer. • Multicellular kin-group size detection messages, consisting of a 16-byte bitstring. These are handled every update and use Conduit’s built-in pooling support for inter-process transfer. Implementing all cell-cell interaction via Conduit-based messaging channels allows the simulation to be parallelized down to the granularity, potentially, of individual cells. These messaging channels allow cells to communicate using the same interface whether they are placed within the same thread, across different threads, or across diffferent processes. However, in practice, for this benchmarking we assign 3600 cells to each thread or process. Because all cell-cell interactions occur via Conduit-based messaging channels, logically-neighboring cells can interact fully whether or not they are located on the same thread or process (albeit with potential irregularities due to best-effort limitations). An alternate approach to evolving large populations might be an island model, where Conduit-based messaging channels would be used solely to exchange genomes between otherwise independent populations (Bennett III et al., 1999). However, we chose to instead parallelize DISHTINY as a unified spatial realm in order to enable parent-offspring interaction 19 Mode Description 0 Barrier sync every update 1 Rolling barrier sync 2 Fixed barrier sync 3 No barrier sync 4 No inter-cpu communication Table 2.1: Asynchronicity modes used for benchmarking experiments, arranged from most to least synchro- nized. and leave the door open for future work with multicells that exceed the scope of an individual thread or process. 2.2.2 Graph Coloring Benchmark The graph coloring benchmark employs a graph coloring algorithm designed for distributed WLAN channel selection (Leith et al., 2012). In this algorithm, nodes begin by randomly choosing a color. Each computational update, nodes test for any neighbor with the same color. If and only if a conflicting neighbor is detected, nodes randomly select another color. The probability of selecting each possible color is stored in array associated with each node. Before selecting a new color, the stored probability of selecting the current (conflicting) color is decreased by a multiplicative factor b. We used b = 0.1, as suggested by Leith et al. Likewise, the stored probability of selecting all others is increased by a multiplicative factor. Regardless of whether their color changed, nodes always transmit their current color to their neighbor. Our benchmarks focus on weak scalability, using a fixed problem size of 2 048 graph nodes per thread or process. These nodes were arranged in a two-dimensional grid topology where each node had three possible colors and four neighbors. We implement the algorithm with a single Conduit communication layer carrying graph color as an unsigned integer. We used Conduit’s built-in pooling feature to consolidate color information into a single MPI message between pairs of communicating processes each update. We performed five replicates, each with a five second simulation runtime. Solution error was measured as the number of graph color conflicts remaining at the end of the benchmark. 2.2.3 Asynchronicity Modes For both benchmarks, we compared performance across a spectrum of synchronization settings, which we term “asynchronicity modes” (Table 2.1). Asynchronicity mode 0 represents traditional fully-synchronous methodology. Under this treatment, full barrier synchronization was performed between each computational update. Asynchronicity mode 3 represents fully asynchronous methodology. Under this treatment, individual threads or processes performed computational updates freely, incorporating input from other threads or processes on a fully best-effort basis. 20 During early development of the library, we discovered episodes where unprocessed messages built up faster than they could be processed — even if they were being skipped over to only get the latest message. In some instances, this strongly degraded quality of service or even caused runtime instability. We opted for MPI communication primitives that could consume many backlogged messages per call and increased buffer size to address these issues, but remained interested in the possibility of partial synchronization to clear potential message backlogs. So, we included two partially-synchronized treatments: asynchronicity modes 1 and 2. In asynchronicity mode 1, threads and processes alternated between performing computational updates for a fixed-time duration and executing a global barrier synchronization. For the graph coloring benchmark, work was performed in 10ms chunks. For the digital evolution benchmark, which is more computationally intensive, work was performed in 100ms chunks. In asynchronicity mode 2, threads and processes executed global barrier synchronizations at predetermined time points. In both experiments, global barrier synchro- nization occurred on second hand ticks of the UTC clock. Finally, asynchronicity mode 4 disables all inter-thread and inter-process communication, including barrier synchronization. We included this mode to isolate the impact on performance of communication between threads and processes from other factors potentially affecting performance, such as cache crowding. In this run mode for the graph coloring benchmark, all calls send messages between processes or threads were skipped (except after the benchmark concluded, when assessing solution quality). Because of its larger footprint, incorporating logic into the digital evolution simulation to disable all inter-thread and inter-process messaging was impractical. Instead, we launched multiple instances of the simulation as fully-independent processes and measured performance of each. 2.2.4 Quality of Service Metrics The best-effort communication model eschews effort to insulate of computation from real-time message delivery dynamics. Because these dynamics are difficult to predict a priori and can bias computation, thorough, empirical runtime measurements are necessary to understand results of such computation. To this end, we developed a suite of quality of service metrics. Figure 2.1 provides space-time diagrams illustrating the metrics presented in this section. For the purposes of these metrics, we assume that simulations proceed in an iterative fashion with alternating compute and communication phases. For short, we refer to a single compute-communication cycle as a “simstep.” We derive formulas for metrics in terms of independent observations preceding and succeeding a “snapshot” window, during which the simulation and any associated best-effort communication proceeds unimpeded. Snapshot observations are taken at one minute intervals over the course of each of our 21 A B A B A B A B vs ✅ ✅ vs ❌ ✅ ✅ ✅ ✅ ❌ (a) Clumpiness (b) Delivery Failure Rate A B A B A B A B vs vs (c) Latency (d) Simstep Period Figure 2.1: Quality of service metrics. Each illustration is a space-time diagram, with A and B representing independent processes. The vertical axis depicts the passage of time, from top to bottom. Solid black arrows represent message delivery. The left panel of each metric’s diagram depicts a scenario with a lower (“better”) value for that metric compared to the right panel, which depicts a higher (“worse”) value for that metric. 22 a replicate experiments. The following section, 2.2.5, details the experimental apparatus used to generate quality of service metrics reported in this work. Simstep Period We calculate the amount of wall-time elapsed per simulation update cycle (“Simstep Period”) during a snapshot window as update count after − update count before . walltime after − walltime before Figure 2.1d compares a scenario with low simstep period to a scenario with a higher simstep period. Simstep Latency This metric reports the number of simulation iterations that elapse between message dispatch and message delivery. Figure 2.1c compares a scenario with low latency to a scenario with a higher latency. To insulate against imperfect clock synchronization between processes, we estimate one-way wall-time latency from a round-trip measure. As part of our instrumentation, each simulation element maintains an independent zero-initialized “touch counter” associated with every neighbor simulation element it commu- nicates with. Dispatched messages originating from each simulation element are bundled with the value of the unique touch counter associated with the target element’s counter. When a message is received back to the originating element from the target element, the touch counter is set to 1 + bundled touch count. In this manner, the touch counter increments by two for each successful round trip completed. (Because simulation elements are arranged as a toroidal mesh, all interaction between simulation elements is reciprocal.) We therefore calculate one-way latency during a snapshot window as, update count after − update count before  . min touch count after − touch count before, 1 Note that if no touches elapsed during the snapshot window, we make a best-case assumption that one might elapse immediately after the end of the snapshot window (i.e., we count at least one elapsed touch). Wall-time Latency Wall-time latency is closely related to simstep latency, except that interpret time in terms of elapsed simulation updates instead of wall time. To calculate wall-time latency we apply a conversion to simstep latency based on simstep period, simstep latency × simstep period. This metric directly tells the real-time performance of message transmission. Although it directly follows from the interaction between simstep period and wall-time latency, it complements simstep latency’s 23 convenient interpretation in terms of potential simulation mechanics (e.g., simulation elements tending to see data from two updates ago versus from ten). In addition to simstep latency, Figure 2.1c is also representative of wall-time latency — the difference being interpretation of y axis in terms of wall-time instead of elapsed simulation updates. Delivery Failure Rate Delivery failure rate measures the fraction of messages sent that are dropped. The only condition where messages are dropped is when a send buffer fills. (Under the existing MPI-based implementation, messages that queue on the send buffer are guaranteed for delivery.) So, we can calculate successful send count after − successful send count before . attempted send count after − attempted send count before Delivery Clumpiness Delivery clumpiness seeks to quantify the extent to which message arrival is consolidated to a subset of message pull attempts. That is, the extent to which independently dispatched messages arrive in bundles rather than as an even stream. If messages all arrive in independent pull attempts, then clumpiness will be zero. At the point where the pigeonhole principle applies (num arriving messages >= num pull attempts), clumpiness will also be zero so long as every pull attempt is laden. If all messages arrive during a single pull attempt, then clumpiness will approach 1. We formulate clumpiness as the compliment of steadiness. (Reporting clumpiness provides a lower-is- better interpretation consistent with the rest of the quality of service metrics.) Steadiness, in turn, stems from three component statistics, num laden pulls elapsed =laden pull count after − laden pull count before num messages received =message count after − message count before num pulls attempted =pull attempt count after − pull attempt count before. Here, we refer to pull attempts that successfully retrieve a message as “laden.” 24 We combine num messages received and num pulls attempted to derive, num opportunities for laden pulls =   min num messages received, num pulls attempted . Then, to calculate steadiness, num laden pulls elapsed . num opportunities for laden pulls Finally, we find delivery clumpiness as 1 − steadiness. Figure 2.1a compares a scenario with low clumpi- ness to a scenario with higher clumpiness. 2.2.5 Quality of Service Experiments Quality of service experiments executed the graph coloring algorithm described in Section 2.2.2. In order to maximize communication intensity, only one graph vertex was assigned per CPU. Ten experimental replicates were performed for each condition surveyed. Slightly over five minutes of runtime was afforded to each replicate. Over five minutes of runtime, snapshot observations were taken at one minute intervals. The first snapshot observation was taken one minute after the beginning of runtime. Snapshot observations lasted one second, with the graph coloring algorithm running fully unhampered during the entire snapshot. This was accomplished by collecting and recording data via a separate thread. That thread collected and recorded a first tranche of snapshot data, spin waited for one second, and then recorded a second tranche. Because the underlying system runns in real-time while being observed, state changes can occur during data collection (somewhat akin to photographic motion blur). Therefore, some in- tuitive invariants — like strictly non-negative delivery failure rates — do not hold in some cases. However, the magnitude of such violations is generally minor. Further, because data collection procedures were consistent across treatments, statistical comparisons between treatments remain sound, even if direct interpretation of reported metrics should be taken with a grain of salt. Snapshots were performed independently for each process at each timepoint. So, for example, for two processes over the five minute window of a single replicate ten snapshots were collected. For statistical tests comparing treatments, snapshots were aggregated by replicate by both mean and median. For each quality of service statistic we estimate mean — which captures effects of extreme-magnitude outliers — and median — which more better represents typicality — across these window samples. Statistical comparisons across treatment conditions are performed via regression. We use ordinary least squares regression to analyze means (Geladi and Kowalski, 1986) and quantile regression to analyze medians (Koenker and Hallock, 2001). For comparisons between dichotomous, categorical treatment conditions, one condition is coded as 0 and 25 the other as 1. In the case of ordinary least squares regression, this boils down to an independent t-test. Although quantile regression on categorical predictors is not precisely equivalent to a direct test on medians between two groups (i.e., Mood’s median test), there is precedent for this approach (Konstantopoulos et al., 2019; Petscher and Logan, 2014). Most statistics reported here can be calculated just as well in terms of incoming or outgoing messages. That is, most statistics can be generated via data from instrumentation attached to message “inlets” or data from instrumentation attached to message “outlets” with no obvious reason to prefer one over the other. As “inlet-” and “outlet-”derived statistics are nearly identical in all cases, we simply report the mean over these two measurements. 2.2.6 Code, Data, and Reproducibility Benchmarking Experiments Benchmarking experiments were performed on Michigan State University’s High Performance Comput- ing Center, a cluster of hundreds of heterogeneous x86 nodes linked with InfiniBand interconnects. For multithread experiments, benchmarks for each thread count were collected from the same node. For mul- tiprocess experiments, each processes was assigned to a distinct node in order to ensure results were rep- resentative of performance in a distributed context. All multiprocess benchmarks were recorded from the same collection of nodes. Hostnames are recorded for each benchmark data point. For an exact accounting of hardware architectures used, these hostnames can be crossreferenced with a table included with the data that summarizes the cluster’s node configurations. Code for the distributed graph coloring benchmark is available at https://github.com/mmore500/ conduit under demos/channel_selection. Code for the digital evolution simulation benchmark is available at https: //github.com/mmore500/dishtiny. Exact versions of software used are recorded with each benchmark data point. Data is available via the Open Science Framework at https://osf.io/7jkgp/ and https://osf.io/72k5n (Foster and Deardorff, 2017). A live, in-browser notebook for all reported statistics and data visual- izations and is available via Binder at https://mybinder.org/v2/gh/mmore500/conduit/binder?filepath= binder%2Fdate%3D2021%2Bproject%3D72k5n (Project Jupyter et al., 2018). Quality of Service Experiments Quality of service experiments were performed on Quality of service experiments were carried out on Michigan State University’s High Performance Computing Center lac cluster, consisting of 28-core Intel(R) Xeon(R) CPU E5-2680 v4 2.40GHz nodes. All statistical comparisons are performed between observa- tions from the same job allocation. (Except in the case where intranode and internode configurations were compared, where experiments were performed on separate allocations using comparable nodes on the same 26 cluster.) Benchmarking experiments described in Section 2.2.6 used a send/receive buffer size of 2. However, due to the high communication intensity of the graph coloring problem with just one simulation element per CPU, quality of service experiments required a larger buffer size of 64 to maintain runtime stability. In early work developing the Conduit library, we discovered that real-time messaging channels can enter a destabilizing positive feedback spiral when incoming messages take longer to handle (e.g., skip past or read) than sending messages. Under such conditions, when a process exchanging messages from a partner process experiences a delay it sends fewer messages to that partner process. Due to fewer incoming messages, the partner the partner process can update more rapidly, increasing incoming message load on the delayed process. This effect can snowball the partnership intended for even, two-way message exchange into effectively a unilateral producer-consumer relationship where (potentially unbounded) work piles up on the consumer. To interrupt such a scenario, we use the bulk message pull call MPI_Testsome to ensure fast message consumption under backlogged conditions. So, receiver workload remains closer to constant under high traffic situations (instead of having to pull messages down one-by-one). Larger receive buffer size, as configured for the quality of service experiments, increases the effectiveness of the bulk message consumption countermeasure. Code for the distributed graph coloring benchmark is available at https://github.com/mmore500/ conduit under demos/channel_selection. Exact versions of software used are recorded with each benchmark data point. Data is available via the Open Science Framework at https://osf.io/72k5n/ (Foster and Deardorff, 2017). A live, in-browser notebook for all reported statistics and data visualizations is available via Binder at https://mybinder.org/v2/gh/mmore500/conduit/binder?filepath=binder%2Fdate%3D2021%2Bproject% 3D72k5n (Project Jupyter et al., 2018). 2.3 Results and Discussion Sections 2.3.1 and 2.3.2 compare execution performance under the best-effort communication versus the perfect communication models. In particular, both sections investigate how the impact of best-effort communication on performance relates to CPU count scale. Section 2.3.1 covers multithreading and Section 2.3.2 covers multiprocessing. The next sections investigate how system configuration affects quality of service. Specifically, these sections cover the impact of • increasing compute workload per simulation update step (Section 2.3.3), • within-node versus between-node process placement (Section 2.3.4), and • multithreading versus multiprocessing (Section 2.3.5). 27 Section 2.3.6 tests how quality of service changes with CPU count. This analysis fleshes out the performance-centric picture of best-effort scalability established in Sections 2.3.1 and 2.3.2. Section 2.3.7 tests how inclusion of an apparently faulty node (i.e., that provided exceptionally poor quality of service) affects global quality of service. This experiment provides insight into the robustness of best-effort approaches to single-point failure. 2.3.1 Performance: Multithread Benchmarks We first tested how performance on the graph coloring and digital evolution benchmarks fared when increasing thread count on a single hardware node. Figure 2.2a presents per-CPU algorithm update rate for the graph coloring benchmark at 1, 4, 16, and 64 threads. Update rate performance decreased with increasing multithreading across all asynchronicity modes. This performance degradation was rather severe — per-CPU update rate decreased by 61% between 1 and 4 threads and by about another 75% between 4 and 64 threads. Surprisingly, this issue appears largely unrelated to inter-thread communication, as it was also observed in asynchronicity mode 4, where all interthread communication is disabled. Perhaps per-CPU update rate degradation under threading was induced by strain on a limited system resource like memory cache or access to the system clock (which was used to control run timing). This unexpectedly severe phenomenon merits further investigation in future work with this benchmark. Nevertheless, we were able to observe significantly better performance of best-effort asynchronicity modes 1, 2, and 3 at high thread counts. At 64 threads, these run modes significantly outperformed the fully-synchronized mode 0 (p < 0.05, non-overlapping 95% confidence intervals). Likewise, as shown in Figure 2.2b, best-effort asynchronicity modes were able to deliver significantly better graph coloring solutions within the allotted compute time than the fully-synchronized mode 0 (p < 0.05, non-overlapping 95% confidence intervals). Figure 2.2c shows per-CPU algorithm update rate for the digital evolution benchmark at 1, 4, 16, and 64 threads. Similarly to the graph coloring benchmark, update rate performance decreased with increasing multithreading across all asynchronicity modes — including mode 4, which eschews inter-thread commu- nication. Even without communication between threads, with 64 threads each thread performed updates at only 61% the rate of a lone thread. At 64 threads, best-effort asynchronicity modes 1, 2, and 3 exhibit about 43% the update-rate performance of a lone thread. Although best-effort inter-thread communica- tion only exhibits half the update-rate performance of completely decoupled execution at 64 threads, this update-rate performance is roughly 2.1× that of the fully-synchronous mode 0. Indeed, best-effort modes significantly outperform the fully-synchronous mode on the digital evolution benchmark at both 16 and 64 28 Multithread Graph Coloring 8000 asynchronicity mode 0 7000 1 2 6000 updates per cpu-second 3 5000 4 4000 3000 2000 1000 0 1 4 16 64 ncpus (a) Graph coloring per-thread update rate. Higher is better. Multithread Graph Coloring Solution Quality asynchronicity mode 0 102 1 2 3 conflicts per cpu 4 101 1 4 16 64 ncpus (b) Graph coloring solution conflicts. Lower is better. Multithread Digital Evolution 70 asynchronicity mode 0 60 1 2 updates per cpu-second 50 3 4 40 30 20 10 0 1 4 16 64 ncpus (c) Digital evolution per-thread update rate. Higher is better. Figure 2.2: Multithread benchmark results. Bars represent bootstrapped 95% confidence intervals. 29 threads (p < 0.05, non-overlapping 95% confidence intervals). 2.3.2 Performance: Multiprocess Benchmarks Next, we tested how performance on the graph coloring and digital evolution benchmarks fared when scaling with fully independent processes located on different hardware nodes. Figure 2.3a shows per-CPU algorithm update rate for the graph coloring benchmark at 1, 4, 16, and 64 processes. Unlike the multithreaded benchmark, multiprocess graph coloring exhibits consistent update-rate performance across process counts under asynchronicity mode 4, where inter-CPU communication is entirely disabled. This matches the unsurprising expectation that, indeed, with comparable hardware a single process should exhibit the same mean performance as any number of completely decoupled processes. At 64 processes, best-effort asynchronicity mode 3 with the graph coloring benchmark exhibits about 63% the update-rate performance of single-process execution. This represents a 7.8× speedup compared to fully-synchronous mode 0. Indeed, best-effort mode 3 enables significantly better per-CPU update rates at 4, 16, and 64 processes (p < 0.05, non-overlapping 95% confidence intervals). Likewise, shown in Figure 2.3b, best-effort asynchronicity mode 3 yields significantly better graph- coloring results within the allotted time at 4, 16, and 64 processes (p < 0.05, non-overlapping 95% confidence intervals). Interestingly, partial-synchronization modes 1 and 2 exhibited highly inconsistent solution quality performance at 16 and 64 process count benchmarks. Fixed-timepoint barrier sync (mode 2) had particularly poor performance performance at 64 processes (note the log-scale axis). We suspect this was caused by a race condition where workers would assign sync points to different fixed points different based on slightly different startup times (i.e., process 0 syncs at seconds 0, 1, 2... while process 1 syncs at seconds 1, 2, 3..). Figure 2.3c presents per-CPU algorithm update rate for the digital evolution benchmark at 1, 4, 16, and 64 processes. Relative performance fares well at high process counts under this relatively computation-heavy workload. With 64 processes, fully best-effort simulation retains about 92% the update rate performance of single-process simulation. This represents a 2.1× speedup compared to the fully-synchronous run mode 0. Best-effort mode 3 significantly outperforms the per-CPU update rate of fully-synchronous mode 0 at process counts 16 and 64 (p < 0.05, non-overlapping 95% confidence intervals). 2.3.3 Quality of Service: Computation vs. Communication Having shown performance benefits of best-effort communication on the graph coloring and digital evolution benchmarks in Sections 2.3.1 and 2.3.2, we next seek to more fully characterize the best-effort approach using a holistic suite of proposed quality of service metrics. This section evaluates how a simulation’s ratio of communication intensity to computational work affects these quality of service metrics. The graph coloring benchmark serves as our experimental model. 30 Multiprocess Graph Coloring asynchronicity mode 6000 0 1 2 5000 updates per cpu-second 3 4 4000 3000 2000 1000 0 1 4 16 64 ncpus (a) Graph coloring per-process update rate. Higher is better. Multiprocess Graph Coloring Solution Quality asynchronicity mode 103 0 1 2 3 conflicts per cpu 4 102 101 1 4 16 64 ncpus (b) Graph coloring solution conflicts. Lower is better. Multiprocess Digital Evolution 60 50 updates per cpu-second 40 30 20 asynchronicity mode 0 1 10 2 3 0 1 4 16 64 ncpus (c) Digital evolution per-process update rate. Higher is better. Figure 2.3: Multiprocess benchmark results. Bars represent bootstrapped 95% confidence intervals. 31 For this experiment, arbitrary compute work (detached from the underlying algorithm) was added to the simulation update process. We used a call to the std::mt19937 random number engine as a unit of compute work. In microbenchmarks, we found that one work unit consumed about 35ns of walltime and 21ns of compute time. We performed 5 treatments, adding 0, 64, 4 096, 262 144, or 16 777 216 units of compute work to the update process. For each treatment, measurements were made on a pair of processes split across different nodes. Simstep Period Unsurprisingly, we found a direct relationship between per-update computational workload and the walltime required per computational update. Supplementary Figures A.24 and A.26 depict the distribution of walltime per computational update across snapshots. Once added compute work supersedes the light compute work already associated with the graph coloring algorithm update step (at around 64 work units), simstep period scales in direct proportion with compute work. Indeed, we found a significant positive relationship between both mean and median simstep period and added compute work (Supplementary Figures A.32 and A.34). At 0 units of added compute work, mean and median simstep period was 14.7 14.7 µs. At 16 777 216 units of added compute work, mean simstep period was 611ms and median simstep period was 507ms. Supplementary Tables A.17 and A.18 detail numerical results of these regressions. Simstep Latency Unsurprisingly, again, we observed a negative relationship between the number of simulation steps elapsed during message transit and added computational work. Put simply, longer update steps provide more time for messages to transit. Supplementary Figures A.25 and A.21 show the distribution of simstep latency across compute work- loads. With no added compute work, messages take between 20 and 100 simulation steps to transit (mean: 48.0 updates; median: 42.5 updates). At maximum compute work per update, messages arrive at a median 1.00 update latency. Regression analysis confirms a significant negative relationship between both mean and median log simstep latency and log added compute work (Supplementary Figures A.33 and A.29). Supplementary Tables A.17 and A.18 detail numerical results of these regressions. Walltime Latency Effects of compute work on walltime latency highlight an important caveat in interpretation of this metric. At 0, 64, and 4 096 work units, walltime latency measures ≈ 1 ms (means: 708 µs, 788 µs, 902 µs; 32 medians: 622 µs, 640 µs, 738 µs). However, once simstep period grows to ≈ 10 ms at 262 144 work units and (an order of magnitude in excess of walltime latency observed at low compute loads), walltime latency increases with added compute work. At 16 777 216 compute work units, 1.00s median walltime latency is observed. Because our computational model assumes on-demand message delivery with a communication phase only occurring once per simulation update, message transmission speed is fundamentally limited by simu- lation update period. If a message is dispatched while its recipient is busy doing computational work, the soonest it can be received will be when that recipient completes the computational phase of its update. In order to measure transmission time fully independent of delays due to on-demand delivery, additional in- strumentation would be necessary. However, when this latency is greater than a few simsteps, this measure is reasonably representative of message transmission time. Supplementary Figures A.20 and A.22 show the distribution of walltime latency across computational workloads. Supplementary Figures A.28 and A.30 summarize regression between walltime latency and added compute work. Supplementary Tables A.17 and A.18 detail numerical results of those regressions. Delivery Clumpiness We observed a significant negative relationship between computation workload and delivery clumpiness. At low computational intensity, we observed clumpiness greater than 0.95, meaning that fewer than 5% of pull requests were laden with fresh messages (at 0 compute work mean: 0.96, median 0.96). However, at high computational intensity clumpiness reached 0, indicating that messages arrived as a steady stream (at 16 777 216 compute work mean: 0.00, median 0.00). Presumably, the reduction in clumpiness is due to increased real-time separation between dispatched messages. Supplementary Figure A.23 shows the effect of computational workload on the distribution of observed clumpinesses. We found a significant negative relationship between both mean and median clumpiness and computational intensity. Supplementary Figure A.31 visualizes these regressions and Supplementary Tables A.17 and A.18 provide numerical details. Delivery Failure Rate We did not observe any delivery failures across all replicates and all compute workloads. So, compute workload had no observable effect on delivery reliability. Supplementary Figure A.27 shows the distribution of delivery failure rates across computation workloads and Supplementary Figure A.35 shows regressions of delivery failure rate against computational workload. See Supplementary Tables A.17 and A.18 for numerical details. 33 2.3.4 Quality of Service: Intranode vs. Internode This section tests the effect of process assignment on best-effort quality of service, comparing multi-node and single-node assignments. The graph coloring benchmark again serves as our experimental model. For this experiment, processes were either assigned to the same node or were assigned to different nodes. In both cases, we used two processes. Simstep Period Simstep period was significantly slower under internode conditions than under intranode conditions. When processes shared the same node, simstep period was around 9 µs (mean: 9.06 µs; median: 9.08 µs). Under internode conditions, simstep period was around 14 µs (mean: 14.5 µs; median: 14.4 µs). Supplemen- tary Figures A.40 and A.42 depict the distribution of walltime per computational update across intranode and internode conditions. This result presumably attributes to an increased walltime cost for calls to the MPI implementation backing internode communication compared to the MPI implementation backing intranode communication. Although this effect is clearly detectable, its magnitude is modest given the minimal computational intensity of the simulation update step — only ≈ 56% more expensive than intranode dispatch. Both mean and median simstep period increased significantly under internode conditions. (Supple- mentary Figures A.48 and A.50 visualize these regressions and Supplementary Tables A.19 and A.20 detail numerical results.) Simstep Latency Significantly more simulation updates transpired during message transmission under internode condtions compared to intranode conditions. Supplementary Figures A.41 and A.37 compares the distributions of simstep latency across these condi- tions. Simstep latency was around 1 update for intranode communication (mean: 1.00 updates; median 0.75 updates) and around 40 updates for internode communication (mean: 41.6 updates; median: 37.4 updates). Regression analysis confirms the significant effect of process placement on simstep latency (Supple- mentary Figures A.49 and A.45). Supplementary Tables A.19 and A.20 detail numerical results of these regressions. Walltime Latency Significantly more walltime elapsed during message transmission under internode condtions compared to intranode conditions. Walltime latency was less than 10 µs for intranode communication (mean: 7.70 µs; median: 6.94 µs). Internode communication had approximately 50× greater walltime latency, at around 500 µs (mean: 600 µs; 34 median: 551 µs). Supplementary Figures A.36 and A.38 show the distributions of walltime latency for intra- and inter-node communication. Regression analysis confirmed a significant increase in walltime latency under inter-node communication (Supplementary Figures A.44, A.46; Supplementary Tables A.19 and A.20). Delivery Clumpiness Delivery clumpiness was minimal under intranode communication and very high under internode com- munication. Under intranode conditions, we observed a mean clumpiness value of 0.014 and a median of 0.002. Under internode conditions, we observed mean and median clumpiness values of 0.96. Supplementary Figures A.39 and A.39 show the distributions of clumpiness for intra- and inter-node communication. Regression analysis confirmed a significant increase in clumpiness under inter-node communication (Supplementary Figures A.47, A.47; Supplementary Tables A.19 and A.20). Delivery Failure Rate Somewhat counterintuitively, a significantly higher proportion of deliveries failed for intranode commu- nication than for internode communication. We observed a delivery failure rate of around 0.3 for intranode communication (mean: 0.33; median: 0.30) and no delivery failures for internode communication (mean: 0.00; median: 0.00). In some intranode snapshot windows, we observed a delivery failure rate as high as 0.8. Supplementary Figures A.39 and A.39 show the distributions of delivery failure rate for intra- and inter-node communication. Because of Conduit’s current MPI-based implementation, messages only drop when the underlying send buffer fills; queued messages are guaranteed for delivery. Slower simstep period under internode allocation could improve stability of the send buffer due to more time, on average, between send attempts. Underlying buffering or consolidation by the MPI backend for internode communication might also play a role by allowing data to be moved out of the userspace send buffer more promptly. Regression analysis confirmed a significant increase in delivery failure under intra-node communication (Supplementary Figures A.47, A.47; Supplementary Tables A.19 and A.20). 2.3.5 Quality of Service: Multithreading vs. Multiprocessing This section compares best-effort quality of service under multithreading and multiprocessing schemes. We hold hardware configuration constant by restricting multiprocessing to cores a single hardware node, as is the case for multithreading. However, inter-process communication occurred via MPI calls while inter-thread communication occurring via shared memory access mediated by a C++ std::mutex. The graph coloring benchmark again serves as our experimental model. Both treatments used a single 35 pair of CPUs. Simstep Period Multithreading enabled faster simulation update turnover than multiprocessing. Under multithreading, simstep period was around 5 µs (mean: 4.60 µs; median: 4.64 µs). Simstep period for multiprocessing was around 9 µs (mean: 9.00 µs; median: 9.04 µs). Supplementary Figures A.56 and A.58 depict the distribution of walltime per computational update for both multiprocessing and multithreading. This result falls in line with expectations that interaction via shared memory incurs lower overhead than via MPI calls. Regression analysis showed that both mean and median simstep period were significantly slower un- der multiprocessing compared to multithreading. (Supplementary Figures A.64 and A.66 visualize these regressions and Supplementary Tables A.21 and A.22 detail numerical results.) Walltime Latency No significant difference in walltime latency was detected between multiprocessing and multithreading. In the median case, walltime latency was approximately 5 µs for multithreading and 8 µs for multi- processing. However, a pair of extreme outliers among snapshot windows — with walltime latencies of approximately 12ms — drove multithreading walltime latency much higher in the mean case (451 µs). In the median case, multiprocessing walltime latency was 8.56 µs. Cache invalidation or mutex contention provide possible explanations for the observed episodes of ex- treme multithreading latency, although magnitude on the order of milliseconds for such effects is surprising. Multithreading appears to provide marginally lower latency service in the median case, but at the cost of vulnerability to extreme high-latency disruptions. Supplementary Figures A.52 and A.54 show the distributions of walltime latency for multithread and multiprocess runs. Regression analysis did not detect any significant difference in walltime latency between multithreading and multiprocessing (Supplementary Figures A.60, A.62; Supplementary Tables A.21 and A.22). Simstep Latency No significant difference in simstep latency was detected between multiprocessing and multithreading. In the median case, multiprocessing offered marginally lower simstep latency than multithreading. Me- dian simstep latency was 0.84 updates under multiprocessing and 1.10 updates under multithreading. How- ever, just as for walltime latency, extreme magintude outliers (≈ 2 000 simsteps) boosted mean simstep latency for multithreading. Mean simstep latency was 0.94 updates under multiprocessing and 78.0 updates under multithreading. Supplementary Figures A.57 and A.53 compare the distributions of simstep latency across these conditions. 36 Direct measurements of simstep period and walltime latency suggest that faster simstep period, rather than slower walltime latency, explain the marginally higher simstep latency under multithreading. Regression analysis detected no significant effect of threading versus processing on simstep latency in both the mean and median cases (Supplementary Figures A.65 and A.61). Supplementary Tables A.21 and A.22 detail numerical results of these regressions. Delivery Clumpiness Multithreading exhibited higher median clumpiness and greater variance in clumpiness than multipro- cessing. Under multithreading, clumpiness was nearly 1 within some snapshot windows and less than 0.1 within others. Under multiprocessing, clumpiness was consistently less than 0.1. Supplementary Figures A.55 and A.55 show the distributions of clumpiness under both multiprocessing and multithreading. Multithreading median clumpiness was 0.54. Multiprocessing median clumpiness was 0.03. Multithreading and multipro- cessing mean clumpinesses were 0.56 and 0.03, respectively. Regression analysis confirmed a significantly greater clumpiness under both multithreading compared to multiprocessing (Supplementary Figures A.63, A.63; Supplementary Tables A.21 and A.22). Delivery Failure Rate We observed a higher proportion of deliveries fail for multiprocessing than for multithreading. (This is as expected; the multithread implementation directly wrote updates to a piece of shared memory, so there was no send buffer to backlog and induce message drops.) Multiprocessing exhibited both mean and median delivery failure rate of 0.38. In individual multipro- cessing snapshot windows, we observed a delivery failure rates ranging from less than 0.1 to as high as 0.7. We observed no multithreaded delivery failures. Supplementary Figures A.55 and A.55 show the distributions of delivery failure rate for multithreading and multiprocessing. Regression analysis confirmed a significant increase in delivery failure under multiprocessing (Supple- mentary Figures A.63, A.63; Supplementary Tables A.21 and A.22). 2.3.6 Quality of Service: Weak Scaling Sections 2.3.2 and 2.3.1 showed how best-effort communication could improve application performance, particularly when scaling up processor count. Multiprocess performance scales well under the best-effort approach, with overlapping performance estimate intervals for 16 and 64 processor counts on both surveyed benchmark problems. This section aims to flesh out a more holistic picture of the effects of increasing processor count on best-effort computation by considering a comprehensive suite of quality of service metrics. Our particular 37 interest is in which, if any, aspects of quality of service degrade under larger processing pools. To address these questions, we performed weak scaling experiments on 16, 64, and 256 processes using the graph coloring benchmark. To broaden the survey, we tested scaling with different numbers of processors allocated per node and different numbers of simulation elements assigned per processor. For the first variable, we tested scaling on allocations with each processor hosted on an independent node and allocations where each node hosted an average of four processors. This allowed us to examine how quality of service fared in homogeneous network conditions, where all communication between processes was inter-node, compared to heteregeneous conditions, where some inter-process communication was inter-node and some was intra-node. For the second variable, we tested with 2 048 simulation elements (“simels”) per processor (consistent with the benchmarking experiments performed in Sections 2.3.2 and 2.3.1) and just one simulation element per processor. This allowed us to vary the amount of computational work performed per process. Simstep Period Supplementary Figures A.5 and A.7 survey the distributions of simstep periods observed within snapshot windows. Across process counts, simstep period registers around 80 µs with one simel and around 200 µs with 2 048 simels. However, on heterogeneous allocations (4 CPUs per node) this metric is more variable, spanning up to an order of magnitude. Outlier observations range up to around 10ms with 2 048 simels and up to slightly less than 100ms inlet / 4s outlet with 1 simel. We performed an ordinary least squares (OLS) regression to test how mean simstep period changed with processor count. In all cases except one simel per CPU with four CPUs per node, mean simstep period increased significantly with processor count from 16 to 64 to 256 CPUs. However, from 64 to 256 processors mean simstep period only increased significantly with one simel per CPU and one CPU per node. Between 64 and 256 processes, mean simstep period actually decreased significantly for runs with 2 048 simels per CPU. Figure 2.4 and Supplementary Figure A.13 visualize reported OLS regressions. Supplementary Tables A.5 and A.7 provide numerical details on reported OLS regressions. Median simstep period exhibited the same relationships with processor count, tested with quartile re- gression. Supplementary Figures A.18 and A.19 visualize corresponding quartile regressions. Supplementary Tables A.13 and A.15 report numerical details on those quartile regressions. Except for the extreme case of one simel per CPU and one CPU per node, simstep period quality of service is stable in scaling from 64 to 256 processes. Walltime Latency Walltime latency sits at around 500 µs for one-simel runs and around 2ms for 2 048-simel runs. However, variability is greater for heterogeneous (four CPUs per node) allocations. Extreme outliers of up to almost 38 100ms inlet/2s outlet occur in four CPUs per node, one-simel runs. In 256 process, 2 048-simel, one CPU per node runs, outliers of more than 10s occur. Supplementary Figures A.1 and A.3 show the distribution of walltime latencies observed across run conditions. We performed OLS regressions to test how mean walltime latency changed with processor count. Over 16, 64, and 256 processes, mean walltime latency increased significantly with processor count only with 2 048 simels per CPU. Between 64 and 256 processes, mean walltime latency increased significantly with processor count only for one CPU per node with 2 048 simels per CPU. Supplementary Figures A.9 and Figure A.11 show these regressions. Supplementary Tables A.1 and A.3 provide numerical details. Next, we performed quantile regressions to test how processor count affected median walltime latency. Over 16, 64, and 256 processes, median walltime latency increased significantly only with 4 CPUs per node and 2 048 simels per CPU. Over 64 and 256 processes, there was no significant relationship between processor count and median walltime latency under any condition. Figure 2.5 and Supplementary Figure A.16 show regression results. Supplementary Tables A.9 A.11 provide numerical details. Simstep Latency Simstep latency sits around 7 updates for runs with one simel per CPU and around 1.2 updates for runs with 2 048 simels per CPU. For runs with one simel per CPU, outlier snapshot windows reach up to 50 updates under homogeneous allocations and up to almost 100 updates under heterogeneous allocations. The 2 048 simels per CPU, one CPU per node, 256 process condition exhibited outliers of up to almost 8 000 update simstep latency. Supplementary Figures A.6 and A.2 show the distribution of simstep latencies observed across run conditions. Over 16, 64, and 256 processes, mean simstep latency increased with process count only under 1 CPU per node, 2 048 simel per CPU conditions. The same was true over just 64 to 256 processes. Supplemen- tary Figures A.12 and A.10 show the OLS regressions performed, with Supplementary Tables A.6 and A.2 providing numerical details. For median simstep latency, however, there was no condition where latency increased significantly with process count. Figure 2.6 and Supplementary Figure A.15 show the quantile regressions performed, with Supplementary Tables A.14 and A.10 providing numerical details. Delivery Clumpiness For one-simel-per-CPU runs, median delivery clumpiness registered between 0.8 and 0.6. On 2 048- simel-per-CPU runs, median delivery clumpiness was lower at around 0.4. Supplementary Figure A.4 shows the distribution of delivery clumpiness values observed across run conditions. Using OLS regression, we found no evidence of mean clumpiness worsening with increased process count. In fact, over 16, 64, and 256 processes clumpiness significantly decreased with process count in all conditions 39 except four CPUs per node with 2 048 simels per CPU. Figure 2.7 and Supplementary Table A.4 detail regressions performed to test the relationship between mean clumpiness and process count. Median delivery clumpiness exhibited the same relationships with processor count, tested with quartile regression. Supplementary Figure A.17 and Supplementary Table A.12 detail regressions between median clumpiness and process count. Delivery Failure Rate Typical delivery failure rate was near zero, except with one simel per CPU and four CPUs per node where median delivery failure rate was approximately 0.1. However, outlier delivery failure rates of up to 0.7 were observed with 1 CPU per node, 2 048 simels per CPU, and 256 processes. Outlier delivery failure rates of up to 0.2 were observed with 4 CPUs per node, 2 048 simels per CPU, and 256 processes. Supplementary Figure A.8 shows the distribution of delivery failure rates observed across run conditions. Mean delivery failure rate increased significantly between 64 and 256 processes with 1 CPU per node and 2 048 simels per CPU as well as with 4 CPUs per node an 1 simel per CPU. However, the median delivery failure rate only increased significantly with processor count with 4 CPUs per node and 1 simel per CPU. Supplementary Figure A.14 and Supplementary Table A.8 detail the OLS regression testing mean de- livery failure rate against processor count. Figure 2.8 and Supplementary Table A.16 detail the quantile regression testing median delivery failure rate against processor count. 2.3.7 Quality of Service: Faulty Hardware The extreme magnitude of outliers for metrics reported in Section 2.3.6 prompted further investigation of the conditions under which these outliers arose. Closer inspection revealed that the most extreme outliers were all associated with snapshots on a single node: lac-417. So, we acquired two separate 256 process allocations on the lac cluster: one including lac-417 and one excluding lac-417. Supplementary Figures A.72, A.74, A.73, A.69, A.68, A.70, A.71, and A.75 compare the distributions of quality of service metrics between allocations with and without lac-417. Extreme outliers are present exclusively in the lac-417 allocation for walltime latency, simstep latency, and delivery failure rate. Otherwise, the metrics’ distributions across snapshots are very similar between allocations. Supplementary Figures A.80, A.82, A.81, A.77, A.76, A.78, A.79, A.79, and A.83 chart OLS and quantile regressions of quality of service metrics on job composition. Mean walltime latency, simstep latency, and delivery failure rate are all significantly greater with lac-417. Surprisingly, mean simstep period is significantly longer without lac-417. 40 However, there is no significant difference in median value for any quality of service metric between allocations including or excluding lac-417. This stability of metric medians within allocations containing lac-417 — which have significantly different means due to outlier values induced by the presence of lac-417 — demonstrates how the best-effort system maintains overall quality of service stability despite defective or degraded components. Supplementary Tables A.23 and A.24 provide numerical details on regressions reported above. 41 Ordinary Least Squares Regression Cpus Per Node = 1 1e8 Cpus Per Node = 4 1.75 90000 1.50 Simstep Period Inlet (ns) Num Simels Per Cpu = 1 85000 1.25 Estimated Statistic = Simstep Period Inlet (ns) Mean | Num Processes = 16, 64, 256 1.00 Cpus Per Node = 1 Cpus Per Node = 4 80000 10000 0.75 0.04 Num Simels Per Cpu = 1 75000 0.50 8000 Absolute Effect Size 0.02 70000 0.25 6000 0.00 0.00 65000 4000 −0.02 1e6 1e6 2000 −0.04 2.0 Num Simels Per Cpu = 2048 2.1 Simstep Period Inlet (ns) 0 1.9 200000 2.0 Num Simels Per Cpu = 2048 200000 1.8 Absolute Effect Size 150000 1.9 150000 1.7 1.8 100000 1.6 100000 1.7 1.5 50000 50000 2 3 4 2 3 4 Log Num Processes Log Num Processes 0 0 (a) Complete ordinary least squares regression plot. (b) Estimated regression coefficient for complete re- Observations are means per replicate. gression. Zero corresponds to no effect. Ordinary Least Squares Regression Cpus Per Node = 1 1e8 Cpus Per Node = 4 1.75 90000 1.50 Simstep Period Inlet (ns) Num Simels Per Cpu = 1 Estimated Statistic = Simstep Period Inlet (ns) Mean | Num Processes = 64, 256 85000 1.25 Cpus Per Node = 1 Cpus Per Node = 4 80000 1.00 7000 0.04 Num Simels Per Cpu = 1 0.75 6000 Absolute Effect Size 75000 0.02 0.50 5000 70000 0.25 4000 0.00 3000 65000 0.00 −0.02 2000 1e6 1e6 2.1 1000 −0.04 2.0 Num Simels Per Cpu = 2048 0 Simstep Period Inlet (ns) 2.0 1.9 0 0 Num Simels Per Cpu = 2048 −25000 1.8 −20000 1.9 Absolute Effect Size −50000 1.7 −40000 −75000 1.8 1.6 −100000 −60000 1.7 1.5 −125000 2 3 4 2 3 4 −150000 −80000 Log Num Processes Log Num Processes (c) Piecewise ordinary least squares regression plot. (d) Estimated regression coefficient for rightmost par- Observations are means per replicate. tial regression. Zero corresponds to no effect. Figure 2.4: Ordinary least squares regressions of Simstep Period Inlet (ns) against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Error bands and bars are 95% confidence intervals. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 42 Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 500000 600000 Latency Walltime Inlet (ns) Num Simels Per Cpu = 1 575000 475000 550000 450000 Estimated Statistic = Latency Walltime Inlet (ns) Median | Num Processes = 16, 64, 256 Cpus Per Node = 1 Cpus Per Node = 4 0 525000 425000 20000 500000 Num Simels Per Cpu = 1 400000 −10000 Absolute Effect Size 10000 475000 375000 −20000 0 450000 350000 −30000 1e6 1e6 −10000 2.75 −40000 Latency Walltime Inlet (ns) Num Simels Per Cpu = 2048 2.6 −20000 2.50 2.25 150000 2.4 350000 Num Simels Per Cpu = 2048 125000 2.00 300000 Absolute Effect Size 100000 250000 2.2 1.75 75000 200000 1.50 50000 150000 2.0 25000 1.25 100000 0 2 3 4 2 3 4 50000 −25000 Log Num Processes Log Num Processes 0 (a) Complete quantile regression plot. Observations (b) Estimated regression coefficient for ordinary least are medians per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 500000 600000 Latency Walltime Inlet (ns) Num Simels Per Cpu = 1 575000 475000 Estimated Statistic = Latency Walltime Inlet (ns) Median | Num Processes = 64, 256 550000 450000 Cpus Per Node = 1 Cpus Per Node = 4 50000 425000 30000 525000 Num Simels Per Cpu = 1 20000 40000 Absolute Effect Size 500000 400000 10000 30000 475000 375000 0 20000 450000 350000 −10000 1e6 1e6 10000 2.75 −20000 Num Simels Per Cpu = 2048 0 Latency Walltime Inlet (ns) 2.6 2.50 −30000 50000 2.25 2.4 Num Simels Per Cpu = 2048 0 0 2.00 Absolute Effect Size −50000 −50000 2.2 1.75 −100000 −100000 1.50 −150000 2.0 −150000 −200000 1.25 −250000 2 3 4 2 3 4 −200000 Log Num Processes Log Num Processes −300000 (c) Piecewise quantile regression plot. Observations (d) Estimated regression coefficient for rightmost par- are medians per replicate. tial regression. Zero corresponds to no effect. Figure 2.5: Quantile Regressions of Latency Walltime Inlet (ns) against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Quantile regression estimates relationship between independent variable and median of response variable. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 43 Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 7.5 9 Latency Simsteps Inlet Num Simels Per Cpu = 1 7.0 Estimated Statistic = Latency Simsteps Inlet Median | Num Processes = 16, 64, 256 8 Cpus Per Node = 1 Cpus Per Node = 4 6.5 0.0 0.0 7 −0.2 −0.1 6.0 Num Simels Per Cpu = 1 −0.4 Absolute Effect Size −0.2 6 −0.6 5.5 −0.3 −0.8 −0.4 5 5.0 −1.0 −0.5 −1.2 1.40 1.45 −0.6 −1.4 Num Simels Per Cpu = 2048 1.35 1.40 −0.7 Latency Simsteps Inlet −1.6 1.30 1.35 0.01 0.00 Num Simels Per Cpu = 2048 1.25 0.04 1.30 −0.01 Absolute Effect Size 1.20 0.02 1.25 −0.02 1.15 −0.03 0.00 1.20 −0.04 1.10 −0.02 1.15 −0.05 1.05 −0.06 −0.04 2 3 4 2 3 4 −0.07 Log Num Processes Log Num Processes −0.06 (a) Complete quantile regression plot. Observations (b) Estimated regression coefficient for ordinary least are medians per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 7.5 9 Num Simels Per Cpu = 1 Estimated Statistic = Latency Simsteps Inlet Median | Num Processes = 64, 256 Latency Simsteps Inlet 7.0 Cpus Per Node = 1 Cpus Per Node = 4 8 6.5 0.4 0.0 Num Simels Per Cpu = 1 7 Absolute Effect Size 6.0 −0.2 0.2 6 5.5 0.0 −0.4 5.0 5 −0.6 −0.2 1.40 1.45 −0.8 −0.4 Num Simels Per Cpu = 2048 1.35 1.40 Latency Simsteps Inlet 0.06 0.08 1.30 1.35 Num Simels Per Cpu = 2048 0.06 0.04 1.25 1.30 Absolute Effect Size 0.04 0.02 1.20 0.02 1.25 0.00 1.15 0.00 1.20 −0.02 −0.02 1.10 1.15 −0.04 −0.04 1.05 −0.06 2 3 4 2 3 4 −0.06 Log Num Processes Log Num Processes −0.08 (c) Piecewise quantile regression plot. Observations (d) Estimated regression coefficient for rightmost par- are medians per replicate. tial regression. Zero corresponds to no effect. Figure 2.6: Quantile Regressions of Latency Simsteps Inlet against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Quantile regression estimates relationship between independent variable and median of response variable. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 44 Ordinary Least Squares Regression Cpus Per Node = 1 Cpus Per Node = 4 0.84 0.700 Estimated Statistic = Delivery Clumpiness Mean | Num Processes = 16, 64, 256 Num Simels Per Cpu = 1 0.82 Delivery Clumpiness 0.675 Cpus Per Node = 1 Cpus Per Node = 4 0.80 0.00 0.000 0.650 −0.005 0.78 Num Simels Per Cpu = 1 −0.01 0.625 Absolute Effect Size −0.010 0.76 0.600 −0.02 −0.015 0.74 0.575 −0.020 0.72 −0.03 −0.025 −0.04 −0.030 0.42 0.40 Num Simels Per Cpu = 2048 0.40 Delivery Clumpiness 0.00 0.06 0.35 0.38 Num Simels Per Cpu = 2048 0.05 0.36 0.30 −0.01 Absolute Effect Size 0.34 0.04 0.25 −0.02 0.32 0.03 0.20 0.30 0.02 −0.03 0.15 0.28 0.01 2 3 4 2 3 4 −0.04 0.00 Log Num Processes Log Num Processes (a) Complete ordinary least squares regression plot. (b) Estimated regression coefficient for complete re- Observations are means per replicate. gression. Zero corresponds to no effect. Ordinary Least Squares Regression Cpus Per Node = 1 Cpus Per Node = 4 0.84 Estimated Statistic = Delivery Clumpiness Mean | Num Processes = 64, 256 0.700 Num Simels Per Cpu = 1 0.82 Cpus Per Node = 1 Cpus Per Node = 4 0.000 0.000 Delivery Clumpiness 0.675 0.80 −0.005 −0.005 0.650 Num Simels Per Cpu = 1 Absolute Effect Size 0.78 −0.010 −0.010 0.625 0.76 −0.015 −0.015 0.600 −0.020 0.74 −0.020 0.575 −0.025 0.72 −0.025 −0.030 −0.030 0.42 −0.035 0.40 Num Simels Per Cpu = 2048 0.40 Delivery Clumpiness 0.35 0.00 0.38 Num Simels Per Cpu = 2048 0.06 Absolute Effect Size 0.36 0.30 −0.01 0.04 0.34 0.25 −0.02 0.02 0.32 0.20 −0.03 0.30 0.00 0.15 −0.04 0.28 −0.02 2 3 4 2 3 4 −0.05 Log Num Processes Log Num Processes −0.04 (c) Piecewise ordinary least squares regression plot. (d) Estimated regression coefficient for rightmost par- Observations are means per replicate. tial regression. Zero corresponds to no effect. Figure 2.7: Ordinary least squares regressions of Delivery Clumpiness against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Error bands and bars are 95% confidence intervals. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 45 Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 0.04 0.10 Num Simels Per Cpu = 1 Delivery Failure Rate Estimated Statistic = Delivery Failure Rate Median | Num Processes = 16, 64, 256 0.02 0.08 Cpus Per Node = 1 Cpus Per Node = 4 0.030 0.06 0.00 0.04 Num Simels Per Cpu = 1 0.025 Absolute Effect Size 0.04 −0.02 0.02 0.020 0.02 0.00 0.015 −0.04 0.00 −0.02 0.010 −0.04 0.005 Num Simels Per Cpu = 2048 0.04 0.04 0.000 Delivery Failure Rate 0.02 0.02 0.04 0.04 Num Simels Per Cpu = 2048 0.00 0.00 Absolute Effect Size 0.02 0.02 −0.02 −0.02 0.00 0.00 −0.04 −0.04 −0.02 −0.02 2 3 4 2 3 4 −0.04 −0.04 Log Num Processes Log Num Processes (a) Complete quantile regression plot. Observations (b) Estimated regression coefficient for ordinary least are medians per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 0.04 0.10 Estimated Statistic = Delivery Failure Rate Median | Num Processes = 64, 256 Num Simels Per Cpu = 1 Delivery Failure Rate Cpus Per Node = 1 Cpus Per Node = 4 0.02 0.08 0.04 0.030 0.06 Num Simels Per Cpu = 1 0.00 Absolute Effect Size 0.025 0.02 0.04 −0.02 0.020 0.00 0.02 0.015 −0.04 −0.02 0.00 0.010 −0.04 0.005 Num Simels Per Cpu = 2048 0.04 0.04 0.000 Delivery Failure Rate 0.02 0.02 0.04 0.04 Num Simels Per Cpu = 2048 Absolute Effect Size 0.00 0.00 0.02 0.02 −0.02 −0.02 0.00 0.00 −0.04 −0.04 −0.02 −0.02 2 3 4 2 3 4 −0.04 −0.04 Log Num Processes Log Num Processes (c) Piecewise quantile regression plot. Observations (d) Estimated regression coefficient for rightmost par- are medians per replicate. tial regression. Zero corresponds to no effect. Figure 2.8: Quantile Regressions of Delivery Failure Rate against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Quantile regression estimates relationship between independent variable and median of response variable. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 46 2.4 Conclusion The fundamental motivation for best-effort communication is efficient scalability. Our results confirm that best-effort communication can fulfill on this goal. We found that the best-effort approach significantly increases performance at high CPU count. This finding was consistent across the communication-intensive graph coloring benchmark and the computation- intensive digital evolution benchmark. The computation-heavy digital evolution benchmark yielded the strongest scaling efficiency, achieving at 64 processes 92% the update-rate of single-process execution. We observed the greatest relative speedup under distributed communication-heavy workloads — about 7.8× on the graph coloring benchmark. In the case of the graph coloring benchmark, we found that best-effort communication can help achieve tangibly better solution quality within a fixed time constraint. Because real-time volatility affects the outcome of computation under the best-effort model, raw execu- tion speed performance does not suffice to fully understand the consequences of the best-effort communication model. In order to characterize the real-time dynamics under the best-effort model, we designed and mea- sured a suite of quality of service metrics: simstep period, simstep latency, wall-time latency, delivery failure rate, and delivery clumpiness. We performed several experiments to validate and characterize these metrics. Comparing quality of service between multithreading and multiprocessing, we found that multithreading had lower runtime over- head cost but that multiprocessing reduced delivery erraticity, curbing especially extreme poor quality of service outlier events. We found better quality of service, especially with respect to latency, for processes occupying the same node. Finally, varying the ratio of computational work to communication, we found lower communication intensity associated with less volatile quality of service. In order for best-effort communication to succeed in facilitating scale-up, median quality of service must stabilize with increasing CPU count. Put another way, best-effort communication cannot succeed at scale if communication quality tends toward complete degradation. In Section 2.3.6, we used weak scaling experiments to test the effect of scale-up on quality of service at 8, 64, and 256 processes. Under a lower communication-intensivity task parameterization, we found that all median quality of service metrics were stable when scaling from 64 to 256 process. Under maximal communication intensivity, we found in one case that median simstep period degraded from around 80 µs to around 85 µs. In another case, median message delivery failure rate increased from around 7% to around 9%. Such minor — and, in most cases, nil — degradation in median quality of service despite maximal communication intensivity bodes well for the viability of best-effort communication at scale. Resilience is a second major motivating factor for best-effort computing. In another promising result, 47 we found that the presence of an apparently faulty compute node did not degrade median performance or quality of service. Despite extreme quality of service degradation measured among that node and its clique, collective performance and quality of service remained steady. In effect, the best-effort approach successfully decoupled global performance from the worst performer. Such so-called “straggler effects” plague traditional approaches to large-scale high-performance computing (Aktaş and Soljanin, 2019), so avoiding them is a major boon. Development of the Conduit library stemmed from a practical need for an abstract, prepackaged best- effort commmunication interface to support our digital evolution research. Because real-time effects are fundamentally application-dependent and arise without any explicit in-program specification (and therefore may be unanticipated) it is important to be able to perform such quality of service profiling case-by-case in applications of best-effort communication. The instrumentation used in these experiments is written as wrappers around the library’s Inlet and Outlet classes that may be enabled via compile-time configuration switch. This makes data generation for quality of service analysis trivial to perform in any system built with the Conduit library. We hope that making this library and quality of service metrics available to the community can reduce domain expertise and programmability barriers to taking advantage of the best-effort communication model to efficiently leverage burgeoning parallel and distributed computing power. In future work, it may be of interest to design systems that monitor and proactively react to real-time quality of service conditions. For example, imposing a variable cost for cell-cell messaging to agents based on traffic levels or increasing per-update resource generation for agents on slow-running nodes. We are eager to investigate how Conduit’s best-effort communication model scales on much larger process counts on the order of thousands of cores. 48 Chapter 3 Methods to Enable Decentralized Phylogenetic Tracking in a Distributed Digital Evolution System Authors: Matthew Andres Moreno, Emily Dolson, and Charles Ofria This chapter is adapted from (Moreno et al., 2022b), which underwent peer review and appeared in the proceedings of the 2022 Conference on Artificial Life (ALIFE 2022). This chapter presents a novel algorithm (“hereditary stratigraphy”) to facilitate reconstruction-based phylogenetic studies in digital evolution systems. This approach enables efficient, accurate phylogenetic reconstruction with tunable, explicit trade-offs between annotation memory footprint and reconstruction accuracy. We can estimate, for example, MRCA generation of two genomes within 10% relative error with 95% confidence up to a depth of a trillion generations with genome annotations smaller than a kilobyte. Simulated inference over known lineages recovers up to 85.70% of the information contained in the original tree using 64-bit annotations. 3.1 Introduction In traditional serially-processed digital evolution experiments, phylogenetic trees can be tracked per- fectly as they progress (Bohm et al., 2017; Lalejini et al., 2019; Wang et al., 2018) rather than reconstructed afterward, as must be done in most biological studies of evolution. Such direct phylogenetic tracking en- ables experimental possibilities unique to digital evolution, such as perfect reconstruction of the sequence of phylogenetic states that led to a particular evolutionary outcome (Dolson et al., 2020; Lenski et al., 2003). In a shared-memory context, it is not difficult to maintain a complete phylogeny by ensuring that offspring retain a permanent reference to their parent (or vice versa). As simulations progress, however, memory usage would balloon if all simulated organisms were stored permanently. Garbage collecting extinct lineages and saving older history to disk greatly ameliorates this issue (Bohm et al., 2017; Dolson et al., 2019). If sufficient memory or disk space can be afforded to log all reproduction events, recording a perfect phylogeny in a distributed context is also not especially difficult. Processes could maintain records of each reproduction event, storing the parent organism (and its associated process) with all generated offspring (and their destination processes). As long organisms are uniquely identified globally, these “dangling ends” could be joined in postprocessing to weave a continuous global phylogeny. Of course, for the huge population sizes made possible by distributed systems, such stitching may become a demanding task in and of itself. Additionally, even small amounts of lost or corrupted data could fundamentally degrade tracking by disjoining large tree subsections. However, if memory and disk space are limited, distributed phylogeny tracking becomes a more burden- 49 some challenge. A naive approach might employ a server model to maintain a central store of phylogenetic data. Processes would dispatch notifications of birth and death events to the server, which would curate (and garbage collect) phylogenetic history much the same as current serial phylogenetic tracking implementations. Unfortunately, this server model approach would present scalability challenges: burden on the server process would worsen in direct proportion to processor count. This approach would also be similarly brittle to any lost or corrupted data. A more scalable approach might record birth and death events only on the process(es) where they unfold. However, lineages that went extinct locally could not be safely garbage collected until the extinction of their offspring’s lineages on other processors could be confirmed. Garbage collection would thus require extinction notifications to wind back across processes each lineage had traversed. Again, this approach would also be brittle to loss or corruption of data. In a distributed context — especially, a distributed, best-effort context — phylogenetic reconstruction (as opposed to tracking) could prove simpler to implement, more efficient at runtime, and more robust to data loss while providing sufficient information to address experimental questions of interest. However, phylogenetic reconstruction from genomes with a traditional model of divergence under gradual accumulation of random mutations poses its own difficulties, including • accounting for heterogeneity in evolutionary rates (i.e., the rate at which mutations accumulate due to divergent mutation rates or selection pressures) between lineages (Lack and Van Den Bussche, 2010), • performing sequence alignment (Casci, 2008), • mutational saturation (Hagstrom et al., 2004), • appropriately selecting and applying complex reconstruction algorithms (Kapli et al., 2020), and • computational intensity (Sarkar et al., 2010). The computational flexibility of digital artificial life experiments provides a unique opportunity to over- come these challenges: designing heritable genome annotations specifically to ensure simple, efficient, and effective phylogenetic reconstruction. For maximum applicability of such a solution, these annotations should be phenotypically neutral heritable instrumentation (Stanley and Miikkulainen, 2002) that can be applied to any digital genome. In this paper, we present “hereditary stratigraphy,” a novel heritable genome annotation system to facilitate post-hoc phylogenetic inference on asexual populations. This system allows explicit control over trade-offs between space complexity and accuracy of phylogenetic inference. Instead of modeling genome components diverging through a neutral mutational process, we keep a record of historical checkpoints that 50 allow comparison of two lineages to identify the range of time in which they diverged. Careful management of these checkpoints allows for a variety of trade-off options, including: • linear space complexity and fixed-magnitude inference error, • constant space complexity and inference error linearly proportional to phylogenetic depth, and • logarithmic space complexity and inference error linearly proportional to time elapsed since MRCA (which we suspect will be the most broadly useful trade-off). In Methods we motivate and explain the hereditary stratigraphy approach. Then, in Results and Discus- sion we simulate post-hoc inference on known phylogenies to assess the quality of phylogenetic reconstruction enabled by the hereditary stratigraphy method. 3.2 Methods This section will introduce intuition for the strategy of our hereditary stratigraph approach, define the vocabulary we developed to describe aspects of this approach, overview configurable aspects of the approach, present mathematical exposition of the properties of space complexity and inference quality under particular configurations, and then recap digital experiments that demonstrate this approach in an applied setting. 3.2.1 Hereditary Strata and the Hereditary Stratigraphic Column Our algorithm, particularly the vocabulary we developed to describe it, draws loose inspiration from the concept of geological stratigraphy, inference of natural history through analysis of successive layers of geological material (Steno, 1916). As an introductory intuition, suppose a body of rock being built up through regular, discrete events depositing of geological material. Note that in such a scenario we could easily infer the age of the body of rock by counting up the number of layers present. Next, imagine making a copy of the rock body in its partially-formed state and then moving it far away. As time runs forward on these two rock bodies, independent layering processes will cause consistent disparity in the layers forming on each forwards from their point of separation. To deduce the historical relationship of these rock bodies, we could simply align and compare their layers. Layers from their base up through the first disparity would correspond to shared ancestry; further disparate layers would correspond to diverged ancestry. Figure 3.1 depicts the process of comparing columns for phylogenetic inference. Shifting now from intuition to implementation, a fixed-length randomly-generated binary tag provides a suitable “fingerprint” mechanism mirroring our metaphorical “rock layers.” We call this “fingerprint” tag a differentia. The width of this tag controls the probability of spurious collisions between independently generated instances. At 64 bits wide the tag effectively functions as a UID: collisions between randomly 51 Column A Column B Stratum Stratum “Fingerprint” “Fingerprint” Gen 0 Gen 0 0x504b 0x504b lity t na Gen 1 (eliminated) Gen 1 (eliminated) las mmo co Stratum Stratum “Fingerprint” “Fingerprint” Gen 2 Gen 2 0x3d49 0x3d49 MRCA Stratum “Fingerprint” Gen 3 (eliminated) Gen 3 0xe191 firs dis t Stratum Stratum “Fingerprint” “Fingerprint” Gen 4 Gen 4 pa 0xb9da 0x25c4 rity Stratum Stratum A B “Fingerprint” “Fingerprint” … 0x7d5a … 0x4f00 Figure 3.1: Inferring the generation of the most-recent common ancestor (MRCA) of two hereditary strati- graphic columns “A” and “B”. Columns are aligned at corresponding generations. Then the first generation with disparate “fingerprints” is determined. This provides a hard upper bound on the generation of the MRCA: these strata must have been deposited along separate lines of descent. Searching backward for the first commonality preceding that disparity provides a soft lower bound on the generation of the MRCA: these strata evidence common ancestry but might collide by chance. Some strata mmay have been elimi- nated from the columns, as shown, in order to save space at the cost of increasing uncertainty of MRCA generation estimates. 52 generated tags are so unlikely (p < 5.42 × 10−20 ) they can essentially be ignored. At the other end of the spectrum, collision probability would be 1/256 for a single byte and 1/2 for a single bit. In the case of narrow differentia, in order to set a lower bound for the MRCA generation, you would have to backtrack common strata from the last commonality until the probability of that many successive spurious collisions was enough to satisfy your desired confidence level (e.g., 95% confidence). Even then, there would be a possibility of the the true MRCA falling before the estimated lower bound. Note, however, that no matter the width of the differentia the generation of the first discrepancy provides a hard upper bound on the generation of the MRCA. In accordance with our geological analogy, we refer to the packet of data accumulated each generation as a stratum. This packet contains the differentia and, although not employed in this work, could hold other arbitrary user-defined data (i.e., simulation timestamp, phenotype characteristics, etc.). Again in accordance with the geological analogy, we refer to the chronological stack of strata that accumulate over successive generations as a hereditary stratigraphic column. 3.2.2 Stratum Retention Policy As currently stated, strata in each column will accumulate proportionally to the length of evolutionary history simulated. In an evolutionary run with thousands or millions of generations, this approach would soon become intractable — particularly when columns are serialized and transmitted between distributed computing elements. To solve this problem, we can trade off precision for compactness by strategically deleting strata from columns as time progresses. Figure 3.2 overviews how stratum deposit and stratum elimination might progress over two generations under the hereditary stratigraphic column scheme. Different patterns of deletion will lead to different trade-offs, both in terms of the scaling relationship of column size to generations elapsed and in terms of the arrangement of inference precision over evolutionary history (i.e., focusing precision on more recent evolutionary history versus spreading it evenly over the entire history). We refer to the rule set used to selectively eliminate strata over time as the “stratum retention policy.” We explore several different retention policy designs here, and implement our software to allows for free, modular interchange of retention policies. Our software allows specification of a policy as either a “predicate” or a “generator.” The predicate method requires a function that takes the generation of a stratum and the current number of strata deposited and returns whether that stratum should be retained at that point in time. The generator method requires a function that takes the current number of strata deposited and yields the set of generations that should be deleted at that point in time. Although the predicate form of a policy is useful for analyzing and proving 53 2nd Generation 1. Deposit 2. Apply 3rd Generation (A) 3rd Generation (B) Stratum Retention Policy Column Column Column idx idx idx 0 0 0 Stratum Stratum Stratum “Fingerprint” retain “Fingerprint” retain “Fingerprint” retain Gen 0 Gen 0 Gen 0 0x504b 0x504b 0x504b idx 1 Stratum “Fingerprint” Gen 1 Gen 1 (eliminated) Gen 1 (eliminated) 0xddbd eliminate idx idx idx 2 1 1 Stratum Stratum Stratum “Fingerprint” retain “Fingerprint” retain “Fingerprint” retain Gen 2 Gen 2 Gen 2 0x3d49 0x3d49 0x3d49 w! Ne idx idx 2 2 (Gen 3) Stratum Stratum “Fingerprint” retain “Fingerprint” retain Gen 3 Gen 3 0xd01a 0xe74a w! w! … … Ne … Ne … … … Figure 3.2: Cartoon illustration of stratum deposit process. This process marks the elapse of a generation when a hereditary stratigraphic column is inherited by an offspring. First, a new stratum is appended to the end of the column with a randomly-generated “fingerprint.” This “fingerprint” distinguishes strata that were generated along disparate lines of descent (e.g., 0xd01a for 3rd Generation A and 0xe74a for 3rd generation B). Then, the column’s configured stratum retention policy is applied to “prune” the column by eliminating strata from specific generations. Although this cartoon depicts an empty space for eliminated strata, the underlying data structure behind a column (i.e., the pink overlay) can condense to reduce space complexity. 54 properties of policies, the generator form is generally more efficient in practice. We provide equivalent predicate and generator implementations for each stratum retention policy discussed here. Strata elimination causes a stratum’s position within the column data structure to no longer correspond to the generation it was deposited. Therefore, it may seem necessary to store the generation of deposit within the stratum data structure. However, for all deterministic retention policies a perfect mapping exists backwards from column index to recover generation deposited without storing it. We provide this formula for each stratum retention policy surveyed here. Finally, for each policy we provide a formula to calculate the exact number of strata retained under any parameterization after n generations. The next subsections introduce several stratum retention policies, explain the intuition behind their implementation, and elaborate their space complexity and resolution properties. For each policy, patterns of stratum retention are illustrated in Figure 3.3. The formulas for number of strata retained after n generations, the formulas to calculate stratum deposit generation from column index, and the retention predicate specifications of each policy are available in Supplementary Section B.4. The generator specification of each policy is available in Supplementary Section B.5. For tapered depth-proportional resolution and recency-proportional resolution, the accuracy of MRCA estimation can also be explored via an interactive in-browser web applet at https://hopth.ru/bi. 3.2.3 Fixed Resolution Stratum Retention Policy The fixed resolution retention policy imposes a fixed absolute upper bound r on the spacing between retained strata. The strategy is simple: permanently retain a stratum every rth generation. (For arbitrary reasons of implementation convenience, we also require each stratum to be retained during at least the generation it is deposited). See top panel of Figure 3.3. This retention policy suffers from linear growth in a column’s memory footprint with respect to number of generations elapsed: every rth generation generation a new stratum is permanently retained. For this reason, it is likely not useful in practice except potentially in scenarios where the number of generations is small and fixed in advance. We include it here largely for illustrative purposes as a gentle introduction to retention policies. 3.2.4 Depth-Proportional Resolution Stratum Retention Policy The depth-proportional resolution policy ensures spacing between retained strata will be less than or equal to a proportion 1/r of total number of strata deposited n. Achieving this limit on uncertainty requires retaining sufficient strata so that no more than n/r generations elapsed between any two strata. This policy accumulates retained strata at a fixed interval until twice as many as r are at hand. Then, every other retained stratum is purged and the cycle repeats with a new twice-as-wide interval between retained strata. 55 Policy Lower-Resolution Parameterization Higher-Resolution Parameterization Properties Fixed Resolution 0 0 Space Complexity 256 256 O(n) Generation Generation 512 512 MRCA Uncertainty 768 768 O(1) 1024 1024 where n is 0 256 512 768 1024 0 256 512 768 1024 Column Position Column Position gens elapsed. Tapered Depth-Proportional Depth-Proportional 0 0 Space Complexity 256 256 O(1) Generation Generation 512 512 MRCA Uncertainty 768 768 O(n) 1024 1024 where n is 0 256 512 768 1024 0 256 512 768 1024 Column Position Column Position gens elapsed. 0 0 Space Resolution Resolution Complexity 256 256 O(1) Generation Generation 512 512 MRCA Uncertainty 768 768 O(n) 1024 1024 where n is 0 256 512 768 1024 0 256 512 768 1024 Column Position Column Position gens elapsed. Recency-Proportional Space 0 0 Complexity 256 256 O(log(n)) MRCA Generation Generation Uncertainty 512 512 Resolution 768 768 O(m) where m is 1024 0 256 512 768 1024 1024 0 256 512 768 1024 gens since Column Position Column Position MRCA and n is total gens elapsed. Figure 3.3: Comparison of stratum retention policies. Policy visualizations show retained strata in black. Time progresses along the y-axis from top to bottom. New strata are introduced along the diagonal and then “drip” downward as a vertical line until eliminated. The set of retained strata present within a column at a particular generation g can be read as intersections of retained vertical lines with a horizontal line with intercept g. Policy visualizations are provided for two parameterizations for each policy: the first where the maximum uncertainty of MRCA generation estimates would be 512 generations and the second where the maximum uncertainty of MRCA generation estimates would be 128 generations. 56 See second from top panel of Figure 3.3. When comparing stratigraphic columns from different generations, the resolution guarantee holds in terms of the number of generations experienced by the older of the two columns. Because this retention policy is deterministic, for two columns with the same policy, every stratum that is held by the older column is also guaranteed to be present in the younger column (unless it hasn’t yet been deposited on the younger column). Therefore, the strata that would enable the desired resolution when comparing two columns of the same age are guaranteed to be available, even when one column has elapsed more generations. Because the number of strata retained under this policy is bounded as 2r + 1, space complexity scales as O(1) with respect to the number of strata deposited. It follows that the MRCA generation estimate uncertainty scales as O(n) with respect to the number of strata deposited. 3.2.5 Tapered Depth-Proportional Resolution Stratum Retention Policy This policy refines the depth-proportional resolution policy to provide a more stable column memory footprint over time. The naive depth-proportional resolution policy builds up strata until twice as many are present as needed then purges half of them all at once. The tapered depth-proportional resolution policy functions identically to the depth-proportional policy except that it removes unnecessary strata gradually from back to front as new strata are deposited, instead of eliminating them simultaneously. See third from top panel of Figure 3.3. The column footprint stability of this variation makes it easier to parameterize our experiments to ensure comparable end-state column footprints for fair comparison between retention policies, in addition to making this policy likely better suited to most use cases. By design, this policy has the same space complexity and MRCA estimation uncertainty scaling relationships with number generations elapsed as the naive depth-proportional resolution policy. 3.2.6 MRCA-Recency-Proportional Resolution Stratum Retention Policy The MRCA-recency-proportional resolution policy ensures distance between the retained strata sur- rounding any generation point will be less than or equal to a user-specified proportion 1/r of the number of generations elapsed since that generation. This policy can be constructed recursively. So, to begin, let’s consider setting up just the first generation g of the stratum after the root ancestor we will retain when n generations have elapsed. A simple geometric analysis reveals that providing the guaranteed resolution for the worst-case generation within the window between generation 0 and generation g (i.e., generation g − 1) requires g ≤ ⌊n/(r + 1)⌋. 57 Num Gens Guaranteed MRCA-Recency-Proportional Resolution Elapsed 1 4 10 100 1.0 × 103 18 26 41 80 1.0 × 106 32 50 85 184 1.0 × 109 51 79 134 293 1.0 × 1012 64 102 177 396 Table 3.1: Number strata retained after one thousand, one million, one billion, and one trillion generations under the recency-proportional resolution stratum retention policy. Four different policy parameterizations are shown, the first where MRCA generation can be determined between two extant columns with a guar- anteed relative error of 100%, the second 25%, the third 10%, and the fourth 1%. A column’s memory footprint will be a constant factor of these retained counts based on the fingerprint differentia width chosen. For example, if single byte differentia were used, the column’s memory footprint in bits would be 8× the number of strata retained. We now have an upper bound for the generation of the first stratum generation we must retain. However, we must guarantee that strata at these generations are actually available for us to retain (i.e., haven’t been purged out of the column at a previous time point). We will do this by picking the generation that is the highest power of 2 less than or equal to our bound. If we repeat this procedure as we recurse, we are guaranteed that this generation’s stratum will have been preserved across all previous timepoints. Why does this work? Consider a sequence where all elements are spaced out by strictly nonincreasing powers of 2. Consider the first element of the list. All multiples this first element will be included in the list. So, when we ratchet up g to 2g as n increases, we are guaranteed that 2g has been retained. This principle generalizes recursively down the list. This is a similar principle to the approach of strictly-doubling interval sizes used in the Depth-Proportional Resolution stratum retention policies described above. This step of truncating to the nearest less than or equal to power of 2 affects our recursive step size is at most halved. So, because step size is a constant fraction of remaining generations n (at worst 2(r+1) ), the n number of steps made (and number of strata retained) scales as O(log(n)) with respect to the number of strata deposited. Table 3.1 provides exact figures for the number of strata retained under different parameterizations of the recency-proportional retention policy between one thousand and one trillion generations. As for MRCA generation estimate uncertainty, in the worst case it scales as O(n) with respect to the greater number of strata deposited. However, with respect to estimating the generation of the MRCA for lineages diverged any fixed number of generations ago, uncertainty scales as O(1). How does space complexity scale with respect to the policy’s specified resolution r? Through extrapola- tion from OEIS sequences A063787 and A056791 via guess and check (Oeis, 2021a,b), we posited the exact number of strata retained after n generations as 58 X r HammingWeight(n) + ⌊log2 (⌊n/r⌋)⌋ + 1. 1 This expression has been unit tested extensively to ensure perfect reliability. Approximating and ap- plying logarithmic properties, this policy’s space complexity can be calculated within a constant factor as  nr  log(n) + log . r! To analyze the relationship between space complexity and resolution r, we will examine the ratio of space complexities induced when scaling resolution r up by a constant factor f > 1. Evaluating this ratio as r → ∞, we find that space complexity scales directly proportional to f ,   nf r log(n) + log (f r)! lim   = f. r→∞ nr log(n) + log r! Evaluating this ratio as n → ∞, we find that this scaling relationship is never worse than directly proportional for any r,   nf r log(n) + log (f r)! fr + 1 lim   = n→∞ log(n) + log nr r+1 r! r + 1/f =f r+1 ≤ f. 3.2.7 Computational Experiments In order to assess the practical performance of the hereditary stratigraph approach in an applied setting, we simulated the process of stratigraph propagation over known “ground truth” phylogenies extracted from pre-existing digital evolution simulations (Hernandez et al., 2022). These simulations propagated populations of between 100 and 165 bitstrings between 500 and 5,000 synchronous generations under the NK fitness landscape model (Kauffman and Weinberger, 1989). In order to ensure coverage of a variety of phylogenetic conditions, we sampled a variety of selection schemes that impose profoundly different ecological regimens (Dolson and Ofria, 2018), • EcoEA Selection (Goings et al., 2012), 59 • Lexicase Selection (Helmuth et al., 2014), • Random Selection, and • Sharing Selection (Goldberg et al., 1987). Supplementary Table B.8 provides full details on the conditions each ground truth phylogeny was drawn from. The phylogenies themselves are available with our supplementary material. For each ground truth phylogeny, we tested combinations of three configuration parameters: • target end-state memory footprints for extant columns (64, 512, and 4096 bits), • differentia width (1, 8, and 64 bits), and • stratum retention policy (tapered depth-proportional resolution and recency-proportional resolution). Stratum retention policies were parameterized so that the maximum number of strata possible were present at the end of the experiment without exceeding the target memory footprint. If the target mem- ory footprint is exceeded by the sparsest possible parameterization of a retention policy, then that sparsest possible parameterization was used. Supplementary Tables tables B.1 to B.5 provide the calculated para- materizations and memory footprints of extant columns. In order to assess the viability of phylogenetic inference using hereditary stratigraphic columns from extant organisms, we used the end-state stratigraphs to reconstruct an estimate of the actual ground truth phylogenetic histories. The first step to reconstructing a phylogenetic tree for the history of an extant popu- lation at the end of an experiment is to construct a distance matrix by calculating all pairwise phylogenetic distances between extant columns. We defined phylogenetic distance between two extant columns as the sum of each extant organism’s generational distance back to the generation of their MRCA, estimated as the mean of the upper and lower 95% confidence bounds. Figure 3.4 provides a cartoon summary of the process of calculating phylogenetic distance between two extant columns. We then used the unweighted pair group method with arithmetic mean (UPGMA) reconstruction tool provided by the BioPython package to generate estimate phylogenetic trees (Cock et al., 2009; Sokal, 1958). After generating the reconstructed tree topology, we performed a second pass to adjust branch lengths so that each internal tree node sat at the mean of its estimated 95% confidence generation bounds. 3.2.8 Software and Data As part of this work, we published the hstrat Python library with a stable public-facing API intended to enable incorporation in other projects with extensive documentation and unit testing on GitHub at 60 actual MRCA Gen 0 “Fingerprint” . 0x1f. Gen 2 “Fingerprint” . 0x76. possible Column, Generation 2 estimated MRCA generation MRCA Gen 0 “Fingerprint” . 0x1f. Gen 0 “Fingerprint” . 0x1f. Gen 2 “Fingerprint” . 0x76. Gen 2 “Fingerprint” . 0x76. generations Gen 3 “Fingerprint” . 0xa4. d2 Gen 3 “Fingerprint” . 0xba. d1 Column, Generation 3 Column, Generation 3 Gen 0 “Fingerprint” . 0x1f. Gen 0 “Fingerprint” . 0x1f. Gen 0 “Fingerprint” . 0x1f. Gen 2 “Fingerprint” . 0x76. Gen 2 “Fingerprint” . 0x76. Gen 2 “Fingerprint” . 0x76. Gen 4 “Fingerprint” . 0x95. Gen 4 “Fingerprint” . 0xa2. Gen 4 “Fingerprint” . 0x47. Column, Generation 4 Column, Generation 4 Column, Generation 4 extinct columns extant columns Gen 0 “Fingerprint” . 0x1f. Gen 0 “Fingerprint” . 0x1f. Gen 0 “Fingerprint” . 0x1f. Gen 2 “Fingerprint” . 0x76. Gen 2 “Fingerprint” . 0x76. Gen 2 “Fingerprint” . 0x76. Gen 4 “Fingerprint” . 0x95. Gen 4 “Fingerprint” . 0x47. Gen 4 “Fingerprint” . 0x47. Gen 5 “Fingerprint” . 0xcb. Gen 5 “Fingerprint” . 0x88. Gen 5 “Fingerprint” . 0xe3. Column, Generation 5 Column, Generation 5 Column, Generation 5 Figure 3.4: Cartoon illustration of column inheritance along a phylogenetic tree and the process to infer phylogenetic history along extant columns. This scenario supposes a stratum retention policy where only strata from even generations are retained. The common ancestor of the focal clade, which is at generation 2, is shown at top. Generation 3 columns inherit that ancestor’s strata and each append a new stratum. Generation 4 columns append another new stratum then eliminate their generation 3 strata. Finally, another generation elapses to yield generation 5 strata. Suppose that only generation 5 strata are extant. So, greyed- out columns above are not directly observable. Phylogenetic history is deduced by pairwise comparison between extant columns. For each pair (like the example highlighted in red) a phylogenetic distance can be computed as the sum generations elapsed to each extant column after the estimated generation of their MRCA. These pairwise distances can then be fed into a phylogeny reconstruction algorithm. 61 https://github.com/mmore500/hstrat and on PyPI.In the near future, we intend to complete and publish a corresponding C++ library. Supporting software materials can be found on GitHub at https://github.com/mmore500/ hereditary-stratigraph-concept Supporting computational notebooks are available for in-browser use via BinderHub at https://hopth.ru/bk (Ragan-Kelley and Willing, 2018). Our work benefited from many pieces of open source scientific software (Bostock et al., 2011; Hunter, 2007; Meurer et al., 2017; Paradis et al., 2004; Smith, 2020b,c; Sukumaran and Holder, 2010; Ushey et al., 2022; Virtanen et al., 2020,?; Waskom, 2021; Wickham et al., 2022). The ground truth phylogenies used in this work as well as supplementary figures, tables, and text are available via the Open Science Framework at https://osf.io/4sm72/ (Foster and Deardorff, 2017; Moreno et al., 2022a). Phylogenetic data associated with this project is stored in the Alife Community Data Standards format (Lalejini et al., 2019). 3.3 Results and Discussion In this section, we analyze the quality of reconstructions of known phylogenetic trees using hereditary stratigraphy. Figure 3.5 compares an example reconstruction from columns using tapered depth-proportional stratum retention, an example reconstruction using recency-proportional stratum retention, and the under- lying ground truth phylogeny. Interactive in-browser visualizations comparing all reconstructed phylogenies to their corresponding ground truth are available at https://hopth.ru/bi. 3.3.1 Reconstruction Accuracy Measuring tree similarity is a challenging problem, with many conflicting approaches that all provide different information (Smith, 2020a). Ideally, we would use a metric of reconstruction accuracy that 1) is commonly used so that there exists sufficient context to understand what constitutes a good value, 2) behaves consistently across different types of trees, and 3) behaves reasonably for the types of trees common in artificial life data. Unfortunately, these objectives are somewhat in conflict. The primary source of this problem is multifurcations, nodes from which more than two lineages branch at once. In reconstructed phylogenies in biology, multifurcations are generally assumed to be the result of insufficient information. It is thought that the real phylogeny had multiple bifurcations that occurred so close together that the reconstruction algorithm is unable to separate them. In artificial life phylogenies, however, we have the opposite problem. When we perfectly track a phylogeny, it is common for us to know that a multifurcation did in fact occur. However, it is challenging for our reconstructions to properly identify multifurcations, because it requires perfectly lining up multiple divergence times. Many of the most popular tree distance metrics interpret the difference between a multifurcation and a set of bifurcations as a dramatic change in topology. For some use cases, this change in topology may indeed be meaningful, although research on the 62 3 3 h a a l 4 4 8 9 9 1 5 c 5 c 5 m 0 0 3 2 2 a j j 4 h h 9 10 l 10 l 10 c 8 8 0 taxa taxa taxa 1 1 2 m m j d d d 15 g 15 g 15 g e e e f f 6 i i f 5 6 i 20 k 20 5 20 5 6 7 k 7 b 7 b k b n n n 0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600 branch length branch length branch length (a) Ground truth phylogeny. (b) 1-bit Fingerprint Differentia, (c) 1-bit Fingerprint Differentia, Tapered Depth-Proportional Res- MRCA-Recency-Proportional olution Stratum Retention Predi- Resolution Stratum Retention cate, 64 bit target column foot- Predicate, 64 bit target column print. footprint. Figure 3.5: Example phylogeny reconstructions of ground-truth lexicase selection phylogeny from inference on extant hereditary stratigraphic columns. Shaded error bars on reconstructions indicate 95% confidence intervals for the true generation of tree nodes. Arbitrary color is added to enhance distinguishability. Figure 3.6: Proportion of information present in the ground-truth ftness sharing phylogeny that was cap- tureed by our reconstruction, across various retention policies. High is better (1 is perfect). RPR is recency- proportional resolution policy and TDPR is tapered depth-proportional resolution policy. extent of this problem is limited. Nevertheless, we suspect that for the majority of use cases, the tiny branch lengths between the internal nodes will make this source of error relatively minor. To overcome this obstacle, we have measured our reconstruction accuracy using multiple metrics. We will primarily focus on Mutual Clustering Information (as implemented in the R TreeDist package) (Smith, 2020c), which is a direct measure of the quantity of information in the ground truth phylogeny that was successfully captured in the reconstruction. It is relatively unaffected by the failure to perfectly reproduce multifurcations. For the purposes of easy comparison to the literature, we also measured the Clustering Information Distance (Smith, 2020c). 63 Across ground truth phylogenies, we were able to reconstruct the phylogenetic topology with between 47.75% and 85.70% of the information contained in the original tree using a 64-bit column memory footprint, between 47.75% and 80.36% using a 512-bit column memory footprint, and between 51.13% and 83.53% us- ing a 4096-bit column memory footprint. While the Clustering Information Distance reached its maximum possible score (1.0) for the heavily-multifurcated EcoEA phylogeny, it agreed with the Mutual Clustering Information score for less multifurcated phylogenies, such as fitness sharing. Using the Recency Proportional Resolution retention policy and a 4096-bit column memory footprint, we were able to reconstruct a fitness sharing phylogeny with a Clustering Information Distance of only 0.2923471 from the ground truth. For context, that result is comparable to the distance between phylogenies reconstructed from two closely-related proteins in H3N2 flu (0.25) (Jones et al., 2021). To build further intuition, we strongly encourage readers to refer to our interactive web reconstruction. Figure 3.6 summarizes error reconstructing the fitness sharing selection phylogeny in terms of the mutual clustering information metric (Smith, 2022). The phylogenies reconstructed from the EcoEA condition performed comparably, with lexicase and random selection faring somewhat worse (Moreno et al., 2022a). In the case of random selection, we suspect that this reduced per- formance is the result of having many nodes that originated very close together at the end of the experiment. As expected, we did observe overall more accurate reconstructions from columns that were allowed to occupy larger memory footprints. 3.3.2 Differentia Size Among the surveyed ground truth phylogenies and target column footprints, we consistently found that smaller differentia were able to yield more or as accurate phylogenetic reconstructions. The stronger per- formance of narrow differentia was particularly apparent in low-memory-footprint scenarios where overall phylogenetic inference power was weaker. Overall, single-bit differentia outperformed 64-bit differentia under 20 conditions, and were indistinguishable under 6 conditions, and were worse under 4 conditions. We used Cluster Information Distance to perform these comparisons. Full results are available in Supplementary Section B.2. Although narrower differentia have less distinguishing power on their own, their smaller size allows more to be packed into the memory footprint to cover more generations, which seems to help recon- struction power. We must note that narrower differentia can pack more thoroughly into the footprint caps we imposed on column size, so their extant columns tended to have slightly more overall bits. However, this was a small enough imbalance (in most cases < 10%) that we believe it is unlikely to fully account for the stronger performance of narrow-differentia configurations. 64 3.3.3 Retention Policy Across the surveyed ground truth phylogenies and target column memory footprints, we found that the recency-proportional resolution stratum retention policy generally yielded better phylogenetic reconstruc- tions. Phylogenetic reconstruction quality was better in 28 conditions, equivalent in 13 conditions, and worse in 4 conditions. Again, this effect was most apparent in the small-stratum-count scenarios where overall inference power was weaker. We used Cluster Information Distance to perform these comparisons. Full results are available in Supplementary Section B.3. The stronger performance of recency-proportional resolution is likely due to the denser retention of recent strata under the recency-proportional metric, which help to resolve the more numerous (and therefore typically more tightly spaced) phylogenetic events in the near past (Zhaxybayeva and Gogarten, 2004). Recency-proportional resolution tended to be able to fit fewer strata within the prescribed memory footprints (except in cases where it could not fit within the footprint) so its stronger performance cannot be attributed to more retained bits in the end-state extant columns. 3.4 Conclusion To our knowledge, this work provides a novel design for digital genome components that enable phylo- genetic inference on asexual populations. This provides a viable alternative to perfect phylogenetic tracking, which is complex and possibly cumbersome in distributed computing scenarios, especially with fallible nodes. Our approach enables flexible, explicit trade-offs between space complexity and inference accuracy. Hered- itary stratigraphic columns are efficient: our approach can estimate, for example, the MRCA generation of two genomes within 10% error with 95% confidence up to a depth of a trillion generations with genome annotations smaller than a kilobyte. However, they are also powerful: we were able to achieve tree re- constructions recovering up to 85.70% of the information contained in the original tree with only a 64-bit memory footprint. This and other methodology to enable decentralized observation and analysis of evolving systems will be essential for artificial life experiments that use distributed and best-effort computing approaches. Such systems will be crucial to enabling advances in the field of artificial life, particularly with respect to the question of open-ended evolution (Ackley and Cannon, 2011; Moreno et al., 2021a,b) Mork work is called for to further enable experimental analyses in distributed, best-effort systems while preserving those systems’ efficiency and scalability. As parallel and distributed computing becomes increasingly ubiquitous and begins to more widely pervade artificial life systems, hereditary stratigraphy should serve as a useful technique in this toolbox. Important work extending and analyzing hereditary stratigraphy remains to be done. Analyses should be performed to expound MRCA resolution guarantees of stratum retention policies when using narrow (i.e., 65 single-bit) differentia. Constant-size-complexity stratum retention policies that preferentially retain a denser sampling of more-recent strata should be developed and analyzed. Extensions to sexual populations should be explored, including the possibility of annotating and tracking individual genome components instead of whole-genome individuals. An alternate approach might be to define a preferential inheritance rule so that at each generation slot within a column, a single differentia sweeps over an entire interbreeding population. Optimization of tree reconstruction from extant hereditary stratigraphs remains an open question, too, particularly with regard to properly handling multifurcations. It would be particularly valuable to develop methodology to annotate inner nodes of trees reconstructed from hereditary stratigraphs with confidence levels. The problem of designing genomes to maximize phylogenetic reconstructability raises unique questions about phylogenetic estimation. Such a backward problem — optimizing genomes to make analyses trivial as opposed to the usual process of optimizing analyses to genomes — puts questions about the genetic information analyses operate on in a new light. In particular, it would be interesting to derive upper bounds on phylogenetic inference accuracy given genome size and generations elapsed. 66 Part II Evolving Complexity, Novelty, and Adaptation in Digital Multicells 67 Chapter 4 Exploring Evolved Multicellular Life Histories in an Open-Ended Digital Evolution System Authors: Matthew Andres Moreno and Charles Ofria This chapter is adapted from (Moreno and Ofria, 2022), which appeared in the Frontiers in Ecology and Evolution Models in Ecology and Evolution special issue, Digital Evolution: Insights for Biologists. This chapter introduces the DISHTINY framework, which enables evolution experiments with digital multicells. Indeed, in evolutionary experiments, we repeatedly observed group-level traits that are charac- teristic of a fraternal transition. These included reproductive division of labor, resource sharing within kin groups, resource investment in offspring groups, asymmetrical behaviors mediated by messaging, morpho- logical patterning, and adaptive apoptosis. We report eight case studies from replicates where transitions occurred and explore the diverse range of adaptive evolved multicellular strategies. 4.1 Introduction An evolutionary transition in individuality is an event where independently replicating entities unite to replicate as a single, higher-level individual (Smith and Szathmary, 1997). These transitions are understood as essential to natural history’s remarkable record of complexification and diversification (Smith and Szath- mary, 1997). Likewise, artificial life researchers have highlighted transitions in individuality as a mechanism that is missing in digital systems, but necessary for achieving the evolution of complexity and diversity that we witness in nature (Banzhaf et al., 2016; Taylor et al., 2016). Fraternal evolutionary transitions in individuality are transitions in which the higher-level replicating entity is derived from the combination of cooperating kin that have entwined their long-term fates (West et al., 2015). Multicellular organisms and eusocial insect colonies exemplify this phenomenon (Smith and Szathmary, 1997) given that both are sustained and propagated through the cooperation of lower-level kin. This work focuses on fraternal transitions. Although not our focus here, egalitarian transitions — events in which non-kin unite, such as the genesis of mitochondria by symbiosis of free-living prokaryotes and eukaryotes (Smith and Szathmary, 1997) — also constitute essential episodes in natural history. In nature, major fraternal transitions occur sporadically with few extant transitional forms, making them challenging to study. For instance, on the order of 25 independent origins of Eukaryotic multicellularity are known (Grosberg and Strathmann, 2007) with most transitions having occurred hundreds of millions of years ago (Libby and Ratcliff, 2014). Recent work in experimental evolution (Gulli et al., 2019; Koschwanez et al., 2013; Ratcliff et al., 2015; Ratcliff and Travisano, 2014), mechanistic modeling (Hanschen et al., 2015; Staps et al., 2019), and digital evolution (Goldsby et al., 2012, 2014) complements traditional post hoc approaches 68 focused on characterizing the record of natural history. These systems each instantiate the evolutionary transition process, allowing targeted manipulations to test hypotheses about the requisites, mechanisms, and evolutionary consequences of fraternal transitions. Digital evolution, computational model systems designed to instantiate evolution in abstract algorithmic substrates rather than directly emulating any specific biological system (Dolson and Ofria, 2021; Wilke and Adami, 2002), occupies a sort of middle ground between wet work and mechanistic modeling. This approach offers a unique conjunction of experimental capabilities that complements work in both of those disciplines. Like modeling, digital evolution affords rapid generational turnover, complete observability (every event in a digital system can be tracked), and complete manipulability (every event in a digital system can can be arbitrarily altered). However, as with in vivo experimental evolution, digital evolution systems can exhibit rich evolutionary dynamics stemming from complex, rugged fitness landscapes (LaBar and Adami, 2017) and sophisticated agent behaviors (Grabowski et al., 2013). Our work here follows closely in the intellectual vein of Goldsby’s deme-based digital evolution exper- iments (Goldsby et al., 2012, 2014). In her studies, high-level organisms exist as a group of cells within a segregated, fixed-size subspace. High-level organisms must compete for a limited number of subspace slots. Individual cells that comprise an organism are controlled by heritable computer programs that allow them to self-replicate, interact with their environment, and communicate with neighboring cells. Goldsby’s work defines two modes of cellular reproduction: tissue accretion and offspring generation. In this way, somatic and gametogenic modes of reproduction are explicitly differentiated. Within a group, cells undergo tissue accretion, whereby a cell copies itself into a neighboring position in its subspace. In the latter, a population slot is cleared to make space for a daughter organism then seeded with a single daughter cell from the parent organism. Goldsby’s model abstracts away developmental cost to focus on resource competition between groups. Cells grow freely within an organism, but fecundity depends on the collective profile of computational tasks (usually mathematical functions) performed within the organism. When an organism accumulates sufficient resource, a randomly chosen subspace is cleared and a single cell from the replicating organism is used as a propagule to seed the new organism. This setup mirrors the dynamics of biological multicellularity, in which cell proliferation may either grow an existing multicellular body or found a new multicellular organism. Here, we take several steps to develop a computational environment that removes the enforcement and rigid regulation of multiple organismal levels. Specifically, we remove the explicitly segregated subspaces and we let multicells interact with each other more freely. We demonstrate the emergence of multicellularity where each organism manages its own spatial distribution and reproductive process. This spatially uni- fied approach enables more nuanced interactions among organisms, albeit at the cost of substantially more 69 complicated analyses. Instead of a single explicit interface to mediate interactions among high-level organ- isms, such interactions must emerge via many cell-cell interfaces. Novelty can occur in terms of interactions among competitors, among organism-level kin, or even within the building blocks that make up hierarchical individuality. Experimentally studying fraternal transitions in a digital system where key processes (repro- ductive, developmental, homeostatic, and social) occur implicitly within a unified framework can provide unique insights into nature. For example, pervasive, arbitrary interactions between multicells introduces the possibility for strong influence of biotic selection. However, in our system, multicells do not emerge from an entirely impartial substrate. We do explicitly provide some framework to facilitate fraternal transitions in individuality by allowing cells to readily designate distinct hereditary groups. Offspring cells may either remain part of their parent’s hereditary group or found a new group. Cells can recognize group members, thus allowing targeted communication and resource sharing with kin. We reward cells for performing tasks designed to require passive collaboration among hereditary group members. As such, cells that form hereditary groups to maximize advantage on those tasks stand to increase their inclusive fitness. In previous work introducing the DISHTINY (DIStributed Hierarchical Transitions in IndividualitY) framework we evolved parameters for manually designed cell-level strategies to explore fraternal transitions in individuality (Moreno and Ofria, 2019). In this work we extend DISHTINY to incorporate a more dynamic event-driven genetic programming representation called SignalGP, which was designed to facilitate dynamic interactions among agents and between agents and their environment (Lalejini and Ofria, 2018). As expected, with the addition of cell controllers capable of nearly arbitrary computation we see a far more diverse set of behaviors and strategies arise. Here, we perform case studies to characterize notable multicellular phenotypes that evolved via this more dynamic genetic programming underpinning. Each case study strain was chosen by screening the entire set of replicate evolutionary runs for signs of the trait under investigation and then manually the most promising strain(s) for further investigation. Case studies presented therefore represent an anecdotal sampling, rather than an exhaustive summary, with respect to each trait of interest. Our goal is to explore a breadth of possible evolutionary outcomes under the DISHTINY framework. We see this as a precursory step toward hypothesis-driven work contributing to open questions about fraternal transitions in individuality. 4.2 Materials and Methods We performed simulations in which cells evolved open-ended behaviors to make decisions about resource sharing, reproductive timing, and apoptosis. We will first describe the environment and hereditary grouping system cells evolved under and then describe the behavior-control system cells used. 70 Signal Signal Signal Signal GP GP GP GP Instance Instance Instance Instance Signal Signal Signal GP GP GP Instance Instance Instance [tag] [instruction] 0111 [instruction] Signal Signal Signal [instruction] GP GP GP Instance Instance Instance 1 0 uctio n] uctio 011 n] 010 Sign al 00 [instr n] [instr 00 uctio [instr [instr Signal Signal Signal Signal 11 uctio SignalGP al 01 GP GP GP GP n] Sign Instance Instance Instance Instance 0000 Instance [in n] str tio Signal Signal Signal u cti uc n] [in str on str tio GP GP GP u ] [in uc [in cti str n] Instance Instance Instance str on [in tio uc ] uc 00 tio str 00 n] [in 10 10 0001 1110 Broadcast 0001 Signal Signal Signal Environment GP GP GP Instance Instance Instance (b) How individual SignalGP instances are organized Signal GP Signal GP Signal GP Signal GP into DISHTINY cells. Above, DISHTINY cells are Instance Instance Instance Instance (a) Overview of a single SignalGP instance. depicted as gray squares. Each DISHTINY cell is SignalGP program modules contain ordered sets of controlled by independent execution of the cell’s ge- instructions that activate and execute independently netic program on four distinct SignalGP instances, in response to tagged signals. Above, these modules depicted as colored circles. Each of four independent are shown as rectangular lists with bitstring tags instances manages cell behavior with respect to a sin- protruding from the SignalGP instance. These gle cardinal direction: sensing environmental state, signals can originate from any of three sources: (1) receiving intercellular messages, and determining cell internally from execution of “Signal” instructions actions. Above, the special role of each instance is within a program’s modules, (2) from the outside depicted as a reciporical arrow to the neighboring in- environment, or (3) from other agents executing stance in the neighboring cell. (All four instances “Message” instructions. Graphic provided courtesy sense non-directional environmental cues and non- Alexander Lalejini. directional actions may be taken by any instance.) These four instances can communicate with one an- other via intracellular messaging, indicated above by smaller reciprocal arrows among instances within a cell. Figure 4.1: Schematic illustrations of how an individual SignalGP instance functions and how SignalGP instances control DISHTINY cells. Execution of cells’ genetic programs on SignalGP instances controls cell behavior in our model. 71 4.2.1 Cells and Hereditary Groups Cells occupy individual tiles on a 60-by-60 toroidal grid. Over discrete time steps (“updates”), cells can collect a resource. Collected resource decays at a rate of 0.1% per update, incentivizing its quick use but gradual enough so as to not prevent the most naive cells from eventually accumulating enough resource to reproduce. Once sufficient resource accrues, cells may pay one unit of resource to place a daughter cell on an adjoining tile of the toroidal grid (i.e., reproduce), replacing any existing cell already there. Daughter cells inherit their parent’s genetic program, except any novel mutations that may arise. Mutations included whole-function duplication and deletion, bit flips on tags for instructions and functions, instruction and argument substitutions, and slip mutation of instruction sequences. We used standard SignalGP mutation parameters from (Lalejini and Ofria, 2018), but only applied mutations to 1% of daughter cells at birth. Daughter cells may also inherit hereditary group ID, introduced and discussed below. Cells accrue resource via a cooperative resource-collection process. The simulation distributes large amounts of resource within certain spatial bounds in discrete, intermittent events. Working in a group allows cells to more fully collect available resource during these events. Cooperating in medium-sized groups (on the order of 100 cells) accelerates per-cell resource collection rate. Unicellular, too-small, or too-large groups collect resource at a lesser per-cell rate. As an arbitrary side effect of the simulation algorithm employed to instantiate the cooperative resource distribution process, groups with a roughly circular layout collect resource faster than irregularly-shaped groups. Cooperative resource collection unfolds as an entirely passive process on the part of the cells, influenced only by a group’s spatial layout. Full details on the simulation algorithm that determines cooperative resource collection rates appear in in Supplementary Section C.2. Cells may grow a cooperative resource-collecting group through cell proliferation. We refer to these cooperative, resource-collecting groups as “hereditary groups.” As cells reproduce, they can choose to adsorb daughter cells onto the parent’s hereditary group or expel those offspring to found a new hereditary group. These decisions affect the spatial layout of these hereditary groups and, in turn, affect individual cells’ resource-collection rate. To promote group turnover, we counteract the established hereditary groups’ advantage with a simple aging scheme. As hereditary groups age over elapsed updates and somatic generations, their constituent cells lose the ability to regenerate somatic tissue and then, soon after, to collect resource. A complete description of group aging mechanisms used appears in Supplementary Section C.3. Because new hereditary group IDs arise first in a single cell and grow disseminate exclusively among direct descendants of that progenitor cell, hereditary groups are reproductively bottlenecked. This clonal (or “staying together”) multicellular life history stands in contrast with an aggregative (or “coming together”) life 72 cycle where chimeric groups arise via fusion of potentially loosely-related lineages (Staps et al., 2019). Such clonal development is known to strengthen between-organism selection effects (Grosberg and Strathmann, 2007). In this work, we screen for fraternal transitions in individuality with respect to these hereditary groups by evaluating three characteristic traits of higher-level organisms: resource sharing, reproductive division of labor, and apoptosis. We can further screen for the evolution of complex multicellularity by assessing cell- cell messaging, regulatory patterning, and functional differentiation between cells within hereditary groups (Knoll, 2011). 4.2.2 Hierarchical Nesting of Hereditary Groups Successive fraternal transitions in natural history — for example, to multicellularity and then to euso- ciality (Smith and Szathmary, 1997) — underscores the constructive power of evolution to harness emergent structures as building blocks for further novelty. Such substructure can also provide scaffolding for dif- ferentiation and division of labor within an organism (Wilson, 1984). To explore these dynamics, in some experimental conditions we incorporated a hierarchical extension to the hereditary grouping scheme described above. Hierarchical levels are introduced into the system by providing a mechanism to groups of hereditary groups to form. We accomplish this through two separate, but overlaid, instantiations of the hereditary grouping scheme. We refer to each independent hereditary grouping system as a “level.” The hierarchical extension allows two levels of hereditary grouping, identified here as L0 and L1. L0 instantiates smaller, inner grouping embedded inside of a L1 grouping. Without the hierarchical extension, only L0 is present.1 We refer to the highest hereditary grouping level present in a simulation as the “apex” level. Under the hierarchical extension, each cell contained a pair of separate hereditary group IDs — the first for L0 and the second for L1. During reproduction, daughter cells could either 1. inherit both L0 and L1 hereditary group ID, 2. inherit L0 hereditary group ID but not L1 hereditary group ID, or 3. inherit neither hereditary group ID. In order to enforce hierarchical nesting of hereditary group IDs, daughter cells could not inherit just the L1 hereditary group ID. 1 We chose to number these levels using the computer science convention of zero-based indexing (as opposed to everyday practice of counting up from one) to maintain consistency with source code and data sets associated with this work. 73 Hierarchical hereditary group IDs are strictly nested: all cells are members of one L0 hereditary group and L1 hereditary group. No cell can be a member of two L0 hereditary groups or two L1 hereditary groups. Likewise, no L0 hereditary group can appear within more than one L1 hereditary group. Useful as a concrete illustration of this scheme, Figure 4.6a depicts hierarchically-nested hereditary groupings assumed by an evolved strain. 4.2.3 Cell-Level Organisms Our experiments use cell-level digital organisms controlled by genetic programs subject to mutations and selective pressures that stem from local competition for limited space. We employ the SignalGP event-driven genetic programming representation. As sketched in Figure 4.1a , this representation is specially designed to express function-like modules of code in response to internal signals or external stimuli. This process can be considered somewhat akin to gene expression. In our experiments, virtual CPUs can execute responses to up to 24 signals at once, with any further signals usurping the longest- running modules. The event-driven framework facilitates the evolution of dynamic interactions between digital organisms and their environment (including other organisms) (Lalejini and Ofria, 2018). Special module components allow evolving programs to sense and interact with their environment, through mechanisms including resource sharing, hereditary group sensing, apoptosis, cell reproduction, and arbitrary cell-cell messaging. Modules can also include general purpose computational elements like con- ditionals and loops, which allows cells to evolve sophisticated behaviors conditioned on current (and even previous) local conditions. A simple “regulatory” system provides special CPU instructions that dynamically adjust which modules are activated by particular signals. In our simulation, directionality of some inputs and outputs must be accounted for (e.g., specifying which neighbor to share resource with). To accomplish this, we provide each cell an independent SignalGP hardware instance to manage inputs and outputs with respect to each specific cell neighbor. So there are four virtual hardware sets per cell, one for each cardinal direction.2 Figure 4.1b overviews the configuration of the four SignalGP instances that constitute a single cell. Supplementary Sections C.4, C.5, C.6, and C.7 provide full details of the digital evolution substrate underpinning this work. 4.2.4 Surveyed Evolutionary Conditions To broaden our exploration of possible evolved multicellular behaviors in this system, we surveyed several evolutionary conditions. 2 This approach differs from existing work evolving digital organisms in grid-based problem domains, where directionality is managed by a within-cell “facing” state that determines the source direction for inputs and the target direction for outputs (Biswas et al., 2014; Goldsby et al., 2014, 2018; Grabowski et al., 2010; Lalejini and Ofria, 2018); see Supplemental Section C.4 for further detail. 74 In one manipulation, we explored the effect of enabling hierarchical structure within hereditary groups, such that parent cells can choose to keep offspring in their same sub-group, in just the same full group, or expel them entirely to start a new group. Cells can sense and react to the level of hereditary ID commonality shared with each neighbor. This manipulation presents opportunity for hierarchical individuality or for a mechanism to mediate differentiation within a multicell, but does not enforce it. In a second manipulation, we explored the importance of explicitly selecting for medium-sized groups (as had been needed to maximize resource collection) by removing this incentive. Instead, the system distributed resource at a uniform per-cell rate. We combined these two manipulations to yield four surveyed conditions: 1. “Flat-Even”: One hereditary group level (flat) with uniform resource inflow (even). In-browser simula- tion: https://hopth.ru/i, 2. “Flat-Wave”: One hereditary group level (flat) with group-mediated resource collection (wave); In- browser simulation: https://hopth.ru/j), 3. “Nested-Even”: Two hierarchically-nested hereditary group levels (nested) with uniform resource inflow (even). In-browser simulation: https://hopth.ru/k, 4. “Nested-Wave”: Two hierarchically-nested hereditary group levels (nested) with group-mediated re- source collection (wave). In-browser simulation: https://hopth.ru/l. Supplementary Section C.8 provides full details for each of the four surveyed evolutionary conditions. For each condition, we simulated 40 replicate populations for up to 1,048,576 (220 ) updates. During this time, on the order of 4,000 cellular generations and 500 apex-level group generations elapsed in runs. (Full details appear in Supplementary Table C.2.) Due to variability in simulation speed, four replicates only completed 262,144 updates. All analyses involving inter-replicate comparisons were therefore performed at this earlier time point. 4.3 Results To characterize the general selective pressures induced by surveyed environmental conditions, we assessed the prevalence of characteristic multicellular traits among evolved genotypes across replicates. In the case of an evolutionary transition of individuality, we would expect cells to modulate their own reproductive behavior to prioritize group interests above individual cell interests. In DISHTINY, cell reproduction inherently destroys an immediate neighbor cell. As such, we would expect somatic growth to occur primarily at group peripheries in a higher-level individual. Supplementary Figure C.1 compares cellular reproduction rates between the interior and exterior of apex-level hereditary groups. For all treatments, phenotypes 75 with depressed interior cellular reproduction rates dominated across replicates (non-overlapping 95% CI). By update 262,144 (about 1,000 cellular generations; see Supplementary Table C.2), all four treatment conditions appear to select for some level of reproductive cooperation among cells. Across replicate evolutionary runs in all four treatments, we also found that resource was transferred among registered kin at a significantly higher mean rate than to unrelated neighbors (non-overlapping 95% CI). Genetic programs controlling cells can sense whether any particular neighbor shares a common hereditary group ID. Thus, selective activation of resource sharing behavior to hereditary group members might have evolved, which would provide one possible explanation for this observation.3 However, cells are also capable of conditioning behavior on whether a particular neighbor is direct kin (i.e., a parent or child). To test whether this resource-sharing was solely an artifact of sharing between direct cellular kin, we also assessed mean sharing to registered kin that were not immediate cellular relatives. Mean sharing between such cells also exceeded sharing among unrelated neighbors (non-overlapping 95% CI). Thus, all four treatments appear to select for functional cooperation among wider kin groups. Supplementary section C.12 presents these results in detail. 4.3.1 Qualitative Life Histories Although cooperative cell-level phenotypes were common among evolved hereditary groups, across repli- cates functional and reproductive cooperation arose via diverse qualitative life histories. To provide a general sense for the types of life histories we observed in this system, Figure 4.2 shows time lapses of representative multicellular groups evolved in different replicates. Figure 4.2a depicts an example of a naive life history in which — beyond the cellular progenitor of a propagule group — the parent and propagule groups exhibit no special cooperative relationship. In Figure 4.2b , propagules repeatedly bud off of parent groups to yield a larger network of persistent parent-child cooperators. In Figure 4.2c , propagules are generated at the extremities of parent groups and then rapidly replace most or all of the parent group. Finally, in Figure 4.2d , propagules are generated at the interior of a parent group and replace it from the inside out. To better understand the multicellular strategies that evolved in this system, we investigated the mech- anisms and adaptiveness of notable phenotypes that evolved in several individual evolutionary replicates. In the following sections, we present these investigations as a series of case studies. 4.3.2 Case Study: Burst Lifecycle We wondered how the strain exhibiting the “burst” lifecycle in Figure 4.2d determined when and where to originate its propagules. To assess whether gene regulation instructions played a role in this process, we prepared two knockout strains. In the first, gene regulation instructions were replaced with no-operation 3 Alternately to the same end, resource sharing behavior could be instead suppressed in the opposite case, when a neighbor holds a different hereditary group ID. 76 Update 0 Update 128 Update 256 Update 384 Update 512 (a) Naive (animation: https://hopth.ru/x, in-browser simulation: https://hopth.ru/1). The offspring group is birthed at the exterior of the parent group. Parent and offspring groups then compete with each other for space just the same as they do with other groups. Update 0 Update 32 Update 64 Update 128 Update 512 (b) Adjoin (animation: https://hopth.ru/y, in-browser simulation: https://hopth.ru/2). The offspring group begins as a single cell at the exterior of the parent group. Parent and offspring groups then exclusively expend reproductive effort to compete with other groups. This results in a stable interface between the parent and offspring groups as the offspring group grows over time. Update 0 Update 72 Update 144 Update 216 Update 288 (c) Sweep (animation: https://hopth.ru/z, in-browser simulation: https://hopth.ru/3). The offspring group begins as a single cell at the exterior of the parent group. The offspring group then grows rapidly into the parent group, resulting in a near-complete transfer of simulation space into the offspring group. Multiple offspring groups may simultaneously grow over the parent, as is the case here. Update 0 Update 96 Update 192 Update 288 Update 384 (d) Burst (animation: https://hopth.ru/0, in-browser simulation: https://hopth.ru/4). The offspring group begins as a single cell at the interior of the parent group. Over time, the offspring group grows over the parent group from the inside out. Multiple offspring groups may develop simultaneously, as is the case here. Figure 4.2: Time lapse examples of qualitative life histories evolved under the Nested-Wave treatment. From left to right within each row, frames depict the progression of simulation state within a subset of the simulation grid. L1 hereditary groups are by differentiated by grayscale tone and separated by solid black borders. L0 hereditary groups are by separated by dashed gray borders. In each example, the focal parent L1 group is colored purple and the focal offspring group orange. 77 Wild type Propagule Regulation knockout knockout (a) Regulation visualizations 0.5 Interior Propagule Count 0.4 0.3 0.2 0.1 0.0 Wild Type Propagule Knockout Regulation Knockout Genotype (b) Interior propagule rate by genotype Figure 4.3: Analysis of a wild type strain exhibiting a “burst” lifecycle evolved under the “Nested-Wave” treatment exhibiting interior propagule generation. Subfigure 4.3a compares gene regulation between an- alyzed strains. Group layouts are overlaid via borders between cells. Black borders divide L1 groups and white borders divide L0 groups. Borders between L1 groups are underlined in red for greater visibility. Within these group layouts, regulation state for each cell’s four directional SignalGP instances is color coded using a PCA mapping from regulatory state to three-dimensional RGB coordinates. (The PCA mapping is calculated uniquely for each L1 hereditary group.) Within a L1 hereditary group, color similarity among tile quarters indicates that the corresponding SignalGP instances exhibit similar regulatory state. However, the particular hue of a SignalGP instance has no significance. In the case of identical regulatory state (here, due to the absence of genetic regulation in a knockout strain) this color coding appears gray. Wild type interior propagules are annotated with red arrows. Subfigure 4.3b compares the mean number of interior propagules observed per L1 hereditary group. Error bars indicate 95% confidence. View an animation of wild type gene regulation at https://hopth.ru/t. View the wild type strain in a live in-browser simulation at https://hopth.ru/g. 78 (Nop) instructions (so that gene regulation state would remain baseline). In the second, the reproduction instructions to spawn a propagule were replaced with Nop instructions. Figure 4.3a depicts the gene regulation phenotypes of these strains. Figure 4.3b compares interior propagule generation between the strains, confirming the direct mecha- nistic role of gene regulation in promoting interior propagule generation (non-overlapping 95% CI). In head-to-head match-ups, the wild type strain outcompetes both the regulation-knockout (20/20; p < 0.001; two-tailed Binomial test) and the propagule-knockout strains (20/20; p < 0.001; two-tailed Binomial test). The deficiency of the propagule-knockout strain confirms the adaptive role of interior propagule generation. Likewise, the deficiency of the regulation-knockout strain affirms the adaptive role of gene regulation in the focal wild type strain. 4.3.3 Case Study: Cell-cell Messaging We discovered adaptive cell-cell messaging in two evolved strains. Here, we discuss a strain evolved under the Flat-Wave treatment where cell-cell messaging disrupts directional and spatial uniformity of resource sharing. Supplementary Section C.13 overviews an evolved strain where cell-cell messaging appears to intensify expression of a contextual tit-for-tat policy between hereditary groups. Figure 4.4 depicts the cell-cell messaging, resource sharing, and resource stockpile phenotypes of the wild type strain side-by-side with corresponding phenotypes of a cell-cell messaging knockout strain. In the wild type strain, cell-cell messaging emanates from irregular collection of cells — in some regions, grid-like and in others more sparse — broadcasting to all neighboring cells. Resource sharing appears more widespread in the knockout strain than in the wild type. However, messaging’s effects suppressing resource sharing is neither spatially nor directionally homogeneous. Relative to the knockout strain, cell-cell messaging increases variance in cardinal directionality of net resource sharing (WT: mean 0.28, S.D. 0.07, n = 54; KO: mean 0.17, S.D. 0.07, n = 69; p < 0.001, bootstrap test). Cell-cell messaging also increases variance of resource sharing density with respect to spatial quadrants demarcated by the hereditary group’s spatial centroid (WT: mean 0.23, S.D. 0.07, n = 52; KO: mean 0.16, S.D. 0.08, n = 68; p < 0.001, bootstrap test). We used competition experiments to confirm the fitness advantage both of cell-cell messaging (20/20; p < 0.001; two- tailed Binomial test) and (using a separate knockout strain) resource sharing (20/20; p < 0.001; two-tailed Binomial test). The fitness advantage of irregularized sharing might stem from a corresponding increase in the fraction of cells with enough resource to reproduce stockpiled (WT: mean 0.18, S.D. 0.11, n = 54; KO: mean 0.06, S.D. 0.08, n = 69; p < 0.001, bootstrap test). 79 Messaging Resource Sharing Resource Stockpile Wild Type Messaging Knockout Figure 4.4: Visualization of phenotypic traits of a wild type strain evolved under the “Flat-Wave” treatment and corresponding intercell messaging knockout strain. For these visualizations, group layouts are overlaid via borders between cells. Black borders divide L0 hereditary groups. In the messaging visualization, color coding represents the volume of incoming messages. White represents no incoming messages and the magenta to blue gradient runs from one incoming message to the maximum observed incoming message traffic. Unlike the wild type strain, as expected the messaging knockout strain exhibits no messaging activity. In the resource sharing visualization, color coding represents the amount of incoming resource. White represents no incoming resource and the magenta to blue gradient runs from the minimum to the maximum observed incoming incoming resource. The wild type strain exhibits much more sparse resource sharing than the messaging knockout strain. In the resource stockpile visualization, white represents zero-resource stockpiles, blue represents stockpiles with just under enough resource to reproduce, green represents stockpiles with enough resource to reproduce, and yellow represents more than enough resource to reproduce. The wild type groups contain more cells with rich resource stockpiles (green and yellow) than the messaging knockout strain. View an animation of the wild type strain at https://hopth.ru/p. View the wild type strain in a live in-browser simulation at https://hopth.ru/e. 80 Resource Stockpile Resource Sharing Wild Type Relative Stockpile Sensing Knockout Figure 4.5: Visualization of phenotypic traits of a wild type strain evolved under the “Nested-Wave” treatment and corresponding resource-sensing knockout strain. For these visualizations, group layouts are overlaid via borders between cells. Black borders divide L1 hereditary groups and dashed gray borders divide L0 hereditary groups. In the resource stockpile visualization, white represents zero-resource stockpiles, blue represents stockpiles with just under enough resource to reproduce, green represents stockpiles with enough resource to reproduce, and yellow represents more than enough resource to reproduce. The wild type groups contain more cells with rich resource stockpiles (green and yellow) than the knockout strain. In the resource- sharing visualization, white represents no incoming resource and the magenta to blue gradient runs from the minimum to the maximum observed amount of incoming shared resource. The wild type strain exhibits less resource sharing than the knockout strain. View an animation of the wild type strain at https://hopth.ru/s. View the wild type strain in a live in-browser simulation at https://hopth.ru/h. 81 4.3.4 Case Study: Gradient-conditioned Cell Behavior To further assess how multicellular groups process and employ spatial and directional information, we investigated whether successful multicellular strategies evolved where cells condition their behavior based on the resource concentration gradient within a multicellular group. We discovered a strain that employs a dynamic strategy where cells condition their own resource-sharing behavior based on the relative abundance of their own resource stockpiles compared to their neighbors. This strain appears to use this information to selectively suppress resource sharing. This strain’s wild type outcompeted a variant where cells’ capacity to assess relative richness of neighboring resource stockpiles was knocked out (20/20; p < 0.001; two-tailed Binomial test). Figure 4.5 contrasts the wild type resource-sharing phenotype with the more sparse knockout resource-sharing phenotype. This result raises the question of whether more sophisticated morphological patterning might evolve within the experimental system. Next, in Section 4.3.5, we examine a strain that exhibited striking genetically driven morphological patterning of hereditary groups. 4.3.5 Case Study: Morphology Figure 4.6a shows one of the more striking examples of genetically encoded hereditary group patterning we observed. In this strain, which arose in a Nested-Even treatment replicate, L0 hereditary groups arrange as elongated, one-cell-wide strands. Knocking out intracell messaging disrupts the stringy arrangement of L0 hereditary groups groups, shown in Figure 4.6b . Figure 4.6c compares the distribution of cells’ L0 same-hereditary-group neighbor counts for L1 groups of nine or more cells. Compared to the knockout variant, many fewer wild-type cells are have three or four L0 same-hereditary-group neighbors, consistent with the one-cell-wide strands (non- overlapping 95% CI). However, we also observed that wild-type L0 hereditary groups were overall smaller than the knockout strain (WT: mean 2.1, S.D. 1.5; messaging knockout: mean 4.3, S.D. 5.1; p < 0.001; bootstrap test). So, we set out to determine determine whether smaller L0 group size alone was sufficient to explain these observed differences in neighbor count. We compared a dimensionless shape factor describing group stringiness (perimeter divided by the square root of area) between the wild type and messaging knockout strains. Between L0 group size four (the smallest size stringiness can emerge at on a grid) and L0 group size six (the largest size we had sufficient replicate wild type observations for), wild type exhibited significantly greater stringiness (Figure 4.6d ; 4: p < 0.01, bootstrap test; 5: p < 0.01, bootstrap test; 6: non-overlapping 95% CI). This confirms that more sophisticated patterning beyond just smaller L0 group size is at play to create the observed one-cell-wide L0 strand morphology. 82 (a) Wild type (b) Messaging knock- out 0.5 8 Genotype Genotype Wild Type Wild Type 7 Messaging Knockout Messaging Knockout 0.4 6 Shape Factor (P/√ A ) Fraction Cells 5 0.3 4 0.2 3 2 0.1 n = 90 n = 41 n = 77 n = 38 n = 23 n = 31 n = 23 n = 17 n = 14 1 n=8 n=3 n=4 0.0 0 0 1 2 3 4 4 5 6 7 8 9 Neighbor Count Level-zero Group Size (c) Distribution of L0 same-hereditary-group (d) L0 hereditary group stringiness measure neighbor counts. versus group sizes. Figure 4.6: Comparison of a wild type strain evolved under the “Nested-Even” treatment with stringy L0 hereditary groups and the corresponding intracellular-messaging knockout strain. Subfigures 4.6a and 4.6b visualize hereditary group layouts; color hue denotes and black borders divide L1 hereditary groups while color saturation denotes and white borders divide L0 hereditary groups. Smaller, thinner, and more elongated L0 groups can be seen in the wild type strain than in the knockout strain. Subfigures 4.6c and 4.6d quantify the morphological effect of the intracellular-messaging knockout. In the formula for Shape Factor given in Subfigure 4.6c , P refers to group perimeter and A refers to group area. Error bars indicate 95% confidence. View an animation of the wild type strain at https://hopth.ru/q. View the wild type strain in a live in- browser simulation at https://hopth.ru/f. 83 Competition experiments failed to show a fitness effect of this strain’s morphological patterning. The wild type strain won competitions about as often as the knockout strain (6/20). Thus, it seems this trait emerged either by drift, as the genetic background of a selective sweep, or was advantageous against a divergent competitor earlier in evolutionary history. 4.3.6 Case Studies: Apoptosis Strain A Strain B Wild Type Apoptosis Knockout Figure 4.7: Comparison of wild type strains and corresponding apoptosis knockout strains. In all visualiza- tions, color hue denotes and black borders divide apex-level hereditary groups. In Strain A visualizations, color saturation denotes and white borders divide L0 hereditary groups. (Strain B evolved under the flat treatment.) Black tiles are dead. These dead tiles, all due to apoptosis, can be seen in both strain’s wild type. Dead tiles appear to be clustered contiguously or near contiguously at group peripheries in both strains, with more dead tiles apparent in Strain A than Strain B. View an animation of wild type strain A at https://hopth.ru/m. View an animation of wild type strain B at https://hopth.ru/n. View wild type strain A in a live in-browser simulation at https://hopth.ru/b. View wild type strain B in a live in-browser simulation at https://hopth.ru/c. Finally, we assessed whether cell self-sacrifice played a role in multicellular strategies evolved across our survey. Screening replicate evolutionary runs by apoptosis rate flagged two strains with several orders of magnitude greater activity. In strain A, evolved under the Nested-Even treatment, apoptosis accounts for 2% of cell mortality. In strain B, evolved under the Nested-Flat treatment, 15% of mortality is due to apoptosis. To test the adaptive role of apoptosis in these strains, we performed competition experiments against apoptosis knockout strains, in which all apoptosis instructions were substituted for Nop instructions. Figure 4.7 compares the wild type hereditary group structures of these strains to their corresponding knockouts. Apoptosis contributed significantly to fitness in both strains (strain A: 18/20, p < 0.001, two-tailed 84 Binomial test; strain B: 20/20, p < 0.001, two-tailed Binomial test). The success of strategies incorporating cell suicide is characteristic of evolutionary conditions favoring altruism, such as kin selection or a transition from cell-level to collective individuality. To discern whether spatial or temporal targeting of apoptosis contributed to fitness, we competed wild type strains with apoptosis-knockout strains on which we externally triggered cell apoptosis with spatially and temporally uniform probability. In one set of competition experiments, the knockout strain’s apoptosis probability was based on the observed apoptosis rate of the wild type strain’s monoculture. In a second set of competition experiments, the knockout strain’s apoptosis probability was based on the observed apoptosis rate of the population in the evolutionary run the wild type strain was harvested from. In both sets of experiments on both strains, wild type strains outcompeted knockout strains with uniform apoptosis probabilities (strain A monoculture rate: 18/20, p < 0.001, two-tailed Binomial test; strain A population rate: 19/20, p < 0.001, two-tailed Binomial test; strain B monoculture rate: 20/20, p < 0.001, two-tailed Binomial test; strain B population rate: 20/20, p < 0.001, two-tailed Binomial test). 4.4 Discussion In this work, we selected for fraternal transitions in individuality among digital organisms controlled by genetic programs. Because — unlike previous work (Goldsby et al., 2012, 2014) — we provided no experi- mentally prescribed mechanism for collective reproduction, we observed the emergence of several distinct life histories. Evolved strategies exhibited intercellular communication, coordination, and differentiation. These included endowment of offspring propagule groups, asymmetrical intra-group resource sharing, asymmetri- cal inter-group relationships, morphological patterning, gene-regulation mediated life cycles, and adaptive apoptosis. Across treatments, we observed resource-sharing and reproductive cooperation among registered kin groups. These outcomes arose even in treatments where registered kin groups lacked functional significance (i.e., resource was distributed evenly), suggesting that reliable kin recognition alone might be sufficient to observe aspects of fraternal collectivism evolve in systems where population members compete antagonisti- cally for limited space or resources and spatial mixing is low. In addition to their functional consequences, perhaps the role of physical mechanisms such as cell attachment simply as a kin recognition tool might merit consideration. In future work, we are eager to undertake experiments investigating open questions pertaining to major evolutionary transitions such as the role of pre-existing phenotypic plasticity (Clune et al., 2007; Lalejini and Ofria, 2016), pre-existing environmental interactions, pre-existing reproductive division of labor, and how transitions relate to increases in organizational (Goldsby et al., 2012), structural, and functional (Goldsby 85 et al., 2014) complexity. Expanding the scope of our existing work to directly study evolutionary dynamics and evolutionary histories will be crucial to such efforts. In particular, we plan to investigate mechanisms to evolve greater collective sophistication among agents. The modular design of SignalGP lends itself to the possibility of exploring sexual recombination. We are interested in exploring extensions to allow cell groups to develop neural and vascular networks (Moreno and Ofria, 2020). We hypothesize that selective pressures related to intra-group coordination and inter-group conflict might spur developmental and structural infrastructure that could be co-opted to evolve agents proficient at unrelated tasks like navigation, game-playing, or reinforcement learning. Unfortunately, however, experiments with multicellularity are specially constrained by a fundamental limitation of digital evolution research: processing power (Moreno, 2020). This limitation, which commonly manifests as smaller population sizes than natural populations (Liard et al., 2018), only compounds when the unit of selection shifts to computationally expensive groups of dozens or hundreds of component individuals. Ongoing work with DISHTINY is testing approaches to harness increasingly abundant parallel processing power for digital evolution simulation (Moreno et al., 2021b). The spatial, distributed nature of our approach potentially affords a route to achieve large-scale digital multicellularity experiments consisting of millions, instead of thousands, of cells via high-performance parallel computing. We hope that such technical efforts will also benefit other computational work exploring a broader range of conceptual models of multicellularity. For instance, this work assumes incessant, pervasive biotic interaction via competition for space. However, many natural systems exhibit more intermittent, sparse encounters between multicells and such selective interactions have been hypothesized as key to the evolution of complexity and diversity (Soros and Stanley, 2014). Also crucial to explore, and unaccounted for in this work, are dynamics of cell migration in development (Horwitz and Webb, 2003) and motility of multicells (Arnellos and Keijzer, 2019). It seems certain that the varied conditions and mechanistic richness of biological reality can only be fully explored through a plurality of conceptual models and model systems. 86 Chapter 5 A Case Study of Novelty, Complexity, and Adaptation in a Multicellular System Authors: Matthew Andres Moreno, Santiago Rodriguez Papa, and Charles Ofria This chapter is adapted from (Moreno et al., 2021a), which underwent peer review and appeared in the proceedings of the Fourth Workshop on Open-Ended Evolution (OEE4) at the 2021 Conference on Artificial Life (ALIFE 2021). This chapter reports trajectories of novelty, complexity, and adaptation in a case study from the DISHTINY simulation system. This case study lineage produced ten qualitatively distinct multicellular morphologies, several of which exhibit asymmetrical growth and distinct life stages. We find that a loose — sometimes divergent — relationship can exist among novelty, complexity, and adaptation. 5.1 Introduction The challenge, and promise, of open-ended evolution has animated decades of inquiry and discussion within the artificial life community (Packard et al., 2019). The difficulty of devising models that produce continuing open-ended evolution suggests profound philosophical or scientific blind spots in our understand- ing of the natural processes that gave rise to contemporary organisms and ecosystems. Already, pursuit of open-ended evolution has yielded paradigm-shifting insights. For example, novelty search demonstrated how processes promoting non-adaptive diversification can ultimately yield adaptive outcomes that were pre- viously unattainable (Lehman and Stanley, 2011). Such work lends insight to fundamental questions in evolutionary biology, such as the relevance — or irrelevance — of natural selection with respect to increases in complexity (Lehman, 2012; Lynch, 2007) and the origins of evolvability (Kirschner and Gerhart, 1998; Lehman and Stanley, 2013). Evolutionary algorithms devised in support of open-ended evolution models also promise to deliver tangible broader impacts for society. Possibilities include the generative design of engineering solutions, consumer products, art, video games, and AI systems (Kenneth O. Stanley, 2017; Nguyen et al., 2015). Preceding decades have witnessed advances toward defining — quantitatively and philosophically — the concept of open-ended evolution (Bedau et al., 1998; Dolson et al., 2019; Lehman and Stanley, 2012) as well as investigating causal phenomena that promote open-ended dynamics such as ecological dynamics, selection, and evolvability (Dolson, 2019; Huizinga et al., 2018; Soros and Stanley, 2014). The concept of open-endedness is fundamentally characterized by intertwined generation of novelty, functional complexity, and adaptation (Taylor et al., 2016). How and how closely these phenomena relate to one another remains an open question. Here, we aim to complement ongoing work to develop a firmer theoretical understanding of 87 the relationship between novelty, complexity, and adaptation by exploring the evolution of these phenomena through a case study using the DISHTINY digital multicelullarity framework . We apply a suite of qualitative and quantitative measures to assess how these qualities can change over evolutionary time and in relation to one another. 5.2 Methods 5.2.1 Simulation The DISHTINY simulation environment tracks cells occupying tiles on a toroidal grid (size 120 × 120 by default). Cells collect a uniform inflow of continuous-valued resource. This resource can be spent in increments of 1.0 to attempt asexual reproduction into any of a cell’s four adjacent cells. A cell can only be replaced if it commands less than 1.0 resource. If a cell rebuffs a reproduction attempt, its resource stockpile decrements by 1.0 down to a minimum of 0.0. In order to facilitate the formation of coherent multicellular groups, the DISHTINY framework pro- vides a mechanism for cells to form groups and detect group membership . Groups arise through cellular reproduction. When a cell proliferates, it may choose to initiate its offspring as a member of its kin group, thereby growing it, or induce the offspring to found a new kin group. This process is similar to the growth of biological multicellular tissues, where cell offspring can be retained as members of the tissue or permanently expelled. We incentivize group formation by providing an additional resource inflow bonus based on group size. Per-cell resource collection rate increases linearly with group size up to a cap of 12 members. Past 12 members, the decay rate of cells’ resource stockpiles begins increasing exponentially. These mechanisms select for medium-sized groups; the harsh penalization of oversize groups, in particular, prevents any single group from consuming the entire population. Groups that are too small do not receive this bonus. Groups that are too large receive a penalty. In order to ensure group turnover, we force groups to fragment into unicells after 8,192 (213 ) updates. In Chapter 4 , we established that this framework can select for traits characteristic of multicellularity, such as cooperation, coordination, and reproductive division of labor . We also found more case studies of interest arose when two nested levels of group membership were tracked as opposed to a single, un-nested level of group membership . With nested group membership, group growth still occurs by cellular reproduction. Cells are given the choice to retain offspring within both groups, to expel offspring from both groups, or to expel offspring from the innermost group only. In addition to being given the choice to expel or retain offspring within both groups, cells are also allowed to expel offspring from the innermost group only. Section 4.2.2 provides greater detail on group membership and hierarchical group membership in DISHTINY. In this work, we allow for nested kin groups. 88 events & output sensors simulation registers outputs 🍓 inputs [tag] 🕓 [tag] 🎁🍓 ☠ � 🐣 📬 📬🛂 📝 [tag] core 0 � � 🍓 🛂🍓 🕓 📝 🛂🕓 � 7 67 3 56 0 13 core 1 Ra Rb Rc Rd Re Rf ... 1 10 4 40 3 72 ... core 2 R0 R1 R2 R3 R4 R5 ... [tag] [tag] virtual [tag] [tag] [tag] cpu [tag] tagged tagged messages neighbor agents messages & co-cardinals Figure 5.1: Overview of genome execution. Tagged events and messages (shown as bells and envelopes, re- spectively) activate module execution on virtual cores. Simulation state can also be read directly using sensor instructions to access input registers. Special instructions write to output registers, allowing interaction with the simulation, and generate tagged messages, allowing interaction with other virtual CPUs. 89 [tag] [tag anchor] [instruction] core 0 [tag] [tag anchor] [instruction] core 2 core 3 [instruction] [tag] [tag anchor] c. [tag anchor] a. b. [tag] [instruction Figure 5.2: Overview of DISHTINY system. Cells occupy slots on a toroidal grid (Subfigure a). As cells reproduce, they may grow their existing kin group (shown here by color) or splinter off to found new ones. Each cell, shown here bounded within black squares, is controlled by four virtual CPUs, referred to as “cardinals” and shown here within triangles (Subfigure b). Cardinals within a cell can interact via message passing (blue conduits). Cardinals can interact with the corresponding cardinal in their neighboring cell through message passing or simulation intrinsics (i.e., resource sharing, offspring spawning, etc.), represented here by purple conduits. These inter-cell interactions may span physical hardware threads or processes. All virtual CPUs within a cell independently execute the same linear genetic program (Subfigure c). Tagged subsections of this linear genetic program (“modules”) activate in response to stimuli. In addition to controlling reproduction behavior, evolving genomes can also share resources with adja- cent cells, perform apoptosis (recovering a small amount of resource that may be shared with neighboring cells), and pass arbitrary messages to neighboring cells. Cell behaviors are controlled by event-driven genetic programs in which linear GP modules are activated in response to cues from the environment or neighbor- ing agents; signals are handled in quasi-parallel on up to 32 virtual cores (Figure 5.1) (Lalejini and Ofria, 2018). Each cell contains four independent virtual CPUs, all of which execute the same genetic program (Figure 5.2a). Each CPU manages interactions with a single neighboring cell. We refer to a CPU managing interactions with a particular neighbor as a “cardinal” (as in “cardinal direction”). These CPUs may com- municate via intra-cellular message passing. Full details on the instruction set and event library used as well as simulation logic and parameter settings appear in supplementary material. Supplementary Section D.5 provides full detail on simulation components and parameters. 5.2.2 Evolution We performed evolution in three-hour windows for compatibility with our compute cluster’s scheduling system. We refer to these windows as “stints.” We randomly generated one-hundred instruction genomes at the outset of the initial stint, stint 0. At the end of each three hour window, the system harvested and stored genomes in a population file. We then seeded subsequent stints with the previous stint’s population. No simulation state besides genome content was preserved between stints. In addition to simplifying implemen- 90 tation concerns, re-seeding each stint ensured that strains retained the capability to grow from a well-mixed innoculum. This facilitated later competition experiments between strains. In order to ensure heterogeneity of biotic environmental factors experienced by evolving cells, we imposed a diversity maintenance scheme. In this scheme, descendants of a single progenitor cell from stint 0 that proliferated to constitute more than half of the population were penalized with resource loss. The severity of the penalty increased with increasing prevalence beyond half of the population. Thus, we ensured that descendants from at least two distinct stint 0 progenitors remained over the course of the simulation. We arbitrarily chose a strain for primary study — we refer to this strain as the “focal” strain and others as “background” strains. In our case study, there was only one background strain in addition to this focal strain. In our screen for case studies, we evolved 40 independent populations for 101 stints. We selected population 16005 from among these 40 to profile as a case study due to its distinct asymmetrical group morphology. At the conclusion of each stint, we selected the most abundant genome within the population as a representative specimen. We performed a suite of follow-up analyses on each representative specimen to characterize aspects of complexity, detailed in the following subsections. To ensure that specimens were consistently sampled from descendants of the same stint 0 progenitor, we only considered genomes with the lowest available stint 0 progenitor ID. 5.2.3 Phenotype-neutral Nopout After harvesting representative specimens from each stint, we filtered out genome instructions that had no impact on the simulation. To accomplish this, we performed sequential single-site “nopouts” where individual genome instructions were disabled by replacing them with a Nop instruction. 1 We reverted nopouts that altered a strain’s phenotype and kept those that did not. To determine whether phenotypic alteration occurred, we seeded an independent, mutation-disabled simulation with the stain in question and ran it side-by-side with an independent, mutation-disabled simulation of the wildtype strain. If any divergence in resource concentration was detected between the two strains within a 2,048 update window, the single site nopout was reverted. We continued this process until no single-site nopouts were possible without altering the genome’s phenotype. To speed up evaluation, we performed step-by-step, side-by-side comparisons using a smaller toroidal grid size of just 100 tiles. This process left us with a “Phenoytpe-neutral Nopout” variant of the wildtype genome where all re- 1 This Nop instruction was chosen to perform the same number of random number generator touches as the original instruction to control for arbitrary effects of advancing the generator. 91 maining instructions contributed to the phenotype. However, in further analyses we discovered that 21 phenotype-neutral nopouts from our case study were not actually neutral — competition experiments revealed they were significantly less fit than the wildtype strain. This might be due to insufficient spatial or temporal scope to observe expression of particular genome sites in our test for phenotypic divergence. 5.2.4 Estimating Critical Fitness Complexity Next, we sought to detect genome instructions that contributed to the strain fitness. For each remaining op instruction in the Phenotype-neutral Nopout variant, we took the wildtype strain and applied a nopout at the corresponding site. We then competed this variant against the wildtype strain. Evaluating only remaining op instructions in the Phenotype-neutral Nopout variant allowed us to decrease the number of fitness competitions we had to perform. We initialized fitness competitions by seeding a population half-and-half with two strains. We ran these competitions for 10 minutes (about 4,200 updates) on a 60 × 60 toroidal grid, after which we assessed the relative abundances of descendants of both seeded strains. To determine whether fitness differed significantly between a wildtype and variant strain, we compared the relative abundance of the strains observed at the end of competitions against outcomes from 20 control wildtype-vs-wildtype competitions. We fit a T -distribution to the abundance outcomes observed under the control wildtype-vs-wildtype competitions and deemed outcomes that fell outside the central 98% probability density of that distribution a significant difference in fitness. This allowed us to screen for fitness effects of single-site nopouts while only performing a single competition per site. This process left us with a “Fitness-noncritical Nopout” variant of the wildtype genome where all re- maining instructions contributed to the phenotype. We called the number of remaining instructions its “critical fitness complexity.” We adjusted this figure downwards for the expected 1% rate of false-positive fitness differences among tested genome sites. This metric mirrors the MODES complexity metric described in (Dolson et al., 2019) and the approximation of sequence complexity advanced in (Adami et al., 2000). 5.2.5 Estimating State Interface Complexity In addition to estimating the number of genome sites that contribute to fitness, we measured the number of different environmental cues and the number of different output mechanisms that cells adaptively incorporated into behavior. One possible way to take this measure would be to disable event cues, sensor instructions, and output registers one by one and test for changes in fitness. However, this approach would fail to distinguish context- dependent input/output from merely contingent input/output. For example, a cell might happen to depend 92 on a sensor being set at a certain frequency but not on the actual underlying simulation information the sensor represents. To isolate context-dependent input/output state interactions, we tested the fitness effect of swapping particular input/output states between CPUs rather than completely disabling them. That is, for example, CPU b would be forced to perform the output generated by CPU a or CPU b would be shown the input meant for CPU a. We performed this manipulation on half the population in a fitness competition for each individual component of the simulation’s introspective state (44 sensor states relating to the status of a CPU’s own cell), extrospective state (61 sensor states relating to the status of a neighboring cell), and writable state (18 output states, 10 of which control cell behavior and 8 of which act as global memory for the CPU). 2 We deemed a state as fitness-critical if this manipulation resulted in decreased fitness at significance p < 0.01 using a T -test parameterized by 20 control wild-type vs wild-type competitions. We describe the number of states that cells interact with to contribute to fitness as “State Interface Complexity.” 5.2.6 Estimating Messaging Interface Complexity In addition to estimating the number of input/output states cells use to interact with the environment, we also estimated the number of distinct intra-cellular messages cardinals within a cell use to coordinate and inter-cellular messages that cells use to coordinate. As with state interface complexity, distinguishing context-dependent behavior from contingent behavior is critical to attaining a meaningful measurement. For example, a cardinal might happen to depend on always receiving a inter-cellular message from a neighbor or an intra-cellular message from another cardinal. Although meaningless, if that message were blocked, fitness would decrease. So, instead of simply discarding messages to test for a fitness effect, we re-route messages back to the sending cardinal instead of their intended recipient. We deemed a messages as fitness-critical if this manipulation resulted in decreased fitness at significance p < 0.01 using a T -test parameterized by 20 control wild-type vs wild-type competitions. We refer to the number of distinct messages that cells send to contribute to fitness as “Messaging Interface Complexity.” We refer to the sum of State Interface Complexity, Intra-messaging Interface Complexity, and Inter- messaging Interface Complexity as “Cardinal Interface Complexity.” 5.2.7 Estimating Adaptation In order to assess ongoing changes in fitness, we performed fitness competitions between the repre- sentative focal strain specimen sampled at each stint and the focal strain population from the preceding 2 A full description of each piece of introspective, extrospective, and writable state is listed in supplemen- tary material. 93 stint. (Recall from Section 5.2.2 that, due to a diversity maintenance procedure, two completely indepen- dent strains coexisted over the course of the experiment — the “focal” strain selected for analysis and a “background” strain.) Using the population from the preceding stint as the competitive baseline (rather than the representative specimen) ensured more focused, consistent measurement of the fitness properties of the specimen at the current stint (e.g., preventing skewed results from a sampled “dud” at the preceding stint). We performed 20 independent replicates of each competition. Competing strains were well-mixed within the full-sized toroidal grid at the outset of each competition, which lasted for 10 minutes of wall time. This was sufficient to simulate about 8,000 updates at stint 0 and 2,000 updates at stint 100 (Supplementary Figures D.10, D.12, and D.11). We determined that a gain of fitness had occurred if the current stint specimen constituted a population majority at the conclusion of more than 17 of those competitions, corresponding to a significance level of p < 0.005 under the two-tailed binomial null hypothesis. Likewise, we deemed winning fewer than 3 competitions a significant fitness loss. 5.2.8 Implementation We employed multithreading to speed up execution. We split the simulation into four 60 × 60 subgrids. Each subgrid executed asynchronously, using the Conduit C++ Library presented in Chapter 2 to orchestrate best-effort, real-time interactions between simulation elements on different threads . This approach is inspired by Ackley’s notion of indefinite scalability (Ackley and Small, 2014). In other work benchmarking the system, we have demonstrated that this approach improves scalability. The simulation scales to 4 threads with 80% efficiency, up to 64 threads with 40% efficiency and up to 64 nodes with 80% efficiency (Chapter 2) . Over the 101 three-hour evolutionary stints performed to evolve the case study, 7,565,309 simulation updates elapsed. This translates to 74,904 updates elapsed per stint or about 6.9 updates per second. However, the update processing rate was not uniform across stints: the simulation slowed about 77% as stints progressed. Supplementary Figure D.9 shows elapsed updates for each stint. During stint 0, 176,816 updates elapsed (about 16.3 updates per second). During stint 100, only 41,920 updates elapsed (about 3.8 updates per second). Although working asynchronously, threads processed similar number of updates during each stint. The mean standard deviation of update-processing rate between threads was 2%. The mean difference of the update-processing rate between the fastest and slowest threads was 5%. The maximum value of these statistics observed during a stint was 9% and 20%, respectively, at stint 44. Supplementary Figure D.9b shows the distribution of elapsed updates across threads for each stint evolved during the case study. Software is available under a MIT License at https://github.com/mmore500/dishtiny. All data is avail- 94 able via the Open Science Framework at https://osf.io/prq49. Supplementary material is available via the Open Science Framework at https://osf.io/gekc8. 5.3 Results 5.3.1 Evolutionary History Due to the parallel nature of the experimental framework, we did not perform perfect phylogeny tracking. Chapter 3 discusses challenges parallelizing perfect phylogeny tracking in depth. However, we did track the total number of ancestors seeded into stint 0 with extant descendants. At the end of stints 0 and 1, three distinct original phylogenetic roots were present in the population. From stint 2 onward, only two distinct original phylogenetic roots were present. We performed follow-up analyses on specimens sampled from the lowest original phylogenetic root ID present in the population. 3 For the first two stints, the focal strain was root ID 2,378. During stint 2, original phylogenetic root 2,378 went extinct. So, all further follow-up analyses were sampled from descendants of ancestor 12,634. We also tracked the number of genomes reconstituted at the outset of each stint with extant descendants at the end of that stint. This count grows from approximately 10 around stint 15 to upwards of 30 around stint 40 (Supplementary Figure D.4a). Among descendants of the lowest original phylogenetic root, the number of independent lineages spanning a stint also increases from around 5 to around 15 (Supplementary Figure D.4b). This decrease in phylogenetic consolidation on a stint-by-stint basis correlates with the waning number of simulation updates performed per stint (Supplementary Figures D.4c and D.4d). More complete phylogenetic data will be necessary in future experiments to address questions about the possibility of long- term stable coexistence beyond the two strains supported under the explicit diversity maintenance scheme. On the specimen from stint 100 used in the final case study, an evolutionary history of 20,212 cell generations had elapsed. Of these cellular reproductions, 11,713 (58%) had full kin group commonality, 7,174 had partial kin group commonality (35%), and 1,325 had no kin group commonality (7%). On this specimen, 1,672 mutation events had elapsed. During these events, 7,240 insertion-deletion alterations had occurred and 26,153 point mutations had occurred. This strain experienced a selection pressure of 18% over its evolutionary history, meaning that only 82% of the mutations that would be expected given the number of cellular reproductions that had elapsed were present. In order to characterize the evolutionary history of the experiment in greater detail, we performed a parsimony-based phylogenetic reconstruction on the sampled representative specimens from each stint, 3 This approach was designed to choose an arbitrary strain as focal. Barring extinction, that same strain will then be identified as focal consistently across subsequent stints. Phylogenetic root ID had no functional consequences; it is simply an arbitrary basis for focal strain selection. 95 93 (e) 96 (b) 95 (b) 94 (e) 97 (h) 99 (e) 100 (j) 98 (e) 90 (g) 91 (e) 89 (g) 92 (i) 87 (e) 85 (e) 86 (e) 88 (e) 83 (e) 84 (e) 80 (e) 82 (e) 78 (e) 74 (i) 68 (e) 70 (e) 64 (e) 65 (e) 66 (g) 67 (g) 63 (g) 76 (g) 77 (i) 79 (g) 81 (e) 75 (i) 72 (e) 73 (g) 69 (g) 71 (e) 56 (g) 59 (h) 57 (g) 61 (e) 62 (g) 60 (g) 55 (g) 58 (g) 50 (g) 51 (g) 52 (g) 53 (e) 49 (g) 54 (g) 46 (g) 47 (g) 48 (g) 45 (g) 42 (e) 44 (e) 40 (e) 38 (e) 37 (e) 36 (e) 41 (e) 43 (e) 39 (f) 33 (e) 30 (e) 32 (e) 35 (e) 34 (e) 29 (e) 31 (e) 26 (b) 28 (b) 27 (e) 24 (e) 25 (e) 23 (e) 18 (e) 19 (e) 21 (e) 22 (e) 20 (e) 17 (e) 16 (e) 15 (e) 13 (b) 14 (d) 12 (b) 9 (b) 10 (b) 11 (b) 5 (b) 6 (b) 7 (b) 8 (b) 3 (b) 4 (b) 2 (c) 0 100 200 300 400 500 600 Phylogenetic Distance Figure 5.3: Phylogeny of sampled focal strain representatives across stints reconstructed using parsimony algorithm (Cock et al., 2009). Each leaf node corresponds to a sampled representative. Representatives from stints 0 and 1, which share no common ancestry with representatives from other stints, are excluded. Numbers refer to stint that each representative was sampled from. Color coding and parentheticals of stint labels correspond to qualitative morph codes described in Table 5.1. 96 shown in Figure 5.3. We used genomes’ fixed-length blocks of 35 64-bit tags that mediate environmental interactions as the basis for this reconstruction. These tag blocks underwent bitwise mutation over the course of the experiment.4 Supplementary Figure D.5 shows hamming distance between all pairs of tag blocks. We additionally tried several other tree inference methods, discussed in supplementary material; however, these yielded lower-quality reconstructions. Although the phylogeny of stint representatives includes many instances that do not constitute a string lineage (i.e., each stint’s representative descending directly from the preceding stint’s representative), we did not observe evidence of long-term coexistence of clades over more than ten stints. 5.3.2 Qualitative Morphological Categorizations We performed a qualitative survey of the evolved life histories along the evolutionary timeline by ana- lyzing video recordings of monocultures representative specimens from each stint. Table 5.1 summarizes the ten morphological categories we grouped specimens into. In brief, specimens from early stints largely grew as unicellular or small multicellular groups (morphs a, b). Then, the specimen from stint 14 grew as larger, symmetrical groups (morph d). At stint 15, a distinct, asymmetrical horizontal bar morphology evolved (morph e). At stint 45, a delayed secondary spurt of group growth in the vertical direction arose (morph g). This morphology was sampled frequently until stint 60 when morph e began to be sampled primarily again. However, morph g was observed as late as stint 90. Phylogenetic analysis (Figure 5.3) indicates that observations of morph e at stint 53 and onward are instances of secondary loss rather than retention of trait e by a separate lineage coexisting with the lineage expressing morph g. Three separate reversion events from morph g to morph e appear likely. Interestingly, morph g individuals at stints 89 and 90 appear to represent subsequent trait re-gain after reversion from morph g to morph e. Table 5.1 provides more detailed descriptions of each qualitative morph category as well as video and a still image example of each. Supplementary Table D.1 provides morph categorization for each stint as well as links to view the stint’s specimen in a video or in-browser web simulation. 5.3.3 Fitness Of the 100 competition assays performed, 57 indicated significant fitness gain, 23 were neutral, and 20 indicated significant fitness loss (shown in upper right of Figure 5.4, at the intersection of the “Biotic Background, Without” column and “Assay Subject, Specimen” row.) We were surprised by the frequency of deleterious outcomes, leading us to perform a second set of experiments to investigate whether these outcomes could be explained as sampling of “dud” representatives. 4 In future experiments, we plan to incorporate new methodology for “hereditary stratigraph” genome annotations expressly designed to facilitate phylogenetic reconstruction (Chapter 3) . 97 ID Morphology Snapshot Video Individual cells, no multi-cellular kin groups. Re- https://hopth.ru/ a source use is low—most cells simply hoard resource until their stockpile is beyond sufficient to repro- duce. Only a handfuls of cells intermittently ex- 21/b=prq49+s= 16005+t=0+v= video+w=specimen pend resource. https://hopth.ru/ b Mostly individual cells, with some two-, three-, and four-cell groups evenly spread out. Resource usage occurs in short spurts in one or two adjacent cells. 21/b=prq49+s= 16005+t=1+v= video+w=specimen Large multi-cellular groups dominate, consisting of https://hopth.ru/ c hundreds of cells. Group growth is unchecked and continues until cells’ resource stockpiles are entirely 21/b=prq49+s= 16005+t=2+v= depleted by the excess group size penalty. video+w=specimen https://hopth.ru/ d Clear groups of 10 to 15 cells in size form. Cell proliferation appears somewhat more active at the periphery of groups compared to the interior. 21/b=prq49+s= 16005+t=14+v= video+w=specimen https://hopth.ru/ e Groups are visibly elongated along the horizontal axis. After initial development, some gradual, ir- regular growth occurs along the vertical axis. 21/b=prq49+s= 16005+t=15+v= video+w=specimen https://hopth.ru/ f Groups are horizontally elongated similarly to mor- phology e, but have a greater consistent vertical thickness of three or four cells. 21/b=prq49+s= 16005+t=39+v= video+w=specimen Initial group growth is almost entirely horizontal, https://hopth.ru/ g with groups usually taking up only one row of cells. However, after an apparent timing cue groups per- 21/b=prq49+s= 16005+t=45+v= form a brief bout of aggressive vertical growth. video+w=specimen Groups grow horizontally and then proliferate ver- https://hopth.ru/ h tically on a timing cue like morph e. However, after that timing cue cell proliferation is incessant with 21/b=prq49+s= 16005+t=59+v= almost no resource retention. video+w=specimen https://hopth.ru/ i Irregular groups of mostly less than ten cells. Inces- sant proliferation with almost no resource retention leads to rapid group turnover. 21/b=prq49+s= 16005+t=74+v= video+w=specimen Groups grow horizontally and then proliferate ver- https://hopth.ru/ j tically on a timing cue like morph e. However, sev- eral viable horizontal-bar offspring groups form be- 21/b=prq49+s= 16005+t=100+v= fore force-fragementation. video+w=specimen Table 5.1: Qualitative morph phenotype categorizations. Color coding of morph IDs has no significance beyond guiding the eye in scatter plots where points are labeled by morph. Snapshot visualizes spatial layout of kin groups on toroidal grid at a fixed point in time. Each cell corresponds to a small square tile. Color hue denotes and black borders divide outermost kin groups while color saturation denotes and white borders divide innermost kin groups. 98 Biotic Background Biotic Background Biotic Background Contemporary Biotic Background Prefatory Biotic Background Contemporary (no diversity maint.) Prefatory (no diversity maint.) Without 57 53 50 48 44 45 40 Assay Subject 40 36 33 32 count 30 Specimen 23 23 23 20 20 12 11 10 0 50 49 50 50 48 44 Assay Subject 40 34 count 30 20 Population 16 10 6 3 0 Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Outcome Outcome Outcome Outcome Outcome Figure 5.4: Distributions of adaptation assay outcomes over all stints. For each adaptation assay, three outcomes were possible: significant fitness gain, significant fitness loss, or no significant fitness change (“neu- tral”). Significance cutoff p < 0.005 was used. A fitness loss — color coded red — corresponds to winning 2 or fewer competitions out of 20 against the preceding stint’s focal strain population. A fitness gain — color coded green — corresponds to winning 18 or more competitions out of 20 against the preceding stint’s focal strain population. Neutral fitness outcomes are color coded yellow. Outcome counts are accumulated over experiments from stint 1 through stint 100. Upper row shows results for sampled focal strain genome, lower row shows results for entire focal strain population. See Figure 5.5 for explanation of competition biotic backgrounds. See Figure 5.6 for joint distributions of fitness outcomes across biotic backgrounds. In these competition assays, we competed the entire focal strain population against the focal strain population from the preceding stint. However, we observed a similar result: 50 assays indicated significant fitness gain, 34 were neutral, and 16 indicated significant fitness loss (shown in lower right of Figure 5.4, at the intersection of the “Biotic Background, Without” column and “Assay Subject, Population” row.) Next, we investigated whether the presence of the background strain as a “biotic background” influenced fitness. We repeated the two experiments described above (specimen and population competition assays), but inserted the background strain as half of the initial well-mixed population. In one assay setup, we used the background strain population from the current stint. We refer to this as “contemporary biotic background.” In another, which we call “prefatory biotic background,” we used the background strain population from the previous stint. We refer to the original competition assays absent the background strain as “without biotic background.” Figure 5.5 summarizes these competition assay designs. After incorporating the background strain into our measure of fitness, we detected fewer whole- population deleterious outcomes — only six under contemporary biotic background conditions and only three under prefatory biotic background conditions (Figure 5.4). To determine whether the presence of the background strain caused the overall reduction in whole-population deleterious outcomes, we performed a control competitions under biotic background conditions, but with the focal strain population substituted for 99 ⚖⠀ : ⠀ diversity maintenance evolution experiment ⠀📏⠀ : k e ⠀ ⚖⠀ ⠀ ⚖⠀ y prevalence assay ⠀🧫⠀ : population … background strain focal strain background strain focal strain … ⠀🦠⠀ : 🧫population⠀⠀ 🧫population⠀⠀ 🧫population 🧫population specimen ⠀⠀ : stint n-1 ⠀⠀ : 4 hours simulation 4 hours simulation stint n-1 ⠀⠀ : ⠀stint n-1⠀ ⠀stint n⠀ sampled focal strain 🦠specimen genome background strain ⠀⠀ : focal strain specimen adaptation assays ⠀⚖⠀ ⠀⚖⠀ ⠀📏⠀ ⠀📏⠀ ⠀📏⠀ focal strain focal strain background background 🧫population 🦠specimen strain 🧫pop focal strain strain 🧫pop focal strain focal strain focal strain specimen specimen 🧫pop 🧫pop 🦠 🦠 hout fatory orary ⠀“wit ⠀“pre temp d”⠀ und”⠀ round ”⠀ ⠀“con un ckgro backg b a ckgro bio tic ba biotic biotic population adaptation assays ⠀⚖⠀ ⠀⚖⠀ ⠀📏⠀ ⠀ 📏⠀ ⠀📏⠀ focal strain focal strain background background 🧫population 🧫population strain 🧫pop strain 🧫pop focal strain focal strain focal strain focal strain 🧫pop 🧫pop 🧫pop 🧫pop hout fatory orary ⠀“wit ⠀“pre temp d”⠀ und”⠀ round ”⠀ ⠀“con un ckgro backg b a ckgro bio tic ba biotic biotic Figure 5.5: Detail of adaptation assay design. Top panel shows progress of original evolutionary experiment over one stint. A diversity maintenance procedure was used to ensure long-term coexistence of a least two strains over the course of the experiment by penalizing any strain that occupied more than half of thread-local population space. A “focal strain” was arbitrarily chosen for study; we refer to the other strain as the “background strain.” Adaptation assays in lower panels measure fitness change over the course of that stint through against the population from the preceding stint. The middle panel shows measurement of adaptation of the representative specimen that was sampled for analysis at each stint. The bottom panel shows measurement of the adaptation of the entire focal strain population at each stint. Competitors were mixed in even proportion into the environment. Bar heights represent initial relative proportions of assay participants at the beginning of the competition. Adaptation was measured by measuring change in population composition over a 10 minute competition window. We call this measurement of population composition change a “prevalence assay”. Competition experiments were performed absent the background strain, with the background strain population from the preceding stint, or with the background strain population from the current stint — shown separately in each panel. 100 Contemporary Contemporary Contemporary Contemporary Contemporary Contemporary Contemporary Contemporary Contemporary Biotic Background Biotic Background Biotic Background (no diversity maint.) (no diversity maint.) (no diversity maint.) Biotic Background Biotic Background Biotic Background Fit Loss Neutral Fit Gain Biotic Background Biotic Background Biotic Background Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain 25 25 25 Prefatory Prefatory Prefatory 20 20 20 (no diversity maint.) Biotic Background Biotic Background 15 count count 15 count 15 Biotic Background 10 10 Fit Loss Fit Loss 7 10 6 Fit Loss 7 5 5 5 5 3 2 2 2 2 3 3 1 1 2 0 1 1 1 0 0 25 25 23 25 Prefatory Prefatory Prefatory 20 20 20 (no diversity maint.) Biotic Background Biotic Background 15 count count 15 count 15 11 Biotic Background 10 11 10 10 Neutral Neutral 10 8 6 Neutral 5 7 5 5 4 4 5 5 4 4 4 2 2 1 1 1 0 0 0 25 27 26 25 25 25 Prefatory Prefatory Prefatory 20 20 20 (no diversity maint.) Biotic Background Biotic Background 15 count count 15 count 15 10 Biotic Background 10 Fit Gain Fit Gain 10 Fit Gain 6 6 6 5 5 5 4 4 5 4 4 5 4 3 2 2 2 2 1 0 0 0 Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Without Biotic Background Without Biotic Background Without Biotic Background Without Biotic Background Without Biotic Background Without Biotic Background Without Biotic Background Without Biotic Background Without Biotic Background (a) Joint distribution of adaptation assay on (b) Joint distribution of adaptation assay on (c) Joint distribution of adaptation assay on fo- representative specimen from focal strain over representative specimen from focal strain over cal strain population over biotic backgrounds, biotic backgrounds, with diversity maintenance biotic backgrounds, with diversity maintenance with diversity maintenance during competition. during competition. disabled during competition. Figure 5.6: Joint distribution of adaptation assay outcomes across biotic backgrounds. For each adaptation assay, three outcomes were possible: significant fitness gain, significant fitness loss, or no significant fitness change (“neutral”). Significance cutoff p < 0.005 was used. A fitness loss — color coded red — corresponds to winning 2 or fewer competitions out of 20 against the preceding stint’s focal strain population. A fitness gain — color coded green — corresponds to winning 18 or more competitions out of 20 against the preceding stint’s focal strain population. Neutral fitness outcomes are color coded yellow. Outcome counts are accumulated over experiments from stint 1 through stint 100. Counts in each subfigure therefore sum to 100. Column position in facet grid indicates outcome with contemporary biotic background, row position indicates outcome with prefatory biotic background, and bar color and x position indicates outcome without biotic background. See Figure 5.5 for explanation of competition biotic backgrounds. See Figure 5.7 for detail on joint distribution of outcomes with and without diversity maintenance, which were mostly identical. 101 the background strain population (Supplementary Figure D.18). Under these conditions, nine of the stints where whole-population deleterious outcomes had been detected came up neutral and one, surprisingly, tested significantly adaptive (Supplementary Figure D.17). Dose-dependent fitness effects and/or reduced experimental sensitivity of the biotic background assay appear to play at least a partial role in explaining the reduction of detected whole-population deleterious outcomes. However, 10 stints still tested significantly deleterious with the control focal strain biotic background in addition to without biotic background. Four stints do provide strong, direct evidence of a selective effect by the background strain: four whole- population outcomes that were deleterious without their biotic background were actually significantly ad- vantageous in the presence of both the prefatory and contemporary background strain populations (Figure 5.6c). All four of these stints exhibited whole-population deleterious outcomes under the control focal strain biotic background, indicating that the observed fitness sign change was specifically due to the presence of the background strain (Supplementary Figure D.17). Additionally, we detected two deleterious outcomes without biotic background as significantly adaptive under the prefatory biotic background but as neutral under contemporary biotic background (Figure 5.6c). Control focal strain biotic background experiments again suggest that the background strain, specifically, is responsible for this effect (Supplementary Figure D.17). We also found one whole-population outcome that was significantly advantageous without biotic back- ground and in the presence of the prefatory background strain population but significantly deleterious in the presence of the contemporary background strain, possibly suggesting a “arms race”-like evolutionary innovation on the part of the background strain over that stint (Figure 5.6c). Nonetheless, we still saw three whole-population outcomes that were significantly deleterious under all three conditions (Figure 5.6c). These whole-population outcomes were also deleterious under the control focal strain biotic background experiments (Supplementary Figure D.17). Muller’s ratchet (Andersson and Hughes, 1996) or maladaptation due to environmental change (Brady et al., 2019) may provide possible explanations, but a definitive answer will require further study. We also performed fitness assays on individual sampled specimens with both biotic backgrounds. Out of 100 stints tested, we observed 20 significantly deleterious outcomes without biotic background, 23 signif- icantly deleterious outcomes under prefatory biotic background, and 12 significantly deleterious outcomes under contemporary biotic background (Figure 5.4). Unlike the whole-population deleterious outcomes dis- cussed above, some deleterious outcomes from sampled specimens is not surprising. Evolving populations naturally contain standing variation in fitness (Martin and Roques, 2016), so occasional sampling of less-fit individuals should be expected. Reciprocally, we observed 57 significantly adaptive outcomes without biotic background, 44 with prefatory biotic background, and 48 with contemporary biotic background (Figure 5.4). 102 44 50 48 Prefatory Contemporary 40 (no diversity maint.) Biotic Background (no diversity maint.) Biotic Background 40 32 Fit Loss Fit Loss 34 30 Neutral Neutral 30 Fit Gain count count Fit Gain 22 20 20 10 10 10 5 1 1 2 1 0 0 Fit Loss Neutral Fit Gain Fit Loss Neutral Fit Gain Prefatory Biotic Background Contemporary Biotic Background (a) prefatory biotic background outcomes with and (b) contemporary biotic background outcomes with without diversity maintenance and without diversity maintenance Figure 5.7: Joint distribution of competition experiments performed under biotic background conditions with diversity maintenance enabled and disabled. Color coding denotes outcome without diversity maintenance and x position denotes outcome with diversity maintenance. Note that both plots above show distributions for adaptation assays on representative specimens. Competition experiments without diversity maintenance were not performed for population-level adaptation. See Figure 5.5 for explanation of competition biotic backgrounds. Greater sensitivity of the “without biotic background” adaptation assay could account for the counterintuitive detection of more adaptive outcomes under abiotic conditions (i.e., the absence of the background strain). As before with the population-level adaptation assays, we detected four specimen outcomes that were deleterious without biotic background but significantly advantageous under both tested background strain populations (Figure 5.6a). Additionally, and again as before, we detected two deleterious outcomes without biotic background as significantly adaptive under the prefatory biotic background but as neutral under the contemporary biotic background (Figure 5.6a). Control focal strain biotic background experiments confirm that the background strain, specifically, is responsible for these effects (Supplementary Figure D.17). We found no specimen outcomes that were advantageous under the prefatory biotic background but deleterious under the contemporary background. However, we found three stints with opposite dynamics: specimen outcomes deleterious under prefatory biotic background but advantageous under contemporary biotic background (Figure 5.6a), further suggesting coincident, interacting evolutionary innovations along focal and background strain lineages (Figure 5.6a). To better characterize the mechanism behind fitness effects caused by the background strain, we per- formed additional specimen adaptation assays under biotic background conditions with diversity maintenance disabled. This analysis allowed us to test whether action of the diversity maintenance mechanism, rather than direct interactions between the focal and background strains, caused the observed fitness effects. Figure 5.7 compares adaptation assay outcomes with and without diversity maintenance under both the prefatory and contemporary biotic background conditions. Outcomes were generally similar, and we observed only one sign-change difference was observed: one specimen outcome was beneficial under prefatory biotic background 103 conditions without diversity maintenance but deleterious with diversity maintenance. Further, as shown in Figure 5.6b, without diversity maintenance we still to observed four outcomes that were advantageous only under biotic conditions and instead tested deleterious under abiotic conditions. So, biotic selective effects cannot be explained as an artifact of activation of the diversity maintenance scheme.5 Significant increases in fitness occur throughout the evolutionary history of the case study, but not at every stint. Figure 5.8 summarizes the outcome of all adaptation assays stint-by-stint across evolutionary history. Neutral outcomes appear to occur more frequently at later stints. This may be indicative of slower evolutionary innovation, but may also result to some extent from simulation of fewer generations during evolutionary stints (Supplementary Figure D.9) and during competition experiments (Supplementary Figure D.12) due to slower execution of later genomes. Figure 5.9 shows the magnitudes of calculated fitness differentials for all adaptation assays. Fitness differentials during the first 40 stints are generally higher magnitude than later fitness differentials, although a strong fitness differential occurs at stint 93. Although the emergence of morphology d was associated with significant increases in fitness in some specimen assays and morphologies e and g were associated with significant increases in fitness across all specimen assays (Figure 5.8), the magnitude of these fitness differentials appears ordinary compared to fitness differentials at other stints (Figure 5.9). Supplementary Figure D.13 shows mean end-competition prevalence across assays, telling a similar story. In addition to competition assays, we also measured growth rate of specimen strains by tracking doubling time (in updates) when seeded into quarter-full toroidal grids (Figure 5.10). Morph b exhibited a fast growth rate early on that was never matched by later morphs. This measure appears to be a poor overall proxy for fitness, highlighting the importance of biotic aspects of the simulation environment which are not present in the empty space the assayed cells double into. 5.3.4 Fitness Complexity Figure 5.11 plots critical fitness complexity of specimens drawn from across the case study’s evolutionary history. Critical fitness complexity reaches more than 20 under morph b, jumps to more than 40 under morph d, drops to slightly more than 30 for morph e. Critical fitness complexity reaches a peak of 48 sites around stint 39 then levels out and decreases. This decrease may in part be due to declining sensitivity of competition experiments due to slower simulation resulting in execution of fewer updates within the fixed-duration jobs (Supplementary Figure D.10). 5 We also conducted specimen adaptation assays with diversity maintenance disabled under the control focal strain biotic background. In these experiments, we again found no evidence for impact from the diversity maintenance scheme on results (Supplementary Figure D.17). 104 Figure 5.8: Summary of adaptation assay outcomes for sampled representative specimen (top) population-level adaptation (bottom). Color coding Assay Subject = Specimen Assay Subject = Population 100(j)██ 99-(e)██ 98-(e)██ 97-(h)██ 96-(b)██ 95-(b)██ 94-(e)██ 93-(e)██ 92-(i)██ 91-(e)██ 90-(g)██ 89-(g)██ Significant Fitness Gain (p < 0.005) 88-(e)██ 87-(e)██ 86-(e)██ 85-(e)██ 84-(e)██ 83-(e)██ 82-(e)██ 81-(e)██ 80-(e)██ 79-(g)██ 78-(e)██ 77-(i)██ 76-(g)██ 75-(i)██ 74-(i)██ 73-(g)██ 72-(e)██ 71-(e)██ 70-(e)██ 69-(g)██ 68-(e)██ and parentheticals of stint labels correspond to qualitative morph codes described in Table 5.1. See Figure 5.5 for explanation of competition biotic 67-(g)██ 66-(g)██ 65-(e)██ 64-(e)██ 63-(g)██ 62-(g)██ 61-(e)██ 60-(g)██ 59-(h)██ 58-(g)██ 57-(g)██ 56-(g)██ 55-(g)██ 54-(g)██ 53-(e)██ Competition Stint 52-(g)██ Neutral 51-(g)██ 50-(g)██ 105 49-(g)██ 48-(g)██ 47-(g)██ 46-(g)██ 45-(g)██ 44-(e)██ 43-(e)██ 42-(e)██ 41-(e)██ 40-(e)██ 39-(f)██ 38-(e)██ 37-(e)██ 36-(e)██ 35-(e)██ 34-(e)██ 33-(e)██ 32-(e)██ backgrounds. 31-(e)██ 30-(e)██ 29-(e)██ 28-(b)██ 27-(e)██ 26-(b)██ 25-(e)██ 24-(e)██ 23-(e)██ 22-(e)██ Significant Fitness Loss (p < 0.005) 21-(e)██ 20-(e)██ 19-(e)██ 18-(e)██ 17-(e)██ 16-(e)██ 15-(e)██ 14-(d)██ 13-(b)██ 12-(b)██ 11-(b)██ 10-(b)██ 9--(b)██ 8--(b)██ 7--(b)██ 6--(b)██ 5--(b)██ 4--(b)██ 3--(b)██ 2--(c)██ 1--(b)██ Without Without Contemporary Contemporary Prefatory Prefatory Contemporary Prefatory (no diversity maint.) (no diversity maint.) Biotic Background Biotic Background Symlog Median Fitness Differential −10−4 0 10−4 Contemporary Assay Subject = Specimen Contemporary (no diversity maint.) Biotic Background Prefatory Prefatory (no diversity maint.) Without Contemporary Assay Subject = Population Biotic Background Prefatory Without 1--(b)██ 2--(c)██ 3--(b)██ 4--(b)██ 5--(b)██ 6--(b)██ 7--(b)██ 8--(b)██ 9--(b)██ 10-(b)██ 11-(b)██ 12-(b)██ 13-(b)██ 14-(d)██ 15-(e)██ 16-(e)██ 17-(e)██ 18-(e)██ 19-(e)██ 20-(e)██ 21-(e)██ 22-(e)██ 23-(e)██ 24-(e)██ 25-(e)██ 26-(b)██ 27-(e)██ 28-(b)██ 29-(e)██ 30-(e)██ 31-(e)██ 32-(e)██ 33-(e)██ 34-(e)██ 35-(e)██ 36-(e)██ 37-(e)██ 38-(e)██ 39-(f)██ 40-(e)██ 41-(e)██ 42-(e)██ 43-(e)██ 44-(e)██ 45-(g)██ 46-(g)██ 47-(g)██ 48-(g)██ 49-(g)██ 50-(g)██ 51-(g)██ 52-(g)██ 53-(e)██ 54-(g)██ 55-(g)██ 56-(g)██ 57-(g)██ 58-(g)██ 59-(h)██ 60-(g)██ 61-(e)██ 62-(g)██ 63-(g)██ 64-(e)██ 65-(e)██ 66-(g)██ 67-(g)██ 68-(e)██ 69-(g)██ 70-(e)██ 71-(e)██ 72-(e)██ 73-(g)██ 74-(i)██ 75-(i)██ 76-(g)██ 77-(i)██ 78-(e)██ 79-(g)██ 80-(e)██ 81-(e)██ 82-(e)██ 83-(e)██ 84-(e)██ 85-(e)██ 86-(e)██ 87-(e)██ 88-(e)██ 89-(g)██ 90-(g)██ 91-(e)██ 92-(i)██ 93-(e)██ 94-(e)██ 95-(b)██ 96-(b)██ 97-(h)██ 98-(e)██ 99-(e)██ 100(j)██ Competition Stint Figure 5.9: Median calculated fitness differential outcomes of competition experiments. Zero fitness differential corresponds to a neutral result, color mapped to white. Blue indicates positive fitness differential (fitness gain) compared to the previous stint and red indicates negative fitness differential (fitness loss). Color coding and parentheticals of stint labels correspond to qualitative morph codes described in Table 5.1. Note that color intensity is plotted on a symlog scale due to distribution of fitness differentials over multiple orders of magnitude. Upper panels shows results for sampled focal strain genome, lower panel shows results for entire focal strain population. See Figure 5.5 for explanation of competition biotic backgrounds. 106 1.0012 Morph a Mean Doubling Time Growth Rate b 1.0010 c d 1.0008 e f g 1.0006 h i j 1.0004 1.0002 0 20 40 60 80 100 Stint Figure 5.10: Growth rate estimated from doubling time experiments, measuring time for a monoculture to grow from 0.25 maximum population size to 0.5 maximum population size. 50 Morph a b 40 Critical Fitness Complexity c d e 30 f g 20 h i j 10 0 0 20 40 60 80 100 Stint Figure 5.11: Critical fitness complexity. Number of single-site nopouts that significantly decrease fitness, ad- justed for expected false positives. Color coding and letters correspond to qualitative morph codes described in Table 5.1. Dotted vertical line denotes emergence of morph e. Dashed vertical line denotes emergence of morph g. 107 Phylogenetic analysis (Figure 5.3) suggests independent origins of the critical fitness complexity in morph d and morph e — the morph d specimen from stint 14 is more closely related to the morph b specimen from stint 13 than to the morph e specimen from stint 15. Likewise, specimens of lower complexity morphs i and b that appear past stint 70 appear to have independent evolutionary origins. 5.3.5 Interface Complexity Figure 5.12 summarizes cardinal interface complexity, as well as its constituent components, for speci- mens drawn from across the case study’s evolutionary history. Notably, cardinal interface complexity more than doubles from 6 interactions to 17 interactions coinci- dent with the emergence of morph e (Figure 5.12a). This is due to simultaneous increases in extrospective state sensing (2 to 9 states; Figure 5.12f), introspective state sensing (1 to 4 states; Figure 5.12e), and writable state usage (1 to 2 states; Figure 5.12b). The emergence of morph g coincided with an increase in writable state interface complexity from 1 to 3 as shown in Figure 5.12b. However, morph g was not associated with other changes in other aspects of cardinal interface complexity. The greatest observed cardinal interface complexity was 22 interactions at stints 54 and 67. 5.3.6 Genome Size Figure 5.13 shows evolutionary trajectories of three genome size metrics in sampled focal strain spec- imens. Instruction count and module count increased from 100 and 5 to around 800 and 30, respectively, between stints 0 and 40. Within this period at stint 24, instruction count jumped from around 600 to more than 800 and module count jumped by about 5. This was coincident with detection in our adaptation assays of population-level sign-change mediation of adaptation by the background strain (Figure 5.8). In sampled specimen fitness assays at stint 24, we detected significant increases in fitness in the presence of the background strain but no significant change in fitness in its absence. Between stints 40 and 90, module count gradually increased to around 40 while instruction count remained stable. Then, at stint 93, instruction count jumped to around 1,500 and module count jumped to around 60. This was coincident with the strong fitness differentials observed at stint 93 (Figure 5.9). To better understand the functional effects of changes in genome size, we additionally measured the number of instructions that affected agent phenotype, shown as “phenotype complexity” in Figure 5.13c. This measure can be considered akin to a count of “active” sites. Phenotype complexity varied greatly stint to stint. The median value increased from nearly 0 to around 200 sites between stints 0 and 40. Between stints 40 and 90, we observed phenotype complexity values ranging from less than 100 to more than 500. Morph g specimens appear to show particularly great variance 108 Num Less Fit Under Writable State Perturbation Morph 4.0 Morph a a 20 b 3.5 b Cardinal Interface Complexity c c 3.0 d d 15 e 2.5 e f f g 2.0 g 10 h h 1.5 i i j 1.0 j 5 0.5 0 0.0 0 20 40 60 80 100 0 20 40 60 80 100 Stint Stint (a) Cardinal interface complexity, the total number of (b) Writable state interface complexity, the number of distinct interactions between a virtual CPU control- output states that contribute to fitness. See Supple- ling cell behavior and its surroundings that contribute mentary Figure D.3 for detail on the writable states to fitness. (Sum of Figures figs. 5.12b to 5.12f.) that contribute to fitness. Num Less Fit Under Inter Self-Send @ Filter Mod 20 Num Less Fit Under Intra Self-Send @ Filter Mod 20 5 Morph 2.00 Morph a a b 1.75 b 4 c c d 1.50 d e 3 1.25 e f f g 1.00 g 2 h h i 0.75 i j j 0.50 1 0.25 0 0.00 0 20 40 60 80 100 0 20 40 60 80 100 Stint Stint (c) Intermessage interface complexity, the number of (d) Intramessage interface complexity, the number of distinct inter-cell messages that contribute to fitness. distinct inter-cell messages that contribute to fitness. Num Less Fit Under Introspective State Perturbation Num Less Fit Under Extrospective State Perturbation Morph Morph 10 a a 8 b b c 8 c d d 6 e e f 6 f g g 4 h h 4 i i j j 2 2 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Stint Stint (e) Introspective interface complexity, the number of (f) Extrospective interface complexity, the number of states viewed in the own cell that contribute to fit- states viewed in neighboring cells that contribute to ness. See Supplementary Figure D.2 for detail on the fitness. See Supplementary Figure D.1 for detail on introspective states that contribute to fitness. the extrospective states that contribute to fitness. Figure 5.12: Interface complexity estimates. Color coding and letters correspond to qualitative morph codes described in Table 5.1. Dotted vertical line denotes emergence of morph e. Dashed vertical line denotes emergence of morph g. 109 Mean Program Module Count (monoculture mean) 1600 Morph Morph a 70 a 1400 b b c 60 1200 c d Num Instructions d e 50 1000 e f f g 40 800 g h h 600 i 30 i j 400 j 20 200 10 0 20 40 60 80 100 Stint 0 20 40 60 80 100 Stint (a) instruction count (b) module count Morph 700 a b 600 c Phenotype Complexity d 500 e f 400 g 300 h i 200 j 100 0 0 20 40 60 80 100 Stint (c) phenotype complexity Figure 5.13: Genome size of sampled focal strain specimens. Instruction count is the total number of in- structions present in the genome. Module count is the number of tagged linear GP modules available for activation by signals from the environment, from other agents, or from within an agent. Phenotype complex- ity is the number of genome sites that contribute to phenotype, measured as number sites remaining after phenotype-neutral nopout (Section 5.2.3). This measure gives a sense of the number of “active” instructions that influence agents’ behavior. Color coding and letters correspond to qualitative morph codes described in Table 5.1. Dotted vertical line denotes emergence of morph e. Dashed vertical line denotes emergence of morph g. 110 in phenotype complexity. The first observed morph g specimen at stint 45 exhibited relatively low phenotype complexity of around 100 active sties. The highest phenotype complexity values of around 700 were measured from three specimens of morphs e and g in the last ten stints. 5.4 Discussion Throughout the case study lineage, we describe an evolutionary sequence of ten qualitatively distinct multicellular morphologies (Table 5.1). The emergence of some, but not all, of these morphologies coincided with an increase in fitness compared to the preceding population. Outcomes from the first observed morphol- ogy c specimen are significantly deleterious in all contexts. Likewise, morphology f , while advantageous in the absence of the background strain, appeared neutral in its presence (Figure 5.8). However, the genesis of morphology e, and g are associated with significant fitness gain in all contexts (Figure 5.8). This latter set of novelties might be described as “innovations,” which Hochberg et al. define as qualitative novelty associated with an increase in fitness (Hochberg et al., 2017). Interestingly, the magnitude of the fitness differentials associated with the emergence of morphologies e, and g do not appear to fall outside the bounds of other stint-to-stint fitness differentials (Figure 5.9). The relationship between innovation and complexity also appears to be loosely coupled. The emergence of morphology d was accompanied by a spike in critical fitness complexity (from 25 sites at stint 13 to 43 sites at stint 14). However, the emergence of morphology i coincided with a loss of critical fitness complexity (from more than 30 sites to fewer than 10 sites). The specimen of morph i at stint 77, which phylogenetic analysis suggests may have independent trait origin from the specimen at stint 75, exhibited significant fitness gain across all contexts despite decimation of complexity. Phylogenetic analysis suggests that morphology e was not a direct descendant of morphology d. So the emergence of morphology e appears to have coincided with a more modest increase in fitness complexity from 25 sites to 31 sites. Similarly, the emergence of morphology g with 42 critical sites at stint 45 coincided with a relatively modest increase in fitness complexity from 39 critical sites at stint 44. We also see evidence that increases in complexity do not imply qualitative novelty in morphology. In Figure 5.11, we can also observe notable increases in critical fitness complexity that did not coincide with apparent morphological innovation. For example, fitness complexity jumped from 11 sites at stint 11 to 27 sites at stint 12 while morphology b was retained. In addition, a more gradual increase in fitness complexity was observed from 27 sites at stint 16 to 46 sites at stint 36 all with consistent morphology e. Finally, we also observed disjointedness between alternate measures of functional complexity. Notably, critical fitness complexity increased by 18 sites with the emergence of morph d but interface complexity increased only marginally. Subsequently observed morph e had nearly triple the interface complexity of 111 morph d (6 interactions vs. 17 interactions) but had 12 sites lower critical fitness complexity. In addition, the gradual increase in critical fitness complexity between stint 15 and 36 under morphology e is not accompanied by a clear change in interface complexity (Figures 5.12a and 5.11). These apparent inconsistencies between metrics for functional complexity evidence the multidimensionality of this idea and underscore well-known difficulties in attempts to describe and quantify it (Böttcher, 2018). 5.5 Conclusion Complexity and novelty are not inevitable outcomes of Darwinian evolution (Kenneth O. Stanley, 2017). Instead, how and why some lineages within some model systems evolve complexity and novelty merits explanation. Efforts to develop substrates and conditions sufficient to observe the evolution of complexity and novelty plays a crucial role in validating the sufficiency of theory. Additionally, subsequent availability of complexity and novelty potential within experimental substrates enables work to test and refine theory. The artificial life research community has a rich track record to these ends. The case study reported here tracks a lineage over two phenotypic innovations and several-fold increases in complexity. DISHTINY relaxes common simulation constraints (Goldsby et al., 2012, 2014), enabling broad genetic determination of multicellular life history and allowing for unconstrained cellular interactions between multicellular bodies. As such, this case study opens new windows into evolutionary origins of complexity and novelty, especially with respect to biotic interactions. Our case study exhibits loose coupling between novelty, complexity, and adaptation. We observe in- stances where novelty coincides with adaptation and instances where it does not. We observe instances where increases in complexity coincide with adaptation and where decreases in complexity coincide with adapta- tion. We observe instances where innovation coincides with spikes in complexity and instances where it does not. We even observe contradiction between metrics that measure different aspects functional complexity. For example, the specimen sampled at stint 15 had near triple the interface complexity of the specimen sampled at stint 14 but lower critical fitness complexity. Loose coupling between the conceptual threads of novelty, complexity, and adaptation in this case study highlights the importance of considering these factors independently when developing open-ended evolution theory — direct coupling among them cannot be assumed. Our observation of significant selective effects by the background strain suggests it may serve a crucial role in understanding the focal strain. Future work should characterize trajectories of adaptation, novelty, and complexity in this background strain. Additionally, success of the biotic background in fleshing out our adaptation assays suggests that complexity measures could be improved through similar incorporation of the 112 biotic background. It would be particularly interesting to measure the contribution of the background strain to complexity as the difference between complexity statistics with and without the biotic background. To more systematically test the role of biotic selection on facilitating evolution of complexity, future experiments might test for differences in the rate of high-complexity evolutionary outcomes between evolution experiments with and without long-term coexistence between lineages (i.e., diversity maintenance mechanism enabled versus disabled). This case study highlights the potential usefulness of toolbox-based approaches to analyzing open-ended evolution systems in which an array of analyses are performed to distinguish disparate dimensions of open- endedness (Dolson et al., 2019). Our findings emphasize, in particular, the critical role of biotic context in such analyses. In future work, we are interested in further extending this toolbox. One priority will be estimating epistatic contributions to fitness without resorting to all-pairs knockouts or other even more extensive assays. Such methodology will be crucial for systems where fitness is implicit and expensive to measure. 113 Chapter 6 Conclusion Portions of this chapter are adapted from (Moreno and Ofria, 2020) The complexity, novelty, and diversity found in the natural world continually inspires scientific curiosity to understand their origins, just as the ingenuity of natural adaptations spur engineers to try to replicate their design. In this dissertation, I have pushed forward parallel and distributed high-performance comput- ing techniques, and leveraged them to perform digital evolution experiments to study complexity, novelty, diversity, and adaptation in evolved multicells. This chapter details contributions of this dissertation, describes avenues for future research, then pro- vides some closing reflections. 6.1 Contribution Part I of this dissertation developed algorithm engineering for computational scale-up of digital evo- lution experiments. In addition to proposed algorithms and reported experimental results, each chapter’s accompanying open source software library will directly enable real-world applications within the broader community. Although methods and software in this section tailor to digital evolution, we anticipate potential for other applications within distributed computing. Chapter 2 implemented and tested a best-effort communication framework (Conduit) on commercially available high-performance computing hardware. Conduit’s median performance on several quality-of-service metrics remains stable in scaling experiments up to 256 processes. In separate experiments, I demonstrated how the best-effort approach can provide better quality solutions to a graph coloring problem within a fixed time limit. At 64 processes, best-effort communication yielded a 2× faster update rate on a compute-intensive problem and a 7× faster update rate on a communication-intensive problem. Chapter 3 presented the “hereditary stratigraphy” algorithm for phylogenetic analyses in decentralized, best-effort artificial life experiments. This approach supports tunable trade-offs between inference precision and annotation memory footprint. We derive several alternate asymptotic trade-offs and report strategies to attain each. Simulated reconstructions of phylogenies taken from real experiments demonstrate end-to-end viability of the approach, with up to 85% of original phylogenetic information recovered under reconstruction from 64-bit annotations. Part II of this dissertation introduced DISHTINY, a new framework for experiments evolving digital multicells. Application of engineering techniques from Part I to DISHTINY yields efficient scalability, with scale up from one to 64 processes incurring only 8% performance degradation. Experiments in this chapter survey evolved multicellular life histories within the system, using case studies to characterize complexity, 114 novelty, adaptation, morphology, and mechanisms. Chapter 4 described four qualitative life histories that arose across 40 DISHTINY evolutionary repli- cates. Phenotypic traits characteristic of multicellularity corroborate occurance of fraternal transitions in individuality across replicates. Observed traits include reproductive division of labor, resource sharing within kin groups, resource investment in offspring groups, asymmetrical behaviors mediated by messaging, morpho- logical patterning, and adaptive apoptosis. These findings validate simulation design, confirming sufficiency of agent implementation and selective pressures to produce diverse multicellular traits. This work also builds baseline intuition for DISHTINY life histories, providing a foundation for further work with the system. Chapter 5 tracks the evolution of novelty, complexity, and adaptation along a case study lineage. Ten qualitatively distinct multicellular morphologies occurred along this lineage, several of which exhibited asym- metrical growth or distinct life stages. This chapter develops and applies a suite of adaptation and complexity measures. These include competition experiments under various background conditions, a doubling time as- say, knockout competitions to count active genome sites, knockout competitions to count adaptive genome sites, and decontextualization competitions to count distinct adaptive environmental interactions. Measures of novelty, complexity, and adaptation trace loosely coupled, sometimes divergent, trajectories along the case study lineage. This result reinforces the paradigm shift away from reductive distillation of these phenom- ena to common symptoms of implicit underlying evolutionary “progress.” Additionally, adaptation assays indicate significant biotic selection effects, raising questions about the role of co-evolution on this strain’s evolutionary history. 6.2 Future Work Important work remains across the breadth of topics explored in this dissertation. This section briefly highlights several pertinent open questions and unsolved problems. Decentralized methods for diversity maintenance also remain an open question. Although diversity maintenance can readily be performed on a per-process basis using a finite resource model, how to generalize this approach to a distributed context remains unclear. (In future work, the current per-process approach may not be sufficient if smaller per-process cell counts or larger group size reduces the number of multicells occupying a single process too far.) Perhaps, in addition to enabling post-hoc analyses, hereditary stratigraph annotations could guide phylogeny-aware during-simulation interventions to maintain diversity. Sexual recombination plays a central role in natural history (Smith and Szathmary, 1997) and genetic programming (Poli et al., 2008). Incorporating sexual recombination into DISHTINY could enable digital evolution experiments probing the intersection between fraternal transitions in individuality, the evolution of sex, and the evolution of complexity. 115 However, work has yet to be performed on sexual recombination with event-driven genetic programming encodings. It will be of particular interest to determine whether such encodings’ distinct, tagged modules provide an effective basis for semantic crossover. Further, in contrast to many natural systems, genetic programming work overwhelmingly employs monoploid (rather than the polyploid) genomic structure. This approach avoids difficulty integrating co-execution of two separate programs into a single phenotype profile. Tagged modules, however, could support co-expression among multiple alleles of the same gene, poten- tiallyenabling more effective recombination and more salient digital evolution models for research on the evolution of sex. Sexual recombination also constitutes important unexplored territory for distributed phylogenetic in- ference on digital evolution agents. As presented in Chapter 3, hereditary stratigraphy assumes asexual lineages. One possible strategy for generalizing this methodology to sexual lineages would be applying anno- tations to individual genome sites to track independent gene trees. Another possibility would be to apply a gene drive mechanism to annotations so that a single consensus differentia coalescences at each strata. This would distinguish genetically isolated subpopulations, providing a basis for species tree reconstruction. Direct efforts to evolve emergent multicellular functionality should also be considered. Multicellular motility could be selected for by increasing resource collection rate based on the distance from the site where a group originated. More sophisticated inter- and intra-groups interactions could be selected for by introducing discrete tokens with resource value that differs among cells or among groups (perhaps determined via a hash of cell or group ID and token ID). Such efforts could extend to selecting for multicells that solve simple pattern detection tasks. This goal would require careful consideration about how to “wire” input/output controls into multicell collectives and how to make problem instances available on demand to multicells in a distributed setting, but could have powerful applications. The ability to show multicells outperforming individual unicells or even collections of unicells on such problems would be an exciting result. Past work exploring the introduction of neuron-like cell-cell interconnects into the DISHTINY model could serve as a stepping stone toward these objectives (Moreno and Ofria, 2020). 6.3 Closing Remarks Above all, this dissertation pursues larger-scale, more dynamic digital evolution models. This requires reconciliation of orthogonal, perhaps even somewhat conflicting, aims: engineering for efficient scalability and relaxation of programmed-in model constraints on multicells. However, the artificial life ethos of “life as it could be” furnishes a uniquely pliant testbed for approaches to distributed computing that radically depart from established practice (Forbes, 2000). Best-effort approaches explored first in this context could prove 116 useful in broader realms of high-performance computing, particularly hard real-time and machine learning applications (Rhodes et al., 2019). We are excited to see impact from dawning adoption of high-performance computing hardware unfold in advancing the fecundity of open-ended artificial life models. In conjunction with progress in theory, paleontology, and laboratory-based experiments, such work will play an instrumental role in fleshing out our account of natural history. Indeed, many fundamental questions remain to be addressed, particularly notable among them the likely multifaceted and interconnected mechanisms shaping biological complexity. Artificial life systems, in particular, will be increasingly well-positioned to untangle the origins of biological complexity in relation to fitness, genetic drift over elapsed evolutionary time, mutational load, genetic recombination (sex and horizontal gene transfer), ecology, historical contingency, and key innovations. Such insight can make practical, real-world impact: understanding evolution helps us predict and influence it (e.g., managing natural ecosystems, mitigating antimicrobial resistance) as well as harness it for automated design through evolutionary algorithms. That the small sampling of experiments reported here yielded a wide variety of evolved behaviors and individual life histories lends credence to the notion that natural history’s breadth is not surprising so much as it is inevitable. Further research teasing apart the constructive potential inherent in major evolutionary transitions promises better capability to shape them and produce computing systems that reflect the capability and robustness of natural organisms. 117 BIBLIOGRAPHY Ackley, D. and Small, T. (2014). Indefinitely scalable computing= artificial life engineering. In ALIFE 14: The Fourteenth International Conference on the Synthesis and Simulation of Living Systems, pages 606–613. MIT Press. Ackley, D. H. (2018). Digital protocells with dynamic size, position, and topology. In ALIFE 2018: The 2018 Conference on Artificial Life, pages 83–90. Ackley, D. H. (2019). Building a survivable protocell for a corrosive digital environment. In ALIFE 2019: The 2019 Conference on Artificial Life, pages 111–118. MIT Press. Ackley, D. H. and Cannon, D. C. (2011). Pursue robust indefinite scalability. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, HotOS’13, page 8, USA. USENIX Association. Ackley, D. H. and Williams, L. R. (2011). Homeostatic architectures for robust spatial computing. In 2011 Fifth IEEE Conference on Self-Adaptive and Self-Organizing Systems Workshops, pages 91–96. IEEE. Acun, B., Gupta, A., Jain, N., Langer, A., Menon, H., Mikida, E., Ni, X., Robson, M., Sun, Y., Totoni, E., et al. (2014). Parallel programming with migratable objects: Charm++ in practice. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 647–658. IEEE. Adami, C., Ofria, C., and Collier, T. C. (2000). Evolution of biological complexity. Proceedings of the National Academy of Sciences, 97(9):4463–4468. Aktaş, M. F. and Soljanin, E. (2019). Straggler mitigation at scale. IEEE/ACM Transactions on Networking, 27(6):2266–2279. Andersson, D. I. and Hughes, D. (1996). Muller’s ratchet decreases fitness of a dna-based microbe. Proceedings of the National Academy of Sciences, 93(2):906–907. Arnellos, A. and Keijzer, F. (2019). Bodily complexity: Integrated multicellular organizations for contraction- based motility. Frontiers in Physiology, 10. Baig, U. I., Bhadbhade, B. J., and Watve, M. G. (2014). Evolution of aging and death: what insights bacteria can provide. The Quarterly Review of Biology, 89(3):209–233. Banzhaf, W., Baumgaertner, B., Beslon, G., Doursat, R., Foster, J. A., McMullin, B., De Melo, V. V., Miconi, T., Spector, L., Stepney, S., and White, R. (2016). Defining and simulating open-ended novelty: Requirements, guidelines, and challenges. Theory in Biosciences, 135(3):131–161. Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. (2012). Legion: Expressing locality and indepen- dence with logical regions. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE. Bedau, M. A., Snyder, E., and Packard, N. H. (1998). A classification of long-term evolutionary dynamics. In Artificial Life VI: Proceedings of the Sixth International Conference on Artificial Life, pages 228–237. MIT Press. Benenson, Y. (2009). Biocomputers: from test tubes to live cells. Molecular BioSystems, 5(7):675–685. Bennett III, F. H., Koza, J. R., Shipman, J., and Stiffelman, O. (1999). Building a parallel computer system for $18,000 that performs a half peta-flop per day. In Proceedings of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume 2, pages 1484–1490. Biswas, R., Bryson, D., Ofria, C., and Wagner, A. (2014). Causes vs benefits in the evolution of prey grouping. In ALIFE 14: The Fourteenth International Conference on the Synthesis and Simulation of Living Systems, ALIFE 2021: The 2021 Conference on Artificial Life, pages 641–648. 118 Blondeau, A., Cheyer, A., Hodjat, B., and Harrigan, P. (2009). Distributed network for performing complex algorithms. US Patent App. 12/267,287. Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. (1996). Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55–69. Bocquet, M., Hirztlin, T., Klein, J.-O., Nowak, E., Vianello, E., Portal, J.-M., and Querlioz, D. (2018). In-memory and error-immune differential rram implementation of binarized deep neural networks. In 2018 IEEE International Electron Devices Meeting (IEDM), pages 20–6. IEEE. Bohm, C., G., N. C., and Hintze, A. (2017). MABE (Modular Agent Based Evolver): A framework for digital evolution research. In ECAL 2017, the Fourteenth European Conference on Artificial Life, pages 76–83. Bonabeau, E. W. and Theraulaz, G. (1994). Why do we need artificial life? Artificial Life, 1(3):303–325. Bonnet, J., Yin, P., Ortiz, M. E., Subsoontorn, P., and Endy, D. (2013). Amplifying genetic logic gates. Science, 340(6132):599–603. Bostock, M., Ogievetsky, V., and Heer, J. (2011). D3 data-driven documents. IEEE Transactions on Visualization and Computer Graphics, 17(12):2301–2309. Böttcher, T. (2018). From molecules to life: quantifying the complexity of chemical and biological systems in the universe. Journal of Molecular Evolution, 86(1):1–10. Brady, S. P., Bolnick, D. I., Angert, A. L., Gonzalez, A., Barrett, R. D., Crispo, E., Derry, A. M., Eckert, C. G., Fraser, D. J., Fussmann, G. F., et al. (2019). Causes of maladaptation. Evolutionary Applications, 12(7):1229–1242. Bundy, J., Ofria, C., and Lenski, R. E. (2021). How the footprint of history shapes the evolution of digital organisms. bioRxiv. Byna, S., Meng, J., Raghunathan, A., Chakradhar, S., and Cadambi, S. (2010). Best-effort semantic docu- ment search on gpus. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 86–93. Cantú-Paz, E. (2001). Master-slave parallel genetic algorithms. In Efficient and Accurate Parallel Genetic Algorithms, pages 33–48. Springer. Cardwell, D. and Song, F. (2019). An extended roofline model with communication-awareness for distributed- memory hpc systems. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pages 26–35. Casci, T. (2008). Lining up is hard to do. Nature Reviews Genetics, 9(8):573–573. Chakradhar, S. T. and Raghunathan, A. (2010). Best-effort computing: Re-thinking parallel software and hardware. In Design Automation Conference, pages 865–870. IEEE. Chakrapani, L. N., Korkmaz, P., Akgul, B. E., and Palem, K. V. (2008). Probabilistic system-on-a-chip architectures. ACM Transactions on Design Automation of Electronic Systems (TODAES), 12(3):1–28. Chakravorty, S. and Kale, L. V. (2004). A fault tolerant protocol for massively parallel systems. In 18th International Parallel and Distributed Processing Symposium, page 212. IEEE. Chakravorty, S. and Kalé, L. V. (2007). A fault tolerance protocol with fast fault recovery. In 2007 IEEE International Parallel and Distributed Processing Symposium, pages 1–10. IEEE. Chamberlain, B. L., Callahan, D., and Zima, H. P. (2007). Parallel programmability and the chapel language. The International Journal of High Performance Computing Applications, 21(3):291–312. 119 Channon, A. (2019). Maximum individual complexity is indefinitely scalable in geb. Artificial Life, 25(2):134– 144. Che, S., Li, J., Sheaffer, J. W., Skadron, K., and Lach, J. (2008). Accelerating compute-intensive applications with gpus and fpgas. In 2008 Symposium on Application Specific Processors, pages 101–107. IEEE. Cheney, N., MacCurdy, R., Clune, J., and Lipson, H. (2014). Unshackling evolution: evolving soft robots with multiple materials and a powerful generative encoding. ACM SIGEVOlution, 7(1):11–23. Chippa, V. K., Mohapatra, D., Roy, K., Chakradhar, S. T., and Raghunathan, A. (2014). Scalable effort hardware design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22(9):2004–2016. Cho, H., Leem, L., and Mitra, S. (2012). Ersa: Error resilient system architecture for probabilistic applica- tions. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 31(4):546–558. Clune, J., Ofria, C., and Pennock, R. T. (2007). Investigating the emergence of phenotypic plasticity in evolving digital organisms. In Costa, F. A. e., Rocha, L. M., Costa, E., Harvey, I., and Coutinho, A., editors, Proceedings of the 9th European Conference on Advances in Artificial Life, pages 74–83, Berlin, Heidelberg. Springer-Verlag. Clune, J., Stanley, K. O., Pennock, R. T., and Ofria, C. (2011). On the performance of indirect encoding across the continuum of regularity. IEEE Transactions on Evolutionary Computation, 15(3):346–367. Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al. (2009). Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423. Covert, A. W., Lenski, R. E., Wilke, C. O., and Ofria, C. (2013). Experiments on the role of deleterious mutations as stepping stones in adaptive evolution. Proceedings of the National Academy of Sciences, 110(34):E3171–E3178. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M. a., Senior, A., Tucker, P., Yang, K., Le, Q., and Ng, A. (2012). Large scale distributed deep networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc. Dolson, E., Banzhaf, W., and Ofria, C. (2018). Applying ecological principles to genetic programming. In Genetic Programming Theory and Practice XV, pages 73–88. Springer. Dolson, E., Lalejini, A., Jorgensen, S., and Ofria, C. (2020). Interpreting the tape of life: Ancestry-based analyses provide insights and intuition about evolutionary dynamics. Artificial Life, 26(1):58–79. Dolson, E. and Ofria, C. (2017). Spatial resource heterogeneity creates local hotspots of evolutionary poten- tial. In ECAL 2017, the Fourteenth European Conference on Artificial Life, pages 122–129. Dolson, E. and Ofria, C. (2018). Ecological theory provides insights about evolutionary computation. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 105–106. Dolson, E. and Ofria, C. (2021). Digital evolution for ecology research: A review. Frontiers in Ecology and Evolution, 9. Dolson, E. L. (2019). On the Constructive Power of Ecology in Open-Ended Evolving Systems. PhD thesis, Michigan State University. Dolson, E. L., Vostinar, A. E., Wiser, M. J., and Ofria, C. (2019). The modes toolbox: Measurements of open-ended dynamics in evolving systems. Artificial Life, 25(1):50–73. Dongarra, J., Hittinger, J., Bell, J., Chacon, L., Falgout, R., Heroux, M., Hovland, P., Ng, E., Webster, C., and Wild, S. (2014). Applied mathematics research for exascale computing. Technical report, Lawrence Livermore National Lab.(LLNL). 120 Downing, K. L. (2015). Intelligence emerging: adaptivity and search in evolving neural systems. MIT Press. Eiben, A. and Smith, J. (2015). Introduction to Evolutionary Computing. Springer, Berlin. El-Ghazawi, T. and Smith, L. (2006). Upc: Unified parallel c. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, page 27–es, New York, NY, USA. Association for Computing Machinery. Ellenbogen, J. C. and Love, J. C. (2000). Architectures for molecular electronic computers. i. logic structures and an adder designed from molecular electronic diodes. Proceedings of the IEEE, 88(3):386–426. Forbes, N. (2000). Life as it could be: Alife attempts to simulate evolution. IEEE Intelligent Systems and their Applications, 15(6):2–7. Fortuna, M. A., Barbour, M. A., Zaman, L., Hall, A. R., Buckling, A., and Bascompte, J. (2019). Coevolu- tionary dynamics shape the structure of bacteria-phage infection networks. Evolution, 73(5):1001–1011. Foster, E. D. and Deardorff, A. (2017). Open science framework (osf). Journal of the Medical Library Association, 105(2):203. Gagliardi, F., Moreto, M., Olivieri, M., and Valero, M. (2019). The international race towards exascale in europe. CCF Transactions on High Performance Computing, pages 1–11. Gamell, M., Teranishi, K., Heroux, M. A., Mayo, J., Kolla, H., Chen, J., and Parashar, M. (2015). Local recovery and failure masking for stencil-based applications at extreme scales. In SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12. IEEE. Geladi, P. and Kowalski, B. R. (1986). Partial least-squares regression: a tutorial. Analytica Chimica Acta, 185:1–17. Gerhart, J. and Kirschner, M. (2007). The theory of facilitated variation. Proceedings of the National Academy of Sciences, 104(suppl 1):8582–8589. Gilbert, D. (2015). Artificial intelligence is here to help you pick the right shoes. International Business Times. Goings, S., Goldsby, H., Cheng, B. H., and Ofria, C. (2012). An ecology-based evolutionary algorithm to evolve solutions to complex problems. In ALIFE 2012: The Thirteenth International Conference on the Synthesis and Simulation of Living Systems, pages 171–177. MIT Press. Goldberg, D. E., Richardson, J., et al. (1987). Genetic algorithms with sharing for multimodal function optimization. In Genetic Algorithms and their Applications: Proceedings of the Second International Conference on Genetic Algorithms, volume 4149. Hillsdale, NJ: Lawrence Erlbaum. Goldsby, H., Kerr, B., and Ofria, C. (2020). Major transitions in digital evolution. In Evolution in Action: Past, Present and Future, pages 333–347. Springer. Goldsby, H. J., Dornhaus, A., Kerr, B., and Ofria, C. (2012). Task-switching costs promote the evolu- tion of division of labor and shifts in individuality. Proceedings of the National Academy of Sciences, 109(34):13686–13691. Goldsby, H. J., Knoester, D. B., and Ofria, C. (2010). Evolution of division of labor in genetically homogenous groups. In Pelikan, M. and Branke, J., editors, Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, pages 135–142, New York, NY. ACM. Goldsby, H. J., Knoester, D. B., Ofria, C., and Kerr, B. (2014). The evolutionary origin of somatic cells under the dirty work hypothesis. PLOS Biology, 12(5):1–11. 121 Goldsby, H. J., Young, R. L., Hofmann, H. A., and Hintze, A. (2017). Increasing the complexity of solutions produced by an evolutionary developmental system. In Pelikan, M. and Branke, J., editors, Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 57–58, New York, NY. ACM. Goldsby, H. J., Young, R. L., Schossau, J., Hofmann, H. A., and Hintze, A. (2018). Serendipitous scaffolding to improve a genetic algorithm’s speed and quality. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18, page 959–966, New York, NY, USA. Association for Computing Machinery. Good, B. H., McDonald, M. J., Barrick, J. E., Lenski, R. E., and Desai, M. M. (2017). The dynamics of molecular evolution over 60,000 generations. Nature, 551(7678):45–50. Grabowski, L. M., Bryson, D. M., Dyer, F. C., Ofria, C., and Pennock, R. T. (2010). Early evolution of memory usage in digital organisms. In Fellermann, H., Dörr, M., Hanczyc, M. M., Laursen, L. L., Maurer, S. E., Merkle, D., Monnard, P., Støy, K., and Rasmussen, S., editors, Artificial Life XII: Proceedings of the Twelfth International Conference on the Synthesis and Simulation of Living Systems, pages 224–231, Odense, Denmark. MIT Press. Grabowski, L. M., Bryson, D. M., Dyer, F. C., Pennock, R. T., and Ofria, C. (2013). A case study of the de novo evolution of a complex odometric behavior in digital organisms. PLOS ONE, 8(4):1–10. Gropp, W., Lusk, E., Doss, N., and Skjellum, A. (1996). A high-performance, portable implementation of the mpi message passing interface standard. Parallel Computing, 22(6):789–828. Gropp, W. and Snir, M. (2013). Programming for exascale computers. Computing in Science & Engineering, 15(6):27–35. Grosberg, R. K. and Strathmann, R. R. (2007). The evolution of multicellularity: A minor major transition? Annual Review of Ecology, Evolution, and Systematics, 38(1):621–654. Gu, R. and Becchi, M. (2019). A comparative study of parallel programming frameworks for distributed gpu applications. In Proceedings of the 16th ACM International Conference on Computing Frontiers, pages 268–273. Gulli, J. G., Herron, M. D., and Ratcliff, W. C. (2019). Evolution of altruistic cooperation among nascent multicellular organisms. Evolution, 73(5):1012–1024. Hagstrom, G. I., Hang, D. H., Ofria, C., and Torng, E. (2004). Using avida to test the effects of natural selection on phylogenetic reconstruction methods. Artificial Life, 10(2):157–166. Hanschen, E. R., Shelton, D. E., and Michod, R. E. (2015). Evolutionary transitions in individuality and recent models of multicellularity. In Ruiz-Trillo, I. and Nedelcu, A. M., editors, Evolutionary Transitions to Multicellular Life, pages 165–188. Springer, Dordrecht, Netherlands. Harding, S. and Banzhaf, W. (2007a). Fast genetic programming and artificial developmental systems on gpus. In 21st International Symposium on High Performance Computing Systems and Applications (HPCS’07), pages 1–7. IEEE. Harding, S. and Banzhaf, W. (2007b). Fast genetic programming on gpus. In European Conference on Genetic Programming, pages 90–101. Springer. Heinemann, C. (2008). Artificial life environment. Informatik-Spektrum, 31(1):55–61. Helmuth, T., Spector, L., and Matheson, J. (2014). Solving uncompromising problems with lexicase selection. IEEE Transactions on Evolutionary Computation, 19(5):630–643. Hennessy, J. L. and Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier. Hernandez, J. G., Lalejini, A., and Dolson, E. (2022). What Can Phylogenetic Metrics Tell us About Useful Diversity in Evolutionary Algorithms? In Banzhaf, W., Trujillo, L., Winkler, S., and Worzel, B., editors, Genetic Programming Theory and Practice XVIII, pages 63–82. Springer Singapore, Singapore. 122 Heroux, M. A. (2014). Toward resilient algorithms and applications. arXiv preprint arXiv:1402.3809. Hochberg, M. E., Marquet, P. A., Boyd, R., and Wagner, A. (2017). Innovation: an emerging focus from cells to societies. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1735):20160414. Hodjat, B. and Shahrzad, H. (2013). Distributed evolutionary algorithm for asset management and trading. US Patent 8,527,433. Hornby, G., Globus, A., Linden, D., and Lohn, J. (2006). Automated antenna design with evolutionary algorithms. In Space 2006, page 7242. American Institute of Aeronautics and Astronautics. Hornby, G. S. (2005). Measuring, enabling and comparing modularity, regularity and hierarchy in evolu- tionary design. In Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation, pages 1729–1736. Horwitz, R. and Webb, D. (2003). Cell migration. Current Biology, 13(19):R756–R759. Huizinga, J., Stanley, K. O., and Clune, J. (2018). The emergence of canalization and evolvability in an open-ended, interactive evolutionary system. Artificial Life, 24(3):157–181. Hunter, J. D. (2007). Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90– 95. Hursey, J., Squyres, J. M., Mattox, T. I., and Lumsdaine, A. (2007). The design and implementation of checkpoint/restart process fault tolerance for open mpi. In 2007 IEEE International Parallel and Distributed Processing Symposium, pages 1–8. IEEE. Izzo, D., Rucinski, M., and Ampatzis, C. (2009). Parallel global optimisation meta-heuristics using an asynchronous island-model. In 2009 IEEE Congress on Evolutionary Computation, pages 2301–2308. IEEE. Jones, J. E., Le Sage, V., Padovani, G. H., Calderon, M., Wright, E. S., and Lakdawala, S. S. (2021). Parallel evolution between genomic segments of seasonal human influenza viruses reveals rna-rna relationships. eLife, 10:e66525. Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12. Kajmakovic, A., Diwold, K., Kajtazovic, N., and Zupanc, R. (2020). Challenges in mitigating soft errors in safety-critical systems with cots microprocessors. In PESARO 2020, The Tenth International Conference on Performance, Safety and Robustness in Complex Systems and Applications, pages 13–18. IARIA. Kale, L. V. and Krishnan, S. (1993). Charm++ a portable concurrent object oriented system based on c++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, pages 91–108. Kapli, P., Yang, Z., and Telford, M. J. (2020). Phylogenetic tree building in the genomic age. Nature Reviews Genetics, 21(7):428–444. Karakus, M. and Durresi, A. (2017). Quality of service (qos) in software defined networking (sdn): A survey. Journal of Network and Computer Applications, 80:200–218. Karnik, T. and Hazucha, P. (2004). Characterization of soft errors caused by single event upsets in cmos processes. IEEE Transactions on Dependable and Secure Computing, 1(2):128–143. Kashyap, V. (2006). Ip over infiniband (ipoib) architecture. The Internet Society, 22. Kauffman, S. A. and Weinberger, E. D. (1989). The nk model of rugged fitness landscapes and its application to maturation of the immune response. Journal of Theoretical Biology, 141(2):211–245. 123 Kenneth O. Stanley, J. L. (2017). Open-endedness: The last grand challenge you’ve never heard of. Radar / AI & ML. Kim, J.-S., Ha, S., and Jhon, C. S. (1998). Relaxed barrier synchronization for the bsp model of computation on message-passing architectures. Information Processing Letters, 66(5):247–253. Kirschner, M. and Gerhart, J. (1998). Evolvability. Proceedings of the National Academy of Sciences, 95(15):8420–8427. Knoll, A. H. (2011). The multiple origins of complex multicellularity. Annual Review of Earth and Planetary Sciences, 39(1):217–239. Koenker, R. and Hallock, K. F. (2001). Quantile regression. Journal of Economic Perspectives, 15(4):143–156. Konstantopoulos, S., Li, W., Miller, S., and van der Ploeg, A. (2019). Using quantile regression to estimate intervention effects beyond the mean. Educational and Psychological Measurement, 79(5):883–910. Koop, M. J., Sur, S., Gao, Q., and Panda, D. K. (2007). High performance mpi design using unreliable datagram for ultra-scale infiniband clusters. In Proceedings of the 21st Annual International Conference on Supercomputing, pages 180–189. Koschwanez, J. H., Foster, K. R., and Murray, A. W. (2013). Improved use of a public good selects for the evolution of undifferentiated multicellularity. eLife, 2:e00367. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25:1097–1105. LaBar, T. and Adami, C. (2017). Evolution of drift robustness in small populations. Nature Communications, 8(1):1–12. Lack, J. B. and Van Den Bussche, R. A. (2010). Identifying the confounding factors in resolving phylogenetic relationships in vespertilionidae. Journal of Mammalogy, 91(6):1435–1448. Lalejini, A., Dolson, E., Bohm, C., Ferguson, A. J., Parsons, D. P., Rainford, P. F., Richmond, P., and Ofria, C. (2019). Data standards for artificial life software. In ALIFE 2019: The 2019 Conference on Artificial Life, pages 507–514. MIT Press. Lalejini, A., Moreno, M. A., and Ofria, C. (2020). Case study of adaptive gene regulation in dishtiny. Preprint via Open Science Framework at https://osf.io/kqvmn. Lalejini, A., Moreno, M. A., and Ofria, C. (2021). Tag-based regulation of modules in genetic programming improves context-dependent problem solving. Genetic Programming and Evolvable Machines, 22(3):325– 355. Lalejini, A. and Ofria, C. (2016). The evolutionary origins of phenotypic plasticity. In Gershenson, C., Froese, T., Siqueiros, J. M., Aguilar, W., Izquierdo, E., and Sayama, H., editors, Proceedings of the Artificial Life Conference 2016, pages 372–379, Cambridge, MA. MIT Press. Lalejini, A. and Ofria, C. (2018). Evolving event-driven programs with signalgp. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1135–1142. Langdon, W. B. and Banzhaf, W. (2019). Continuous long-term evolution of genetic programming. In ALIFE 2019: The 2019 Conference on Artificial Life, pages 388–395. MIT Press. Lehman, J. (2012). Evolution through the Search for Novelty. PhD thesis, University of Central Florida. Lehman, J. and Stanley, K. O. (2011). Abandoning objectives: Evolution through the search for novelty alone. Evolutionary Computation, 19(2):189–223. Lehman, J. and Stanley, K. O. (2012). Beyond open-endedness: Quantifying impressiveness. In ALIFE 2012: The Thirteenth International Conference on the Synthesis and Simulation of Living Systems, pages 75–82. MIT Press. 124 Lehman, J. and Stanley, K. O. (2013). Evolvability is inevitable: Increasing evolvability without the pressure to adapt. PloS One, 8(4):e62186. Leith, D. J., Clifford, P., Badarla, V., and Malone, D. (2012). Wlan channel selection without communication. Computer Networks, 56(4):1424–1441. Lenski, R. E., Ofria, C., Pennock, R. T., and Adami, C. (2003). The evolutionary origin of complex features. Nature, 423(6936):139–144. Liard, V., Parsons, D., Rouzaud-Cornabas, J., and Beslon, G. (2018). The complexity ratchet: Stronger than selection, weaker than robustness. In ALIFE 2018: The 2018 Conference on Artificial Life, pages 250–257, Tokyo, Japan. Libby, E. and Ratcliff, W. C. (2014). Ratcheting the evolution of multicellularity. Science, 346(6208):426–427. Lipson, H. et al. (2007). Principles of modularity, regularity, and hierarchy for scalable systems. Journal of Biological Physics and Chemistry, 7(4):125–128. Lynch, M. (2007). The frailty of adaptive hypotheses for the origins of organismal complexity. Proceedings of the National Academy of Sciences, 104(suppl 1):8597–8604. Martin, G. and Roques, L. (2016). The nonstationary dynamics of fitness distributions: asexual model with epistasis and standing variation. Genetics, 204(4):1541–1558. Meng, J., Chakradhar, S., and Raghunathan, A. (2009). Best-effort parallel execution framework for recogni- tion and mining applications. In 2009 IEEE International Symposium on Parallel & Distributed Processing, pages 1–12. IEEE. Meurer, A., Smith, C. P., Paprocki, M., Čertík, O., Kirpichev, S. B., Rocklin, M., Kumar, A., Ivanov, S., Moore, J. K., Singh, S., et al. (2017). Sympy: symbolic computing in python. PeerJ Computer Science, 3:e103. Miaoulis, G. and Plemenos, D. (2008). Intelligent Scene Modelling Information Systems, volume 181. Springer. Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Duffy, N., et al. (2019). Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293–312. Elsevier. Mittal, S. (2016). A survey of techniques for approximate computing. ACM Computing Surveys (CSUR), 48(4):1–33. Moreno, M. A. (2019). Evaluating function dispatch methods in signalgp. Preprint via Open Science Framework at https://osf.io/rmkcv. Moreno, M. A. (2020). Profiling foundations for scalable digital evolution methods. Preprint via Open Science Framework at https://osf.io/tcjfy. Moreno, M. A., Dolson, E., and Ofria, C. (2022a). Hereditary stratigraph concept supplement. Available at https://osf.io/4sm72. Moreno, M. A., Dolson, E., and Ofria, C. (2022b). Hereditary Stratigraphy: Genome Annotations to Enable Phylogenetic Inference over Distributed Populations. In ALIFE 2022: The 2022 Conference on Artificial Life, ALIFE 2022: The 2022 Conference on Artificial Life. 64. Moreno, M. A. and Ofria, C. (2019). Toward open-ended fraternal transitions in individuality. Artificial Life, 25(2):117–133. Moreno, M. A. and Ofria, C. (2020). Practical steps toward indefinite scalability: In pursuit of robust computational substrates for open-ended evolution. Preprint via Open Science Framework at https://doi. org/10.17605/OSF.IO/53VGH. 125 Moreno, M. A. and Ofria, C. (2022). Exploring evolved multicellular life histories in a open-ended digital evolution system. Frontiers in Ecology and Evolution, 10. Moreno, M. A., Papa, S. R., and Ofria, C. (2020). Conduit: A c++ library for best-effort high performance computing. In Proceedings of the 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems at the 2020 International Conference on High Performance Computing & Simulation, HPCS 2020. Moreno, M. A., Papa, S. R., and Ofria, C. (2021a). Case study of novelty, complexity, and adaptation in a multicellular system. In OEE4: The Fourth Workshop on Open-Ended Evolution. Moreno, M. A., Papa, S. R., and Ofria, C. (2021b). Conduit: A c++ library for best-effort high performance computing. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’21, page 1795–1800, New York, NY, USA. Association for Computing Machinery. Nguyen, A. M., Yosinski, J., and Clune, J. (2015). Innovation engines: Automated creativity and improved stochastic optimization via deep learning. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO ’15, page 959–966, New York, NY, USA. Association for Computing Machinery. Ni, X. (2016). Mitigation of failures in high performance computing via runtime techniques. PhD thesis, University of Illinois. Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). Hogwild! a lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pages 693–701. Noel, C. and Osindero, S. (2014). Dogwild!-distributed hogwild for cpu & gpu. In NIPS Workshop on Distributed Machine Learning and Matrix Computations, pages 693–701. Oeis (2021a). Sequence a056791. The on-line encyclopedia of integer sequences. Available at https://oeis.org/A056791. Oeis (2021b). Sequence a063787. The on-line encyclopedia of integer sequences. Available at https://oeis.org/A063787. Ofria, C., Adami, C., Collier, T. C., and Hsu, G. K. (1999). Evolution of differentiated expression patterns in digital organisms. In European Conference on Artificial Life, pages 129–138. Springer. Ofria, C., Bryson, D. M., and Wilke, C. O. (2009). Avida, pages 3–35. Springer London, London. Ofria, C., Dolson, E., Lalejini, A., Fenton, J., Moreno, M. A., Jorgensen, S., Miller, R., Stredwick, J., Zaman, L., Schossau, J., Gillespie, L., G, N. C., and Vostinar, A. (2019). Empirical c++ scientific software library for research, education, & public engagement. Packard, N., Bedau, M. A., Channon, A., Ikegami, T., Rasmussen, S., Stanley, K. O., and Taylor, T. (2019). An overview of open-ended evolution: Editorial introduction to the open-ended evolution ii special issue. Artificial Life, 25(2):93–103. Paradis, E., Claude, J., and Strimmer, K. (2004). Ape: analyses of phylogenetics and evolution in r language. Bioinformatics, 20(2):289–290. Petscher, Y. and Logan, J. A. (2014). Quantile regression in the study of developmental sciences. Child Development, 85(3):861–881. Poli, R., Langdon, W. B., and McPhee, N. F. (2008). A Field Guide to Genetic Programming. Lulu Enterprises, UK Ltd. Pontes, A. C., Mobley, R. B., Ofria, C., Adami, C., and Dyer, F. C. (2020). The evolutionary origin of associative learning. The American Naturalist, 195(1):E1–E19. 126 Project Jupyter, Matthias Bussonnier, Jessica Forde, Jeremy Freeman, Brian Granger, Tim Head, Chris Holdgraf, Kyle Kelley, Gladys Nalvarte, Andrew Osheroff, Pacer, M., Yuvi Panda, Fernando Perez, Ben- jamin Ragan Kelley, and Carol Willing (2018). Binder 2.0 - Reproducible, interactive, sharable envi- ronments for science at scale. In Fatih Akici, David Lippa, Dillon Niederhut, and Pacer, M., editors, Proceedings of the 17th Python in Science Conference, pages 113 – 120. Queller, D. C. (1997). Cooperators since life began. The Quarterly Review of Biology, 72(2):184–188. Ragan-Kelley, B. and Willing, C. (2018). Binder 2.0-reproducible, interactive, sharable environments for science at scale. In Proceedings of the 17th Python in Science Conference (F. Akici, D. Lippa, D. Niederhut, and M. Pacer, eds.), pages 113–120. Rahmati, D., Murali, S., Benini, L., Angiolini, F., De Micheli, G., and Sarbazi-Azad, H. (2011). Com- puting accurate performance bounds for best effort networks-on-chip. IEEE Transactions on Computers, 62(3):452–467. Ratcliff, W. C., Denison, R. F., Borrello, M., and Travisano, M. (2012). Experimental evolution of multicel- lularity. Proceedings of the National Academy of Sciences, 109(5):1595–1600. Ratcliff, W. C., Fankhauser, J. D., Rogers, D. W., Greig, D., and Travisano, M. (2015). Origins of multicel- lular evolvability in snowflake yeast. Nature Communications, 6:6102. Ratcliff, W. C. and Travisano, M. (2014). Experimental evolution of multicellular complexity in saccha- romyces cerevisiae. BioScience, 64(5):383–393. Ray, T. (1995). A proposal to create a network-wide biodiversity reserve for digital organisms. Technical Report TR-H-133, ATR. Ray, T. S. and Hart, J. F. (2000). Evolution of differentiation in multithreaded digital organisms. Artificial Life, 7:132–140. Ray, T. S. and Thearling, K. (1996). Evolving parallel computation. Complex Systems, 10(3):229–237. Reinders, J. (2007). Intel threading building blocks: outfitting C++ for multi-core processor parallelism. " O’Reilly Media, Inc.". Rhodes, O., Peres, L., Rowley, A., Gait, A., Plana, L., Brenninkmeijer, C., and Furber, S. (2019). Real-time cortical simulation on neuromorphic hardware. Royal Society of London. Proceedings A. Mathematical, Physical and Engineering Sciences, 378(2164):1–21. Sarkar, S., Majumder, T., Kalyanaraman, A., and Pande, P. P. (2010). Hardware accelerators for biocom- puting: A survey. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pages 3789–3792. IEEE. Scoles, S. (2018). Cosmic ray showers crash supercomputers. here’s what to do about it. Wired. Smith, J. M. and Szathmary, E. (1997). The major transitions in evolution. Oxford University Press. Smith, M. R. (2020a). Information theoretic generalized robinson-foulds metrics for comparing phylogenetic trees. Bioinformatics, 36(20):5007–5013. Smith, M. R. (2020b). ms609/treedistdata: v1.0.0. Smith, M. R. (2020c). TreeDist: Distances between Phylogenetic Trees. R package version 2.5.0. Smith, M. R. (2022). Robust analysis of phylogenetic tree space. Systematic Biology. Sokal, R. R. (1958). A statistical method for evaluating systematic relationships. Univ. Kansas, Sci. Bull., 38:1409–1438. 127 Soros, L. and Stanley, K. (2014). Identifying necessary conditions for open-ended evolution through the artificial life world of chromaria. In ALIFE 14: The Fourteenth International Conference on the Synthesis and Simulation of Living Systems, pages 793–800. MIT Press. Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., and Gurumurthi, S. (2015). Memory errors in modern systems: The good, the bad, and the ugly. ACM SIGARCH Computer Architecture News, 43(1):297–310. Stanley, K. O. and Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2):99–127. Stanley, K. O. and Miikkulainen, R. (2003). A taxonomy for artificial embryogeny. Artificial Life, 9(2):93– 130. Staps, M., van Gestel, J., and Tarnita, C. E. (2019). Emergence of diverse life cycles and life histories at the origin of multicellularity. Nature Ecology & Evolution, 3(8):1197–1205. Steno, N. (1916). The prodromus of Nicolaus Steno’s dissertation concerning a solid body enclosed by process of nature within a solid, volume 11. University of Michigan Press. Sukumaran, J. and Holder, M. T. (2010). Dendropy: a python library for phylogenetic computing. Bioin- formatics, 26(12):1569–1571. Sutter, H. et al. (2005). The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb’s Journal, 30(3):202–210. Tang, C., Bouteiller, A., Herault, T., Venkata, M. G., and Bosilca, G. (2014). From mpi to openshmem: Porting lammps. In Workshop on OpenSHMEM and Related Technologies, pages 121–137. Springer. Taylor, T., Bedau, M., Channon, A., Ackley, D., Banzhaf, W., Beslon, G., Dolson, E., Froese, T., Hickin- botham, S., Ikegami, T., et al. (2016). Open-ended evolution: Perspectives from the oee workshop in york. Artificial Life, 22(3):408–423. Teranishi, K. and Heroux, M. A. (2014). Toward local failure local recovery resilience model using mpi-ulfm. In Proceedings of the 21st European MPI Users’ Group Meeting, pages 51–56. Ushey, K., Allaire, J., and Tang, Y. (2022). reticulate: Interface to ’Python’. Available at https://rstudio.github.io/reticulate/. Valiant, L. G. (1990). A bridging model for parallel computation. Communications of the ACM, 33(8):103– 111. Vankeirsbilck, J., Hallez, H., and Boydens, J. (2015). Soft error protection in safety critical embedded applications: An overview. In 2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), pages 605–610. IEEE. Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272. Wang, R., Clune, J., and Stanley, K. O. (2018). Vine: an open source interactive data visualization tool for neuroevolution. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 1562–1564. Waskom, M. L. (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021. West, S. A., Fisher, R. M., Gardner, A., and Kiers, E. T. (2015). Major evolutionary transitions in individ- uality. Proceedings of the National Academy of Sciences, 112(33):10112–10119. 128 Wickham, H., François, R., Henry, L., and Müller, K. (2022). dplyr: A Grammar of Data Manipulation. Available at https://dplyr.tidyverse.org. Wilke, C. O. and Adami, C. (2002). The biology of digital organisms. Trends in Ecology & Evolution, 17(11):528–532. Wilson, E. O. (1984). The relation between caste ratios and division of labor in the ant genus pheidole (hymenoptera: Formicidae). Behavioral Ecology and Sociobiology, 16(1):89–98. Xiang, D., Wang, X., Jia, C., Lee, T., and Guo, X. (2016). Molecular-scale electronics: from concept to function. Chemical Reviews, 116(7):4318–4440. Zaman, L., Devangam, S., and Ofria, C. (2011). Rapid host-parasite coevolution drives the production and maintenance of diversity in digital organisms. In Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, pages 219–226. Zhao, X., Papagelis, M., An, A., Chen, B. X., Liu, J., and Hu, Y. (2019). Elastic bulk synchronous parallel model for distributed deep learning. In 2019 IEEE International Conference on Data Mining (ICDM), pages 1504–1509. IEEE. Zhaxybayeva, O. and Gogarten, J. P. (2004). Cladogenesis, coalescence and the evolution of the three domains of life. TRENDS in Genetics, 20(4):182–187. 129 Appendix A Design and Scalability Analysis of Conduit: a Best-effort Communication Software Framework A.1 Weak Scaling This section provides full results from weak scaling experiments discussed in 2.3.6. 1e6 Cpus Per Node = 1 Cpus Per Node = 4 1e7 Cpus Per Node = 1 Cpus Per Node = 4 1.2 Latency Walltime Inlet (ns) Latency Walltime Inlet (ns) 8 Num Simels Per Cpu = 1 Num Simels Per Cpu = 1 1.0 0.8 6 0.6 4 0.4 2 0.2 0 0.0 1e6 1e10 1.4 Latency Walltime Inlet (ns) Num Simels Per Cpu = 2048 Latency Walltime Inlet (ns) Num Simels Per Cpu = 2048 4 1.2 1.0 3 0.8 0.6 2 0.4 0.2 1 0.0 16 64 256 16 64 256 16 64 256 16 64 256 Num Processes Num Processes Num Processes Num Processes (a) Distribution of Latency Walltime Inlet (ns) for (b) Distribution of Latency Walltime Inlet (ns) for each snapshot, without outliers. each snapshot, with outliers. Figure A.1: Distribution of Latency Walltime Inlet (ns) for individual snapshot measurements for weak scaling experiment (Section 2.3.6). Lower is better. 130 Cpus Per Node = 1 Cpus Per Node = 4 Cpus Per Node = 1 Cpus Per Node = 4 17.5 15.0 Latency Simsteps Outlet Num Simels Per Cpu = 1 80 Latency Simsteps Outlet Num Simels Per Cpu = 1 12.5 60 10.0 7.5 40 5.0 20 2.5 0.0 0 2.2 8000 2.0 7000 Num Simels Per Cpu = 2048 Num Simels Per Cpu = 2048 Latency Simsteps Outlet Latency Simsteps Outlet 1.8 6000 1.6 5000 1.4 4000 1.2 3000 1.0 2000 0.8 1000 0.6 0 16 64 256 16 64 256 16 64 256 16 64 256 Num Processes Num Processes Num Processes Num Processes (a) Distribution of Latency Simsteps Outlet for each (b) Distribution of Latency Simsteps Outlet for each snapshot, without outliers. snapshot, with outliers. Figure A.2: Distribution of Latency Simsteps Outlet for individual snapshot measurements for weak scaling experiment (Section 2.3.6). Lower is better. 1e9 Cpus Per Node = 1 Cpus Per Node = 4 1e6 Cpus Per Node = 1 Cpus Per Node = 4 1.2 2.0 Latency Walltime Outlet (ns) Latency Walltime Outlet (ns) Num Simels Per Cpu = 1 Num Simels Per Cpu = 1 1.0 1.5 0.8 1.0 0.6 0.4 0.5 0.2 0.0 0.0 1e6 1e10 1.4 Latency Walltime Outlet (ns) Num Simels Per Cpu = 2048 Latency Walltime Outlet (ns) Num Simels Per Cpu = 2048 1.2 4 1.0 3 0.8 0.6 2 0.4 0.2 1 0.0 16 64 256 16 64 256 16 64 256 16 64 256 Num Processes Num Processes Num Processes Num Processes (a) Distribution of Latency Walltime Outlet (ns) for (b) Distribution of Latency Walltime Outlet (ns) for each snapshot, without outliers. each snapshot, with outliers. Figure A.3: Distribution of Latency Walltime Outlet (ns) for individual snapshot measurements for weak scaling experiment (Section 2.3.6). Lower is better. 131 Cpus Per Node = 1 Cpus Per Node = 4 Cpus Per Node = 1 Cpus Per Node = 4 1.0 0.9 Num Simels Per Cpu = 1 Num Simels Per Cpu = 1 0.8 0.8 Delivery Clumpiness Delivery Clumpiness 0.7 0.6 0.6 0.4 0.5 0.4 0.2 0.3 0.0 0.8 0.7 Num Simels Per Cpu = 2048 Num Simels Per Cpu = 2048 0.7 0.6 Delivery Clumpiness Delivery Clumpiness 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 16 64 256 16 64 256 16 64 256 16 64 256 Num Processes Num Processes Num Processes Num Processes (a) Distribution of Delivery Clumpiness for each (b) Distribution of Delivery Clumpiness for each snapshot, without outliers. snapshot, with outliers. Figure A.4: Distribution of Delivery Clumpiness for individual snapshot measurements for weak scaling experiment (Section 2.3.6). Lower is better. Cpus Per Node = 1 Cpus Per Node = 4 1e7 Cpus Per Node = 1 Cpus Per Node = 4 200000 Simstep Period Inlet (ns) Simstep Period Inlet (ns) 8 Num Simels Per Cpu = 1 Num Simels Per Cpu = 1 150000 6 4 100000 2 50000 0 1e6 1e7 1.0 Num Simels Per Cpu = 2048 Num Simels Per Cpu = 2048 2.5 Simstep Period Inlet (ns) Simstep Period Inlet (ns) 0.8 2.0 0.6 1.5 0.4 1.0 0.2 16 64 256 16 64 256 16 64 256 16 64 256 Num Processes Num Processes Num Processes Num Processes (a) Distribution of Simstep Period Inlet (ns) for each (b) Distribution of Simstep Period Inlet (ns) for each snapshot, without outliers. snapshot, with outliers. Figure A.5: Distribution of Simstep Period Inlet (ns) for individual snapshot measurements for weak scaling experiment (Section 2.3.6). Lower is better. 132 Cpus Per Node = 1 Cpus Per Node = 4 Cpus Per Node = 1 Cpus Per Node = 4 17.5 15.0 80 Latency Simsteps Inlet Num Simels Per Cpu = 1 Latency Simsteps Inlet Num Simels Per Cpu = 1 12.5 60 10.0 7.5 40 5.0 20 2.5 0.0 0 8000 2.0 7000 Num Simels Per Cpu = 2048 Num Simels Per Cpu = 2048 1.8 Latency Simsteps Inlet Latency Simsteps Inlet 6000 1.6 5000 1.4 4000 1.2 3000 1.0 2000 0.8 1000 0.6 0 16 64 256 16 64 256 16 64 256 16 64 256 Num Processes Num Processes Num Processes Num Processes (a) Distribution of Latency Simsteps Inlet for each (b) Distribution of Latency Simsteps Inlet for each snapshot, without outliers. snapshot, with outliers. Figure A.6: Distribution of Latency Simsteps Inlet for individual snapshot measurements for weak scaling experiment (Section 2.3.6). Lower is better. Cpus Per Node = 1 Cpus Per Node = 4 1e9 Cpus Per Node = 1 Cpus Per Node = 4 4 200000 Simstep Period Outlet (ns) Num Simels Per Cpu = 1 Simstep Period Outlet (ns) Num Simels Per Cpu = 1 3 150000 2 100000 1 50000 0 1e6 1e7 1.0 Num Simels Per Cpu = 2048 Num Simels Per Cpu = 2048 2.5 Simstep Period Outlet (ns) Simstep Period Outlet (ns) 0.8 2.0 0.6 1.5 0.4 1.0 0.2 16 64 256 16 64 256 16 64 256 16 64 256 Num Processes Num Processes Num Processes Num Processes (a) Distribution of Simstep Period Outlet (ns) for (b) Distribution of Simstep Period Outlet (ns) for each snapshot, without outliers. each snapshot, with outliers. Figure A.7: Distribution of Simstep Period Outlet (ns) for individual snapshot measurements for weak scaling experiment (Section 2.3.6). Lower is better. 133 Cpus Per Node = 1 Cpus Per Node = 4 Cpus Per Node = 1 Cpus Per Node = 4 0.5 0.5 Num Simels Per Cpu = 1 Num Simels Per Cpu = 1 0.4 0.4 Delivery Failure Rate Delivery Failure Rate 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.7 Num Simels Per Cpu = 2048 Num Simels Per Cpu = 2048 0.0002 0.6 Delivery Failure Rate Delivery Failure Rate 0.5 0.0001 0.4 0.0000 0.3 0.2 −0.0001 0.1 0.0 −0.0002 16 64 256 16 64 256 16 64 256 16 64 256 Num Processes Num Processes Num Processes Num Processes (a) Distribution of Delivery Failure Rate for each (b) Distribution of Delivery Failure Rate for each snapshot, without outliers. snapshot, with outliers. Figure A.8: Distribution of Delivery Failure Rate for individual snapshot measurements for weak scaling experiment (Section 2.3.6). Lower is better. 134 Ordinary Least Squares Regression Cpus Per Node = 1 1e8 Cpus Per Node = 4 625000 1.50 Latency Walltime Inlet (ns) 600000 Num Simels Per Cpu = 1 1.25 575000 1.00 Estimated Statistic = Latency Walltime Inlet (ns) Mean | Num Processes = 16, 64, 256 Cpus Per Node = 1 Cpus Per Node = 4 550000 0.75 0 100000 525000 0.50 −5000 Num Simels Per Cpu = 1 Absolute Effect Size −10000 0 500000 0.25 −15000 0.00 −100000 475000 −20000 −25000 −200000 1e7 1e6 −30000 −300000 Num Simels Per Cpu = 2048 2.8 Latency Walltime Inlet (ns) 2.0 −35000 2.6 1e6 350000 1.5 7 Num Simels Per Cpu = 2048 2.4 300000 6 Absolute Effect Size 1.0 250000 2.2 5 200000 0.5 2.0 4 150000 3 0.0 1.8 100000 2 1 50000 2 3 4 2 3 4 Log Num Processes Log Num Processes 0 0 (a) Complete ordinary least squares regression plot. (b) Estimated regression coefficient for complete re- Observations are means per replicate. gression. Zero corresponds to no effect. Ordinary Least Squares Regression Cpus Per Node = 1 1e8 Cpus Per Node = 4 625000 1.50 Latency Walltime Inlet (ns) 600000 Num Simels Per Cpu = 1 1.25 Estimated Statistic = Latency Walltime Inlet (ns) Mean | Num Processes = 64, 256 575000 1.00 Cpus Per Node = 1 Cpus Per Node = 4 550000 50000 0.75 200000 Num Simels Per Cpu = 1 525000 Absolute Effect Size 40000 0.50 150000 500000 30000 0.25 100000 475000 20000 0.00 50000 1e7 1e6 10000 0 0 Num Simels Per Cpu = 2048 2.8 Latency Walltime Inlet (ns) −50000 2.0 1e7 2.6 0 1.4 Num Simels Per Cpu = 2048 1.5 −50000 2.4 1.2 Absolute Effect Size −100000 2.2 1.0 1.0 −150000 0.8 2.0 0.5 0.6 −200000 1.8 0.4 −250000 0.2 2 3 4 2 3 4 −300000 Log Num Processes Log Num Processes 0.0 (c) Piecewise ordinary least squares regression plot. (d) Estimated regression coefficient for rightmost par- Observations are means per replicate. tial regression. Zero corresponds to no effect. Figure A.9: Ordinary least squares regressions of Latency Walltime Inlet (ns) against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Error bands and bars are 95% confidence intervals. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 135 Ordinary Least Squares Regression Cpus Per Node = 1 Cpus Per Node = 4 9.5 9.0 Latency Simsteps Outlet Num Simels Per Cpu = 1 9 8.5 Estimated Statistic = Latency Simsteps Outlet Mean | Num Processes = 16, 64, 256 Cpus Per Node = 1 Cpus Per Node = 4 8 8.0 0.0 7.5 −0.2 0.04 Num Simels Per Cpu = 1 7 Absolute Effect Size 7.0 −0.4 0.02 6.5 −0.6 6 0.00 6.0 −0.8 −1.0 −0.02 12 −1.2 1.50 −0.04 Num Simels Per Cpu = 2048 Latency Simsteps Outlet −1.4 10 1.45 4.0 0.05 8 1.40 Num Simels Per Cpu = 2048 3.5 0.04 6 1.35 Absolute Effect Size 3.0 0.03 4 1.30 2.5 0.02 2.0 2 1.25 0.01 1.5 0 1.20 0.00 1.0 0.5 −0.01 2 3 4 2 3 4 Log Num Processes Log Num Processes 0.0 −0.02 (a) Complete ordinary least squares regression plot. (b) Estimated regression coefficient for complete re- Observations are means per replicate. gression. Zero corresponds to no effect. Ordinary Least Squares Regression Cpus Per Node = 1 Cpus Per Node = 4 9.5 9.0 Latency Simsteps Outlet Num Simels Per Cpu = 1 9 Estimated Statistic = Latency Simsteps Outlet Mean | Num Processes = 64, 256 8.5 Cpus Per Node = 1 Cpus Per Node = 4 0.4 8 8.0 0.04 Num Simels Per Cpu = 1 7.5 0.2 Absolute Effect Size 0.02 7 7.0 0.0 6.5 0.00 6 6.0 −0.2 −0.02 −0.4 −0.04 12 1.50 Num Simels Per Cpu = 2048 −0.6 Latency Simsteps Outlet 10 1.45 8 0.04 1.40 Num Simels Per Cpu = 2048 8 0.02 Absolute Effect Size 1.35 6 6 0.00 1.30 4 4 1.25 −0.02 2 1.20 −0.04 2 2 3 4 2 3 4 −0.06 Log Num Processes Log Num Processes 0 (c) Piecewise ordinary least squares regression plot. (d) Estimated regression coefficient for rightmost par- Observations are means per replicate. tial regression. Zero corresponds to no effect. Figure A.10: Ordinary least squares regressions of Latency Simsteps Outlet against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Error bands and bars are 95% confidence intervals. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 136 Ordinary Least Squares Regression Cpus Per Node = 1 1e8 Cpus Per Node = 4 625000 1.50 Latency Walltime Outlet (ns) Num Simels Per Cpu = 1 600000 1.25 575000 1.00 Estimated Statistic = Latency Walltime Outlet (ns) Mean | Num Processes = 16, 64, 256 550000 0.75 Cpus Per Node = 1 Cpus Per Node = 4 0 525000 0.50 −5000 0.04 Num Simels Per Cpu = 1 Absolute Effect Size −10000 500000 0.25 0.02 −15000 475000 0.00 −20000 0.00 −25000 1e7 1e6 −0.02 −30000 Latency Walltime Outlet (ns) −0.04 Num Simels Per Cpu = 2048 2.0 2.8 −35000 2.6 1e6 1.5 350000 7 Num Simels Per Cpu = 2048 2.4 300000 1.0 6 Absolute Effect Size 250000 2.2 5 200000 0.5 4 2.0 150000 3 0.0 1.8 100000 2 1 50000 2 3 4 2 3 4 Log Num Processes Log Num Processes 0 0 (a) Complete ordinary least squares regression plot. (b) Estimated regression coefficient for complete re- Observations are means per replicate. gression. Zero corresponds to no effect. Ordinary Least Squares Regression Cpus Per Node = 1 1e8 Cpus Per Node = 4 625000 1.50 Latency Walltime Outlet (ns) Num Simels Per Cpu = 1 600000 1.25 Estimated Statistic = Latency Walltime Outlet (ns) Mean | Num Processes = 64, 256 575000 1.00 Cpus Per Node = 1 Cpus Per Node = 4 550000 0.75 50000 0.04 Num Simels Per Cpu = 1 525000 0.50 Absolute Effect Size 40000 0.02 500000 0.25 30000 0.00 475000 0.00 20000 −0.02 1e7 1e6 10000 −0.04 Latency Walltime Outlet (ns) Num Simels Per Cpu = 2048 0 2.8 2.0 1e7 2.6 0 1.4 Num Simels Per Cpu = 2048 1.5 −50000 2.4 1.2 Absolute Effect Size −100000 1.0 2.2 1.0 −150000 0.8 2.0 0.6 −200000 0.5 1.8 0.4 −250000 2 3 4 2 3 4 0.2 −300000 Log Num Processes Log Num Processes 0.0 (c) Piecewise ordinary least squares regression plot. (d) Estimated regression coefficient for rightmost par- Observations are means per replicate. tial regression. Zero corresponds to no effect. Figure A.11: Ordinary least squares regressions of Latency Walltime Outlet (ns) against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Error bands and bars are 95% confidence intervals. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 137 Ordinary Least Squares Regression Cpus Per Node = 1 Cpus Per Node = 4 9.0 9 Latency Simsteps Inlet Num Simels Per Cpu = 1 8.5 Estimated Statistic = Latency Simsteps Inlet Mean | Num Processes = 16, 64, 256 8.0 Cpus Per Node = 1 Cpus Per Node = 4 8 0.0 0.0 7.5 −0.2 Num Simels Per Cpu = 1 −0.2 7 Absolute Effect Size 7.0 −0.4 6.5 −0.4 −0.6 6 6.0 −0.8 −0.6 −1.0 −0.8 12 −1.2 Num Simels Per Cpu = 2048 1.45 −1.4 −1.0 Latency Simsteps Inlet 10 1.40 4 0.05 8 Num Simels Per Cpu = 2048 1.35 0.04 6 Absolute Effect Size 3 0.03 1.30 4 0.02 1.25 2 2 0.01 1.20 0 0.00 1 1.15 −0.01 2 3 4 2 3 4 Log Num Processes Log Num Processes 0 (a) Complete ordinary least squares regression plot. (b) Estimated regression coefficient for complete re- Observations are means per replicate. gression. Zero corresponds to no effect. Ordinary Least Squares Regression Cpus Per Node = 1 Cpus Per Node = 4 9.0 9 Estimated Statistic = Latency Simsteps Inlet Mean | Num Processes = 64, 256 Latency Simsteps Inlet Num Simels Per Cpu = 1 8.5 Cpus Per Node = 1 Cpus Per Node = 4 8.0 0.6 8 Num Simels Per Cpu = 1 7.5 0.2 0.4 Absolute Effect Size 7 7.0 0.0 0.2 6.5 6 −0.2 0.0 6.0 −0.2 −0.4 12 Num Simels Per Cpu = 2048 1.45 −0.4 Latency Simsteps Inlet 10 8 0.04 1.40 Num Simels Per Cpu = 2048 7 8 1.35 0.02 Absolute Effect Size 6 6 1.30 5 0.00 4 1.25 4 −0.02 1.20 3 2 2 −0.04 1.15 1 2 3 4 2 3 4 −0.06 Log Num Processes Log Num Processes 0 (c) Piecewise ordinary least squares regression plot. (d) Estimated regression coefficient for rightmost par- Observations are means per replicate. tial regression. Zero corresponds to no effect. Figure A.12: Ordinary least squares regressions of Latency Simsteps Inlet against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Error bands and bars are 95% confidence intervals. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 138 Ordinary Least Squares Regression Cpus Per Node = 1 1e8 Cpus Per Node = 4 90000 1.75 Simstep Period Outlet (ns) 1.50 Num Simels Per Cpu = 1 85000 1.25 Estimated Statistic = Simstep Period Outlet (ns) Mean | Num Processes = 16, 64, 256 80000 1.00 Cpus Per Node = 1 Cpus Per Node = 4 0.75 10000 75000 0.04 Num Simels Per Cpu = 1 0.50 8000 Absolute Effect Size 70000 0.02 0.25 6000 0.00 0.00 65000 4000 −0.02 1e6 1e6 2.1 2.0 2000 −0.04 Simstep Period Outlet (ns) Num Simels Per Cpu = 2048 1.9 0 2.0 200000 1.8 200000 Num Simels Per Cpu = 2048 1.9 Absolute Effect Size 150000 1.7 150000 1.8 1.6 100000 100000 1.7 1.5 50000 50000 2 3 4 2 3 4 Log Num Processes Log Num Processes 0 0 (a) Complete ordinary least squares regression plot. (b) Estimated regression coefficient for complete re- Observations are means per replicate. gression. Zero corresponds to no effect. Ordinary Least Squares Regression Cpus Per Node = 1 1e8 Cpus Per Node = 4 90000 1.75 Simstep Period Outlet (ns) 1.50 Num Simels Per Cpu = 1 85000 Estimated Statistic = Simstep Period Outlet (ns) Mean | Num Processes = 64, 256 1.25 Cpus Per Node = 1 Cpus Per Node = 4 80000 1.00 7000 0.04 0.75 Num Simels Per Cpu = 1 75000 6000 Absolute Effect Size 0.50 0.02 5000 70000 0.25 4000 0.00 65000 0.00 3000 −0.02 2000 1e6 1e6 2.0 1000 −0.04 Simstep Period Outlet (ns) Num Simels Per Cpu = 2048 2.0 0 1.9 0 0 1.8 Num Simels Per Cpu = 2048 1.9 −25000 −20000 Absolute Effect Size 1.7 −50000 1.8 −75000 1.6 −40000 −100000 1.7 1.5 −60000 −125000 2 3 4 2 3 4 −150000 Log Num Processes Log Num Processes −80000 (c) Piecewise ordinary least squares regression plot. (d) Estimated regression coefficient for rightmost par- Observations are means per replicate. tial regression. Zero corresponds to no effect. Figure A.13: Ordinary least squares regressions of Simstep Period Outlet (ns) against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Error bands and bars are 95% confidence intervals. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 139 Ordinary Least Squares Regression Cpus Per Node = 1 Cpus Per Node = 4 0.14 0.04 Num Simels Per Cpu = 1 0.13 Estimated Statistic = Delivery Failure Rate Mean | Num Processes = 16, 64, 256 Delivery Failure Rate Cpus Per Node = 1 Cpus Per Node = 4 0.02 0.12 0.020 0.11 0.04 0.00 Num Simels Per Cpu = 1 Absolute Effect Size 0.10 0.02 0.015 −0.02 0.09 0.00 0.010 −0.04 0.08 −0.02 0.005 −0.04 Num Simels Per Cpu = 2048 0.003 0.000 0.006 Delivery Failure Rate 0.002 0.00175 0.00000 Num Simels Per Cpu = 2048 0.004 0.00150 −0.00025 Absolute Effect Size 0.001 0.00125 −0.00050 0.002 0.00100 −0.00075 0.000 0.00075 −0.00100 −0.001 0.000 0.00050 −0.00125 0.00025 2 3 4 2 3 4 −0.00150 Log Num Processes Log Num Processes 0.00000 (a) Complete ordinary least squares regression plot. (b) Estimated regression coefficient for complete re- Observations are means per replicate. gression. Zero corresponds to no effect. Ordinary Least Squares Regression Cpus Per Node = 1 Cpus Per Node = 4 0.14 0.04 Estimated Statistic = Delivery Failure Rate Mean | Num Processes = 64, 256 Num Simels Per Cpu = 1 0.13 Delivery Failure Rate Cpus Per Node = 1 Cpus Per Node = 4 0.02 0.12 0.04 0.030 0.11 Num Simels Per Cpu = 1 0.00 Absolute Effect Size 0.025 0.10 0.02 −0.02 0.020 0.09 0.00 0.015 −0.04 0.08 −0.02 0.010 −0.04 0.005 Num Simels Per Cpu = 2048 0.003 0.000 0.006 Delivery Failure Rate 0.0005 0.0030 Num Simels Per Cpu = 2048 0.002 0.0000 0.004 Absolute Effect Size 0.0025 −0.0005 0.001 0.0020 0.002 −0.0010 0.0015 0.000 −0.0015 0.0010 0.000 −0.0020 0.0005 2 3 4 2 3 4 −0.0025 Log Num Processes Log Num Processes 0.0000 (c) Piecewise ordinary least squares regression plot. (d) Estimated regression coefficient for rightmost par- Observations are means per replicate. tial regression. Zero corresponds to no effect. Figure A.14: Ordinary least squares regressions of Delivery Failure Rate against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Error bands and bars are 95% confidence intervals. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 140 Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 7.5 9 Latency Simsteps Outlet Num Simels Per Cpu = 1 7.0 8 Estimated Statistic = Latency Simsteps Outlet Median | Num Processes = 16, 64, 256 6.5 Cpus Per Node = 1 Cpus Per Node = 4 0.0 0.0 7 6.0 −0.2 −0.1 Num Simels Per Cpu = 1 −0.4 −0.2 Absolute Effect Size 6 5.5 −0.6 −0.3 −0.8 5 −0.4 5.0 −1.0 −0.5 1.50 −1.2 −0.6 Num Simels Per Cpu = 2048 1.45 −1.4 1.4 −0.7 Latency Simsteps Outlet −1.6 1.40 0.06 1.3 1.35 0.00 Num Simels Per Cpu = 2048 0.04 1.30 Absolute Effect Size −0.02 0.02 1.2 1.25 0.00 −0.04 1.20 −0.02 1.1 1.15 −0.06 −0.04 2 3 4 2 3 4 Log Num Processes Log Num Processes −0.06 −0.08 (a) Complete quantile regression plot. Observations (b) Estimated regression coefficient for ordinary least are medians per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 7.5 9 Latency Simsteps Outlet Num Simels Per Cpu = 1 7.0 Estimated Statistic = Latency Simsteps Outlet Median | Num Processes = 64, 256 8 Cpus Per Node = 1 Cpus Per Node = 4 6.5 0.2 0.4 7 Num Simels Per Cpu = 1 6.0 0.0 Absolute Effect Size 0.2 −0.2 6 5.5 0.0 −0.4 5.0 −0.2 1.50 −0.6 Num Simels Per Cpu = 2048 1.45 −0.4 1.4 −0.8 Latency Simsteps Outlet 1.40 0.06 0.06 1.3 1.35 Num Simels Per Cpu = 2048 0.04 0.04 Absolute Effect Size 1.30 0.02 0.02 1.2 1.25 0.00 0.00 1.20 −0.02 −0.02 1.1 −0.04 1.15 −0.04 −0.06 2 3 4 2 3 4 Log Num Processes Log Num Processes −0.06 −0.08 (c) Piecewise quantile regression plot. Observations (d) Estimated regression coefficient for rightmost par- are medians per replicate. tial regression. Zero corresponds to no effect. Figure A.15: Quantile Regressions of Latency Simsteps Outlet against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Quantile regression estimates relationship between independent variable and median of response variable. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 141 Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 625000 500000 Latency Walltime Outlet (ns) Num Simels Per Cpu = 1 600000 475000 575000 450000 Estimated Statistic = Latency Walltime Outlet (ns) Median | Num Processes = 16, 64, 256 550000 Cpus Per Node = 1 Cpus Per Node = 4 525000 425000 0 20000 500000 400000 Num Simels Per Cpu = 1 −10000 10000 Absolute Effect Size 475000 375000 −20000 0 450000 −30000 1e6 1e6 −10000 2.75 Latency Walltime Outlet (ns) −40000 Num Simels Per Cpu = 2048 −20000 2.6 2.50 2.25 150000 2.4 350000 Num Simels Per Cpu = 2048 125000 2.00 300000 Absolute Effect Size 100000 250000 2.2 1.75 75000 200000 1.50 50000 150000 25000 2.0 1.25 100000 0 2 3 4 2 3 4 50000 −25000 Log Num Processes Log Num Processes 0 (a) Complete quantile regression plot. Observations (b) Estimated regression coefficient for ordinary least are medians per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 625000 500000 Latency Walltime Outlet (ns) Num Simels Per Cpu = 1 600000 475000 575000 Estimated Statistic = Latency Walltime Outlet (ns) Median | Num Processes = 64, 256 450000 550000 Cpus Per Node = 1 Cpus Per Node = 4 60000 525000 425000 30000 50000 Num Simels Per Cpu = 1 500000 400000 20000 Absolute Effect Size 40000 475000 10000 375000 30000 450000 0 20000 1e6 1e6 −10000 10000 2.75 −20000 0 Latency Walltime Outlet (ns) Num Simels Per Cpu = 2048 2.6 −10000 2.50 50000 2.25 0 Num Simels Per Cpu = 2048 2.4 0 2.00 −50000 Absolute Effect Size −50000 −100000 2.2 1.75 −100000 −150000 1.50 −150000 −200000 2.0 1.25 −250000 −200000 2 3 4 2 3 4 −300000 Log Num Processes Log Num Processes −250000 (c) Piecewise quantile regression plot. Observations (d) Estimated regression coefficient for rightmost par- are medians per replicate. tial regression. Zero corresponds to no effect. Figure A.16: Quantile Regressions of Latency Walltime Outlet (ns) against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Quantile regression estimates relationship between independent variable and median of response variable. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 142 Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 0.70 0.84 Num Simels Per Cpu = 1 0.82 0.68 Estimated Statistic = Delivery Clumpiness Median | Num Processes = 16, 64, 256 Delivery Clumpiness Cpus Per Node = 1 Cpus Per Node = 4 0.80 0.66 0.00 0.000 0.78 −0.005 −0.01 Num Simels Per Cpu = 1 0.64 Absolute Effect Size −0.010 0.76 0.62 −0.02 −0.015 0.74 −0.03 −0.020 0.72 0.60 −0.025 −0.04 0.425 −0.030 0.40 Num Simels Per Cpu = 2048 −0.05 −0.035 0.400 Delivery Clumpiness 0.35 0.00 0.375 0.06 0.30 Num Simels Per Cpu = 2048 −0.01 0.350 Absolute Effect Size 0.25 0.04 −0.02 0.325 0.20 0.02 −0.03 0.300 0.15 0.275 0.10 −0.04 0.00 2 3 4 2 3 4 −0.05 −0.02 Log Num Processes Log Num Processes (a) Complete quantile regression plot. Observations (b) Estimated regression coefficient for ordinary least are medians per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 0.70 0.84 Estimated Statistic = Delivery Clumpiness Median | Num Processes = 64, 256 Num Simels Per Cpu = 1 0.82 0.68 Cpus Per Node = 1 Cpus Per Node = 4 Delivery Clumpiness 0.01 0.80 0.66 0.00 Num Simels Per Cpu = 1 0.00 0.78 Absolute Effect Size 0.64 0.76 −0.01 −0.01 0.62 0.74 −0.02 0.60 −0.02 0.72 −0.03 −0.03 0.425 −0.04 0.40 Num Simels Per Cpu = 2048 0.400 Delivery Clumpiness 0.35 0.02 0.375 0.06 Num Simels Per Cpu = 2048 0.30 Absolute Effect Size 0.350 0.00 0.04 0.25 0.325 0.20 −0.02 0.02 0.300 0.15 0.00 −0.04 0.275 0.10 −0.02 −0.06 2 3 4 2 3 4 Log Num Processes Log Num Processes −0.04 (c) Piecewise quantile regression plot. Observations (d) Estimated regression coefficient for rightmost par- are medians per replicate. tial regression. Zero corresponds to no effect. Figure A.17: Quantile Regressions of Delivery Clumpiness against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Quantile regression estimates relationship between independent variable and median of response variable. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 143 Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 90000 Simstep Period Inlet (ns) Num Simels Per Cpu = 1 85000 75000 Estimated Statistic = Simstep Period Inlet (ns) Median | Num Processes = 16, 64, 256 80000 70000 Cpus Per Node = 1 Cpus Per Node = 4 10000 75000 4000 65000 Num Simels Per Cpu = 1 Absolute Effect Size 8000 70000 3000 60000 6000 2000 65000 4000 1000 1e6 1e6 2.1 2.0 2000 0 Num Simels Per Cpu = 2048 Simstep Period Inlet (ns) 1.8 0 2.0 250000 250000 1.6 Num Simels Per Cpu = 2048 1.9 200000 200000 Absolute Effect Size 1.4 1.8 150000 150000 1.2 1.7 100000 100000 1.0 1.6 50000 50000 2 3 4 2 3 4 Log Num Processes Log Num Processes 0 0 (a) Complete quantile regression plot. Observations (b) Estimated regression coefficient for ordinary least are medians per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 Simstep Period Inlet (ns) Num Simels Per Cpu = 1 85000 75000 Estimated Statistic = Simstep Period Inlet (ns) Median | Num Processes = 64, 256 Cpus Per Node = 1 Cpus Per Node = 4 80000 70000 3000 8000 75000 Num Simels Per Cpu = 1 65000 2000 Absolute Effect Size 6000 70000 1000 60000 0 65000 4000 −1000 1e6 1e6 2000 2.1 2.0 −2000 Num Simels Per Cpu = 2048 −3000 Simstep Period Inlet (ns) 1.8 0 2.0 0 0 1.6 Num Simels Per Cpu = 2048 −20000 1.9 −20000 Absolute Effect Size 1.4 −40000 −40000 1.8 1.2 −60000 1.0 −60000 −80000 1.7 −100000 −80000 2 3 4 2 3 4 −120000 Log Num Processes Log Num Processes (c) Piecewise quantile regression plot. Observations (d) Estimated regression coefficient for rightmost par- are medians per replicate. tial regression. Zero corresponds to no effect. Figure A.18: Quantile Regressions of Simstep Period Inlet (ns) against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Quantile regression estimates relationship between independent variable and median of response variable. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 144 Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 75000 Simstep Period Outlet (ns) 85000 Num Simels Per Cpu = 1 80000 70000 Estimated Statistic = Simstep Period Outlet (ns) Median | Num Processes = 16, 64, 256 Cpus Per Node = 1 Cpus Per Node = 4 75000 4000 10000 65000 Num Simels Per Cpu = 1 70000 3000 8000 Absolute Effect Size 60000 65000 6000 2000 4000 1000 1e6 1e6 2000 0 Simstep Period Outlet (ns) Num Simels Per Cpu = 2048 2.0 1.8 0 1.9 1.6 250000 250000 Num Simels Per Cpu = 2048 1.4 200000 200000 Absolute Effect Size 1.8 1.2 150000 150000 1.7 1.0 100000 100000 1.6 0.8 50000 50000 2 3 4 2 3 4 Log Num Processes Log Num Processes 0 0 (a) Complete quantile regression plot. Observations (b) Estimated regression coefficient for ordinary least are medians per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Cpus Per Node = 1 Cpus Per Node = 4 85000 75000 Simstep Period Outlet (ns) Num Simels Per Cpu = 1 Estimated Statistic = Simstep Period Outlet (ns) Median | Num Processes = 64, 256 80000 70000 Cpus Per Node = 1 Cpus Per Node = 4 75000 3000 65000 8000 Num Simels Per Cpu = 1 Absolute Effect Size 2000 70000 6000 60000 1000 65000 4000 0 1e6 1e6 −1000 2000 Num Simels Per Cpu = 2048 −2000 Simstep Period Outlet (ns) 2.0 1.8 0 0 0 1.6 1.9 Num Simels Per Cpu = 2048 −20000 1.4 −20000 Absolute Effect Size −40000 1.8 1.2 −40000 −60000 1.7 1.0 −60000 −80000 0.8 −80000 −100000 2 3 4 2 3 4 Log Num Processes Log Num Processes −120000 (c) Piecewise quantile regression plot. Observations (d) Estimated regression coefficient for rightmost par- are medians per replicate. tial regression. Zero corresponds to no effect. Figure A.19: Quantile Regressions of Simstep Period Outlet (ns) against log processor count for weak scaling experiment (Section 2.3.6). Lower is better. Top row shows complete regression and bottom row shows piecewise regression. Quantile regression estimates relationship between independent variable and median of response variable. Note that log is base 4, so processor counts correspond to 16, 64, and 256. 145 Table A.1: Full Ordinary Least Squares Regression results of Latency Walltime Inlet (ns) against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. nd un d tS Bo rB ize u Re o 95 Lo Up lat we ive % r pe CI Eff CI CI Lo n ec we % % tS rB ca 95 95 ize nt ou sP Eff ec ze Si ze ze tS 95 nd er N tS ign Si Si ize % Nu od Eff Eff Eff C m Si e ec ec ec ive Effp IU m els t t t ec pp Pe Eff tic Nu rC pu te te te ec ive er c ifi m so lu so Bo et ri at Pr lu so lu Re Re is gn Cp oc un es se Ab Ab Ab lat lat M St Si u s d Latency Walltime Inlet (ns) mean - 1 1 16/64/256 -19 000 -35 000 -2 400 -0.033 -0.062 -0.0043 30 0.026 Latency Walltime Inlet (ns) mean + 1 2048 16/64/256 5.5e+06 3.5e+06 7.5e+06 2.5 1.6 3.4 30 4.8e-06 Latency Walltime Inlet (ns) mean 0 4 1 16/64/256 -110 000 -350 000 120 000 -0.13 -0.4 0.14 30 0.33 Latency Walltime Inlet (ns) mean + 4 2048 16/64/256 230 000 110 000 340 000 0.11 0.057 0.17 30 0.0003 Latency Walltime Inlet (ns) mean - 1 1 16/64 -63 000 -92 000 -34 000 -0.11 -0.16 -0.061 20 0.00024 Latency Walltime Inlet (ns) mean + 1 2048 16/64 300 000 170 000 430 000 0.14 0.077 0.2 20 0.00014 Latency Walltime Inlet (ns) mean 0 4 1 16/64 -320 000 -900 000 260 000 -0.37 -1 0.3 20 0.26 Latency Walltime Inlet (ns) mean + 4 2048 16/64 660 000 530 000 800 000 0.33 0.27 0.4 20 5.5e-09 Latency Walltime Inlet (ns) mean 0 1 1 64/256 26 000 -990 52 000 0.046 -0.0018 0.093 20 0.058 Latency Walltime Inlet (ns) mean + 1 2048 64/256 1.1e+07 6.5e+06 1.5e+07 4.9 3 6.8 20 3.9e-05 Latency Walltime Inlet (ns) mean 0 4 1 64/256 93 000 -44 000 230 000 0.11 -0.051 0.27 20 0.17 Latency Walltime Inlet (ns) mean - 4 2048 64/256 -210 000 -310 000 -110 000 -0.11 -0.16 -0.055 20 0.00039 146 Table A.2: Full Ordinary Least Squares Regression results of Latency Simsteps Outlet against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. ca nt us Eff ec Pe rN tS ign Nu m od e Si m els Pe rC Nu m pu Pr oc es se s Ab so lu te Eff ec Ab tS ize so lu te Eff ec Ab tS ize so lu te 95 % Eff ec CI Lo Re tS ize we rB lat ive 9 5% ou nd Eff ec CI U Re tS ize pp er lat ive Bo un Eff d ec tS Re lat ize 95 tic ive Eff % CI c n ec tS Lowe et at ifi ize 95 rB ou ri is Si p % C IU nd M St gn Cp pp er Bo un d Latency Simsteps Outlet mean - 1 1 16/64/256 -1.1 -1.4 -0.73 -0.13 -0.17 -0.087 30 5.5e-07 Latency Simsteps Outlet mean + 1 2048 16/64/256 2.9 1.8 4 2.2 1.3 3 30 1.1e-05 Latency Simsteps Outlet mean NaN 4 1 16/64/256 inf nan nan inf nan nan 30 nan Latency Simsteps Outlet mean 0 4 2048 16/64/256 0.017 -0.017 0.052 0.013 -0.012 0.038 30 0.3 Latency Simsteps Outlet mean - 1 1 16/64 -2 -2.7 -1.4 -0.24 -0.32 -0.17 20 1.9e-06 Latency Simsteps Outlet mean - 1 2048 16/64 -0.092 -0.17 -0.013 -0.069 -0.13 -0.0097 20 0.025 Latency Simsteps Outlet mean - 4 1 16/64 -1.4 -2.1 -0.71 -0.17 -0.26 -0.087 20 0.00053 Latency Simsteps Outlet mean 0 4 2048 16/64 0.048 -0.035 0.13 0.036 -0.026 0.097 20 0.24 Latency Simsteps Outlet mean 0 1 1 64/256 -0.099 -0.56 0.36 -0.012 -0.067 0.043 20 0.65 Latency Simsteps Outlet mean + 1 2048 64/256 5.8 3.6 8 4.4 2.7 6.1 20 3.8e-05 Latency Simsteps Outlet mean NaN 4 1 64/256 inf nan nan inf nan nan 20 nan Latency Simsteps Outlet mean 0 4 2048 64/256 -0.013 -0.066 0.04 -0.0095 -0.049 0.03 20 0.62 147 Table A.3: Full Ordinary Least Squares Regression results of Latency Walltime Outlet (ns) against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. CI Lo we Si rB ze ou nd 95 % Eff CI ec tS Up ize pe rB ou nd Eff ec tS ize 95 95 % ca % Eff CI Lo nt ec we Eff ze ze tS ize rB us ec tS Si Si 95 ou Pe rN ign % nd CI Nu m od e eE Eff Eff Up pe Si m ffe ec ec rB els ct t t ound Pe rC te te tic Nu pu lu lat ive ive c ifi m t so so ive M at Pr so lu lu Re Re et is gn oc ri St Si Cpes se Ab Ab Ab Re lat lat s n p Latency Walltime Outlet (ns) mean - 1 1 16/64/256 -20 000 -36 000 -3 000 -0.035 -0.064 -0.0054 30 0.022 Latency Walltime Outlet (ns) mean + 1 2048 16/64/256 5.5e+06 3.5e+06 7.4e+06 2.5 1.6 3.4 30 5e-06 Latency Walltime Outlet (ns) mean NaN 4 1 16/64/256 inf nan nan inf nan nan 30 nan Latency Walltime Outlet (ns) mean + 4 2048 16/64/256 230 000 110 000 340 000 0.11 0.056 0.17 30 0.00034 Latency Walltime Outlet (ns) mean - 1 1 16/64 -65 000 -95 000 -36 000 -0.12 -0.17 -0.063 20 0.00021 Latency Walltime Outlet (ns) mean + 1 2048 16/64 290 000 160 000 430 000 0.13 0.073 0.19 20 0.00021 Latency Walltime Outlet (ns) mean 0 4 1 16/64 -2.5e+06 -7.2e+06 2.2e+06 -0.82 -2.4 0.73 20 0.28 Latency Walltime Outlet (ns) mean + 4 2048 16/64 670 000 530 000 810 000 0.33 0.26 0.4 20 8e-09 Latency Walltime Outlet (ns) mean 0 1 1 64/256 26 000 -1 500 53 000 0.045 -0.0026 0.093 20 0.062 Latency Walltime Outlet (ns) mean + 1 2048 64/256 1.1e+07 6.5e+06 1.5e+07 4.8 2.9 6.7 20 4e-05 Latency Walltime Outlet (ns) mean NaN 4 1 64/256 inf nan nan inf nan nan 20 nan Latency Walltime Outlet (ns) mean - 4 2048 64/256 -210 000 -320 000 -110 000 -0.11 -0.16 -0.054 20 0.00045 148 Table A.4: Full Ordinary Least Squares Regression results of Delivery Clumpiness against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. an tE Pe rN ffe ct Nu od Si gn m Si e m els Pe rC Nu m pu Pr oc es se s Ab so lu te Eff ec Ab tS ize so lu te Eff ec Ab tS ize so lu te 95 % Eff CI Lo Re ec tS we rB lat ize 95 ou nd ive Eff % CI ec tS Up pe Re lat ize rB ou ive Eff nd ec tS Re ize 95 lat ive % CI tic Eff Lo c ifi c n e ct Si we rB et ri at us ze 95 ou nd is Si p % CI M St gn Cp Up pe rBou nd Delivery Clumpiness mean - 1 1 16/64/256 -0.036 -0.045 -0.026 -0.044 -0.056 -0.033 30 1.6e-08 Delivery Clumpiness mean - 1 2048 16/64/256 -0.03 -0.044 -0.015 -0.078 -0.12 -0.041 30 0.00021 Delivery Clumpiness mean - 4 1 16/64/256 -0.021 -0.033 -0.0087 -0.033 -0.052 -0.014 30 0.0016 Delivery Clumpiness mean 0 4 2048 16/64/256 0.028 -0.0021 0.058 0.1 -0.0077 0.21 30 0.067 Delivery Clumpiness mean - 1 1 16/64 -0.055 -0.074 -0.036 -0.068 -0.092 -0.044 20 1.1e-05 Delivery Clumpiness mean 0 1 2048 16/64 -0.034 -0.07 0.001 -0.092 -0.19 0.0027 20 0.056 Delivery Clumpiness mean 0 4 1 16/64 -0.022 -0.053 0.0087 -0.034 -0.082 0.013 20 0.15 Delivery Clumpiness mean 0 4 2048 16/64 0.038 -0.038 0.11 0.14 -0.14 0.41 20 0.31 Delivery Clumpiness mean - 1 1 64/256 -0.017 -0.033 -0.0002 -0.021 -0.041 -0.00025 20 0.048 Delivery Clumpiness mean 0 1 2048 64/256 -0.025 -0.052 0.003 -0.065 -0.14 0.008 20 0.078 Delivery Clumpiness mean - 4 1 64/256 -0.02 -0.035 -0.0039 -0.031 -0.055 -0.0061 20 0.017 Delivery Clumpiness mean 0 4 2048 64/256 0.018 -0.035 0.071 0.065 -0.13 0.25 20 0.48 149 Table A.5: Full Ordinary Least Squares Regression results of Simstep Period Inlet (ns) against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. nd nd nd pe Bo Bo Bo rB u u u er ou Lo Lo nd we Up we r p r Up CI CI CI CI 95 95 % tS ca % % 95 nt ize Eff Si ze Si tS tS us Pe ec tS ze Si ze 95 rN ign ize ize % Nu m od e Eff Eff Eff Si ec ec ec Eff Eff Eff m els t t t ec ec ec Pe rC te te te tic Nu pu ive ive ive c ifi m lu so lu et ri at Pr so lu so Re Re Re is gn oc Cpes se Ab Ab Ab lat lat lat M St Si s n p Simstep Period Inlet (ns) mean + 1 1 16/64/256 8 800 7 400 10 000 0.12 0.11 0.14 30 1.4e-13 Simstep Period Inlet (ns) mean + 1 2048 16/64/256 150 000 97 000 190 000 0.085 0.056 0.11 30 1.5e-06 Simstep Period Inlet (ns) mean NaN 4 1 16/64/256 nan nan nan nan nan nan 30 nan Simstep Period Inlet (ns) mean + 4 2048 16/64/256 160 000 89 000 220 000 0.1 0.058 0.15 30 5.6e-05 Simstep Period Inlet (ns) mean + 1 1 16/64 12 000 9 200 15 000 0.17 0.13 0.21 20 4.5e-08 Simstep Period Inlet (ns) mean + 1 2048 16/64 360 000 350 000 380 000 0.21 0.2 0.22 20 7e-21 Simstep Period Inlet (ns) mean NaN 4 1 16/64 -inf nan nan nan nan nan 20 nan Simstep Period Inlet (ns) mean + 4 2048 16/64 450 000 430 000 480 000 0.3 0.28 0.31 20 6.2e-20 Simstep Period Inlet (ns) mean + 1 1 64/256 5 600 3 700 7 500 0.08 0.052 0.11 20 8.8e-06 Simstep Period Inlet (ns) mean - 1 2048 64/256 -72 000 -83 000 -61 000 -0.042 -0.049 -0.036 20 5.8e-11 Simstep Period Inlet (ns) mean NaN 4 1 64/256 inf nan nan nan nan nan 20 nan Simstep Period Inlet (ns) mean - 4 2048 64/256 -140 000 -160 000 -120 000 -0.093 -0.11 -0.079 20 4.4e-11 150 Table A.6: Full Ordinary Least Squares Regression results of Latency Simsteps Inlet against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. ca nt Pe Eff ec Nu rN tS ign m od e Si m els Pe rC Nu m pu Pr oc es ses Ab so lu te Eff ec Ab tS ize so lu te Eff ec Ab tS ize so lu te 95 % Eff ec CI Lo Re tS ize we rB lat ive 95% ou nd Eff ec C IU Re tS ize pp er lat ive Bo un Eff d ec tS Re lat ize 95 tic ive Eff % CI c n ec tS Lowe M at ifi ize rB is gn us 95 % ou nd et p Cp C IU ri St Si pp er Bo un d Latency Simsteps Inlet mean - 1 1 16/64/256 -1 -1.4 -0.7 -0.13 -0.17 -0.086 30 5.7e-07 Latency Simsteps Inlet mean + 1 2048 16/64/256 2.8 1.7 3.9 2.2 1.4 3.1 30 1.1e-05 Latency Simsteps Inlet mean - 4 1 16/64/256 -0.64 -0.97 -0.3 -0.079 -0.12 -0.037 30 0.00055 Latency Simsteps Inlet mean 0 4 2048 16/64/256 0.016 -0.016 0.047 0.012 -0.012 0.036 30 0.32 Latency Simsteps Inlet mean - 1 1 16/64 -2 -2.6 -1.4 -0.24 -0.31 -0.17 20 2e-06 Latency Simsteps Inlet mean - 1 2048 16/64 -0.081 -0.16 -0.0058 -0.064 -0.12 -0.0045 20 0.036 Latency Simsteps Inlet mean - 4 1 16/64 -1.4 -2.1 -0.69 -0.17 -0.26 -0.085 20 0.00055 Latency Simsteps Inlet mean 0 4 2048 16/64 0.043 -0.033 0.12 0.032 -0.025 0.09 20 0.25 Latency Simsteps Inlet mean 0 1 1 64/256 -0.09 -0.53 0.35 -0.011 -0.065 0.043 20 0.67 Latency Simsteps Inlet mean + 1 2048 64/256 5.7 3.5 8 4.5 2.8 6.2 20 3.8e-05 Latency Simsteps Inlet mean 0 4 1 64/256 0.1 -0.39 0.6 0.013 -0.049 0.074 20 0.67 Latency Simsteps Inlet mean 0 4 2048 64/256 -0.011 -0.061 0.038 -0.0086 -0.046 0.029 20 0.64 151 Table A.7: Full Ordinary Least Squares Regression results of Simstep Period Outlet (ns) against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. nd d Bo nd nd Bo un Bo er u u Bo er u CI Lo Up we Up Lo p we r p r CI CI CI % % % % ca 95 95 95 95 nt Eff ze Si ze ze ze ze sP ec tS Si ze Si Si Si Si er ign Nu No Eff Eff Eff t m de Si ec ec ec Eff Re lat m els t t t ec ive Pe rC te te te Eff ic ive ec t Nu pu et c ist ifi m so lu so Re ri gn Pr lu so lu Re lat ive at Cp oc es Ab Ab Ab lat Eff M St Si u ses n ec t p Simstep Period Outlet (ns) mean + 1 1 16/64/256 8 700 7 400 10 000 0.13 0.11 0.14 30 1.2e-13 Simstep Period Outlet (ns) mean + 1 2048 16/64/256 150 000 98 000 190 000 0.087 0.058 0.12 30 1.2e-06 Simstep Period Outlet (ns) mean NaN 4 1 16/64/256 nan nan nan nan nan nan 30 nan Simstep Period Outlet (ns) mean + 4 2048 16/64/256 150 000 87 000 220 000 0.1 0.058 0.15 30 5.5e-05 Simstep Period Outlet (ns) mean + 1 1 16/64 12 000 9 100 15 000 0.17 0.13 0.21 20 4.2e-08 Simstep Period Outlet (ns) mean + 1 2048 16/64 360 000 350 000 380 000 0.22 0.21 0.22 20 8.8e-21 Simstep Period Outlet (ns) mean NaN 4 1 16/64 -inf nan nan nan nan nan 20 nan Simstep Period Outlet (ns) mean + 4 2048 16/64 440 000 420 000 470 000 0.3 0.28 0.31 20 3.4e-19 Simstep Period Outlet (ns) mean + 1 1 64/256 5 600 3 700 7 500 0.081 0.053 0.11 20 7.1e-06 Simstep Period Outlet (ns) mean - 1 2048 64/256 -69 000 -79 000 -59 000 -0.041 -0.047 -0.035 20 2.2e-11 Simstep Period Outlet (ns) mean NaN 4 1 64/256 inf nan nan nan nan nan 20 nan Simstep Period Outlet (ns) mean - 4 2048 64/256 -140 000 -160 000 -120 000 -0.092 -0.11 -0.078 20 6.5e-11 152 Table A.8: Full Ordinary Least Squares Regression results of Delivery Failure Rate against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. nd nd nd Bo pe Bo Up u rB u pe Lo ou rB Lo ou we Up we r r nd CI CI CI CI 95 % 95 95 ca % 95 % % nt Eff ze ze ze ze Si ze us ec tS Si Si Si ze Pe rN ign Si Si Nu m od e eE Eff Eff ct Si m ffe ec ec Eff Re els ct t t e lat ive Pe rC te te Eff tic Nu pu lu ive ec c ifi m t so so Re t et ri at Pr so lu lu Re lat is gn oc ive Cpes se Ab Ab Ab lat n Eff M St Si s ec t p Delivery Failure Rate mean NaN 1 1 16/64/256 0 0 0 nan nan nan 30 nan Delivery Failure Rate mean + 1 2048 16/64/256 0.0015 0.0011 0.0018 -13 -9.6 -16 30 2.5e-09 Delivery Failure Rate mean + 4 1 16/64/256 0.015 0.0088 0.02 0.15 0.091 0.21 30 1.7e-05 Delivery Failure Rate mean 0 4 2048 16/64/256 -0.00075 -0.0016 9.3e-05 -0.49 -1 0.06 30 0.079 Delivery Failure Rate mean NaN 1 1 16/64 0 0 0 nan nan nan 20 nan Delivery Failure Rate mean 0 1 2048 16/64 8.6e-06 -0.00013 0.00015 -0.073 1.1 -1.3 20 0.9 Delivery Failure Rate mean 0 4 1 16/64 -5.7e-08 -0.012 0.012 -5.8e-07 -0.12 0.12 20 1 Delivery Failure Rate mean 0 4 2048 16/64 -0.00047 -0.0026 0.0016 -0.31 -1.7 1.1 20 0.64 Delivery Failure Rate mean NaN 1 1 64/256 0 0 0 nan nan nan 20 nan Delivery Failure Rate mean + 1 2048 64/256 0.0029 0.0026 0.0032 -25 -23 -28 20 6.7e-14 Delivery Failure Rate mean + 4 1 64/256 0.029 0.025 0.033 0.3 0.26 0.34 20 1.7e-11 Delivery Failure Rate mean 0 4 2048 64/256 -0.001 -0.0026 0.00055 -0.66 -1.7 0.36 20 0.19 153 Table A.9: Full Quantile Regression results of Latency Walltime Inlet (ns) against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. nd un nd nd d Bo Bo u Bo Lo u er we Lo rB er we Up Up r p ou p CI CI CI CI % % % 95 95 95 tS ca nt ze ize Eff Si ze tS ze Pe ec tS Si ze Si Si 95 Nu rN od ign ize % m e Eff Eff Eff Eff Si ec ec ec Eff tiv m els t t t ec ec Pe t eE rC te te te ffe tic Nu pu ive ive et c at ifi m lu lu so c ri is Pr us so so lu Re Re gn oc es lat lat Re M St Si Cp se Ab Ab Ab la s n p Latency Walltime Inlet (ns) median - 1 1 16/64/256 -26 000 -47 000 -6 000 -0.049 -0.087 -0.011 30 0.013 Latency Walltime Inlet (ns) median 0 1 2048 16/64/256 61 000 -29 000 150 000 0.029 -0.014 0.072 30 0.18 Latency Walltime Inlet (ns) median 0 4 1 16/64/256 -1 400 -23 000 20 000 -0.003 -0.05 0.044 30 0.9 Latency Walltime Inlet (ns) median + 4 2048 16/64/256 200 000 25 000 380 000 0.11 0.014 0.21 30 0.027 Latency Walltime Inlet (ns) median - 1 1 16/64 -63 000 -110 000 -18 000 -0.12 -0.2 -0.034 20 0.0083 Latency Walltime Inlet (ns) median + 1 2048 16/64 260 000 30 000 500 000 0.12 0.014 0.23 20 0.029 Latency Walltime Inlet (ns) median 0 4 1 16/64 -43 000 -110 000 21 000 -0.095 -0.24 0.046 20 0.17 Latency Walltime Inlet (ns) median + 4 2048 16/64 690 000 290 000 1.1e+06 0.38 0.16 0.6 20 0.002 Latency Walltime Inlet (ns) median 0 1 1 64/256 2 900 -27 000 33 000 0.0055 -0.051 0.062 20 0.84 Latency Walltime Inlet (ns) median 0 1 2048 64/256 -96 000 -230 000 35 000 -0.045 -0.11 0.017 20 0.14 Latency Walltime Inlet (ns) median 0 4 1 64/256 24 000 -3 100 51 000 0.052 -0.0068 0.11 20 0.08 Latency Walltime Inlet (ns) median 0 4 2048 64/256 -130 000 -310 000 48 000 -0.071 -0.17 0.026 20 0.14 154 Table A.10: Full Quantile Regression results of Latency Simsteps Outlet against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. Bo nd rB un CI d Lo ou Up we nd CI pe rB Up Lo rB ou pe we ou r nd CI CI 95 % % % an % 95 95 95 tE ze ze ze Pe ffe ct ze ze Si rN Si Si Si Si Si Si ze Nu od gn m Si e eE Eff ct ct m els ffe ec Eff Eff Re Pe ct t e e lat ive rC te te Eff tic Nu pu lu ive ec t c ifi c m t so lu Re et ri at usPr so lu so Re lat is Si oc ive gn Cp es se Ab Ab Ab lat Eff M St s n ec t p Latency Simsteps Outlet median - 1 1 16/64/256 -1.1 -1.6 -0.55 -0.13 -0.2 -0.069 30 0.00025 Latency Simsteps Outlet median 0 1 2048 16/64/256 -0.037 -0.08 0.0057 -0.029 -0.063 0.0045 30 0.087 Latency Simsteps Outlet median - 4 1 16/64/256 -0.47 -0.76 -0.18 -0.068 -0.11 -0.026 30 0.0027 Latency Simsteps Outlet median 0 4 2048 16/64/256 -0.003 -0.063 0.057 -0.0022 -0.047 0.042 30 0.92 Latency Simsteps Outlet median - 1 1 16/64 -2 -3 -1.1 -0.25 -0.38 -0.13 20 0.00033 Latency Simsteps Outlet median 0 1 2048 16/64 -0.096 -0.24 0.051 -0.075 -0.19 0.04 20 0.19 Latency Simsteps Outlet median - 4 1 16/64 -1.1 -1.7 -0.43 -0.16 -0.25 -0.063 20 0.0025 Latency Simsteps Outlet median 0 4 2048 16/64 -0.0014 -0.17 0.17 -0.001 -0.12 0.12 20 0.99 Latency Simsteps Outlet median 0 1 1 64/256 -0.32 -0.81 0.18 -0.04 -0.1 0.022 20 0.19 Latency Simsteps Outlet median 0 1 2048 64/256 -0.0014 -0.061 0.058 -0.0011 -0.048 0.046 20 0.96 Latency Simsteps Outlet median 0 4 1 64/256 -0.017 -0.43 0.4 -0.0024 -0.063 0.058 20 0.93 Latency Simsteps Outlet median 0 4 2048 64/256 -0.0068 -0.081 0.068 -0.005 -0.06 0.05 20 0.85 155 Table A.11: Full Quantile Regression results of Latency Walltime Outlet (ns) against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. nd er Bo Re Bo u lat un ive d CI Eff Up ec Lo p tS we ize r CI Re lat 95 ze 95 ive % % Eff CI ca nt 95 Lo ze % ze n ec sP Eff ec tS tS we er tS Si ct Si ize ize rB Nu No ign Si m d e eE Eff 95 ou Si Eff ec Eff % nd m els ffe e t ecp CI Pe ct Up rC te te tic Nu pu lu ive pe c ifi m t lu so M at Pr so so lu Re rB et is Si Cp oc ou ri gn es Ab Ab Ab lat St u ses nd Latency Walltime Outlet (ns) median - 1 1 16/64/256 -26 000 -47 000 -4 900 -0.048 -0.086 -0.0091 30 0.017 Latency Walltime Outlet (ns) median 0 1 2048 16/64/256 60 000 -32 000 150 000 0.028 -0.015 0.071 30 0.19 Latency Walltime Outlet (ns) median 0 4 1 16/64/256 -3 400 -26 000 19 000 -0.0074 -0.056 0.041 30 0.75 Latency Walltime Outlet (ns) median + 4 2048 16/64/256 200 000 13 000 380 000 0.11 0.0073 0.21 30 0.036 Latency Walltime Outlet (ns) median - 1 1 16/64 -69 000 -120 000 -21 000 -0.13 -0.22 -0.038 20 0.0077 Latency Walltime Outlet (ns) median + 1 2048 16/64 260 000 7 700 520 000 0.12 0.0036 0.24 20 0.044 Latency Walltime Outlet (ns) median 0 4 1 16/64 -44 000 -110 000 20 000 -0.096 -0.24 0.044 20 0.17 Latency Walltime Outlet (ns) median + 4 2048 16/64 710 000 300 000 1.1e+06 0.38 0.16 0.6 20 0.0017 Latency Walltime Outlet (ns) median 0 1 1 64/256 3 100 -26 000 33 000 0.0057 -0.049 0.06 20 0.83 Latency Walltime Outlet (ns) median 0 1 2048 64/256 -100 000 -240 000 38 000 -0.048 -0.11 0.018 20 0.14 Latency Walltime Outlet (ns) median 0 4 1 64/256 24 000 -9 100 57 000 0.052 -0.02 0.12 20 0.14 Latency Walltime Outlet (ns) median 0 4 2048 64/256 -150 000 -310 000 13 000 -0.08 -0.17 0.0072 20 0.07 156 Table A.12: Full Quantile Regression results of Delivery Clumpiness against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. an tE sP er ffe ct Nu No d Si gn m Si e m els Pe rC Nu m pu Pr oc es ses Ab so lu te Eff ec Ab tS ize so lu te Eff ec Ab tS ize so lu te 95 % Eff ec CI Lo Re tS ize we rB lat ive 9 5% ou nd Eff ec CI U Re tS ize pp er lat ive Bo un Eff ec d Re tS ize lat ive 95 % tic Eff CI c ifi c n ec tS ize Lo we rB et ri at 95 ou nd is Si Cp p % C IU M St gn u pp er Bo un d Delivery Clumpiness median - 1 1 16/64/256 -0.038 -0.051 -0.024 -0.047 -0.063 -0.03 30 2.7e-06 Delivery Clumpiness median - 1 2048 16/64/256 -0.031 -0.055 -0.0064 -0.079 -0.14 -0.017 30 0.015 Delivery Clumpiness median - 4 1 16/64/256 -0.024 -0.035 -0.014 -0.036 -0.052 -0.021 30 5.5e-05 Delivery Clumpiness median 0 4 2048 16/64/256 0.019 -0.024 0.061 0.057 -0.072 0.19 30 0.37 Delivery Clumpiness median - 1 1 16/64 -0.057 -0.087 -0.027 -0.07 -0.11 -0.033 20 0.00091 Delivery Clumpiness median 0 1 2048 16/64 -0.042 -0.11 0.024 -0.11 -0.28 0.063 20 0.2 Delivery Clumpiness median - 4 1 16/64 -0.033 -0.062 -0.0034 -0.049 -0.093 -0.0052 20 0.031 Delivery Clumpiness median 0 4 2048 16/64 0.03 -0.14 0.2 0.091 -0.41 0.6 20 0.71 Delivery Clumpiness median 0 1 1 64/256 -0.017 -0.043 0.0085 -0.021 -0.053 0.011 20 0.18 Delivery Clumpiness median 0 1 2048 64/256 -0.024 -0.072 0.024 -0.063 -0.19 0.061 20 0.3 Delivery Clumpiness median 0 4 1 64/256 -0.014 -0.036 0.0071 -0.022 -0.054 0.011 20 0.18 Delivery Clumpiness median 0 4 2048 64/256 0.015 -0.04 0.07 0.045 -0.12 0.21 20 0.58 157 Table A.13: Full Quantile Regression results of Simsteps Period Inlet (ns) against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. nd Up nd Up CI pe Bo pe Lo rB u rB we ou Lo ou rB nd we nd r ou CI CI CI % 95 95 % 95 % an tE % 95 ffe Si Si Si tS tS ze Pe rN ct ze ze ze Nu Si gn ize ize tS od i m e Eff Eff Eff Si m ec ec ec Eff ive tiv els t t t ec Pe Eff eE rC tic Nu fic pu te te te ive ec ffe et c at m lu lu lu c ri is gn Pr us oc so so so Re Re Re i Cp es Ab Ab Ab lat lat M St Si ses la n p Simstep Period Inlet (ns) median + 1 1 16/64/256 8 600 6 900 10 000 0.12 0.098 0.15 30 6.7e-11 Simstep Period Inlet (ns) median + 1 2048 16/64/256 140 000 33 000 240 000 0.081 0.019 0.14 30 0.012 Simstep Period Inlet (ns) median 0 4 1 16/64/256 2 000 -470 4 400 0.03 -0.0072 0.067 30 0.11 Simstep Period Inlet (ns) median + 4 2048 16/64/256 140 000 27 000 250 000 0.087 0.017 0.16 30 0.016 Simstep Period Inlet (ns) median + 1 1 16/64 10 000 7 000 13 000 0.15 0.1 0.19 20 2.9e-06 Simstep Period Inlet (ns) median + 1 2048 16/64 360 000 330 000 380 000 0.21 0.2 0.22 20 2.7e-17 Simstep Period Inlet (ns) median 0 4 1 16/64 4 900 -3 300 13 000 0.074 -0.051 0.2 20 0.23 Simstep Period Inlet (ns) median + 4 2048 16/64 370 000 320 000 410 000 0.23 0.2 0.26 20 6.8e-12 Simstep Period Inlet (ns) median + 1 1 64/256 6 400 3 700 9 000 0.091 0.053 0.13 20 9e-05 Simstep Period Inlet (ns) median - 1 2048 64/256 -79 000 -95 000 -63 000 -0.046 -0.056 -0.037 20 6.5e-09 Simstep Period Inlet (ns) median 0 4 1 64/256 300 -2 800 3 400 0.0045 -0.043 0.052 20 0.85 Simstep Period Inlet (ns) median - 4 2048 64/256 -91 000 -120 000 -59 000 -0.058 -0.078 -0.037 20 1.2e-05 158 Table A.14: Full Quantile Regression results of Latency Simsteps Inlet against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. nd nd tS Lo Bo u Re ize we lat rB er 95 Up ive % ou p Eff CI CI CI n ec Lo tS we 95 95 ca % % ize rB 95 nt sP Eff ze ze ze ze ou er ec tS Si Si Si Si % nd Nu No de ign Eff Eff Eff p CI m Eff Si m ec ec ec ec ive Up els t t t t Pe rC Eff pe te te te ec tic Nu pu lu ive rB c ifi m so so M at Pr so lu lu Re Re ou et is gn Cp oc ri es se Ab Ab Ab lat lat nd St Si u s Latency Simsteps Inlet median - 1 1 16/64/256 -1.1 -1.6 -0.62 -0.14 -0.2 -0.079 30 6.6e-05 Latency Simsteps Inlet median 0 1 2048 16/64/256 -0.034 -0.074 0.0068 -0.027 -0.06 0.0055 30 0.1 Latency Simsteps Inlet median - 4 1 16/64/256 -0.42 -0.72 -0.12 -0.063 -0.11 -0.018 30 0.0074 Latency Simsteps Inlet median 0 4 2048 16/64/256 -0.0031 -0.059 0.053 -0.0023 -0.044 0.04 30 0.91 Latency Simsteps Inlet median - 1 1 16/64 -1.9 -2.7 -1.1 -0.24 -0.35 -0.14 20 0.00017 Latency Simsteps Inlet median 0 1 2048 16/64 -0.089 -0.23 0.051 -0.072 -0.19 0.041 20 0.2 Latency Simsteps Inlet median - 4 1 16/64 -1.1 -1.7 -0.36 -0.16 -0.26 -0.054 20 0.005 Latency Simsteps Inlet median 0 4 2048 16/64 -0.0057 -0.15 0.14 -0.0043 -0.12 0.11 20 0.94 Latency Simsteps Inlet median 0 1 1 64/256 -0.39 -0.85 0.064 -0.051 -0.11 0.0082 20 0.088 Latency Simsteps Inlet median 0 1 2048 64/256 -0.0032 -0.063 0.057 -0.0026 -0.051 0.046 20 0.91 Latency Simsteps Inlet median 0 4 1 64/256 0.00035 -0.42 0.42 5.2e-05 -0.062 0.062 20 1 Latency Simsteps Inlet median 0 4 2048 64/256 0.0012 -0.073 0.076 0.00089 -0.055 0.057 20 0.97 159 Table A.15: Full Quantile Regression results of Simstep Period Outlet (ns) against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. d un Bo d nd pe un pe Bo rB rB u ou Lo o Lo nd we Up we r r Up CI CI CI CI % % % % an 95 95 95 95 tE ze ze ze Pe ffe ct ze ze ze rN Si Si Si Si Si Si Si Nu od gn m Si e Eff Eff Eff ive Re m ec ec ec lat els t t t ive Pe Eff Eff rC te te te ec ec t at Nu fic pu t Re et c ist m so so lu lat ri ic gn Pr us oc lu lu so Re ive i Cp es Ab Ab Ab lat Eff M St Si se s n ec t p Simstep Period Outlet (ns) median + 1 1 16/64/256 8 500 6 800 10 000 0.12 0.099 0.15 30 3.2e-11 Simstep Period Outlet (ns) median + 1 2048 16/64/256 140 000 35 000 240 000 0.083 0.021 0.14 30 0.011 Simstep Period Outlet (ns) median 0 4 1 16/64/256 1 700 -460 3 800 0.026 -0.0071 0.059 30 0.12 Simstep Period Outlet (ns) median + 4 2048 16/64/256 140 000 28 000 240 000 0.088 0.018 0.16 30 0.015 Simstep Period Outlet (ns) median + 1 1 16/64 10 000 7 000 13 000 0.14 0.1 0.19 20 1.3e-06 Simstep Period Outlet (ns) median + 1 2048 16/64 350 000 330 000 370 000 0.21 0.19 0.22 20 1e-16 Simstep Period Outlet (ns) median 0 4 1 16/64 4 700 -3 000 12 000 0.072 -0.047 0.19 20 0.22 Simstep Period Outlet (ns) median + 4 2048 16/64 370 000 320 000 420 000 0.24 0.21 0.27 20 6.6e-12 Simstep Period Outlet (ns) median + 1 1 64/256 6 500 4 000 9 000 0.094 0.057 0.13 20 4.1e-05 Simstep Period Outlet (ns) median - 1 2048 64/256 -75 000 -94 000 -57 000 -0.045 -0.056 -0.034 20 9.9e-08 Simstep Period Outlet (ns) median 0 4 1 64/256 630 -2 200 3 500 0.0097 -0.034 0.054 20 0.65 Simstep Period Outlet (ns) median - 4 2048 64/256 -90 000 -120 000 -60 000 -0.058 -0.078 -0.039 20 6.6e-06 160 Table A.16: Full Quantile Regression results of Delivery Failure Rate against against log (Section 2.3.7). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. ca nt us Eff ec Pe rN tS ign Nu m od e Si m els Pe rC Nu m pu Pr oc es ses Ab so lu te Eff ec Ab tS ize so lu te Eff ec Ab tS ize so lu te 95 Eff % CI Re lat ec tS Lo we ive ize 95 rB Eff ec % CI ou nd Relat tS ize Up ive pe rB Eff ec ou nd Re lat tS ize ive 95 % c tic n Eff ec tS CI Lo et at ifi p ize 95 % we rB ou ri is gn CI Up nd M St Si Cp pe rBou nd Delivery Failure Rate median NaN 1 1 16/64/256 0 nan nan nan nan nan 30 nan Delivery Failure Rate median NaN 1 2048 16/64/256 0 nan nan nan nan nan 30 nan Delivery Failure Rate median + 4 1 16/64/256 0.019 0.0076 0.029 0.34 0.14 0.54 30 0.0016 Delivery Failure Rate median NaN 4 2048 16/64/256 0 nan nan nan nan nan 30 nan Delivery Failure Rate median NaN 1 1 16/64 0 nan nan nan nan nan 20 nan Delivery Failure Rate median NaN 1 2048 16/64 0 nan nan nan nan nan 20 nan Delivery Failure Rate median 0 4 1 16/64 0.026 -0.023 0.074 0.47 -0.41 1.4 20 0.28 Delivery Failure Rate median NaN 4 2048 16/64 0 nan nan nan nan nan 20 nan Delivery Failure Rate median NaN 1 1 64/256 0 nan nan nan nan nan 20 nan Delivery Failure Rate median NaN 1 2048 64/256 0 nan nan nan nan nan 20 nan Delivery Failure Rate median + 4 1 64/256 0.018 0.0038 0.032 0.33 0.069 0.59 20 0.016 Delivery Failure Rate median NaN 4 2048 64/256 0 nan nan nan nan nan 20 nan 161 A.2 Computation vs. Communication This section provides full results from computation vs. communication experiments discussed in Section 2.3.3. 109 109 Latency Walltime Inlet (ns) Latency Walltime Inlet (ns) 108 108 107 107 106 106 0 1 2 3 4 0 1 2 3 4 Log Compute Work Log Compute Work (a) Distribution of Latency Walltime Inlet (ns) for each(b) Distribution of Latency Walltime Inlet (ns) for snapshot, without outliers. each snapshot, with outliers. Figure A.20: Distribution of Latency Walltime Inlet (ns) for individual snapshot measurements for compu- tation vs. communication experiment (Section 2.3.3). Lower is better. 162 102 102 Latency Simsteps Outlet Latency Simsteps Outlet 101 101 100 100 0 1 2 3 4 0 1 2 3 4 Log Compute Work Log Compute Work (a) Distribution of Latency Simsteps Outlet for each(b) Distribution of Latency Simsteps Outlet for each snapshot, without outliers. snapshot, with outliers. Figure A.21: Distribution of Latency Simsteps Outlet for individual snapshot measurements for computation vs. communication experiment (Section 2.3.3). Lower is better. 109 109 Latency Walltime Outlet (ns) Latency Walltime Outlet (ns) 108 108 107 107 106 106 0 1 2 3 4 0 1 2 3 4 Log Compute Work Log Compute Work (a) Distribution of Latency Walltime Outlet (ns) for(b) Distribution of Latency Walltime Outlet (ns) for each snapshot, without outliers. each snapshot, with outliers. Figure A.22: Distribution of Latency Walltime Outlet (ns) for individual snapshot measurements for com- putation vs. communication experiment (Section 2.3.3). Lower is better. 163 1.0 1.0 0.8 0.8 Delivery Clumpiness Delivery Clumpiness 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 2 3 4 0 1 2 3 4 Log Compute Work Log Compute Work (a) Distribution of Delivery Clumpiness for each snap-(b) Distribution of Delivery Clumpiness for each snap- shot, without outliers. shot, with outliers. Figure A.23: Distribution of Delivery Clumpiness for individual snapshot measurements for computation vs. communication experiment (Section 2.3.3). Lower is better. 109 108 Simstep Period Inlet (ns) Simstep Period Inlet (ns) 108 107 107 106 106 105 105 104 104 0 1 2 3 4 0 1 2 3 4 Log Compute Work Log Compute Work (a) Distribution of Simstep Period Inlet (ns) for each(b) Distribution of Simstep Period Inlet (ns) for each snapshot, without outliers. snapshot, with outliers. Figure A.24: Distribution of Simstep Period Inlet (ns) for individual snapshot measurements for computation vs. communication experiment (Section 2.3.3). Lower is better. 164 102 102 Latency Simsteps Inlet Latency Simsteps Inlet 101 101 100 100 0 1 2 3 4 0 1 2 3 4 Log Compute Work Log Compute Work (a) Distribution of Latency Simsteps Inlet for each(b) Distribution of Latency Simsteps Inlet for each snapshot, without outliers. snapshot, with outliers. Figure A.25: Distribution of Latency Simsteps Inlet for individual snapshot measurements for computation vs. communication experiment (Section 2.3.3). Lower is better. 1e8 1e9 5 1.0 Simstep Period Outlet (ns) Simstep Period Outlet (ns) 4 0.8 3 0.6 2 0.4 1 0.2 0 0.0 0 1 2 3 4 0 1 2 3 4 Log Compute Work Log Compute Work (a) Distribution of Simstep Period Outlet (ns) for each(b) Distribution of Simstep Period Outlet (ns) for each snapshot, without outliers. snapshot, with outliers. Figure A.26: Distribution of Simstep Period Outlet (ns) for individual snapshot measurements for compu- tation vs. communication experiment (Section 2.3.3). Lower is better. 165 0.04 0.04 Delivery Failure Rate Delivery Failure Rate 0.02 0.02 0.00 0.00 −0.02 −0.02 −0.04 −0.04 0 1 2 3 4 0 1 2 3 4 Log Compute Work Log Compute Work (a) Distribution of Delivery Failure Rate for each snap-(b) Distribution of Delivery Failure Rate for each snap- shot, without outliers. shot, with outliers. Figure A.27: Distribution of Delivery Failure Rate for individual snapshot measurements for computation vs. communication experiment (Section 2.3.3). Lower is better. 166 Ordinary Least Squares Regression Log Latency Walltime Inlet (ns) 20 18 Estimated Statistic = Latency Walltime Inlet (ns) Mean 16 0.04 Absolute Effect Size 0.02 14 0.00 12 −0.02 0 1 2 3 4 −0.04 Log Compute Work (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Log Latency Walltime Inlet (ns) 20 18 Estimated Statistic = Latency Walltime Inlet (ns) Median 16 40 Absolute Effect Size 14 30 20 12 10 0 1 2 3 4 Log Compute Work 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.28: Regressions of Latency Walltime Inlet (ns) against log computational intensity for computation vs. communication experiment (Section 2.3.3). Lower is better. Ordinary least squares regression (top row) estimates relationship between dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 167 Ordinary Least Squares Regression Log Latency Simsteps Outlet 4 Estimated Statistic = Latency Simsteps Outlet Mean 3 0.04 2 Absolute Effect Size 0.02 1 0.00 0 −0.02 0 1 2 3 4 −0.04 Log Compute Work (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Log Latency Simsteps Outlet 4 3 Estimated Statistic1e−6 = Latency Simsteps Outlet Median 1.5 2 1.0 Absolute Effect Size 1 0.5 0.0 0 −0.5 0 1 2 3 4 −1.0 Log Compute Work −1.5 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.29: Regressions of Latency Simsteps Outlet against log computational intensity for computation vs. communication experiment (Section 2.3.3). Lower is better. Ordinary least squares regression (top row) estimates relationship between dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 168 Ordinary Least Squares Regression Log Latency Walltime Outlet (ns) 20 18 Estimated Statistic = Latency Walltime Outlet (ns) Mean 16 0.04 Absolute Effect Size 0.02 14 0.00 12 −0.02 0 1 2 3 4 −0.04 Log Compute Work (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Log Latency Walltime Outlet (ns) 20 18 Estimated Statistic = Latency Walltime Outlet (ns) Median 16 40 Absolute Effect Size 14 30 20 12 10 0 1 2 3 4 Log Compute Work 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.30: Regressions of Latency Walltime Outlet (ns) against log computational intensity for computa- tion vs. communication experiment (Section 2.3.3). Lower is better. Ordinary least squares regression (top row) estimates relationship between dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 169 Ordinary Least Squares Regression 1.0 Delivery Clumpiness Estimated Statistic = Delivery Clumpiness Mean 0.00 0.8 −0.05 0.6 Absolute Effect Size −0.10 0.4 −0.15 0.2 −0.20 0.0 0 1 2 3 4 −0.25 Log Compute Work (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression 1.0 Delivery Clumpiness Estimated Statistic = Delivery Clumpiness Median 0.8 0.00 0.6 −0.05 Absolute Effect Size 0.4 −0.10 0.2 −0.15 0.0 −0.20 −0.25 0 1 2 3 4 Log Compute Work −0.30 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.31: Regressions of Delivery Clumpiness against log computational intensity for computation vs. communication experiment (Section 2.3.3). Lower is better. Ordinary least squares regression (top row) esti- mates relationship between dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 170 Ordinary Least Squares Regression Log Simstep Period Inlet (ns) 20.0 17.5 Estimated Statistic = Simstep Period Inlet (ns) Mean 15.0 35 30 Absolute Effect Size 12.5 25 20 10.0 15 7.5 10 0 1 2 3 4 5 Log Compute Work 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Log Simstep Period Inlet (ns) 20.0 17.5 Estimated Statistic = Simstep Period Inlet (ns) Median 15.0 30 12.5 25 Absolute Effect Size 20 10.0 15 7.5 10 0 1 2 3 4 5 Log Compute Work 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.32: Regressions of Simstep Period Inlet (ns) against log computational intensity for computation vs. communication experiment (Section 2.3.3). Lower is better. Ordinary least squares regression (top row) estimates relationship between dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 171 Ordinary Least Squares Regression Log Latency Simsteps Inlet 4 Estimated Statistic = Latency Simsteps Inlet Mean 3 0.04 2 Absolute Effect Size 0.02 1 0.00 0 −0.02 0 1 2 3 4 −0.04 Log Compute Work (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Log Latency Simsteps Inlet 4 3 Estimated Statistic = Latency Simsteps Inlet Median 1e−6 1.5 2 1.0 Absolute Effect Size 1 0.5 0.0 0 −0.5 0 1 2 3 4 −1.0 Log Compute Work −1.5 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.33: Regressions of Latency Simsteps Inlet against log computational intensity for computation vs. communication experiment (Section 2.3.3). Lower is better. Ordinary least squares regression (top row) estimates relationship between dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 172 Ordinary Least Squares Regression Log Simstep Period Outlet (ns) 20.0 17.5 Estimated Statistic = Simstep Period Outlet (ns) Mean 15.0 35 30 Absolute Effect Size 12.5 25 10.0 20 15 7.5 10 0 1 2 3 4 5 Log Compute Work 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Log Simstep Period Outlet (ns) 20.0 17.5 Estimated Statistic = Simstep Period Outlet (ns) Median 15.0 30 12.5 25 Absolute Effect Size 10.0 20 15 7.5 10 0 1 2 3 4 5 Log Compute Work 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.34: Regressions of Simstep Period Outlet (ns) against log computational intensity for computation vs. communication experiment (Section 2.3.3). Lower is better. Ordinary least squares regression (top row) estimates relationship between dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 173 Ordinary Least Squares Regression 0.04 Delivery Failure Rate Estimated Statistic = Delivery Failure Rate Mean 0.02 0.04 0.00 Absolute Effect Size 0.02 −0.02 0.00 −0.04 −0.02 0 1 2 3 4 −0.04 Log Compute Work (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression 0.04 Delivery Failure Rate Estimated Statistic = Delivery Failure Rate Median 0.02 0.04 0.00 Absolute Effect Size 0.02 −0.02 0.00 −0.04 −0.02 0 1 2 3 4 −0.04 Log Compute Work (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.35: Regressions of Delivery Failure Rate against log computational intensity for computation vs. communication experiment (Section 2.3.3). Lower is better. Ordinary least squares regression (top row) estimates relationship between dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 174 Table A.17: Full Ordinary Least Squares Regression results of quality of service metrics against log computational intensity for computation vs. communication experiment (Section 2.3.3). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. ca nt us Eff ec Pe r tS ign Nu m N od Nu Si m e m els P Pr oc er Cp Ab es se u so lu s te Eff Ab ec tS so lu ize te Eff Ab ec tS so lu ize 95 te Eff % Re ec t CI Lo lat Si ze we ive 95 % rB ou Eff ec C IU nd Re tS ize p pe lat ive rB ou Eff ec nd Re tS ize lat ive 95 % n Eff ec CI Lo tic tS ize we rB ric ifi 95 ou at p % CI nd M is gn Up et St Si Cp pe rB ou nd Latency Walltime Inlet (ns) mean NaN 1 1 2 inf nan nan inf nan nan 50 nan Latency Walltime Outlet (ns) mean NaN 1 1 2 inf nan nan inf nan nan 50 nan Latency Simsteps Inlet mean NaN 1 1 2 inf nan nan inf nan nan 50 nan Latency Simsteps Outlet mean NaN 1 1 2 inf nan nan inf nan nan 50 nan Delivery Failure Rate mean NaN 1 1 2 0 0 0 nan nan nan 50 nan Delivery Clumpiness mean - 1 1 2 -0.25 -0.28 -0.23 -0.26 -0.29 -0.24 50 2.8e-28 Simstep Period Inlet (ns) mean + 1 1 2 37 36 37 0.0025 0.0024 0.0026 50 7.1e-55 Simstep Period Outlet (ns) mean + 1 1 2 36 35 37 0.0025 0.0024 0.0025 50 1.1e-54 175 Table A.18: Full Quantile Regression results of quality of service metrics against log computational intensity for computation vs. communication experiment (Section 2.3.3). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. ca nt us Eff ec Nu Pe rN tS ign m Si od e Nu m m els Pr oc Pe rC Ab es se pu so lu s te Eff ec tS Ab ize so lu te Eff ec Ab tS ize so lu 95 % te Eff CI Lo ec tS we rB Re lat ize 95 ou nd ive Eff % CI ec tS Up pe Re ize rB ou lat ive nd Eff ec tS ize Re lat 95 % ive CI c tic n Eff ec tS Lowe rB M at ifi ize 95 % ou nd et is gn p C IU ri St Si Cp pp er Bo un d Latency Walltime Inlet (ns) median + 1 1 2 45 45 45 7.2e-05 7.2e-05 7.2e-05 50 5.8e-107 Latency Walltime Outlet (ns) median + 1 1 2 45 45 45 7.2e-05 7.2e-05 7.2e-05 50 1.8e-103 Latency Simsteps Inlet median 0 1 1 2 6.1e-08 -1.5e-06 1.6e-06 1.5e-09 -3.5e-08 3.8e-08 50 0.94 Latency Simsteps Outlet median 0 1 1 2 6.1e-08 -1.4e-06 1.6e-06 1.4e-09 -3.5e-08 3.7e-08 50 0.94 Delivery Failure Rate median NaN 1 1 2 0 nan nan nan nan nan 50 nan Delivery Clumpiness median - 1 1 2 -0.26 -0.31 -0.22 -0.27 -0.32 -0.23 50 1.5e-16 Simstep Period Inlet (ns) median + 1 1 2 30 30 30 0.0021 0.0021 0.0021 50 1.9e-139 Simstep Period Outlet (ns) median + 1 1 2 30 30 30 0.0021 0.0021 0.0021 50 1.1e-143 176 A.3 Intranode vs Internode This section provides full results from intranode vs. internode experiments discussed in Section 2.3.4. 1e6 1.4 800000 Latency Walltime Inlet (ns) Latency Walltime Inlet (ns) 1.2 600000 1.0 0.8 400000 0.6 0.4 200000 0.2 0 0.0 0 1 0 1 0 = Intranode | 1 = Internode 0 = Intranode | 1 = Internode (a) Distribution of Latency Walltime Inlet (ns) for each(b) Distribution of Latency Walltime Inlet (ns) for snapshot, without outliers. each snapshot, with outliers. Figure A.36: Distribution of Latency Walltime Inlet (ns) for individual snapshot measurements for intranode vs. internode experiment (Section 2.3.4). Lower is better. 177 60 100 Latency Simsteps Outlet Latency Simsteps Outlet 50 80 40 60 30 40 20 20 10 0 0 0 1 0 1 0 = Intranode | 1 = Internode 0 = Intranode | 1 = Internode (a) Distribution of Latency Simsteps Outlet for each(b) Distribution of Latency Simsteps Outlet for each snapshot, without outliers. snapshot, with outliers. Figure A.37: Distribution of Latency Simsteps Outlet for individual snapshot measurements for intranode vs. internode experiment (Section 2.3.4). Lower is better. 1e6 800000 1.4 Latency Walltime Outlet (ns) Latency Walltime Outlet (ns) 1.2 600000 1.0 0.8 400000 0.6 0.4 200000 0.2 0 0.0 0 1 0 1 0 = Intranode | 1 = Internode 0 = Intranode | 1 = Internode (a) Distribution of Latency Walltime Outlet (ns) for(b) Distribution of Latency Walltime Outlet (ns) for each snapshot, without outliers. each snapshot, with outliers. Figure A.38: Distribution of Latency Walltime Outlet (ns) for individual snapshot measurements for intra- node vs. internode experiment (Section 2.3.4). Lower is better. 178 1.0 1.0 0.8 0.8 Delivery Clumpiness Delivery Clumpiness 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 0 1 0 = Intranode | 1 = Internode 0 = Intranode | 1 = Internode (a) Distribution of Delivery Clumpiness for each snap-(b) Distribution of Delivery Clumpiness for each snap- shot, without outliers. shot, with outliers. Figure A.39: Distribution of Delivery Clumpiness for individual snapshot measurements for intranode vs. internode experiment (Section 2.3.4). Lower is better. 16000 16000 Simstep Period Inlet (ns) Simstep Period Inlet (ns) 14000 14000 12000 12000 10000 10000 8000 8000 6000 6000 4000 4000 0 1 0 1 0 = Intranode | 1 = Internode 0 = Intranode | 1 = Internode (a) Distribution of Simstep Period Inlet (ns) for each(b) Distribution of Simstep Period Inlet (ns) for each snapshot, without outliers. snapshot, with outliers. Figure A.40: Distribution of Simstep Period Inlet (ns) for individual snapshot measurements for intranode vs. internode experiment (Section 2.3.4). Lower is better. 179 60 100 Latency Simsteps Inlet Latency Simsteps Inlet 50 80 40 60 30 40 20 20 10 0 0 0 1 0 1 0 = Intranode | 1 = Internode 0 = Intranode | 1 = Internode (a) Distribution of Latency Simsteps Inlet for each(b) Distribution of Latency Simsteps Inlet for each snapshot, without outliers. snapshot, with outliers. Figure A.41: Distribution of Latency Simsteps Inlet for individual snapshot measurements for intranode vs. internode experiment (Section 2.3.4). Lower is better. 16000 16000 Simstep Period Outlet (ns) Simstep Period Outlet (ns) 14000 14000 12000 12000 10000 10000 8000 8000 6000 6000 4000 4000 0 1 0 1 0 = Intranode | 1 = Internode 0 = Intranode | 1 = Internode (a) Distribution of Simstep Period Outlet (ns) for each(b) Distribution of Simstep Period Outlet (ns) for each snapshot, without outliers. snapshot, with outliers. Figure A.42: Distribution of Simstep Period Outlet (ns) for individual snapshot measurements for intranode vs. internode experiment (Section 2.3.4). Lower is better. 180 0.8 0.8 Delivery Failure Rate Delivery Failure Rate 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 0 1 0 = Intranode | 1 = Internode 0 = Intranode | 1 = Internode (a) Distribution of Delivery Failure Rate for each snap-(b) Distribution of Delivery Failure Rate for each snap- shot, without outliers. shot, with outliers. Figure A.43: Distribution of Delivery Failure Rate for individual snapshot measurements for intranode vs. internode experiment (Section 2.3.4). Lower is better. 181 Ordinary Least Squares Regression 800000 Latency Walltime Inlet (ns) 600000 Estimated Statistic = Latency Walltime Inlet (ns) Mean 400000 600000 Absolute Effect Size 500000 200000 400000 300000 0 200000 0 1 100000 0 = Intranode | 1 = Internode 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Latency Walltime Inlet (ns) 600000 500000 400000 Estimated Statistic = Latency Walltime Inlet (ns) Median 300000 500000 Absolute Effect Size 200000 400000 100000 300000 0 200000 0 1 100000 0 = Intranode | 1 = Internode 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.44: Regressions of Latency Walltime Inlet (ns) against categorically coded treatment for intranode vs. internode experiment (Section 2.3.4). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 182 Ordinary Least Squares Regression Latency Simsteps Outlet 50 40 Estimated Statistic = Latency Simsteps Outlet Mean 30 40 Absolute Effect Size 20 30 10 20 0 10 0 1 0 = Intranode | 1 = Internode 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Latency Simsteps Outlet 40 30 Estimated Statistic = Latency Simsteps Outlet Median 40 20 Absolute Effect Size 30 10 20 0 10 0 1 0 = Intranode | 1 = Internode 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.45: Regressions of Latency Simsteps Outlet against categorically coded treatment for intranode vs. internode experiment (Section 2.3.4). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 183 Ordinary Least Squares Regression 800000 Latency Walltime Outlet (ns) 600000 Estimated Statistic = Latency Walltime Outlet (ns) Mean 400000 600000 Absolute Effect Size 500000 200000 400000 300000 0 200000 0 1 100000 0 = Intranode | 1 = Internode 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Latency Walltime Outlet (ns) 600000 500000 400000 Estimated Statistic = Latency Walltime Outlet (ns) Median 300000 500000 Absolute Effect Size 200000 400000 100000 300000 0 200000 0 1 100000 0 = Intranode | 1 = Internode 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.46: Regressions of Latency Walltime Outlet (ns) against categorically coded treatment for intranode vs. internode experiment (Section 2.3.4). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 184 Ordinary Least Squares Regression 1.0 Delivery Clumpiness 0.8 Estimated Statistic = Delivery Clumpiness Mean 1.0 0.6 0.8 Absolute Effect Size 0.4 0.6 0.2 0.4 0.0 0.2 0 1 0 = Intranode | 1 = Internode 0.0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression 1.0 Delivery Clumpiness 0.8 Estimated Statistic = Delivery Clumpiness Median 1.0 0.6 0.8 Absolute Effect Size 0.4 0.6 0.2 0.4 0.0 0.2 0 1 0 = Intranode | 1 = Internode 0.0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.47: Regressions of Delivery Clumpiness against categorically coded treatment for intranode vs. internode experiment (Section 2.3.4). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bot- tom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 185 Ordinary Least Squares Regression Simstep Period Inlet (ns) 14000 Estimated Statistic = Simstep Period Inlet (ns) Mean 6000 5000 12000 Absolute Effect Size 4000 3000 10000 2000 0 1 1000 0 = Intranode | 1 = Internode 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Simstep Period Inlet (ns) 14000 Estimated Statistic = Simstep Period Inlet (ns) Median 6000 12000 5000 Absolute Effect Size 4000 10000 3000 2000 0 1 1000 0 = Intranode | 1 = Internode 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.48: Regressions of Simstep Period Inlet (ns) against categorically coded treatment for intranode vs. internode experiment (Section 2.3.4). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 186 Ordinary Least Squares Regression 50 Latency Simsteps Inlet Estimated Statistic = Latency Simsteps Inlet Mean 40 40 30 Absolute Effect Size 20 30 10 20 0 10 0 1 0 = Intranode | 1 = Internode 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Latency Simsteps Inlet 40 Estimated Statistic = Latency Simsteps Inlet Median 30 40 20 Absolute Effect Size 30 10 20 0 10 0 1 0 = Intranode | 1 = Internode 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.49: Regressions of Latency Simsteps Inlet against categorically coded treatment for intranode vs. internode experiment (Section 2.3.4). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 187 Ordinary Least Squares Regression Simstep Period Outlet (ns) 14000 Estimated Statistic = Simstep Period Outlet (ns) Mean 6000 12000 5000 Absolute Effect Size 4000 10000 3000 2000 0 1 1000 0 = Intranode | 1 = Internode 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Simstep Period Outlet (ns) 14000 Estimated Statistic = Simstep Period Outlet (ns) Median 6000 12000 5000 Absolute Effect Size 4000 10000 3000 2000 0 1 1000 0 = Intranode | 1 = Internode 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.50: Regressions of Simstep Period Outlet (ns) against categorically coded treatment for intranode vs. internode experiment (Section 2.3.4). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 188 Ordinary Least Squares Regression 0.4 Delivery Failure Rate Estimated Statistic = Delivery Failure Rate Mean 0.3 0.00 −0.05 0.2 Absolute Effect Size −0.10 −0.15 0.1 −0.20 −0.25 0.0 −0.30 0 1 0 = Intranode | 1 = Internode −0.35 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression 0.4 Delivery Failure Rate 0.3 Estimated Statistic = Delivery Failure Rate Median 0.00 0.2 −0.05 Absolute Effect Size −0.10 0.1 −0.15 −0.20 0.0 −0.25 0 1 −0.30 0 = Intranode | 1 = Internode (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.51: Regressions of Delivery Failure Rate against categorically coded treatment for intranode vs. internode experiment (Section 2.3.4). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bot- tom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 189 Table A.19: Full Ordinary Least Squares Regression results of quality of service metrics against log processor count for weak scaling experiment (Section 2.3.6). Listed results include both piecewise and complete regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Quantile regression estimates relationship between independent variable and median of response variable. Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. ca nt Eff ec Pe r tS ign Nu m No de Nu Sim els m Pr Pe rC oc es pu Ab s es so lu te Eff ec tS Ab ize so lu te Eff ec tS Ab so ize 95 lu te % CI Eff ec Lo we Relat tS ize rB ive 95 ou nd Re Eff ec % CI lat ive tS ize Up pe Eff e rB ou Re lat ct Si nd ive z e9 n Eff 5% CI c tic ec tS ize Lowe M at ifi 95 % rB is p us CI ou nd et Si Up ri St gn Cp pe rBou nd Latency Walltime Inlet (ns) mean + 1-2 1 2 600 000 530 000 660 000 77 68 85 20 2.1e-13 Latency Walltime Outlet (ns) mean + 1-2 1 2 590 000 530 000 650 000 77 68 85 20 1.1e-13 Latency Simsteps Inlet mean + 1-2 1 2 41 36 45 41 36 45 20 1.9e-13 Latency Simsteps Outlet mean + 1-2 1 2 40 36 45 40 36 45 20 1e-13 Delivery Failure Rate mean - 1-2 1 2 -0.33 -0.35 -0.31 -1 -1.1 -0.94 20 4e-18 Delivery Clumpiness mean + 1-2 1 2 0.94 0.93 0.96 68 67 69 20 2.6e-30 Simstep Period Inlet (ns) mean + 1-2 1 2 5 500 5 100 5 900 0.6 0.56 0.65 20 1.3e-16 Simstep Period Outlet (ns) mean + 1-2 1 2 5 400 5 100 5 800 0.6 0.56 0.64 20 7.1e-17 190 Table A.20: Full Quantile Regression results of quality of service metrics against log processor count for weak scaling experiment (Section 2.3.6). Listed results include both piecewise and complete regression. Ordinary least squares regression estimates relationship between independent variable and mean of response variable. Quantile regression estimates relationship between independent variable and median of response variable. Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. ca nt Eff ec Pe rN tS ign Nu m o de Nu Sim els m Pr Pe rC oc es pu Ab s es so lu te Eff ec tS Ab ize so lu te Eff ec tS Ab so ize 95 lu te % CI Eff ec Lo we Re lat t Si ze rB ive 95 ou nd Eff % C Re e ct Si I Up lat ive ze pe rB Eff e ct ou nd Relat Si z ive e9 5% n Eff c tic ec tS ize CI Lowe M at ifi 95 % rB is p us CI ou et Si Up nd ri St gn Cp perB ou nd Latency Walltime Inlet (ns) median + 1-2 1 2 550 000 530 000 560 000 78 76 81 20 2.6e-23 Latency Walltime Outlet (ns) median + 1-2 1 2 540 000 530 000 560 000 78 76 80 20 1.5e-24 Latency Simsteps Inlet median + 1-2 1 2 37 35 39 38 36 39 20 2.1e-19 Latency Simsteps Outlet median + 1-2 1 2 37 35 39 37 35 39 20 6.7e-19 Delivery Failure Rate median - 1-2 1 2 -0.32 -0.32 -0.32 -1 -1 -0.99 20 5.7e-39 Delivery Clumpiness median + 1-2 1 2 0.96 0.95 0.97 610 600 620 20 4.6e-31 Simstep Period Inlet (ns) median + 1-2 1 2 5 600 5 200 6 000 0.62 0.58 0.66 20 1.3e-16 Simstep Period Outlet (ns) median + 1-2 1 2 5 500 5 200 5 800 0.61 0.58 0.65 20 3.7e-18 191 A.4 Multithreading vs Multiprocessing This section provides full results from multithreading vs. multiprocessing experiments discussed in Section 2.3.5. 1e7 16000 Latency Walltime Inlet (ns) 1.2 Latency Walltime Inlet (ns) 14000 12000 1.0 10000 0.8 8000 0.6 0.4 6000 0.2 4000 0.0 2000 0 1 0 1 0 = Multithreading | 1 = Multiprocessing 0 = Multithreading | 1 = Multiprocessing (a) Distribution of Latency Walltime Inlet (ns) for each(b) Distribution of Latency Walltime Inlet (ns) for snapshot, without outliers. each snapshot, with outliers. Figure A.52: Distribution of Latency Walltime Inlet (ns) for individual snapshot measurements for multi- threading vs. multiprocessing experiment (Section 2.3.5). Lower is better. 192 3.0 2000 Latency Simsteps Outlet Latency Simsteps Outlet 2.5 1500 2.0 1000 1.5 1.0 500 0.5 0 0 1 0 1 0 = Multithreading | 1 = Multiprocessing 0 = Multithreading | 1 = Multiprocessing (a) Distribution of Latency Simsteps Outlet for each(b) Distribution of Latency Simsteps Outlet for each snapshot, without outliers. snapshot, with outliers. Figure A.53: Distribution of Latency Simsteps Outlet for individual snapshot measurements for multithread- ing vs. multiprocessing experiment (Section 2.3.5). Lower is better. 1e7 14000 Latency Walltime Outlet (ns) Latency Walltime Outlet (ns) 1.2 12000 1.0 10000 0.8 8000 0.6 6000 0.4 4000 0.2 2000 0.0 0 1 0 1 0 = Multithreading | 1 = Multiprocessing 0 = Multithreading | 1 = Multiprocessing (a) Distribution of Latency Walltime Outlet (ns) for(b) Distribution of Latency Walltime Outlet (ns) for each snapshot, without outliers. each snapshot, with outliers. Figure A.54: Distribution of Latency Walltime Outlet (ns) for individual snapshot measurements for multi- threading vs. multiprocessing experiment (Section 2.3.5). Lower is better. 193 1.0 1.0 0.8 0.8 Delivery Clumpiness Delivery Clumpiness 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 0 1 0 = Multithreading | 1 = Multiprocessing 0 = Multithreading | 1 = Multiprocessing (a) Distribution of Delivery Clumpiness for each snap-(b) Distribution of Delivery Clumpiness for each snap- shot, without outliers. shot, with outliers. Figure A.55: Distribution of Delivery Clumpiness for individual snapshot measurements for multithreading vs. multiprocessing experiment (Section 2.3.5). Lower is better. 9000 9000 Simstep Period Inlet (ns) Simstep Period Inlet (ns) 8000 8000 7000 7000 6000 6000 5000 5000 4000 4000 0 1 0 1 0 = Multithreading | 1 = Multiprocessing 0 = Multithreading | 1 = Multiprocessing (a) Distribution of Simstep Period Inlet (ns) for each(b) Distribution of Simstep Period Inlet (ns) for each snapshot, without outliers. snapshot, with outliers. Figure A.56: Distribution of Simstep Period Inlet (ns) for individual snapshot measurements for multithread- ing vs. multiprocessing experiment (Section 2.3.5). Lower is better. 194 3.0 2000 2.5 Latency Simsteps Inlet Latency Simsteps Inlet 1500 2.0 1.5 1000 1.0 500 0.5 0 0 1 0 1 0 = Multithreading | 1 = Multiprocessing 0 = Multithreading | 1 = Multiprocessing (a) Distribution of Latency Simsteps Inlet for each(b) Distribution of Latency Simsteps Inlet for each snapshot, without outliers. snapshot, with outliers. Figure A.57: Distribution of Latency Simsteps Inlet for individual snapshot measurements for multithreading vs. multiprocessing experiment (Section 2.3.5). Lower is better. 9000 9000 Simstep Period Outlet (ns) Simstep Period Outlet (ns) 8000 8000 7000 7000 6000 6000 5000 5000 4000 4000 3000 3000 0 1 0 1 0 = Multithreading | 1 = Multiprocessing 0 = Multithreading | 1 = Multiprocessing (a) Distribution of Simstep Period Outlet (ns) for each(b) Distribution of Simstep Period Outlet (ns) for each snapshot, without outliers. snapshot, with outliers. Figure A.58: Distribution of Simstep Period Outlet (ns) for individual snapshot measurements for multi- threading vs. multiprocessing experiment (Section 2.3.5). Lower is better. 195 0.7 0.7 0.6 0.6 Delivery Failure Rate Delivery Failure Rate 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0 1 0 1 0 = Multithreading | 1 = Multiprocessing 0 = Multithreading | 1 = Multiprocessing (a) Distribution of Delivery Failure Rate for each snap-(b) Distribution of Delivery Failure Rate for each snap- shot, without outliers. shot, with outliers. Figure A.59: Distribution of Delivery Failure Rate for individual snapshot measurements for multithreading vs. multiprocessing experiment (Section 2.3.5). Lower is better. 196 Ordinary Least Squares Regression 1e6 Latency Walltime Inlet (ns) 2.0 Estimated Statistic1e6 = Latency Walltime Inlet (ns) Mean 1.5 0.2 0.0 1.0 Absolute Effect Size −0.2 0.5 −0.4 −0.6 0.0 −0.8 0 1 0 = Multithreading | 1 = Multiprocessing −1.0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression 16000 Latency Walltime Inlet (ns) 14000 Estimated Statistic = Latency Walltime Inlet (ns) Median 12000 10000 5000 Absolute Effect Size 8000 4000 6000 3000 4000 2000 1000 0 1 0 = Multithreading | 1 = Multiprocessing 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.60: Regressions of Latency Walltime Inlet (ns) against categorically coded treatment for multi- threading vs. multiprocessing experiment (Section 2.3.5). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and me- dian of response variable. Error bands and bars are 95% confidence intervals. 197 Ordinary Least Squares Regression Latency Simsteps Outlet 300 Estimated Statistic = Latency Simsteps Outlet Mean 0 200 Absolute Effect Size −50 100 −100 0 0 1 −150 0 = Multithreading | 1 = Multiprocessing (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Latency Simsteps Outlet 3.0 Estimated Statistic = Latency Simsteps Outlet Median 2.5 2.0 0.0 Absolute Effect Size 1.5 −0.2 1.0 −0.4 0 1 −0.6 0 = Multithreading | 1 = Multiprocessing (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.61: Regressions of Latency Simsteps Outlet against categorically coded treatment for multithread- ing vs. multiprocessing experiment (Section 2.3.5). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of re- sponse variable. Error bands and bars are 95% confidence intervals. 198 Ordinary Least Squares Regression 1e6 Latency Walltime Outlet (ns) 2.0 Estimated Statistic = Latency Walltime Outlet (ns) Mean 1.5 1e6 0.2 0.0 1.0 Absolute Effect Size −0.2 0.5 −0.4 −0.6 0.0 −0.8 0 1 0 = Multithreading | 1 = Multiprocessing −1.0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Latency Walltime Outlet (ns) 15000 12500 Estimated Statistic = Latency Walltime Outlet (ns) Median 10000 5000 Absolute Effect Size 4000 7500 3000 5000 2000 2500 1000 0 1 0 0 = Multithreading | 1 = Multiprocessing (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.62: Regressions of Latency Walltime Outlet (ns) against categorically coded treatment for multi- threading vs. multiprocessing experiment (Section 2.3.5). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and me- dian of response variable. Error bands and bars are 95% confidence intervals. 199 Ordinary Least Squares Regression 0.8 Estimated Statistic = Delivery Clumpiness Mean Delivery Clumpiness 0.0 0.6 −0.1 Absolute Effect Size 0.4 −0.2 −0.3 0.2 −0.4 0.0 −0.5 0 1 −0.6 0 = Multithreading | 1 = Multiprocessing (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression 0.8 Delivery Clumpiness Estimated Statistic = Delivery Clumpiness Median 0.0 0.6 −0.1 Absolute Effect Size 0.4 −0.2 0.2 −0.3 −0.4 0.0 −0.5 0 1 0 = Multithreading | 1 = Multiprocessing −0.6 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.63: Regressions of Delivery Clumpiness against categorically coded treatment for multithreading vs. multiprocessing experiment (Section 2.3.5). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 200 Ordinary Least Squares Regression Simstep Period Inlet (ns) 9000 Estimated Statistic = Simstep Period Inlet (ns) Mean 8000 7000 4000 Absolute Effect Size 6000 3000 5000 2000 4000 1000 0 1 0 = Multithreading | 1 = Multiprocessing 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Simstep Period Inlet (ns) 9000 Estimated Statistic = Simstep Period Inlet (ns) Median 8000 5000 7000 4000 Absolute Effect Size 6000 3000 5000 2000 4000 1000 0 1 0 = Multithreading | 1 = Multiprocessing 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.64: Regressions of Simstep Period Inlet (ns) against categorically coded treatment for multithread- ing vs. multiprocessing experiment (Section 2.3.5). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of re- sponse variable. Error bands and bars are 95% confidence intervals. 201 Ordinary Least Squares Regression Latency Simsteps Inlet Estimated Statistic = Latency Simsteps Inlet Mean 300 0 200 Absolute Effect Size −50 100 −100 0 0 1 −150 0 = Multithreading | 1 = Multiprocessing (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression 3.0 Latency Simsteps Inlet Estimated Statistic = Latency Simsteps Inlet Median 2.5 0.1 2.0 0.0 Absolute Effect Size −0.1 1.5 −0.2 −0.3 1.0 −0.4 −0.5 0 1 −0.6 0 = Multithreading | 1 = Multiprocessing −0.7 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.65: Regressions of Latency Simsteps Inlet against categorically coded treatment for multithreading vs. multiprocessing experiment (Section 2.3.5). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 202 Ordinary Least Squares Regression Simstep Period Outlet (ns) 9000 Estimated Statistic = Simstep Period Outlet (ns) Mean 8000 7000 4000 Absolute Effect Size 6000 3000 5000 2000 4000 1000 0 1 0 = Multithreading | 1 = Multiprocessing 0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression Simstep Period Outlet (ns) 9000 8000 Estimated Statistic = Simstep Period Outlet (ns) Median 5000 7000 4000 Absolute Effect Size 6000 3000 5000 2000 4000 1000 0 1 0 = Multithreading | 1 = Multiprocessing 0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.66: Regressions of Simstep Period Outlet (ns) against categorically coded treatment for multi- threading vs. multiprocessing experiment (Section 2.3.5). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and me- dian of response variable. Error bands and bars are 95% confidence intervals. 203 Ordinary Least Squares Regression 0.5 Delivery Failure Rate Estimated Statistic = Delivery Failure Rate Mean 0.4 0.4 0.3 Absolute Effect Size 0.3 0.2 0.1 0.2 0.0 0.1 0 1 0 = Multithreading | 1 = Multiprocessing 0.0 (a) Ordinary least squares regression plot. Observa-(b) Estimated regression coefficient for ordinary least tions are means per replicate. squares regression. Zero corresponds to no effect. Quantile Regression 0.6 Delivery Failure Rate 0.5 Estimated Statistic = Delivery Failure Rate Median 0.4 0.4 0.3 Absolute Effect Size 0.3 0.2 0.2 0.1 0.0 0.1 0 1 0 = Multithreading | 1 = Multiprocessing 0.0 (c) Quantile regression plot. Observations are medians(d) Estimated regression coefficient for quantile regres- per replicate. sion. Zero corresponds to no effect. Figure A.67: Regressions of Delivery Failure Rate against categorically coded treatment for multithreading vs. multiprocessing experiment (Section 2.3.5). Lower is better. Ordinary least squares regression (top row) estimates relationship between categorical dependent variable and mean of response variable. Quantile regression (bottom row) estimates relationship between categorical independent variable and median of response variable. Error bands and bars are 95% confidence intervals. 204 Table A.21: Full Ordinary Least Squares Regression results of quality of service metrics against against categorically coded treatment for multi- threading vs. multiprocessing experiment (Section 2.3.5). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. ca nt sP Eff ec Nu er N tS ign m Si od e Nu m els m Pr Pe rC oc es se pu Ab so s lu te Eff ec tS ize Ab so lu te Eff ec tS ize Ab so 95 % lu te CI Lo Eff ec we tS ize rB ou Relat 9 5% nd ive Eff CI Up ec tS pe rB Re ize ou nd lat ive Eff ec tS ize Re lat 95 % ive CI tic Eff ec Lowe rB ric n ifi tS ize ou at 95 nd M is Si p Cp % CI gn Up et St u pe rB ound Latency Walltime Inlet (ns) mean 0 2 1 1/2 -440 000 -1e+06 170 000 -0.98 -2.3 0.37 20 0.15 Latency Walltime Outlet (ns) mean 0 2 1 1/2 -450 000 -1.1e+06 160 000 -0.98 -2.3 0.35 20 0.14 Latency Simsteps Inlet mean 0 2 1 1/2 -76 -180 28 -0.99 -2.3 0.36 20 0.14 Latency Simsteps Outlet mean 0 2 1 1/2 -77 -180 27 -0.99 -2.3 0.34 20 0.14 Delivery Failure Rate mean + 2 1 1/2 0.38 0.33 0.44 -1.4e+06 -1.2e+06 -1.5e+06 20 6e-12 Delivery Clumpiness mean - 2 1 1/2 -0.53 -0.62 -0.44 -0.94 -1.1 -0.78 20 5.8e-10 Simstep Period Inlet (ns) mean + 2 1 1/2 4 500 4 200 4 700 0.97 0.91 1 20 1.5e-18 Simstep Period Outlet (ns) mean + 2 1 1/2 4 400 4 100 4 700 0.95 0.89 1 20 7.4e-17 205 Table A.22: Full Quantile Regression results of quality of service metrics against against categorically coded treatment for multithreading vs. multiprocessing experiment (Section 2.3.5). Significance level p < 0.05 used. Inf or NaN values may occur due to multicollinearity or due to inf or NaN observations. ca nt sP Eff ec Nu e rN tS ign m Si od e Nu m els m Pr Pe rC oc es pu Ab so se s lu te Eff ec Ab so tS ize lu te Eff ec Ab so tS ize lu te 95 Eff e % CI Re lat ct Si Lo we ive z e9 rB Eff 5 % ou nd Re ec tS CI U lat i z e p pe rB ive Eff ou ec t nd Re lat Si ze ive 95 % n E ffe CI tic ct Si ze Lowe ric ifi 95 rB ou at p % CI nd M is gn Cp Up pe et St Si u rB ound Latency Walltime Inlet (ns) median 0 2 1 1/2 2 700 -180 5 600 0.51 -0.034 1.1 20 0.064 Latency Walltime Outlet (ns) median 0 2 1 1/2 2 500 -350 5 400 0.47 -0.064 1 20 0.081 Latency Simsteps Inlet median 0 2 1 1/2 -0.29 -0.69 0.12 -0.25 -0.6 0.1 20 0.15 Latency Simsteps Outlet median 0 2 1 1/2 -0.3 -0.72 0.11 -0.26 -0.62 0.099 20 0.14 Delivery Failure Rate median + 2 1 1/2 0.38 0.37 0.38 inf inf inf 20 2e-27 Delivery Clumpiness median - 2 1 1/2 -0.53 -0.59 -0.47 -0.97 -1.1 -0.87 20 2.6e-13 Simstep Period Inlet (ns) median + 2 1 1/2 4 500 4 000 4 900 0.96 0.87 1.1 20 2.3e-14 Simstep Period Outlet (ns) median + 2 1 1/2 4 400 4 000 4 800 0.94 0.86 1 20 1.3e-14 206 A.5 With lac-417 vs. Sans lac-417 This section provides full results from faulty hardware experiments discussed in Section 2.3.7. 1e6 1e10 3.5 1.4 Latency Walltime Inlet (ns) Latency Walltime Inlet (ns) 1.2 3.0 1.0 0.8 2.5 0.6 2.0 0.4 0.2 1.5 0.0 0 1 0 1 0 = With lac-417 | 1 = Sans lac-417 0 = With lac-417 | 1 = Sans lac-417 (a) Distribution of Latency Walltime Inlet (ns) for each(b) Distribution of Latency Walltime Inlet (ns) for snapshot, without outliers. each snapshot, with outliers. Figure A.68: Distribution of Latency Walltime Inlet (ns) for individual snapshot measurements for faulty hardware experiment (Section 2.3.7). Lower is better. 207 1.8 8000 Latency Simsteps Outlet 1.6 Latency Simsteps Outlet 6000 1.4 1.2 4000 1.0 2000 0.8 0.6 0 0 1 0 1 0 = With lac-417 | 1 = Sans lac-417 0 = With lac-417 | 1 = Sans lac-417 (a) Distribution of Latency Simsteps Outlet for each(b) Distribution of Latency Simsteps Outlet for each snapshot, without outliers. snapshot, with outliers. Figure A.69: Distribution of Latency Simsteps Outlet for individual snapshot measurements for faulty hardware experiment (Section 2.3.7). Lower is better. 1e6 1e10 3.5 1.4 Latency Walltime Outlet (ns) Latency Walltime Outlet (ns) 1.2 3.0 1.0 0.8 2.5 0.6 2.0 0.4 0.2 1.5 0.0 0 1 0 1 0 = With lac-417 | 1 = Sans lac-417 0 = With lac-417 | 1 = Sans lac-417 (a) Distribution of Latency Walltime Outlet (ns) for(b) Distribution of Latency Walltime Outlet (ns) for each snapshot, without outliers. each snapshot, with outliers. Figure A.70: Distribution of Latency Walltime Outlet (ns) for individual snapshot measurements for faulty hardware experiment (Section 2.3.7). Lower is better. 208 0.8 0.6 0.5 Delivery Clumpiness Delivery Clumpiness 0.6 0.4 0.4 0.3 0.2 0.2 0.1 0.0 0.0 0 1 0 1 0 = With lac-417 | 1 = Sans lac-417 0 = With lac-417 | 1 = Sans lac-417 (a) Distribution of Delivery Clumpiness for each snap-(b) Distribution of Delivery Clumpiness for each snap- shot, without outliers. shot, with outliers. Figure A.71: Distribution of Delivery Clumpiness for individual snapshot measurements for faulty hardware experiment (Section 2.3.7). Lower is better. 1e6 1e6 2.3 6 Simstep Period Inlet (ns) Simstep Period Inlet (ns) 2.2 5 2.1 4 2.0 1.9 3 1.8 2 1.7 0 1 0 1 0 = With lac-417 | 1 = Sans lac-417 0 = With lac-417 | 1 = Sans lac-417 (a) Distribution of Simstep Period Inlet