EXTRACTING STRUCTURE AND FUNCTION FROM COMPLEX SYSTEMS USING INFORMATION-THEORETIC TOOLS By Nitash C G A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Doctor of Philosophy 2022 ABSTRACT One of the primary areas of scientific research is understanding how complex systems work, both structurally and functionally. In the natural world, complex systems are very high dimensional, with many interacting parts, making studying them difficult and in some cases nearly impossible. Due to the complexity of these systems, a lot of modern research focuses on studying these systems from a computational viewpoint. While this necessarily abstracts away from the true system, we attempt to represent the salient aspects of the system in order to better understand it. Results from such computational studies can yield insight into the natural system, and actively constrain the research space by suggesting hypotheses that can be tested. In this work, I investigate the structure and function of two seemingly disparate complex digital systems. I begin with an investigation of the structure and function of an evolved cognitive architecture, and look at how this structure is affected by environmental changes by developing some new metrics to classify cognitive systems. I then look at the structure of the primordial fitness landscape in a different digital system, and use techniques inspired by information theory to understand the structure of this landscape. I first look at the role of historical contingency in the evolution of life by studying how the structure of this fitness landscape affects the evolutionary trajectories of life. I then investigate how information is encoded in the primordial fitness landscape. I then extend this analysis by developing a general approach for calculating the information content of individual sequences, and use them to analyze the primordial landscape. Finally, I validate this information-theoretic technique by predicting the effects of mutations on the function of a specific protein, and show that this technique can outperform the current state of the art approaches. To my mother, for everything they have done. iii ACKNOWLEDGEMENTS I would like to thank my advisor, Christoph Adami, for going above and beyond their role as a mentor. This dissertation would have not been possible without their guidance and support. Arend Hintze, who was also my advisor for some years, for teaching me to always ask the hard questions and not giving up till I answered them. Charles Ofria, for valuable guidance during my years doing research, and for the frequent fascinating discussions about C++ arcana. Bill Punch, for giving me the confidence in myself to be a teacher, and instilling in me the belief that teaching always involves learning new things. I would like to thank the members of my lab, Ali Tehrani, Jory Schossau, Clifford Bohm, Vincent Ragusa, Douglas Kirkpatrick, Thomas LaBar, and Acacia Ackles, for not only help- ing out with science stuff but also for being thoroughly fun people to hang out with. Alok Singh, for introducing me to the world of research. Durga Bhavani, for kindling in me the joy of discovery, and showing me that it’s never too late to be a student. Vijay Borges, for believing that I could succeed, and helping me in every possible way they could. Budhaji Sir, for my earliest memory of realizing that there are usually multiple ways of solving the same problem. My parents and sister, for always being there whenever I needed them. iv TABLE OF CONTENTS Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: The cognitive structure and function of brains evolved in periodically changing environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 3: The structure of a primordial fitness landscape . . . . . . . . . . . . . . 26 Chapter 4: Information scores for individual sequences . . . . . . . . . . . . . . . . . 51 Chapter 5: Predicting mutation effects using information scores . . . . . . . . . . . . 74 Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 v Chapter 1 Introduction The natural world is a complex dynamic system with a very high dimensionality, and it can be challenging to understand how all the different parts interact with each other. There are a number of aspects of natural systems that we do not yet fully understand, and in my research I attempt to address a couple of them. At one end of the spectrum is abiogenesis: the spontaneous emergence of self-replicating organisms. At the other end of the spectrum is a relatively recent evolved organ - the brain, which is the computational processing unit that allows organisms to solve highly complex problems and is the defining feature of what are often considered to be the most advanced forms of life on this planet. I investigate these problems using mathematical and computational tools, and show how these seemingly disparate systems can be understood using very similar computational and mathematical techniques. As with all scientific investigations, the first course of action is to create a model that abstracts away the irrelevant details of the system that we want to investigate, while keeping just enough of the details to be able to observe the desired behaviour. These models of the systems can be represented in various forms; by observing the system in a petri dish, which is a microcosm of the natural system we want to understand; by modeling the dynamics of the system mathematically; by agent-based simulations, i.e. simulating the system in a computer. In my work I focus primarily on the last of these approaches, namely digital simulations, but also use mathematical models in order to better understand the behaviour exhibited in the simulations. 1.1 How brains work The functioning of brains (and central nervous systems in general) allows organisms to solve highly complicated tasks and enable complex social dynamics within species, which increase fitness substantially. The human brain is arguably the best example of a complex logic processing unit, and understanding how the brain works is one of the most important 1 questions that we want to answer. While understanding the structure and function of this fascinating organ is a worthwhile endeavour in and of itself, the practical application of such an understanding would be the ability to design and create artificial systems with similar properties. General purpose intelligence in artificial systems would be a paradigm-altering technology, with probably more impact on our society than any preceding technological innovation, and would allow us to solve a multitude of different tasks that currently require human beings to undertake. Human brains (and brains in general) are extremely complex organs with lots of moving parts. While there is a lot of regularity in the structure of the brain, and we are able to identify some functional regions of the brain (e.g., we can identify the sections of the brain that are specifically involved in vision), there are still many gaps in our understanding of how the regions work, and how they work in conjunction with other sections of the brain. In recent times, technologies such as MRI scans have allowed us to look at extremely small regions of the brain with extremely high precision (Nowogrodzki, 2018). Observing the changes in neuronal activity when the organism is subjected to different external stimuli gives insight into the structure of the brain, and how the different parts interact with each other. However, generating these neuronal recordings and analysing them is quite expensive, and we are still able to only look at a very tiny proportion of the entire brain at a given time. In chapter 2 I use a digital cognitive architecture called Markov Brains in order to in- vestigate some aspects of how brains work. Markov Brains are known to exhibit complex problem solving behaviour, so understanding such digital cognitive architectures can give us insight into the structure of natural cognitive architectures (brains). In particular, in the work I have done I look at how periodic changes in the environment of a species affects their cognitive architecture. We know that many species have gone through multiple radical transitions in their environments, and that the structure and function of their brains have evolved in response to these changes. Understanding how environmental transitions affect 2 the structure and function of digital cognitive architectures can yield insights into how the structure and function of brains have been affected by environmental transitions in their evolutionary history. 1.2 The origin of life Ever since Darwin introduced the theory of evolution we have had an explanation for how complex organisms and behaviours can emerge. However, the theory of evolution presupposes an original self-replicator that was the ancestor of life on Earth, and we do not yet have a good theory that explains how this original self-replicator arose. There is no available fossil record of the early progenitors of life, unlike the fossil record left behind by more recent organisms. In addition, our understanding of molecular biochemistry is insufficient to explain the spontaneous emergence of life from non-replicating molecules. We also have only a single data point to work with, and while it is not unreasonable to assume that life has emerged multiple times in the history of the universe, the emergence of life on Earth is the only instance of this occurring for which we have concrete evidence. All these factors make understanding the origin of the first self-replicator (abiogenesis) very difficult. There are a number of questions regarding abiogenesis that we would like to answer. The most obvious question is the precise details of how life emerged on earth, but this knowledge alone would not be a general theory that could predict the likelihood of spontaneous emer- gence of life given an underlying biochemistry. A general theory of abiogenesis would enable us to answer fundamental questions that go beyond the single instance of abiogenesis that we know has happened on earth. For example, one of these questions is the role of historical contingency in the emergence of life. As Gould asked (Gould, 1990), if Earth’s history were to be replayed a thousand times, how often would it result in a biosphere? Is the emergence of a biosphere inevitable given the underlying chemistry, or was it an extremely unlikely occurrence? Also, would the resulting biochemistry be very similar to the one we have now, or would it be radically different? There are a number of different approaches that address the question of abiogenesis in 3 a way that is removed from the particularities of Earth’s bio-chemistry, which include tools from computer science, information theory, and statistical physics. The intent behind these approaches is to abstract away from the specific details of Earth’s chemistry in order to reach a more fundamental understanding of the nature of abiogenesis. Of course, the ideas and theories developed by these, largely mathematical, approaches would be used to inform the investigation of the spontaneous emergence of life on Earth itself. In chapter 3 I use the digital system Avida to generate a complete fitness landscape of self-replicating organisms. Since the Avida system has been shown to exhibit many of the fundamental properties of biological evolution, it is reasonable to investigate how life could have emerged in this digital system. The complete fitness landscapes of primordial self- replicators can be used to shed light on the structure of biological self-replicator landscapes, even though we cannot generate a complete fitness landscape for biological systems. The digital fitness landscape is used in this work to study the role of historical contingency in the origin of life - in a digital system the question asked by Gould is answerable: we can replay the evolution of life as many times as we want, and look at how similar, or different the resulting digital biospheres are from each other. We can use the structure of the digital primordial landscape to see how the mutational landscape of self-replicators affect the outcome of evolution. The ideas developed in chapter 3 can perhaps guide investigations into the nature of how life emerged on Earth. I also look at how information is encoded in the self-replicating Avidian sequences, where the information that is encoded is exactly the information the sequence has about its ability to undergo self-replication. 1.3 Predicting function from sequence Biomolecular sequences such as DNA sequences typically have some function in their environment, i.e., they encode the “instructions” necessary to perform some task that is necessary for the organism’s fitness. This function is not necessarily a binary function, i.e., different sequences that perform the same function may not be equally efficient at performing that function. Being able to predict the level of functionality of a particular sequence is an 4 active area of research, as this ability has a large number of applications. For example, in the design of HIV anti-viral drugs it is important to determine how effective some particular drug will be in reducing the activity of the HIV protease. Testing every candidate sequence in vivo (i.e., in a natural organism) is infeasible, and reducing the candidate search space effectively is extremely valuable. There are other domains outside the are of molecular biology where the ability to predict function is useful. An example would be neuronal firing patterns; the activity of neurons in different brain regions can be represented as sequences which encode how the brain is processing a particular set of inputs, and mapping the function of the brain from neuronal firings can give us a deeper understanding of how brains work. Currently there are a number of different approaches used to predict the function of a biomolecular sequence. One approach is to model the underlying biochemistry and then simulate the activity of these molecules in a computational system. This approach relies on being able to model the underlying chemistry accurately, which is difficult, while also keeping the model computationally tractable, which becomes increasingly harder as more aspects of the underlying biochemistry are incorporated into the model. Other approaches typically use a large number of existing sequences (usually a multiple sequence alignment (MSA)) that are known to have the particular function that is of interest, in order to train a model that “learns” how the function is encoded in the sequences (Hopf et al., 2017; Riesselman et al., 2018). These approaches have had moderate success, but have a number of issues: they tend to be unable to extract epistatic effects in the sequences, at least when there are multiple sites in the sequences that are in epistasis. Deep learning models have been proposed that take into account higher order correlations in the sequences, but they are still unable to precisely modulate the degree of correlation; they learn the entire data set taking into account all correlations at once. These models also run into the issue of being computationally expensive as they involve millions of parameters that need to be fit to the training data. The deep learning models also have the fundamental limitation of not being able to distinguish between the signal and noise in the training data set and these models 5 typically need to be tweaked in order to separate out the noise from the signal. In chapter 4 I develop a novel technique of predicting function from individual sequences that relies on the observation that the function of a sequence is exactly the “information” the sequence has about how to perform the function in its environment. This observation allows established information-theoretic analyses, which have traditionally only been used to cap- ture the information content of entire ensembles, to be extended for the purpose of capturing the information content of individual sequences by assigning each sequence an “information score”. This information-theoretic approach addresses many of the issues prevalent in clas- sic approaches; the information-theoretic approach is computationally cheaper than other approaches, and does a good job of distinguishing between noise and signal in the training data, since the information content is being explicitly extracted from the training data. I also show that this approach recapitulates the information content of the entire training data when the information score of all sequences are aggregated. In chapter 5 I validate this approach on a real world data set - the binding ability of the WW domain of the YAP-1 protein to bind to its polyproline peptide ligand (Araya et al., 2012), and show that the information-theoretic approach outperforms the best known existing approaches. 1.4 The utility of using digital systems Observing systems in the wild can be extremely difficult and expensive, if not outright impossible in some cases. For starters, not all aspects of the system are easily observable, especially on very small scales of time and size. In addition, replicability can be hard to achieve, and perfect replicability is impossible. Mathematical models, on the other hand are very precise and completely replicable. They have been used to study natural systems to great effect (Wigner, 1990). However, there are limits on the kinds of dynamics that can be modeled mathematically, and for high dimensional systems with many variables, completely analyzing the system is intractable. An important aspect that mathematical models have difficulty with is stochasticity. Random events are pervasive in natural systems, and can have profound effects on the behaviour of 6 systems, which are not easily modeled. Digital systems have none of the issues mentioned above. With the power of modern computers, extremely complicated systems with many moving parts can be simulated in reasonable amounts of time. This approach has been used successfully in various fields (Adami et al., 2016). Additionally, every aspect of the system can be analyzed in as much depth as we care about, a benefit that is a natural consequence of being able to perfectly replicate the observed behaviour. Random events can be simulated with ease, and because they are fundamentally only pseudo-random, they can be replicated with ease. As a result, simulating complex systems in a digital environment has a number of benefits, and is the reason I use these systems in my research. 1.5 The utility of information theory Among the mathematical tools we have available, information theory is an extremely powerful approach, as at its core it attempts to describe the information content of systems. While this theory has been known in some form or the other for several decades (Shannon, 1948), it has notoriously been an under-appreciated tool when analyzing complex systems. In particular, classical formulations have only been able to describe the information content of entire ensembles, rather than individual components of the system (Szostak, 2003). Applying this theory to be able to capture the information content of components of a system has never been successfully done. In the work done in chapter 4 I develop a formulation that can be used to describe the information content of individual components of a system, and show that this can be useful in a variety of ways. By using the power of digital systems to be able to generate large amounts of data (in particular, complete landscapes, which is impossible for non-digital systems), I validate the formulations that I developed. 7 Chapter 2 The cognitive structure and function of brains evolved in periodically changing environments This chapter is published in (CG et al., 2018), and is a result of work done in collaboration with Barbara Lundrigan, Laura Smale, and Arend Hintze. 2.1 Abstract Natural organisms have transitioned from one niche to another over the course of evolu- tion and have adapted accordingly. In particular, if these transition go back and forth be- tween two niches repeatedly, such as transitioning between diurnal and nocturnal lifestyles, this should over time result in adaptations that are beneficial to both environments. Fur- thermore, they should also adapt to the transitions themselves. Here we answer how Markov Brains, which are an analogue to natural brains, change structurally and functionally when experiencing periodic changes. We show that if environments change sufficiently fast, the structural components that form the brains become useful in both environments. However, brains evolve to perform different computations while using the same components, and thus have computational structures that are multifunctional. 2.2 Introduction Changing fitness landscapes affect evolutionary dynamics profoundly (Wilke et al., 2001a; Laakso et al., 2001). Slow changes can favour specialists, while generalists can have an advantage in environments that change often (Travis and Travis, 2002; Li and Wilke, 2004). At the genetic level, changing environments modify the effect of mutations, and can alter their rate of fixation. A mutation that is beneficial in one environment could be even more beneficial in the next, thereby accelerating fixation. Conversely, a deleterious mutation effect can be mitigated by an environmental change, thus allowing valley crossings to occur more often. As the fossil record makes abundantly clear, the history of the planet is peppered with 8 sudden, drastic environmental changes that impose vastly different constraints on surviv- ing species, e.g. The Great Oxidation Event. Another example is the mass extinction of dinosaurs in the cretaceous, which allowed previously nocturnal rodents to fill the diurnal niche recently vacated by sauropods (Gerkema et al., 2013; Maor et al., 2017). Transitioning from a nocturnal to a diurnal lifestyle, or vice versa, is arguably one of the toughest transi- tions to make for a vertebrate. A transition of this magnitude requires that not only entire metabolic pathways adapt to the changed temperature, but also that the sensory modalities adapt to cope with different constraints on foraging. A nocturnal lifestyle either requires extreme adaptations of the vision system to deal with the lack of light, or requires that organisms rely on other sensors, such as olfaction or echolocation in the case of bats. On the other hand, the light of day allows distant objects to be detected easily, emphasizing the importance of vision. On the downside, the higher temperature during the day causes odors to evaporate more quickly, reducing the importance of olfaction. Lastly, a new envi- ronment entails new predators (who may already have adapted to this environment), and hence the entire behavioral repertoire with respect to predation response must be adapted. As a consequence, we expect to see wide ranging structural and functional changes in the neural architecture of organisms that have to adapt to such drastic changes. Additionally, the changes between noctural and diurnal preference is also believed to drive speciation (San- tini et al., 2015). We expect that periodic transitions not only force the organism to evolve to both environments, but may also cause the organism to adapt to the transitions them- selves (as observed in the digital evolutionary model from (Li and Wilke, 2004)). Here, we are specifically interested in these repeated changes, since they should drastically affect the organization and topology of the evolved neural architectures. There have been several computational approaches that have tried to address this ques- tion. For example, it has been shown that neural modularity can mitigate catastrophic forgetting (Ellefsen et al., 2015), i.e. modular structures allow organisms to retain cognitive functions when experiencing a different environment for a number of generations. At the 9 same time, it is not clear that changing environments necessarily promote modular struc- tures (Hüsken et al., 2002; Schlosser and Wagner, 2004; Kashtan and Alon, 2005; Hintze and Adami, 2008). At the genetic level, it has been shown that an organism’s reproductive strat- egy affects the modularity and epistasis of the underlying genetic architecture (Misevic et al., 2006). At the moment, it is unclear if these results will translate to neural architectures. Here we study how an agent controlled by a Markov Brain (Edlund et al., 2011; Marstaller et al., 2013; Hintze et al., 2017) evolves to forage in an environment that periodically changes over the course of evolution. In one environment, the agent can “see” the food, while in the other the agent can only “smell” the food. We show that periodic transitions cause agents to rearrange their neural architecture in such a way that their structural components become useful in both environments - a phenomenon we term structural amalgamation. However, we also show that the structurally amalgamated brains do not create functionally overlapping modules. The brain states that an agent experiences are mostly unique to each environment, and are not shared between environments; the degree to which functional states are shared we term functional amalgamation. 2.3 Materials and Methods 2.3.1 Evolutionary Framework All experiments were performed using the MABE (Bohm et al., 2017) framework. This allows us to evolve different neural substrates, in different environments, and using different selection methods, and allows the results of these experiments to be compared directly. The specific MABE modules used in these experiments are detailed below. 2.3.2 Markov Brains Using natural organisms for an evolutionary experiment such as we are going to perform here, would be extremely time consuming, inconvenient, and costly. As an alternative, we use evolvable Markov Brains (MBs) (Edlund et al., 2011; Marstaller et al., 2013), which are networks that can be composed of probabilistic and deterministic logic gates (for a detailed technical description of Markov Brains see (Hintze et al., 2017)). Here we only use 10 deterministic gates. Each MB is encoded by a genome, which contains genes, and each gene encodes one of the logic gates by specifying what function it performs and how it connects to other gates, the sensors, and the actuators of the agent the MB controls. Figure 2.1 shows the schematic of an MB, and Figure 2.2 shows the encoding that is used to construct MBs from the genome. Among other applications, MBs have been used to understand the evolution of animal behavior (Olson et al., 2013), decision making (Kvam et al., 2015), and rates of evolutionary adaptation in different contexts (Schossau et al., 2016). These MBs are used to control an agent that interacts with the environment using different kinds of sensors, actuators, and internal machinery. We study how these MBs respond to evolutionary pressures, as an abstract model of a natural brain. Clearly, neurons, tissues, sensory cells, and muscles are very different from logic gates and the binary sensors and actuators we use here. However, the analogy between these systems can be drawn on a different level; both systems have components that are involved in different processes, and we are interested in how evolution affects how these components are used, and in turn how their use affects their future evolution. In natural organisms we can ablate cells (say in the visual cortex), and observe if the loss of these cells modulates behavior. Similarly, we can toggle logic gates on and off (like a cell ablation or genetic knockout), which allows us to draw analogous conclusions about the function of those logic gates. As a consequence, our results will not make specific implications about natural neural architecture, but rather how functionality is distributed over the components optimized by selection. 2.3.3 Environment Agents are tasked with solving a foraging problem and their performance is measured by how much food they can collect over a specified period of time. The environment is a square 2D grid with periodic boundary conditions (effectively turning the grid into a torus). There are 20 pieces of food randomly distributed over the grid, and the size of the grid is 10 × 10. When an agent is directly on top of a resource, it can consume it. To keep the quantity of food constant, each time a food is consumed, a new piece of food is placed randomly 11 Figure 2.1: Schematic view of a Markov Brain Network. The sensory inputs, hidden states, and motor outputs at time point t at the top are read by two gates. The result of their computation propagates to new states at time point t + 1. In case two gates try to write into the same state, their outputs will be integrated by applying a logical OR operation. Figure 2.2: Encoding of a Markov Brain. A string of numbers is used as the genome. The genome is read from end to end, and specific pairs of numbers (42,213 for deterministic logic gates) indicate the start of a “gene”. Each gene encodes the connections of one logic gate: red sites are used for the number of inputs, green for the number of outputs , blue for the actual addresses of the states the gate reads from, and yellow for the actual addresses of the states the gate writes into. The remaining cyan colored sites are used to specify the logic of the gate. into the environment on an empty space. An agent can take one of 4 different actions at every time step, namely, move forward, turn 45 degrees left or right, or attempt to consume food, which is only successful if food is directly under them. The environment can be of two different types, which we call day and night. During the day, agents can see food and not smell the food, while during the night, agents can only smell food and not see it. Sensors are implemented as four separate binary inputs for each type of environment. In the case of vision, the sensors light up if the agent is directly facing the food, and within 4 steps of the food. The number of sensors that light up is equal to 4 minus the number of steps 12 Day: Night: Sensors: : Sensor Off : Sensor On : Agent : Food Figure 2.3: Environment types as perceived by an agent. An agent is represented by a squiggle and food by a star. The Day panel shows the agent’s line of sight with blue squares. The Night panel similarly shows the region that can be smelled by the agent with blue squares. The sensors that light up in the respective environments are shown below each panel. from the closest food (see Figure 2.3 day). In the case of olfaction, the sensors light up if the agent is within a Manhattan distance of 4 from the food, in any direction. Similarly to the day environment sensors, the number of sensors that light up in the night environment is equal to 4 minus the Manhattan distance to the closest food (see Figure 2.3 night). In either environment, if all the corresponding sensors of an agent light up, this indicates that the agent is standing directly on top of the food. This allows agents to have similar sensory information that correspond to both the environmental types. Agents experience an environment for 100 time-steps and can collect food during this time. The environment can technically change from day to night in two different ways: either during the lifetime of the agent, which allows them to experience both types of environments during their lifetime, or only over generational time-scales. Here we are interested in the changes to the neural architecture and topology resulting from generational changes to the 13 environment, and as a result, individual agents will only ever encounter a single type of environment during their lifetimes. 2.3.4 Selection Method Populations of agents were allowed to evolve for 10,000 generations in periodically cy- cling environments. At each generation, agents were selected to propagate offspring to the next generation, proportionally to their ability to collect food, which implements a Moran process (Moran, 1958). 2.3.5 Amalgamation To understand how the architecture and function of the evolved MBs are affected by the periodic environmental changes, we perform two kinds of analyses on these agents. Knockout Experiments and Structural Amalgamation After agents were allowed to evolve, their performance along the line of decent (LOD) (Lenski et al., 2003) (the path from a random organism in the final population traced over its an- cestry, back to the start population), was reconstructed. Each agent on the LOD was then tested 50 times over 500 time steps, both in a purely day environment and in a purely night environment in order to more accurately assess an individual’s ability to forage in these dif- ferent types of environments. This also defined the baseline of performance for subsequent knockout experiments. Each agent has a set of logic gates that define what we call structural components. Obvi- ously each gate also conveys a function, but we assess the functional contribution differently (see functional amalgamation). After all, different functions could be implemented on the same computational structure, analogous to how computers are capable of running different programs using the same hardware. We test the contribution of each gene by sequentially removing one gate after another and testing how the agent performs in both the day and night environments. The loss (or gain) in performance, for each individually knocked out component, defines the extent to which it is involved in computations pertaining to the day or night type of environment, and to what degree this gene is involved in both computations. 14 Modularity measures the degree of connectivity of a component within a module as op- posed to its connectivity to the remaining components. Since evolved MBs have to solve at most two tasks, we would a priori expect them to have either one or two modules. How- ever, typical modularity measures do not differentiate between having only one module or no modules at all (for an elaboration on this topic see (Hintze and Adami, 2010)). We will therefore quantify the “Amalgamation” of the evolved brains and not the modularity. Specifically, we want to quantify to what degree the genes that determine the function and connectivity of logic gates in MBs, are involved in diurnal or nocturnal foraging, and if they are needed in both environments. If all gates are used in both environments we argue that the brain is fully amalgamated. To that end, performance was measured in both environments as Wday and Wnight . After that, each component was knocked out one after another, and the performance after each knockout in each environment was assessed as Kday,i and Knight,i (where i indicates the index of each component). We can now quantify the contribution of each component i to the performance of day and night as: ∆Wday,i = Wday − Kday,i (2.1) ∆Wnight,i = Wnight − Knight,i (2.2) A gate (i) that contributes to only one type of environment, say the day, will have a high ∆Wday,i and a low ∆Wnight,i . A gene that contributes to both types of environments, will however result in both delta values being high, while an obsolete or redundant gene will show no effect on the deltas. Additionally, we observed cases where, when genes are deleted, the performance of the agent in one or both environments increased, effectively creating negative values for ∆Wnight,i or ∆Wday,i . Since we are only concerned with the degree of the effect the deleted gate has, we used the absolute of these delta values. The overlap or degree of amalgamation for each gene Ai can now be quantified as the lower of the values |∆Wnight,i | 15 and |∆Wday,i | for each gene i. The total degree of structural amalgamation (AS ), normalized to the number of genes N is thus: N 1 X AS = min(|∆Wday,i |, |∆Wnight,i |) (2.3) N i=0 A brain that has high AS is a highly amalgamated brain, in that more of its structural components are involved in the function of both types of environments. State Transition Experiments and Functional Amalgamation We also analyzed the “experience” that brains have in both types of environments. The experience of a brain can be captured by recording the internal states of a brain. This is analogous to experiments done on natural organisms’ brain states (Werner, 2009). In a nutshell, the state a brain is in at every moment defines the experience the brain has. Since we are working with digital organisms, we recorded the internal states of a brain at every single update as well as the states of the environment. We then reconstructed the brain state transition graph. This graph shows how each brain state transitions to the next one, based both on the sensory inputs the agent receives, and what the agent does. In this graph, each node represents a brain state (experience) and edges show the transitions from state to state given the inputs. By running the agent in different environments, we color coded which node was visited in what environment. We expect some states (experiences) to be visited only in one environment, while other states might be visited in both environments. This allows us to quantify how much each state of the brain is involved in experiencing and processing each environment (e.g. Fig 2.9). Functional amalgamation (AF ) is therefore V (the number of states visited in both environments) over N (the total number of states experienced): V AF = (2.4) N 16 2.4 Results We evolved populations of 100 agents starting with the day or night environment and switched the environments in periodic conditions (no switches, every 1 to 30 generations, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 500, 750, 1000, 2500, and 5000 generations) over 10,000 generations. For each experimental condition, 40 replicates were performed. For each replicate the line of decent (LOD) was reconstructed (Lenski et al., 2003) and each agent on the LOD was tested in the day and night environment again. When evolving agents in only the day or night environment we find that performance over evolutionary time converges to a maximum that is different for the day and night environment (see Figure 2.4). We conclude that in this computational model vision is the superior sense, since it does not require agents to forage for food as much as olfaction requires them to. This also shows that evolving from the night to the day environment is easier than the other way around, which creates an asymmetry within the experiment. However, this is also true for natural evolution, where the transitions between environments are often easier in one direction than the other. Generally, agents experiencing environmental changes very often (up to about changes every 10 generations) keep improving their performance regardless of the transitions over the entire course of the experiment (see Figure 2.5 panel 1 and 2). Agents whose environment changes less often (every 20 to 50 generations) improve their performance less well. They encounter an occasional drop in fitness once in a new environment (see Figure 2.5 panel 3). However, they still manage to optimize their performance overall, just not to the degree they would have in a constant environment (compare to Figure 2.4) or if switches occurred more frequently (see Figure 2.5 panel 1 and 2). Lastly, if environmental switches happen rarely (above 100 generations) agents keep experiencing a drop in fitness once the environment switches, and they are not capable of improving their general performance further over time 17 80 80 70 2 70 1 60 60 50 50 W 40 W 40 30 30 20 20 10 10 0 0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Generations Generations Figure 2.4: Average fitness of agents from the LOD evolved in only the night environment (panel 1) or the day environment (panel 2) without any periodic changes. The fitness mea- sured in the day environment is shown as a dashed line. The fitness in the night environment is shown as a solid line. Gray shadows indicate the standard error over all 40 replicates. (see Figure 2.5 panel 4). For an overview of the attained fitness in the last generation, i.e. the performance after 10,000 generations, for different periods between switches, see Figure 2.6. These results confirm that when environments change rapidly enough agents can not only adapt to one environment at a time, but also adapt to the transition itself (Li and Wilke, 2004). We did not measure if the loss of function when transitioning to a new environ- ment is in itself adaptive i.e. if the rate of function loss decreases over time. However, how this adaptation works in agents controlled by Markov Brains is likely to be different than in (Li and Wilke, 2004). This previous work used AVIDA (Ofria and Wilke, 2004) which evolves sequences of computer commands that control each individual. Rapidly changing environments cause those commands beneficial in one environment to be interspersed with commands that are beneficial in the other environment. Markov Brains do not have this sequential structure but instead form networks, which in our opinion much more closely re- semble actual neural structures. The question now is how Markov Brains change structurally 18 and functionally under these conditions. 2.4.1 Structural Amalgamation We defined the degree of structural amalgamation as the degree to which the computa- tional components (logic gates) are used in both environments. We find that at the end of evolution, the degree of structural amalgamation depends strongly on the speed at which environments switch periodically (see Figure 2.7). This result shows that fast changing environments indeed cause structural components of the brain to be used in both environments, while no changes or rarely changing environments do not cause structural amalgamation. This is not too surprising once the selection pressures these agents experience are considered. After a switch in environment, the selection pressure for components involved in the previous environment is lost, and thus genes (gates) start to drift. The only way this drift could be halted is by ensuring that these components are now also useful in the new environment. 2.4.2 Functional Amalgamation As explained before, amalgamating the structural components of the brain does not imply that functions have to overlap or become intertwined as well. At the same time, structural amalgamation also does not preclude this from happening. We find that the degree of functional amalgamation does not depend on the frequency of periodic transitions (see Figure 2.8) and is generally low (less than 30% of states are experienced in both environment). 2.5 Discussion We showed that MBs evolved to control an agent in periodically switching environments, adapt not only to each environment anew after a switch, but also experience structural and functional reorganization due to the switches. As long as environments change frequently enough (less than 50 generations in each environment) agents do not experience a significant loss in fitness after a switch, and can improve fitness over many switches. Natural organisms probably experience spontaneous or periodic switches on much larger timescales, however natural organisms certainly evolve much slower than the digital ones we used here. We argue 19 that these two effects balance each other and that our results apply to natural organisms qualitatively if not necessarily quantitatively. The difference between structural and functional amalgamation is an important obser- vation. While agents amalgamate their components (genes/gates) to be used in both en- vironments, they evolve to have state to state transitions organized so that states used in one context are separated from states used in another. In hindsight this might not be too surprising, since this form of organization separates mental states from each other as much as possible; environments are experienced differently instead of similarly. Also, at every time point, MBs can only be in one particular state, and a new input will change the state of the brain to a new state, given the current one. As a consequence, specific states can only be reached by a specific sequence of inputs. When the agent finds itself in a new environment that works differently than the one it evolved in, it is unlikely that a similar sequence of inputs will ever be observed. Instead, a different stream of inputs will necessarily lead to a different sequence of brain states. What is particularly surprising, is the finding that even brains that have never experienced a different environment, also do not have significant functional amalgamation (see Figure 2.7 no transitions). Brains that have only evolved in one environment have no direct selection pressure against functional amalgamation. As a consequence, there should be no particular preference or aversion to functional amalgamation, however we still observe that only a few states are shared between both environments. This suggests that agents experience the environment they are evolved in properly, but when they encounter a new environment, they are experiencing distinctively different states. Again, figuratively speaking, when you see something unknown, instead of trying to explain how this new perception fits to something you already know you instead conclude that you haven’t seen this before. The state to state transitions for evolved MBs resemble this principle. Observe that this is at best a loose analogy, albeit an interesting one. We predict that sufficiently short periodic environmental changes should have a similar 20 amalgamating effect on the structure of natural brains while preserving functional separation. Analyzing the anatomy of natural brains across differently evolved species should reveal this principle to be applicable to natural organisms as well. Further investigations will consider the effect of periodic and aperiodic switching of en- vironments on the evolvability and robustness of the computational systems. Similarly, we are curious about the effect of periodic transitions on agents that experience both the day and night environment during their lifetime. These studies will shed more light on the effect of evolutionary transitions on neural architectures. 21 80 80 70 1 70 2 60 60 50 50 W 40 W 40 30 30 20 20 10 10 0 0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Generations Generations 80 80 70 3 70 4 60 60 50 50 W 40 W 40 30 30 20 20 10 10 0 0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Generations Generations Figure 2.5: Average fitness of agents from the LOD evolved with periodic environmental switches 1) every 4 generations, 2) every 10, 3) every 50, and 4) every 100 generations. The fitness measured in the day environment is shown as a dashed line. The fitness in the night environment is shown as a solid line. Gray shadows indicate the standard error over all 40 replicates. 22 70 60 50 W 40 30 20 10 0 0 20 40 60 80 100 Period Figure 2.6: The average fitness of organisms starting their evolution in the day environment. The solid line shows their performance in the night environments, and the dashed line their performance in the day environment for different periods between switches (x axis) after 10,000 generations. Each data point is the average of 40 replicate experiments, error bars show the standard error. 17.5 15.0 12.5 10.0 AS 7.5 5.0 2.5 0.0 0 20 40 60 80 100 Period Figure 2.7: Structural amalgamation (AS ) for agents evolved under different periodic condi- tions (x axes). Agents that started to evolve in the night environment are shown as a solid line, and agents that started to evolve in the day environment are shown as dashed lines. Error bars indicate the standard error. Observe that the left most x-value (0) represents agents that never experienced any environmental change, and thus don’t exhibit any struc- tural amalgamation. 23 0.350 0.325 0.300 0.275 AF 0.250 0.225 0.200 0.175 0 20 40 60 80 100 Period Figure 2.8: Functional amalgamation (AF ) for agents evolved under different periodic con- ditions (x axes). Agents that started to evolve in the night environment are shown as a solid line, and agents that started to evolve in the day environment are shown as dashed lines. Error bars indicate the standard error. 24 0 V2 O0,V1,O1,V0 35_R 35_E V1,V2 V1,V2,V0 0 35_M V0,V1,V2 0,O1,V3 0,V3 V2 O2 3_M 0,V2,O3 O2 O3 4_M O1 O1,O2 O3,0 99_L 100_L O1 O3,0 O2 O2 V3,O3,0,O2 O2 100_M O2 O1 V2,V1 O3 O1 0,V3,O3 99_M 98_M O1,O0 O2 99_E O2,O0 O0,O1 O2 34_M O2 65_R O2 3_R Figure 2.9: A state to state transition graph for an agent evolved only in the night envi- ronment. The nodes represent individual brain states and their corresponding actions (E eating, M for moving forward, R for turning right, and L for turning left). Arrows are labeled according to the sensory inputs experienced during the lifetime of the agent. A label starting with O indicates an olfactory input, while V indicates a vision signal. The number trailing the O or V specifies the distance of the food. After testing the agent multiple times in both environments, the states experienced during the day environment are colored in blue, and the night environment colored in red. The states visited in both environments are colored in green. The width of the edges indicates how often each transition was observed relatively to each other. Agents start in state 0. 25 Chapter 3 The structure of a primordial fitness landscape This chapter is published in (CG et al., 2017) and (Nitash and Adami, 2021), and adapted from both publications. The work presented here was done in collaboration with Thomas LaBar, Arend Hintze, and Christoph Adami. 3.1 Abstract While all organisms on Earth descend from a common ancestor, there is no consensus on whether the origin of this ancestral self-replicator was a one-off event or whether it was only the final survivor of multiple origins. Although resolution of this mystery would enhance our knowledge of the fundamentals of biology, including the likelihood of life on other planets, the number of uncertainties surrounding the origin of life on Earth impede progress. For example, the possible varieties of ancestral self-replicators and their likelihood of spontaneous emergence and persistence are unknown. Here we use the digital evolution system Avida to study the origin of self-replicating computer programs. By using a computational system, we avoid many of the uncertainties inherent in any biochemical system of self-replicators (while running the risk of ignoring a fundamental aspect of biochemistry). First, we generated the exhaustive set of minimal-genome self-replicators and analyzed the network structure of this fitness landscape. We then examined the evolvability of these self-replicators and found that the evolvability of a self-replicator is dependent on its genomic architecture. Finally, we looked at the differential ability of replicators to take over the population when competed against each other, akin to a primordial-soup model of biogenesis, and found that the probability of a self-replicator out-competing the others is not uniform. Instead, progenitor (most-recent common ancestor) genotypes are clustered in a small region of the replicator space. Our results demonstrate how computational systems can be used as test systems for hypotheses concerning the origin of life. 26 3.2 Introduction There is perhaps no topic in biology more fascinating–and yet more mysterious–than the origin of life. With only one example of organic life to date, we have no way of knowing whether the appearance of life on Earth was an extraordinarily rare event, or it if was a commonplace occurrence that was unavoidable given Earth’s chemistry. Were we to replay Earth’s history one thousand times (Gould, 1990), how often would it result in a biosphere? And among the cases where life emerged, how different or how similar would the emergent biochemistries be? The role of historical contingency has been studied extensively in the evolution of life (see, e.g., (Blount et al., 2008) and references therein). Here we endeavour to ask an even more fun- damental question: What is the role of historical contingency in the origin of life? The best evidence suggests that the first self-replicators were RNA-based (Gilbert, 1986; Robertson and Joyce, 2012), although other first self-replicators have been proposed (Leslie E, 2004). Given the large number of uncertainties concerning the possible biochemistry that would lead to the origin of self-replication and life, either on Earth or other planets, researchers have begun to study the process of emergence in an abstract manner. Tools from com- puter science (Pargellis, 1996; Hutton, 2002; Pargellis, 2003; Dorn et al., 2011; Walker and Davies, 2013; Greenbaum and Pargellis, 2017), information theory (Walker, 2014; Adami, 2015; Adami and LaBar, 2017a; Davies and Walker, 2016), and statistical physics (England, 2013; Mathis et al., 2015) have been used in an attempt to understand life and its origins at a fundamental level, removed from the peculiarities of any particular chemistry. Investi- gations along those lines may reveal to us general laws governing the emergence of life that are obscured by the n = 1 nature of our current evidence, point us to experiments that probe such putative laws, and get us closer to understand the inevitability–or perhaps the elusiveness–of life itself (Cronin and Walker, 2016). 27 At the heart of understanding the interplay between historical contingency and the origin of life lies the structure of the fitness landscapes of these first replicators, and how that landscape shapes the biomolecules’ subsequent evolution. While the fitness landscapes of some RNA-based genotypes have been mapped (Jiménez et al., 2013; Petrie and Joyce, 2014) (and other RNA replicators have been evolved experimentally (Mills et al., 1967)), in all such cases evolution already had the chance to shape the landscape for these organisms and “dictate”, as it were, the sequences most conducive for evolution. The structure of primordial fitness landscapes, in comparison, is entirely unknown. While we know, for example, that in realistic landscapes highly fit sequences are genetically close to other highly fit sequences (this is the essence of Kauffman’s “Central Massif” hypoth- esis (Kauffman, 1993), see also (Østman et al., 2010)), we suspect that this convenient property–which makes fitness landscapes “traversable” (Østman et al., 2010)–is an outcome of evolution, in particular the evolution of evolvability. What about primordial landscapes not shaped by evolution? How often are self-replicators in the neighborhood of other self- replicators? Are self-replicators evenly distributed among sequences, or are there (as in the landscapes of evolved sequences) vast areas devoid of self-replicators and rare (genetic) areas that teem with life? Can evolution easily take hold on such primordial landscapes? These are fundamental questions, and they are central to our quest to understand life’s origins. If the fitness landscape consist of isolated fitness networks, as found in some mod- ern RNA fitness landscapes (Jiménez et al., 2013; Petrie and Joyce, 2014), then one may expect the effects of historical contingency to be strong, and the future evolution of life to depend on the characteristics of the first replicator. However, if there exist “neutral net- works” that connect genotypes across the fitness landscape (as found in computational RNA landscapes (Huynen et al., 1996)) then the effect of history may be diminished. Can we learn more about these options? Recently, we have used the digital evolution platform Avida as a model system to study questions concerning the origin of life (Ofria et al., 2009a). In Avida, a population of self- 28 replicating computer programs undergo mutation and selection, and are thus undergoing Darwinian evolution explicitly (Pennock, 2007). Because the genomic content required for self-replication is non-trivial, most avidian genomes are non-viable, in the sense that they can- not form “colonies” and thus propagate information in time. Thus, viable self-replicators are rare in Avida, with their exact abundance dependent on their information content (Adami, 2015; Adami and LaBar, 2017a). Further work on these rare self-replicators showed that while most of them were evolvable to some degree, their ability to improve in replication speed or evolve complex traits greatly varied (LaBar et al., 2015). Furthermore, the capabil- ity of avidian self-replicators to evolve greater complexity was determined by the algorithm they used for replication, suggesting that the future evolution of life in this digital world would be highly contingent on the original self-replicator (LaBar et al., 2016). However, all of this research was performed without a complete knowledge of the underlying fitness landscape, by sampling billions of sequences of a specific genome-size class, and testing their capacity to self-replicate. Sequences used to seed evolution experiments in Avida are usually hand-written (Adami, 1998, 2006), for the simple reason that it was assumed that they would be impossible to find by chance. Indeed, a typical hand-written ancestral replicator of length 15 instructions is so rare–were it the only replicator among sequences of that length–that it would take a thousand processors, executing a million sequences per second each in parallel, about 50,000 years of search to find it (Adami and LaBar, 2017a). However, it turns out that shorter self-replicators exist in Avida. An exhaustive search of all 11,881,376 sequences of length L = 5, as well as all 308,915,776 sequences of length L = 6 previously revealed no self- replicators (Adami and LaBar, 2017a). However, in that investigation six replicators of length L = 8 turned up in a random search of a billion sequences of that length, suggesting that perhaps there are replicators among the 8 billion or so sequences of length L = 7. Here, we confirm that the smallest replicator in Avida must have 8 instructions by testing all L = 7 sequences, but also report mapping the entirety of the L = 8 landscape (268 ≈ 29 209 × 109 sequences) to investigate the fitness landscape of primordial self-replicators of that length. Mapping all sequences in this space allows us to determine the relatedness of self- replicators and study whether they occur in clusters or evenly in sequence space, all without the usual bias of studying only sequences that are among the “chosen” already. Of the almost 209 billion possible genomes, we found that precisely 9141 could undergo self-replication and reproduction, and thus propagate their information forward in time in a noisy environment. We found that these 914 primordial replicators are not uniformly distributed across ge- netic space, but instead cluster into two broad groups (discovered earlier in larger self- replicators (LaBar et al., 2016)) that form 13 main clusters. By analyzing how these groups (and clusters) evolve, we are able to study how the primordial landscape shapes the evo- lutionary landscape, and how chance events early in evolutionary history can shape future evolution. 3.3 Methods 3.3.1 Avida We used Avida (version 2.14) as our computational system to study the origin of self- replication. Avida is a digital evolution system in which a population of computer programs compete for the system resources needed to reproduce (see (Ofria et al., 2009a) for a full description of Avida). Each of these programs is self-replicating and consists of a genome of computer instructions that encode for replication. During this asexual reproduction process, mutations can occur, altering the speed at which these programs reproduce. As faster replicators will out-reproduce slower replicators, selection then leads to the spread of faster replicators. Because avidian populations undergo Darwinian evolution, Avida has been used to explore many complex evolutionary processes (Lenski et al., 1999; Adami et al., 2000; Wilke et al., 2001b; Chow et al., 2004; Covert et al., 2013; Goldsby et al., 2014; Zaman et al., 2014). The individual computer programs in Avida are referred to as avidians. Each avidian 1 The sequences of all replicators can be downloaded from 10.6084/m9.figshare.4551559. 30 organism is essentially a stack based program, where values can be pushed and popped on to the stack, serving as a form of memory (see Fig 3.1). They consist of a genome of computer instructions and different containers to store numbers. Each genome has a defined start point and instructions are sequentially executed throughout the avidian’s lifetime. Some of these instructions allow the avidian to start the replication process, copy their genome into a new daughter avidian, and divide into two avidians (see (LaBar et al., 2016) for the full Avida instruction set). During this replication process, mutations can occur, causing the daughter avidian’s genome to differ from its parent. These mutations can have two broad phenotypic outcomes. First, mutations can alter the number of instruction executions re- quired for replication; these mutations can increase or decrease replication speed and thus fitness. Second, the fixation of multiple mutations can lead to the evolution of complex traits in Avida. These traits are the ability to input binary numbers from the Avida environment, perform Boolean calculations on these numbers, and then output the result of those calcu- lations. In the experiments described here, avidians could evolve any of the nine one- and two-input logic functions (Not, Nand, OrNot, And, Or, AndNot, Nor, Xor, and Equals). This is usually referred to as the “logic-9” environment (Lenski et al., 2003). The ability to perform the above Boolean logic calculations (possess any of these nine traits), increases its bearer’s replication speed by increasing the number of genome instruc- tions the bearer can execute per unit of time. The more instructions an avidian can execute during a unit of time, the fewer units of time that are required for self-replication. These units of time are referred to as updates (they are different from generations). During each update, the entire population will execute 30N instructions, where N is the current popula- tion size. The ability to execute one instruction is called a “Single Instruction Processing” unit, or SIP. If the population is monoclonal, each avidian will receive, on average, 30 SIPs. However, every avidian also has a merit which determines how many SIPs they receive per update. The greater the merit, the more SIPs that individual receives. The ability to per- form the nine calculations multiply an individual’s merit by the following values: Not and 31 Figure 3.1: Schematic of an Avidian organism, showing the CPU (logic unit), the stack (memory), the genome, and the input-output interface to the environment. Nand: 2, OrNot and And: 4, AndNot and OR: 8, Nor and Xor: 16, and Equals: 32. The Avida world consists of a fixed-size toroidal grid of cells. The total number of cells sets the maximum population size. Each cell can be occupied by at most one avidian. After successful reproduction, a new avidian is placed into one of the world’s cells. In a well-mixed population, any cell in the population may be chosen. In a population with spatial structure, the new avidian is placed into one of the nine cells neighboring the parent avidian (including the cell occupied by the parent). If there are empty cells available, the new avidian occupies one of these cells. If all possible cells are occupied, a cell is chosen at random, its occupant removed from the population, and the new avidian then occupies this cell. This random removal implements a form of genetic drift in Avida. For the experiments performed here, the population structure was spatial. 3.3.2 Experimental Design In order to map the entire Avida fitness landscape, we constructed all 268 ≈ 2.09 × 1011 genomes and analyzed whether they could self-replicate. This operation was performed by running these genomes through Avida’s Analyze Mode (described in the Data Analysis 32 section) and checking whether these genomes gave their bearer non-zero fitness, and whether they were viable. Next, we described the fitness landscape by looking for the presence of genotype clusters among the discovered self-replicators. We constructed a network of the fitness landscape where each genotype is a node and the length between two nodes is the square of the Hamming distance between the genotypes. We also examined the frequency of single instruction motifs (monomers), as well as double instruction motifs (dimers). To test the evolvability of the 914 self-replicators, we evolved 10 monoclonal populations of each replicator with 3,600 individuals for 2 × 104 updates in the logic-9 environment (see above). Point mutations occurred at a rate of 7.5 × 10−3 mutations per copied instruction, while single-instruction insertion and deletion mutations both occurred at a rate of 5 × 10−2 mutations per division. At the end of each population’s evolution, we analyzed the most abundant genotype from each population. In order to test the role of historical contingency when the appearance of self-replicators was frequent, we ran experiments where we evolved all 914 self-replicators in the same pop- ulation (a “primordial soup” of replicators). In each population, we placed 10 individuals of each self-replicator. The ancestral population then had 9140 individuals and could expand to 104 individuals at maximum capacity. These populations evolved for 5 × 104 updates in the logic-9 environment. Mutation rates were the same as in the previous evolvability exper- iments. This experiment was performed in 200 replicates. To identify the ancestral genotype that outcompeted all of the other genotypes, we isolated the most abundant genotype at the end of the experiment and traced its evolutionary history back to its original ancestor. 3.3.3 Data analysis Statistics on different avidians were calculated using Avida’s Analyze Mode. In Analyze Mode, a single genotype is examined in isolation as it executes the instructions in its genome, runs through its life-cycle, and possibly creates an offspring. This confers on experimenters the ability to calculate the fitness for an individual avidian (number per offspring generated per unit time) and examine other characteristics, such as whether it can reproduce per- 33 fectly (all offspring are genetically identical to each other and the mother genome) or which traits this avidian possesses. Analyze Mode was also used to calculate quantities such as genome size. Avida’s analyze mode code is available along with the entire Avida software at https://github.com/devosoft/avida. Across-population means and standard errors were calculated using the NumPy (Van Der Walt et al., 2011) Python software package. The clusters of replicators were rendered using Neato, which is an undirected graph embedder that creates a layout similar to that of Multi-Dimensional Scaling (Gansner and North, 2000). Figures were plotted using the Matplotlib Python package (Hunter et al., 2007). 3.4 Results 3.4.1 Structure of the Fitness Landscape Of the 268 (approximately 209 billion) genomes with 8 instructions, we found 914 that could self-replicate. We also searched for self-replicators with seven-instruction genomes but found none, establishing that L = 8 is the minimal self-replicator length in Avida. By discov- ering all self-replicators in this fitness landscape, we can now calculate the precise information content required for self-replication in Avida, using previously-established methods (Adami, 2015), as − log26 ( 914 268 ) ≈ 5.9 mers (a “mer” is a unit of entropy or information, normalized by the number of states that each instruction can take on, see (Adami, 2004)). Our previous estimate (Adami and LaBar, 2017a) of the information content of length-8 replicators, based on finding 8 replicators among a billion random samples, was 5.81 ± 0.13 mers. To study the genetic structure of these replicators, we obtained the distribution of instruc- tions (monomers) across the replicators’ genomes (Fig. 3.2a). This distribution is biased, as every single replicator contained at least the three instructions required for replication: h-copy, h-alloc, and h-divide (denoted by v, w, and x, respectively, see the mapping between instructions and the letter mnemonic in Table 1 in the Appendix). In addition, 75% of repli- cators have a b (nop-B), an f (if-label), and a g (mov-head) instruction, while 25% have a c (nop-C), an h (jmp-head), and an r (swap) instruction in their sequence. We also analyzed 34 A B monomer dimer Figure 3.2: A: Distribution of monomers/single instructions (i.e., proportion of self- replicators containing a given monomer). B: Distribution of dimers (pairs of instructions). Dimers are ordered lexicographically on the x-axis (the proportion of fg, gb, rc, and hc dimers are labeled). the distribution of sequential instruction pairs (dimers) and found that while most dimers do not occur in any self-replicators, the dimers fg and gb occur in approximately 70% of the replicators (Fig. 3.2b) and are highly over-represented . Other dimers such as rc, hc, and dimers containing f,g,b,c,v,w, and x occur in approximately 20%-30% of replicators. If there were no constraint on the genetic architecture, we would expect self-replicators to be distributed uniformly across the fitness landscape. However, we found instead that self- replicators are not distributed uniformly in the landscape, but are grouped into 41 distinct genotype clusters, shown in Fig. 3.3. The dimer distribution function we analyzed above separates primordial self-replicators into two major categories: those that carry fg/gb motifs (“fg-replicators” for short), as opposed to those carrying hc/rc motifs (“hc-replicators”) instead. This separation into two classes was noted earlier from a smaller sample of the landscape (LaBar et al., 2015, 2016), which we corroborate here. By scanning the entire landscape we can confirm that these two types are the only types of self-replicators in the landscape, and the clusters of genotypes are homogeneous in the sense that fg-replicators and hc-replicators do not intermix (Fig. 3.3). Fig. 3.4 shows four examples of clusters pulled from the landscape, showing that they are 35 tightly interconnected. Many self-replicators are isolated and 20 of these clusters consist of only 1 genotype. However, most self-replicators are located in large clusters. Almost 75% of the self-replicators are located in four major clusters with 212, 199, 165, and 95 genotypes each, and almost 96% are contained within the 13 clusters that have at least 14 members. There is thus a distinct gap in the cluster size distribution, with small clusters ranging from 1-3 connected members, while the next largest size class is 14. We find that clusters of replicators are highly connected among each other, with a degree distribution that is sharply peaked around the mean degree of a cluster (see Fig. 3.5), which is similar to what is seen in neutral networks of random RNA structures (Aguirre et al., 2011). We find that fg-replicators form the denser clusters. The 914 self-replicators we found vary in fitness, but consistently we find that the fittest self-replicators contain the fg/gb motifs and many of the lowest fitness self-replicators con- tain the hc/rc motifs. In Fig. 3.6 we show the fitness as a function of the MDS-coordinate. In that figure, color denotes fitness according to the scale on the right. The highest peaks and plateaus all belong to fg-replicators. The hc-replicators appear as a valley (dark blue) bordering the group of fg-replicators. 3.4.2 Self-Replicator Evolvability In order to explore the subsequent role of historical contingency after the emergence of life, we tested the evolvability of all 914 self-replicators. First, we evolved each replicator separately. Almost all self-replicators could evolve increased fitness (Fig. 3.7B). However, there was a wide range of mean relative fitness; fg-replicators clearly undergo more adaptation than hc-replicators. To explain why fg-replicators were more evolvable, we first looked at the evolution of genome size. Replicators with the fg/gb motifs grew larger genomes than replicators with the hc/rc motifs (Fig. 3.7c). As larger genomes can allow for the evolution of novel traits in Avida, and thus fitness increases, we next checked whether the fg-replicators had evolved more computational traits than the hc-replicators. In Avida, traits are snippets 36 802 864 24 54 804 821 822 631 794 806 114 84 149 92 796 656 633 808 638 820 630 903 76 43 816 810 635 65 792 623 632 858 103 125 133 607 636 814 818 627 859 857 800 615 887 32 141 639 617 843 812 628 860 798 634 849 852 754 609 848 573 349 657 637 52 90 823 347 851 41 825 581 654 350 644 621 565 855 575 629 655 283 337 611 346 844 862 861 569 672 642 348 101 682 850 756 651 271 131 671 619 307 112 853 571 752 683 652 650 265 625 259 22 670 319 846 885 678 645 783 301 847 147 589 577 750 648 643 845 30 640 775 785 452 675 680 277 82 567 641 63 874 673 613 624 313 856 854 901 668 653 779 780 871 674 647 777 325 579 608 445 74 139 586 665 614 123 649 646 289 331 439 342 679 677 778 789 295 681 61 392 866 116 685 620 443 446 343 344 386 86 626 776 669 618 80 88 557 442 341 905 782 784 773 496 684 666 658 498 435 453 404 408 676 610 145 72 447 45 94 26 502 454 436 622 616 129 774 516 305 281 402 659 612 39 50 20 786 492 441 396 34 67 787 494 143 781 410 78 438 440 415 687 500 329 663 664 591 899 110 508 287 299 317 406 105 99 504 444 667 388 135 698 701 512 417 394 593 559 510 689 690 882 137 514 448 323 263 335 456 127 788 437 56 15 398 590 121 518 506 604 695 692 28 490 553 311 892 257 390 151 660 662 686 702 537 554 563 696 547 400 842 339 766 761 704 602 605 771 833 412 726 269 275 599 309 291 770 541 450 293 544 836 413 661 321 731 600 758 826 725 594 693 261 765 769 451 253 830 327 768 837 414 596 757 540 545 827 273 315 539 538 809 730 736 699 705 303 764 763 838 603 449 53 148 140 803 807 255 762 835 828 728 729 368 372 790 555 546 832 759 549 548 723 739 297 87 42 31 831 801 724 732 267 285 60 109 23 840 817 813 727 378 772 767 760 543 113 829 381 536 886 873 597 371 542 753 722 370 38 839 841 793 815 811 733 333 279 19 71 902 64 834 734 561 383 128 102 345 551 550 574 713 377 144 79 863 797 576 749 737 735 898 356 552 124 819 374 373 365 75 716 132 91 83 805 578 357 799 566 718 488 380 324 795 791 717 120 27 98 418 49 362 294 428 431 709 564 318 570 751 719 376 379 360 430 588 721 369 824 740 136 881 353 420 361 427 423 146 470 464 276 288 270 312 40 585 711 714 469 375 367 352 712 708 720 425 900 130 715 382 338 421 870 568 384 354 432 429 426 138 580 467 457 351 258 73 111 710 738 463 884 744 875 489 300 422 743 459 260 254 330 424 572 278 358 755 366 364 264 62 707 332 100 51 869 582 314 419 81 21 747 742 461 460 306 434 466 462 320 363 359 336 433 876 355 865 741 583 302 284 326 282 409 29 745 746 403 89 748 465 468 485 308 505 556 389 393 122 458 272 407 700 867 906 476 480 266 515 691 340 688 888 296 558 401 387 14 697 478 560 491 405 93 290 517 509 706 482 486 472 411 703 455 134 595 907 196 513 399 397 85 894 471 395 606 694 3 910 481 474 487 188 216 503 499 592 524 601 184 521 142 904 57 187 497 262 391 385 115 868 533 416 316 598 106 475 483 495 104 35 484 511 895 195 192 532 44 197 507 298 33 0 117 493 523 520 274 562 322 46 473 477 479 501 77 66 890 889 534 587 36 193 525 878 9 189 214 190 181 126 58 6 210 250 256 334 16 200 204 527 150 205 529 17 185 535 286 584 107 95 68 186 198 163 328 891 55 25 201 252 280 179 528 310 872 12 208 165 530 211 194 191 519 883 202 212 169 152 877 268 292 304 4 155 531 96 207 69 209 526 242 1 213 159 157 522 908 911 154 203 171 177 231 247 879 909 206 222 913 215 168 221 234 173 246 153 7 199 219 249 240 108 10 47 244 232 37 172 180 893 118 164 236 59 220 167 156 248 161 897 251 241 237 239 13 119 912 225 175 18 174 170 227 233 182 230 235 5 48 162 229 217 243 11 245 238 160 166 218 228 97 224 896 8 158 880 178 226 176 70 2 223 183 Figure 3.3: The complete fitness landscape of all 914 length-8 replicators. The replicators are colored by the class of motifs they contain (fg replicators are colored in red, while hc replicators are colored in blue.) The relative position between any pair of nodes reflects their distance in Hamming space, displayed via multi-dimensional scaling (MDS). As a con- sequence, it appears as if blue and red clusters are linked, which is not the case. One isolated fg-replicator (red) is close to an hc-replicator cluster (blue), but is not connected to it. All visible edges are between nodes that have a Hamming distance of 1 (i.e. they are a point mutation away from each other). 37 A B 119 83 64 82 5 68 8 11 38 80 149 144 138 16 44 12 93 123 36 76 115 124 160 52 104 101 111 94 14 121 136 91 20 10 146 18 51 73 157 159 33 5 19 110 27 26 81 148 46 142 152 112 21 9 89 116 85 118 25 3 86 134 72 53 127 3 11 103 99 0 15 7 96 23 16 106 145 95 13 164 151 19 158 50 29 31 150 65 56 77 141 120 18 143 7 131 6 129 117 126 41 43 97 156 130 113 98 6 102 32 34 8 107 139 154 74 17 24 147 66 75 2 100 20 114 22 84 109 162 90 55 35 42 70 17 140 128 57 78 40 87 21 1 105 15 2 132 13 10 137 48 22 1 9 79 39 61 155 67 59 135 133 92 14 125 62 49 4 30 153 12 69 88 161 45 54 4 122 37 0 163 58 60 108 63 71 47 28 C D 49 5 79 18 83 44 86 23 58 22 14 65 84 31 12 3 71 51 64 33 36 13 38 89 30 32 60 87 19 27 94 53 76 10 9 7 4 42 26 46 13 43 67 69 34 40 10 28 29 52 8 6 45 75 48 56 72 63 3 1 78 50 17 0 61 81 77 21 88 2 5 16 7 20 39 82 74 90 54 8 12 37 15 55 1 59 95 41 0 85 57 68 11 24 92 35 70 47 11 6 14 25 62 73 93 2 4 66 80 9 91 Figure 3.4: Four clusters from the full landscape of self-replicators of L = 8. A: A 23-node cluster of hc-replicators, B: the third-largest cluster in the network: an fg-replicator cluster with 165 members. C: Another large fg-replicator cluster with 96 genotypes. D: A 15-node hc-replicator cluster. 38 0.35 0.30 proportion of nodes 0.25 0.20 0.15 0.10 0.05 0.00 0 5 10 15 20 25 30 35 40 node degree Figure 3.5: Edge distribution of all replicators in the fitness landscape of L = 8. As each cluster has a particular edge distribution, the distributions of the two different kinds of replicators (fg-types and hc-types) do not overlap. Red: fg-replicators, blue: hc-replicators. of code that allow the avidian to gain energy from the environment, by performing logic operations on binary numbers that the environment provides (see Methods). Replicators with the fg/gb motifs did evolve more novel traits than replicators with the hc/rc motifs (Fig. 3.7D). In fact, only fg-replicators evolved traits in these experiments. Finally, we looked at the effect of historical contingency when all 914 replicators were competed against each other in one population. After 50,000 updates, we identify the most abundant genotype in 200 replicate experiments and reconstruct the line-of-descent to determine which of the replicators gave rise to it (we call that replicator the “progenitor”). Most replicators did not emerge as the progenitor of life in these experiments (Fig. 3.8). Three genotypes, vvwfgxgb, vwvfgxgb, and wvvfgxgb, outcompete the other genotypes in 37, 49, and 45 populations out of 200, respectively, or in about 65% of the competitions. The other progenitors of life were not distributed randomly among the other self-replicators either; most of them were present in the same clusters as the three genotypes from above.Thus, while history is a factor in which of the replicators becomes the seed of all life in these experiments, 39 Figure 3.6: Ancestral fitness of all primordial self-replicators of L = 8, where x-y coordinates are the same as the network in Fig 3.3. 40 Figure 3.7: Fitness and other characteristics of all L = 8 self-replicators before and after evolution. A: Ancestral fitness of all replicators. B: Log mean relative fitness after 2 × 104 updates of evolution. C: Final genome size after 2 × 104 updates of evolution. D: Number of evolved traits after 2 × 104 updates of evolution. In all plots, fg-replicators are in red and hc-replicators are in blue. Error bars (black) are twice the standard error of the mean. All plots are sorted in increasing order. 41 more than half the time the progenitor is one of the three highest-fitness sequences. Thus, life predominantly originates from the highest peaks of the primordial landscape. 3.5 Information content of self-replicators of length 9 Having done a complete mapping of the fitness landscape for length 8 self-replicators, we then generated a complete mapping of the fitness landscape of length 9 self-replicators, as this was reasonably tractable, taking about 3-4 months of compute time on a cluster of machines. The goal here was to have a richer landscape that could be used to do more statistical analyses, and perhaps provide further insight into these primordial landscapes. A complete mapping of the length 9 self-replicator landscape revealed exactly 36,171 sequences that could self-replicate, which is indeed a much richer dataset, and one of the largest complete landscape mappings given the 269 possible sequences of length 9. The information content in these ensembles is one of the simplest measures that can be computed. As defined by Szostak (Szostak, 2003), these are the proportion of sequences in an ensemble that carry some particular fitness trait. More precisely, the information content is the proportion of sequences that carry the trait, relative to the total number of possible sequences. I(Eθ ) = −logF (E ≥ θ) (3.1) where F (E ≥ θ) is just the proportion Nθ /N , Nθ being the count of functional sequences, and N the total number of possible sequences. This can be rewritten as I(Eθ ) = L − logD Nθ , where L is the length of the sequences (in these ensembles, it’s typically assumed that all sequences are of equal length), and D is the size of the alphabet from which these sequences are constructed. This measure is actually an approximation to the the Shannon information content of a biomolecular sequence (Adami and Cerf, 2000), but is a convenient to way to compute the value. This measure requires a complete mapping of the fitness landscape which is not possible in general. There are approximations that can be done to estimate the value, but here we have a complete mapping, which sidesteps this issue completely. Computing the 42 802 864 24 54 804 821 822 631 794 806 114 84 149 92 796 656 633 808 638 820 630 903 76 43 816 810 635 65 792 623 632 858 103 125 133 607 636 814 818 627 859 857 800 615 887 32 141 639 617 843 812 628 860 798 634 849 852 754 609 848 573 349 657 637 52 90 823 347 851 41 825 581 654 350 644 621 565 855 575 629 655 283 337 611 346 844 862 861 569 672 642 348 101 682 850 756 651 271 131 671 619 307 112 853 571 752 683 652 650 265 625 259 22 670 319 846 885 678 645 783 301 847 147 589 577 750 648 643 845 30 640 775 785 452 675 680 277 82 567 641 63 874 673 613 624 313 856 854 901 668 653 779 780 871 674 647 777 325 579 608 445 74 139 586 665 614 123 649 646 289 331 439 342 679 677 778 789 295 681 61 392 866 116 685 620 443 446 343 344 386 86 626 776 669 618 80 88 557 442 341 905 782 784 773 496 684 666 658 498 435 453 404 408 676 610 145 72 447 45 94 26 502 454 436 622 616 129 774 516 305 281 402 659 612 39 50 20 786 492 441 396 34 67 787 494 143 781 410 78 438 440 415 687 500 329 663 664 591 899 110 508 287 299 317 406 105 99 504 444 667 388 135 698 701 512 417 394 593 559 510 689 690 882 137 514 448 323 263 335 456 127 788 437 56 15 398 590 121 518 506 604 695 692 28 490 553 311 892 257 390 151 660 662 686 702 537 554 563 696 547 400 842 339 766 761 704 602 605 771 833 412 726 269 275 599 309 291 770 541 450 293 544 836 413 661 321 731 600 758 826 725 594 693 261 765 769 451 253 830 327 768 837 414 596 757 540 545 827 273 315 539 538 809 730 736 699 705 303 764 763 838 603 449 53 148 140 803 807 255 762 835 828 728 729 368 372 790 555 546 832 759 549 548 723 739 297 87 42 31 831 801 724 732 267 285 60 109 23 840 817 813 727 378 772 767 760 543 113 829 381 536 886 873 597 371 542 753 722 370 38 839 841 793 815 811 733 333 279 19 71 902 64 834 734 561 383 128 102 345 551 550 574 713 377 144 79 863 797 576 749 737 735 898 356 552 124 819 374 373 365 75 716 132 91 83 805 578 357 799 566 718 488 380 324 795 791 717 120 27 98 418 49 362 294 428 431 709 564 318 570 751 719 376 379 360 430 588 721 369 824 740 136 881 353 420 361 427 423 146 470 464 276 288 270 312 40 585 711 714 469 375 367 352 712 708 720 425 900 130 715 382 338 421 870 568 384 354 432 429 426 138 580 467 457 351 258 73 111 710 738 463 884 744 875 489 300 422 743 459 260 254 330 424 572 278 358 755 366 364 264 62 707 332 100 51 869 582 314 419 81 21 747 742 461 460 306 434 466 462 320 363 359 336 433 876 355 865 741 583 302 284 326 282 409 29 745 746 403 89 748 465 468 485 308 505 556 389 393 122 458 272 407 700 867 906 476 480 266 515 691 340 688 888 296 558 401 387 14 697 478 560 491 405 93 290 517 509 706 482 486 472 411 703 455 134 595 907 196 513 399 397 85 894 471 395 606 694 3 910 481 474 487 188 216 503 499 592 524 601 184 521 142 904 57 187 497 262 391 385 115 868 533 416 316 598 106 475 483 495 104 35 484 511 895 195 192 532 44 197 507 298 33 0 117 493 523 520 274 562 322 46 473 477 479 501 77 66 890 889 534 587 36 193 525 878 9 189 214 190 181 126 58 6 210 250 256 334 16 200 204 527 150 205 529 17 185 535 286 584 107 95 68 186 198 163 328 891 55 25 201 252 280 179 528 310 872 12 208 165 530 211 194 191 519 883 202 212 169 152 877 268 292 304 4 155 531 96 207 69 209 526 242 1 213 159 157 522 908 911 154 203 171 177 231 247 879 909 206 222 913 215 168 221 234 173 246 153 7 199 219 249 240 108 10 47 244 232 37 172 180 893 118 164 236 59 220 167 156 248 161 897 251 241 237 239 13 119 912 225 175 18 174 170 227 233 182 230 235 5 48 162 229 217 243 11 245 238 160 166 218 228 97 224 896 8 158 880 178 226 176 70 2 223 183 Figure 3.8: Location of “progenitors” (ancestral types that were the origin of an evolved population 50,000 updates later) in the primordial landscape. Replicators that were never the ancestor genotype of the entire population are in grey. Those that outcompete all other genotypes in fewer than 6 (out of 200) competitions are colored in green. The three genomes that eventually become the ancestor of life in over 130 competitions are in orange. 43 information content for length 8 sequences shows I8 = −log26 (914/268 ) ≈ 5.91 mers where “mers” is the unit of information/entropy relative to the entropy of a sequence, which is computed by simply taking the value to the logarithm of base D. Doing this for length 9 gives I9 = −log26 (36, 171/269 ) ≈ 5.77 mers which is slightly less information content than for length 8 sequences. There are some differences between the fitness landscapes for length 8 and 9 self-replicators. There is no longer a clear separation between f g, and rc replicators as there was in length 8 - there are self-replicating sequences of length 9 that contain both these motifs. We con- structed networks similar to that of length 8, but here rotations were accounted for. By rotations we mean that if one self-replicating sequence could be rotated to produce another self-replicator, they’re considered to be the same sequence. Under this analysis, we found 30,547 self-replicators, which formed a total of 84 distinct clusters. A very large proportion of these replicators turned out to be in a single cluster of size 27,753, which accounts for 94% of the sequences. This suggests that as the size of the sequences increases, there is increased connectivity, i.e. there are more, and shorter paths between pairs of sequences. Such a large network structure is ripe to investigate whether it contains any graph-theoretic properties similar to the properties of networks seen in the real world, such as RNA sequences. 3.6 Information encoding Next, we wanted to look at how information is encoded within these sequences. To do this, we calculate the density of replicators that are some distance n from any particular sequence i. This can be computed easily as Pn (i) k=0 Nv (k) ρi (n) = Pn L  (3.2) k k=0 k 25 44 Figure 3.9: The log of the fraction of functional sequences ϕ(n) at mutational depth n for length 9 replicators. The red line shows the theoretical lower limit for a ϕ(n) with perfectly compressed information (no viable mutational neighbors). (i) where Nv (k) is the number of self-replicators that are up to n single-site mutations away from the sequence i. The right hand side is scaled by nk=0 Lk 25k which is just the total P  number of possible single-site mutants at the same distance n to give the density ρi (n). Plotting the log of this value ϕ(n) = log(ρi (n)) against n shows some interesting proper- ties of how information is encoded in these sequences. Fig 3.9 shows the information density measure for all 36,171 self-replicator sequences (not considering rotations) of length 9, and Fig 3.10 shows the curves for the most robust and most fragile sequences, along with the average information density of these sequences. There’s clearly quite a lot of variability in how information is encoded. There are some replicators that are very robust, i.e. they have a lot of single site mutational neighbours (up 45 n 1 2 3 4 5 6 7 8 9 0 −1 −2 −3 Φ(n) −4 −5 −6 Figure 3.10: ϕ(n) for the most fragile (dotted line) and the most robust (dashed line) repli- cator. The average ϕ(n) across all replicators is indicated by the solid line. 46 to 74 neighbours as mentioned previously), and these replicators show antagonistic epistasis for the first few mutations, but then show synergistic epistasis as the number of possible mutations increases. Some replicators are not robust, i.e. they have few, if any mutational neighbours. Some of these sequences are maximally compressed all the way to 4 mutations, which can be seen by the intersection with the red line in Fig 3.9. All sequences eventually show synergistic epistasis, and by definition all the curves end at the same point, denoted by ϕ(9) = −I9 . 3.7 Discussion Here, we tested the role of fitness landscape structure and historical contingency in the origin of self-replication in the digital evolution system Avida. We characterized the complete fitness landscape of all minimal-genome self-replicators and found that viable genotypes form clusters in the fitness landscape. These self-replicators can be separated into two replication classes, as we previously found for self-replicators with larger genomes (LaBar et al., 2016). We also found that one of these replication classes (the fg-replicators) is more evolvable than the other, although the evolvability of each genotype varies. Finally, we show that, when all self-replicators are competed against each other in a digital “primordial soup”, three genotypes win over 65% of the competitions and many of the other “winners” come from the same genotype cluster. In a previous study with Avida, we found that 6 out of 109 spontaneously-emergent genomes with 8 instructions could self-replicate (Adami and LaBar, 2017a). Here, we found that 914 out of ≈ 2.8 × 1011 genomes could replicate, consistent with our previous re- sults. This concordance suggests that the information-theoretic theory of the emergence of life, originally proposed by Adami (Adami, 2015) and tested with Avida by Adami and LaBar (Adami and LaBar, 2017a), can accurately explain the likelihood of the chance emer- gence of life. Thus, the emergence of self-replication, and life is dependent on the information required for such life. By enumerating all of the length-8 self-replicators, we were able to show that self- 47 replicators are not uniformly distributed across the fitness landscape and that viable geno- types cluster together. The size of these clusters varies: there are few clusters with many genotypes and many clusters with few genotypes, but the cluster size distribution has a gap. The edge distribution of the clusters is similar to what has been found in random RNA structures, and the mean degree differs between replicator types. Genotypes with different replication mechanisms were in different clusters with no evo- lutionary trajectory between the two. Empirical studies of RNA-based fitness landscapes, biochemical model systems for the origin of life, also show that these landscapes consist of isolated fitness peaks with many non-viable genotypes (Jiménez et al., 2013; Petrie and Joyce, 2014). The fact that both RNA-based landscapes (Jiménez et al., 2013; Petrie and Joyce, 2014) and these digital landscapes have similar structures suggests that the evolu- tionary patterns we see in these Avida experiments may be similar to those one would have seen in the origin of life on Earth. The presence of isolated genotype clusters in both digi- tal and RNA fitness landscapes further suggests that the identity of the first self-replicator may determine life’s future evolution, as other evolutionary trajectories are not accessible. However, if populations can evolve larger genomes, non-accessible evolutionary trajectories may later become accessible, as mathematical results on the structure of high-dimensional fitness landscapes suggest (Gavrilets, 1997). To test for the effects of historical contingency in the origin of self-replication in Avida, we evolved all of the 914 replicators in an environment where they could increase in genome size and evolve novel traits. Previously, we found that the evolvability of spontaneously- emergent self-replicators varied and was determined by their replication mechanism (LaBar et al., 2016). However, those genotypes possessed fixed-length genomes of 15 instructions. Here, we confirmed that the genotype of the first self-replicator, and more specifically the replication mechanism of the first replicator, determine the future evolution of novel traits in Avida. The fg-replicators showed high rates of trait evolution, while hc-replicators failed to evolve novel traits in most populations. However, we did not detect any trade-off in 48 evolvability, as we previously found (LaBar et al., 2016). This difference is likely due to their differences in capacity to increase in genome size, as genome size increases enhance the evolution of novel traits and fitness increases in Avida (Gupta et al., 2016; LaBar and Adami, 2016). Would a similar dynamic occur in a hypothetical population of RNA- based replicators? While experimental evolution of RNA replicators has been performed, the selective environments resulted in genome size decreases (Mills et al., 1967). It is unknown how simple RNA replicators vary in their evolvability. We also performed experiments to test for the role of historical contingency in scenarios where any self-replicator could become the progenitor of digital life. Here, we found that only three self-replicators (or their neighbors in the fitness landscape) became the last com- mon ancestor in the majority of populations. This suggests a lack of contingency in the ancestral self-replicator, but emphasizes the role of the ancestral genotype in determining its future evolution. If life emerges rarely, then its future evolution will be determined by the specific genotype that first emerges, as shown from our first set of evolvability experiments (Fig. 3.7). However, if simple self-replicators emerge frequently, then the future evolution is determined by the evolvability of the fittest replicators, a sort of clonal interference (Gerrish and Lenski, 1998) among possible progenitors of life. In this case, the self-replicators that most successfully invaded the population happened to also be of the type that evolved the largest genomes and most complex traits. However, it can be imagined that the opposite trend could occur (LaBar et al., 2016), and then the progenitor of life would limit the future evolution of biological complexity. 3.8 Conclusions In this work we have performed the first complete mapping of a primordial sequence landscape in which replicators are extremely rare (about one replicator per 200 million sequences) and found two functionally inequivalent classes of replicators that differ in their fitness as well as evolvability, and that form distinct clusters in sequence space. In direct evolutionary competition, only the highest-fitness sequences manage to repeatedly become 49 the common ancestor of all life in this microcosm, showing that despite significant diversity of replicators, historical contingency plays only a minor role during early evolution. While it is unclear how the results we obtained in this digital microcosm generalize to a biochemical microcosms, we are confident that they can guide our thinking about primor- dial fitness landscapes. The functional sequences we discovered here are extremely rare, but likely not as rare as putative biochemical primordial replicators. However, from a purely statistical point of view, it is unlikely that a primordial landscape consisting of sequences that are several orders of magnitude more rare would look qualitatively different, not would we expect our results concerning historical contingency to change significantly. After all, random functional RNA sequences (but not replicators, of course) within a computational world (Aguirre et al., 2011), chosen only for their ability to fold, show similar clustering and degree distributions as we find here. Follow-up experiments in the much larger L = 9 landscape (currently under way) will reveal which aspects of the landscape are specific, and which ones are germane, in this digital microcosm. A comparison between fitness landscapes across a variety of evolutionary systems, both digital (Pargellis and Greenbaum, 2016) and biochemical (Jiménez et al., 2013), will further elucidate commonalities expected for simple self-replicators. As the landscapes for these simple self-replicators are mapped, we expect general properties of primordial fitness landscapes to emerge, regardless of the nature of the replicator. As long as primordial self-replicators anywhere in the universe consist of linear heteroplymers that encode the information necessary to replicate, studies with digi- tal microcosms can give us clues about the origin of life that experiments with terrestrian biochemistry cannot deliver. 50 Chapter 4 Information scores for individual sequences This chapter is published in (Adami and CG, 2022) and is a result of work done in collaboration with Christoph Adami. 4.1 Abstract The information content of symbolic sequences (such as nucleic- or amino acid sequences, but also neuronal firings or strings of letters) can be calculated from an ensemble of such se- quences, but because information cannot be assigned to single sequences, we cannot correlate information to other observables attached to the sequence. Here we show that an informa- tion score obtained from multivariate (multiple-variable) correlations within sequences of a “training” ensemble can be used to predict observables of out-of-sample sequences with an accuracy that scales with the complexity of correlations, showing that functional information emerges from a hierarchy of multi-variable correlations. 4.2 Introduction The term “functional information” refers to the information that is necessary and suffi- cient to predict the function of a symbolic sequence within a particular environment. Typ- ically, the sequence could be nucleic or amino acids, but it also could be a sequence of neuronal firings or even a string of letters written in a particular alphabet. The function could be the biochemical activity (for biological sequences), or the level of performance for a neuronal network that produces the firings. For words in a language, the functionality could be simply the likelihood that the word is understood by someone familiar with the language. In some sense, we often simply use the word “information” to indicate functional infor- mation. For example, the hemoglobin DNA sequence carries the information necessary to make the protein hemoglobin that (given the right cellular environment) is functional in the sense that it binds oxygen with a particular affinity. For recordings from animal brains, the functionality of the sequence of neuronal activations could be the accuracy with which the 51 animal performs a rewarded task. Functional information can be defined quantitatively in terms of the likelihood that a random sequence performs the function or task at the specified level (Szostak, 2003; Hazen et al., 2007). Specifically, if we require a particular sequence to perform function E at least at the level θ, then functional information about E as defined by Szostak is given by I(Eθ ) = − log F (E ≥ θ) , (4.1) where F (E ≥ θ) is the fraction of all sequences (say, given a particular length n) that perform the function at a level at least θ. Such a definition of functional information makes eminent sense (it implies that functional sequences are exponentially rare) and accurately links information content to functionality. For example, in the evolution of RNA molecules that encode an enzymatic activity (Carothers et al., 2004), functional information is correlated to both activity and structural complexity. In the evolution of HIV protease (Gupta and Adami, 2016) functional information can be seen to decrease when anti-viral drugs are given (reflecting the reduced activity of the protease), then to increase again as anti-viral resistance emerges. In fact, functional information defined in this manner turns out to be just the coarse-grained Shannon information content of the sequences (Adami and LaBar, 2017b). A shortcoming of this measure is that only large groups of sequences can be characterized in this manner, because accurately estimating the fraction F (E ≥ θ) necessitates large ensembles of functional sequences. Here we develop an approach that allows us to attach a score to every individual sequence in an ensemble, which then allows us to predict the level of functionality of sequences not in the ensemble. In particular, this approach allows us to study how functional information emerges from multi-site correlations within the sequence. 52 4.3 Information content of functional symbolic sequences We now define the information content of a set of sequences ⃗s(k) of length L that form an ensemble S, given a particular environment within which they are functional (Adami and Cerf, 2000; Adami, 2004, 2012). The sequences ⃗s(k) are written in terms of the state of a joint (k) (k) (k) random variable S = S1 S2 · · · SL (one for each monomer), that is, ⃗s(k) = (s1 , s2 , · · · , sL ), while the environment can be thought of as the state of an environment variable, that is, E = e. To define the information content of sequences, we need to define another ensemble: the ensemble of functional sequences Se , which is constrained by requiring function within environment e. Typically, the unconstrained ensemble is all the possible sequences of a given length L, while the ensemble Se consists of only those sequences that are functional in e at a specified level.With this notation, the information content can be written as I(S : E = e) = H(S) − H(S|e) . (4.2) where H(S|e) is the entropy of sequences in Se : an entropy conditioned on the environmental variable taking on the state E = e. Further, H(S) is the Shannon entropy of the uncon- strained ensemble, which, if sequences of length L are drawn from an alphabet of D symbols, is simply H(S) = L log D. The entropy of the constrained ensemble requires the knowledge of the likelihood pk to find a functional sequence s(k) within the constrained ensemble. Thus, Eq. (4.2), being the difference between the uncertainty about a non-functional ensemble and the uncertainty of the functional ensemble, simply reflects the knowledge of the function that constrains the ensemble: it is information about how to carry out the function. Then, we can calculate the Shannon entropy of sequences in Se as XNe H(S|e) = − pk log pk , (4.3) k 53 where Ne is the number of different functional sequences within Se . If we coarse-grain entropy (4.3) by assuming that each sequence appears with equal probability, then pk = 1/Ne and   L Ne I(S : e) = log D − log Ne = − log , (4.4) DL Ne which is Szostak’s functional information since F = DL . Because most functional sequences of interest are exponentially rare, finding a sufficient number of them to accurately estimate the fraction F (E ≥ θ) = Ne /DL is difficult or nearly impossible. For example, while tens of thousands of variants of the 99-mer HIV-1 protease are known, they are not obtained by randomly sampling the 99-mer space, and therefore cannot be used to calculate functional information in this manner. However, it is possible to relate the information I(S : e) to monomer, dimer, and tri-mer entropies (and so on) as follows. In general, a multi-variable entropy such as H(S|e) = H(S1 S2 · · · SL ) can be decomposed into multi-mer correlation entropies using Fano’s entropy and information decomposition theorems (Fano, 1961)). Specifically1 XL X n m−1 H(S1 · · · SL ) = (−1) I(Sσ(1) : · · · : Sσ(m) ) . (4.5) m=1 σ(1)···σ(m) and XL XL I(S1 : · · · : SL ) = (−1)m−1 H(Sσ(1) · · · Sσ(m) ) . (4.6) m=1 σ(1)···σ(m) Here, we write I for informations (shared entropies) for ease of reading. In these formulæ, σ(i) stands for a permutation of the index i. For example, the entropy decomposition theorem 1 Fano’s decomposition theorems are just specific information-theoretic examples of the more general inclusion-exclusion theorem that has been known for some time. For example (Sylvester, 1883) refers to it as a “théorème logique bien connu”. 54 (4.5) can also be written as X L XL H(S1 · · · SL ) = H(Si ) − I(Si : Sj ) i=1 iAAACJ3icbZBNSwMxEIaz9avWr6pHL0ERBKHselCPRS96U7BVcEuZzU7b0CS7JFm1LP0Xnr2qv8ab2KN/RExbD9r6QuDhnRlm8kap4Mb6/sArzMzOzS8UF0tLyyura+X1jbpJMs2wxhKR6JsIDAqusGa5FXiTagQZCbyOuqfD+vUdasMTdWV7KTYktBVvcQbWWc3yRihAtQXS81CPoOnMHb/ij0SnIfiBnep2uP84qPYumuWvME5YJlFZJsCY28BPbSMHbTkT2C+FmcEUWBfaeOtQgUTTyEe39+muc2LaSrR7ytKR+3siB2lMT0auU4LtmMna0Py3FkVyYrVtHTdyrtLMomLjza1MUJvQYTI05hqZFT0HwDR3x1PWAQ3MuvxKocJ7lkgJKs7DGEU/D9PhD0H0XWDBZDzTUD+oBIeV4NIld0LGKpItsk32SECOSJWckQtSI4w8kCfyTF68V+/Ne/c+xq0F72dmk/yR9/kNm6SqSQ== n AAACFHicbVDLSsNAFJ34rPFVdekmWARXJXGhbsSiG5ct2Ac0oUwmt+3QmUmYmSgl5Avcqj/gb7gTcedW/Blx+lho64ELh3Pu5d57woRRpV33y1pYXFpeWS2s2esbm1vbxZ3dhopTSaBOYhbLVogVMCqgrqlm0EokYB4yaIaDq5HfvAWpaCxu9DCBgOOeoF1KsDZSTXSKJbfsjuHME29KShfv9nny/GlXO8VvP4pJykFowrBSbc9NdJBhqSlhkNt+qiDBZIB70DZUYA4qyMaH5s6hUSKnG0tTQjtj9fdEhrlSQx6aTo51X816I/FfLwz5zGrdPQsyKpJUgyCTzd2UOTp2RjE4EZVANBsagomk5niH9LHERJuwbF/AHYk5xyLK/AhYnvnJ6EPMchOYNxvPPGkcl72TsldzS5VLNEEB7aMDdIQ8dIoq6BpVUR0RBOgePaBH68l6sV6tt0nrgjWd2UN/YH38AOQNozQ= Figure 4.2: Mean information scores (Shannon information) as a function of the degree of correlations taken into account. ⟨I1 ⟩ corresponds to the information disregarding pair-wise correlations between sites, while ⟨I⟩2 disregards third-order correlations, and so on. The red line is the exact information I9 = 5.77 mers. of 10−6 throughout, but note that the results on classification accuracy scale only extremely weakly over the 12 orders of magnitude of pseudocounts we tested. Generally speaking, an infinitesimal pseudocount moves information scores of un-represented symbols far out into the negative, and energy scores far to the positive. The pseudocount for higher-order PWMs (we construct those explicitly up to n = 4, where n is the order of correlations that are included) is adjusted so that the marginal probabilities remain correct. In order to predict the function of an arbitrary sequence in light of an MSA of N ≤ Ne functional ones, we need to construct a classifier in such a manner that, given a particular threshold τ , the score divides functional from non-functional sequences. For energy scores, we would deem those with a score E(⃗s) < τ as functional, while for information scores we would deeem those with I(⃗s) > τ to be replicators. Note that because we have exhaustively classified this dataset already, we can determine the accuracy of classification (given a subset of sequences N < Ne for a given classifier) with perfect accuracy. 67 In principle, we can construct classifiers directly from the information score functions (4.44), but due to the fact that the correction terms to I1 involve the summation of many terms with alternating signs, the pseudodocount prevents these classifiers from being effec- tive8 . Indeed, it appears to be impossible to construct pseudocounts in a manner that does not introduce spurious correlations at all levels n (while it is of course possible to make sure no spurious correlations are introduced for a chosen given level). However, it turns out that sums over the multiple-site energy and information functionals E(si sj · · · ) and R(si sj · · · ) do work well, because they do not rely on cancellations. For example, we can define the following information classifiers (similar classifiers can be built from energy functionals): X L D1 (⃗s) = R(si ) , (4.48) i=1 X L D2 (⃗s) = R(si sj ) , (4.49) iAAACHHicbVC7SgNBFJ31GeMrxtJmMQixCbsp1DKghWUE85BsCLOzN8mQeSwzs0pY8g0WturHiJ3YCn6IrTh5FJp44MLhnHu5954wZlQbz/t0lpZXVtfWMxvZza3tnd3cXr6uZaII1IhkUjVDrIFRATVDDYNmrADzkEEjHJyP/cYtKE2luDbDGNoc9wTtUoKNlW4C1ZfFi075uJMreCVvAneR+DNSqOSD4tfLfVDt5L6DSJKEgzCEYa1bvhebdoqVoYTBKBskGmJMBrgHLUsF5qDb6eTgkXtklcjtSmVLGHei/p5IMdd6yEPbybHp63lvLP7rhSGfW226Z+2UijgxIMh0czdhrpHuOA43ogqIYUNLMFHUHu+SPlaYGBtaNhBwRyTnWERpEAEbpUE8/hCzkQ3Mn49nkdTLJf+k5F/Z5Mpoigw6QIeoiHx0iiroElVRDRHE0QN6RE/Os/PqvDnv09YlZzazj/7A+fgBY3+mCA== 0.8 ⇢(D2 ) 0.6 TPR 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 D2 AAACFnicbVA9SwNBEN2LGmP8ilraHAbBKtylUMuAFpYJmg/IhbC3N0mW7O4du3tKOPITLGyi/8B/YSeCla34X8S9JIUmPhh4vDfDzDw/YlRpx/m0Miura9n13EZ+c2t7Z7ewt99QYSwJ1EnIQtnysQJGBdQ11QxakQTMfQZNf3iR+s1bkIqG4kaPIuhw3Be0RwnWRrq+7Ja7haJTcqawl4k7J8VKtvb1Pnl4rnYL314QkpiD0IRhpdquE+lOgqWmhME478UKIkyGuA9tQwXmoDrJ9NSxfWyUwO6F0pTQ9lT9PZFgrtSI+6aTYz1Qi14q/uv5Pl9YrXvnnYSKKNYgyGxzL2a2Du00CDugEohmI0MwkdQcb5MBlphoE1feE3BHQs6xCBIvADZOvCj9ELOxCcxdjGeZNMol97Tk1kxyZTRDDh2iI3SCXHSGKugKVVEdEdRH92iCHq0n68V6td5mrRlrPnOA/sD6+AEGlqRo FPR Figure 4.4: (a) ROC plot for classifier D2 , constructed from 1% of the Ne sequences in Se , and evaluated on 100,000 uniformly sampled non-functional single-point mutants of func- tional sequences, as the threshold parameter τ is varied from τ = 45 to τ = −211. True positives are functional. The diagonal line is the ROC of a random (unspecific) classifier. (b) The distribution of functional sequences Se (blue) and of the non-functional single-mutant sequences S1 , both evaluated by D2 . Note that unlike in Fig. 4.3, the distributions are not scaled. we have shown that defining energy and information functionals built from multi-variate position weight matrices allow us to capture the correlations inherent in multiple sequence alignments in a model-free manner. Standard approaches to fitting MSAs rely on expensive inverse models to fit the bi-partite correlations in the data, but must reduce the alphabet size considerably to reduce the number of parameters in the Potts or Ising models. It is clear from this analysis that such a fit is entirely unnecessary, as the marginal probabilities can be extracted from the data as long as the maximum entropy assumption is used on single sites, pairs of sites, triples, and so on. We tested this approach on the largest computational genotype-phenotype map created to date. Having a complete genotype-phenotype map allows us to calculate the informa- tion content of functional sequences exactly, and study how this information is built up by lower-order correlations. Further, a complete data set helps in creating ensembles that can be used to test the classification accuracy of function prediction. In particular, we used the model-free approach to construct classifiers that can distinguish functional from non- 71 (a) (b) (c) Figure 4.5: Quality of classifiers (AUC of the ROC curves) when testing against different test sets, as a function of the training set size (in percentage of the full set of size Ne ). Note log scale. In all panels, red points and lines refer to the first-order classifiers D1 , green dots and lines depict the performance of second-order classifiers D2 , while the third and fourth-order classifiers are in blue, and purple, respectively. Error bars are standard error over ten replicates (except for N = Ne ). (a) Q1 is the quality score of classifiers for a test set made from 31,291 non-replicating one-mutants of the functional set S2 . (b) Q2 quantifies the quality of the classifiers when tested on the two-mutant non-functional set S2 . (c) QR measures the performance of the classifiers when tested on 31,291 randomly generated non- functional of 9-mers. 72 functional sequences even when the non-functional sequences are single-site polymorphisms of the functional ones. Of course, this approach needs to be tested on biological data sets with significantly longer sequences. However, since the computational cost of creating these complex classifiers scales mainly with the level of multivariate correlations that are included, we expect to be able to classify sequences of several hundreds of sites with full alphabet size (for example, D = 20 or 21 for proteins) taking into account all tri-partite correlations. For binary data such as neuronal spike trains, we expect to be able to handle sequences of thousands of sites, possibly including even more than 4-part correlations. Sequence models of the Ising or Potts type perform markedly better than the stan- dard Hidden Markov Models (HMMs) that have been developed and used since the early 1970s (Rabiner, 1989), because they can take into account arbitrary correlations between pairs of symbols, rather than just those between adjacent symbols in HMMs. However, going beyond pair-wise correlations is deemed impossible for Ising models because of the associated explosion in the number of parameters. While more general pattern classification task (of which sequence classification is but a subset) can nowadays be achieved by training deep neural networks on annotated data, these methods have also displayed severe draw- backs (Nguyen et al., 2015; Jo and Bengio, 2018), for example by being easily fooled by detracting patterns. Furthermore, training such models on a large corpus of data is expen- sive. The model-free classifiers that we have described here are different from deep networks in important ways, as they do not require any training, and their vulnerability to overfitting is easily controlled by adjusting the level of correlation n to be included to the size of the MSA. 73 Chapter 5 Predicting mutation effects using information scores The work presented in this chapter was done in collaboration with Christoph Adami. The results described here are in the process of being written up for publication. 5.1 Introduction In chapter 4 I demonstrated how an information-theoretic approach could be used to successfully classify sequences based on the information content of individual sequences in an ensemble of sequences sharing a common property. This validation of the approach was done on a synthetic data set, as this allowed me to comprehensively understand the dynamics of the system, as well as allowing me to generate a complete landscape of sequences satisfying a property. Neither of these advantages are applicable to biological data - generating complete fitness landscapes is impossible for natural systems, and our collective understanding of the complex dynamics of biological systems is incomplete to say the least. Nonetheless, being able to apply the information-theoretic approaches to such data is the real test of the utility of such an approach. In this chapter, I investigate whether the techniques developed in previous chapters can be used to predict the effects of mutations on the ability of a biomolecular sequence to perform its function. Being able to predict the phenotypic effects of mutations is very valuable, e.g., being able to test drug binding efficacy, predicting the ability of a protein to bind to an enzyme, etc (Romero et al., 2015; Currin et al., 2015; Roscoe et al., 2013; Ostrov et al., 2016; Lek et al., 2016). In particular, I look at a biological dataset that contains the effects of mutating the sites of the WW domain of the human YAP65 protein (hYAP65) on the protein’s ability to bind to its polyproline peptide ligand (Araya et al., 2012). I apply the information-theoretic tools developed in previous chapters on this data, and show that it predicts the effects of mutations with an accuracy comparable to the state of the art techniques, such as deep mutational scanning, models that incorporate pairwise correlations, such as EVmutation (Hopf et al., 2017), as well as deep-learning approaches, such as DeepSequence (Riesselman et al., 2018). 74 5.2 Motivation and current approaches A very hard problem in biology is the prediction of the effects of mutations on the functionality of sequences. These sequences could be RNA, DNA, etc, and their function could be a myriad of different things: binding to a molecule, performing enzymatic reactions, etc. There are a number of different approaches that have been taken to study this problem. The most straightforward approach is to perform in vitro experiments by mutating individual sites of the genotype, and assaying its resultant fitness, often referred to as Deep Mutational scans (Fowler and Fields, 2014). While this is a legitimate approach, it suffers from being extremely expensive to perform these experiments. In addition, the potential search space is prohibitively large; nature has had many millennia to search through potential mutations and preserve the ones that still maintain their fitness - requiring time and resources that are quite simply unavailable to human researchers. This is certainly true with the current technological capabilities in this field, but even with significant advances the search space will likely remain intractable. Due to the limitations of the current approach, analyses of this kind has turned to com- putational techniques. Computational simulations/experiments, while obviously not being a perfect representation of the underlying biology, are nonetheless very helpful in understand- ing the dynamics of systems. While computational techniques must necessarily abstract the underlying system, those techniques can be used to substantially narrow down the search space, which can then be explored more thoroughly by experiments in vitro. Attempts have been made in this area to model the underlying chemistry of bio-molecules, and to predict the functionality of specific sequences using these models (Kircher et al., 2014; Sim et al., 2012). These models have had some limited success in certain areas, but are hindered by the fact that we don’t yet fully understand the chemical properties, and the complex multi-dimensional interactions of large biomolecules. 75 As the quality of the model improves, the computational resources scale up prohibitively as well, making this an infeasible approach for many analyses. An early use of computational techniques has been to build large scale statistical models of the system being investigated. This is done by training a model using known variations of a particular sequence. This approach has had a moderate amount of success, but it notably fails when there are epistatic effects between sites in the sequences, as is typically the case with biomolecular sequences (Breen et al., 2012). In addition, this approach requires enormous training data set sizes, often in the millions of sequences. This kind of data is not always available, and without such training sets, such approaches do not have as high a degree of accuracy as we would like. Work has been done to take into account epistatic effects between pairs of residues (Figli- uzzi et al., 2016; Lapedes et al., 2012; Hopf et al., 2017). This approach has a higher degree of predictive accuracy than models that do not take into account these higher order inter- actions. However, there is clear evidence that shows epistatic effects of even higher orders are common in how proteins/RNAs work, and these models are not able to account for these effects. In particular, naive extensions of these approaches are not scalable - a linear increase in the number of potentially interacting sites leads to an exponential number of variables that need to be fit. The “pairwise” estimation models used in this way simply can- not be extended to consider triplet, quadruple, or higher order interactions. As it happens, these models are actually approximations to a much more general model (the information- theoretic model described in Chapter 4), and in fact follow from the model used there. The information-theoretic model does not suffer from the same intractability as these models because it requires no fitting to the data. Recently, work has been done to try and capture higher order interactions in biomolec- ular data by fitting the data to a non-linear multi-variable model (referred to as a deep latent-variable model) (Riesselman et al., 2018). This approach is similar to the pairwise model, i.e., the EVmutation (Hopf et al., 2017) approach that uses a log-ratio heuristic 76 that considers pairwise residues to predict the effects of mutations. This model extends the EVmutation approach by introducing a nonlinear latent variable model that is intended to capture higher order constraints on the sequences in the training data. This work has shown to be an improvement over the pairwise model, as is expected since more correlations are being taken into account. This approach has a number of issues as well - training the model is computationally very expensive and involves a very large number of training parameters, the training size needed is relatively high, and as with other deep learning models, it suffers from overfitting to the training data. Deep learning models often use a number of techniques that incorporate additional data that is gathered from other data sources. This additional in- formation, e.g., the likelihood of particular amino acid substitutions, is not directly available in the training data set, and yet is crucial for the deep learning models to achieve reasonable prediction accuracy. The issues that deep learning techniques suffer from arise from the fact that the models being trained are unable to distinguish between signal and noise in the training data, and many of the additional corrections to the model are needed precisely to account for this inability to distinguish between noise and signal. In the following sections, I show how the information-theoretic approach is a simpler, computationally cheaper, and more understandable model, which needs only the information available in the training data to achieve prediction accuracy on par with or better than the best existing approaches. 5.3 Using Information theory to predict mutation effects In this work, I apply the information-theoretic approach described in Chapter 4 to predict effect of single substitution mutations on the ability of the YAP-1 (hYAP65) WW protein 1 domain to bind to their polyproline peptide ligand (Araya et al., 2012). The training data for the mutation effects was generated as multiple sequence alignments of the protein family using the default five search iterations of the profile HMM homology search tool jackhammer against the UniRef100 database of non-redundant protein sequences. Length- 1 YAP-1 stands for “Yes Associated Protein” which is a transcriptional co-activator. The WW domain is a modular protein domain that is repeated multiple times, and has a tendency to bind to proline-rich proteins. (Yagi et al., 1999) 77 normalized bit scores were used to threshold sequence similarity instead of E-values, in order to control for comparable evolutionary depth across different families. The precise details of how the multiple sequence alignments were generated is described in the supplementary information of (Riesselman et al., 2018). For this project I used the exact multiple sequence alignments that were provided in the supplementary information of (Riesselman et al., 2018) as FASTA files in the a2m format. To compare the quality of the information-theoretic approach, I use the provided mutation data in (Riesselman et al., 2018) containing data for 363 total mutations at various sites. The data also include the prediction data for the existing approaches, i.e., those taking into account the frequencies of occurrences of amino acids at single sites (Araya et al., 2012), the approach in (Hopf et al., 2017) taking into account epistatic effects of pairs of amino acids, and the deep learning approach (Riesselman et al., 2018) that tries to take into account all correlations in the training data simultaneously. There are some significant advantages to using the information-theoretic approach to predict the effect of mutations: This technique is computationally cheap, both in terms of time and memory. The time complexity of this approach scales polynomially with an increase in the order of correlation. The space (memory) complexity is more of a concern, as it scales polynomially in terms of both the length of the sequences and the alphabet size. A number of implementation opti- mizations, such as using more efficient data structures can potentially reduce the memory usage. One of the other possible techniques is to store only a subset of the entries in the position weight matrices that correspond to frequently seen symbol-site pairs, and recom- puting the infrequently seen entries only when they are needed by the information scoring function. Approaches that optimize for space will often result in additional time being spent on computation, and the optimal space-time trade-off will depend on the particular data set being analyzed. This approach is based in fundamental theories that have been developed a while ago, and are fully understood, allowing a better understanding of the information that is actually 78 being learned from the training data. This is especially an improvement over the deep- learning approaches, which have notoriously been ”black-box” systems, where it is not clear what information is being learned from the training data. I can precisely choose which level of correlations I want to use to predict mutation effects. While this is certainly true of the approaches that precisely use single, and pairwise correla- tions, I can use the same approach in all cases, by simply adjusting the level of correlation that I am interested in. Current models take into account either single site information, or pairwise correlations, and in the case of deep-learning models all correlations are taken into account. In my method, I can adjust the order of correlations precisely, using the same method in all cases. The information-theoretic approach extracts exactly the information about function that is encoded in the sequences, unlike other approaches that will also extract the noise in the sequence ensemble. Alternative approaches inherently are unable to distinguish between signal and noise in the data, and additional effort has to be undertaken to compensate for this. One common adjustment that is employed is to take into account data from other multiple sequence alignments. There are various kinds of additional information that are incorporated, such as the a-priori likelihood of a particular amino acid substitution, which are provided by PAM (point-accepted-mutation) and BLOSUM (BLOcks SUbstitution Matrix) matrices. In the information-theoretic approach I describe here, the only training data used is the multiple sequence alignment for the protein family under consideration, and this approach is still able to extract the information about mutation effects with an accuracy that outperforms the best deep learning models that are currently known. 5.4 Methods To perform the prediction of mutation effects, I used the same tool that was used to analyze the length-9 self-replicator landscape described in Chapter 4. The initial training data set was provided in (Riesselman et al., 2018) as a multiple sequence alignment in a2m FASTA format. The sequences are aligned relative to the wild-type sequence, with insertions 79 represented with the ‘-’ symbol. Since the mutations are relative to the wild-type sequence, it does not matter whether the insertions are considered when generating the position weight matrices. The residues represented with lower-case alphabets were also ignored, and were also ignored by the other techniques used to predict mutation effects. The experimentally observed mutation effects were also provided, which I used to evaluate the efficacy of the information-theoretic approach. In addition, the (Riesselman et al., 2018) paper also included the predicted effects of each mutation for the single-site, pairwise, and deep learning models, which I used as a comparison to my technique. The quality of the predictions was computed using the Spearman Rank Correlation coefficient between the model predictions and the experimentally observed effects, and I used the same correlation coefficient for my analysis as a one-to-one comparison. For the deep-learning approach, the individual mutation effect data was not provided, but the Spearman Rank Correlation was provided. The scatter plots for the single-site and pairwise models are shown in Fig 5.1. The values of the experimental data on the y-axis of the scatter plots is the function of the mutated sequence divided by the function of the wild-type sequence, which indicates that values above 1 are beneficial mutations while values below 1 are deleterious mutations. Similarly, on the x-axis the values plotted are the predicted function of the mutated sequence subtracted by the wild-type sequence, which indicates that values above 0 are predicted to be beneficial mutations while values below 0 are predicted to be deleterious mutations. The Spearman Rank Correlation coefficient is 0.582 for the independent model, and 0.565 for pairwise model, with a p-value ≤ 2.2e−16 in both cases. Note that the prediction accuracy of the independent model slightly outperforms the prediction accuracy of the pairwise model in this data set, though in general the independent model tends to have a lower prediction accuracy than the pairwise model. 5.4.1 Baseline scoring function The first approach I used was exactly as that described in Chapter 4. I scanned the mul- tiple sequence alignment for the YAP-1 WW domain proteins, and generated an information score matrix, where the entries in the matrix (for the first order correlations) are 80 3 3 Experimental Data Experimental Data 2 2 1 1 0 0 −12 −8 −4 0 −8 −4 0 Independent Pairwise (EVmutation) Figure 5.1: Scatter plot showing the correlation between the predicted mutation effects, using only single site information (top panel), and using the correlations between pairs of amino acids (using EVmutation) (bottom panel). x-axis is the predicted mutation effect, and the y-axis is the experimentally observed mutation effect. The blue line is the best linear fit to the data. The horizontal line intercepting the y-axis at 1 indicates the cutoff above which the mutations are beneficial, and the vertical line intercepting the x-axis at 0 indicates the cutoff above which the mutation effect is predicted to be beneficial. The Spearman Rank Correlation coefficient is 0.582, and 0.565 for the independent and pairwise methods respectively, with a p-value ≤ 2.2e−16 for both.   (1) pi (sa ) Ria = log , (5.1) 1/D where pi (sa ) is the probability of seeing the symbol sa at site position i.The information score matrices for higher orders of correlation are produced similarly. This information score matrix was then used to score each individual mutation. For each specified mutation, the wild-type was mutated to the amino acid at the given site, and I scored the mutated sequence using the scoring functions for the first 3 orders of correlations. 81 X L D1 (⃗s) = R(si ) , (5.2) i=1 X L D2 (⃗s) = R(si sj ) , (5.3) i