EXTRACTING STRUCTURE AND FUNCTION
         FROM COMPLEX SYSTEMS
 USING INFORMATION-THEORETIC TOOLS
                         By
                     Nitash C G
               A DISSERTATION
                   Submitted to
           Michigan State University
    in partial fulfillment of the requirements
                 for the degree of
   Computer Science - Doctor of Philosophy
                        2022


                                         ABSTRACT
    One of the primary areas of scientific research is understanding how complex systems
work, both structurally and functionally. In the natural world, complex systems are very
high dimensional, with many interacting parts, making studying them difficult and in some
cases nearly impossible. Due to the complexity of these systems, a lot of modern research
focuses on studying these systems from a computational viewpoint. While this necessarily
abstracts away from the true system, we attempt to represent the salient aspects of the
system in order to better understand it. Results from such computational studies can yield
insight into the natural system, and actively constrain the research space by suggesting
hypotheses that can be tested.
    In this work, I investigate the structure and function of two seemingly disparate complex
digital systems. I begin with an investigation of the structure and function of an evolved
cognitive architecture, and look at how this structure is affected by environmental changes
by developing some new metrics to classify cognitive systems. I then look at the structure
of the primordial fitness landscape in a different digital system, and use techniques inspired
by information theory to understand the structure of this landscape. I first look at the role
of historical contingency in the evolution of life by studying how the structure of this fitness
landscape affects the evolutionary trajectories of life. I then investigate how information
is encoded in the primordial fitness landscape. I then extend this analysis by developing a
general approach for calculating the information content of individual sequences, and use
them to analyze the primordial landscape. Finally, I validate this information-theoretic
technique by predicting the effects of mutations on the function of a specific protein, and
show that this technique can outperform the current state of the art approaches.


To my mother, for everything they have done.
                     iii


                                ACKNOWLEDGEMENTS
    I would like to thank my advisor, Christoph Adami, for going above and beyond their
role as a mentor. This dissertation would have not been possible without their guidance and
support.
    Arend Hintze, who was also my advisor for some years, for teaching me to always ask
the hard questions and not giving up till I answered them.
    Charles Ofria, for valuable guidance during my years doing research, and for the frequent
fascinating discussions about C++ arcana.
    Bill Punch, for giving me the confidence in myself to be a teacher, and instilling in me
the belief that teaching always involves learning new things.
    I would like to thank the members of my lab, Ali Tehrani, Jory Schossau, Clifford Bohm,
Vincent Ragusa, Douglas Kirkpatrick, Thomas LaBar, and Acacia Ackles, for not only help-
ing out with science stuff but also for being thoroughly fun people to hang out with.
    Alok Singh, for introducing me to the world of research.
    Durga Bhavani, for kindling in me the joy of discovery, and showing me that it’s never
too late to be a student.
    Vijay Borges, for believing that I could succeed, and helping me in every possible way
they could.
    Budhaji Sir, for my earliest memory of realizing that there are usually multiple ways of
solving the same problem.
    My parents and sister, for always being there whenever I needed them.
                                               iv


                             TABLE OF CONTENTS
Chapter 1: Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
Chapter 2: The cognitive structure and function of brains evolved in periodically
           changing environments . . . . . . . . . . . . . . . . . . . . . . . . . . . .    8
Chapter 3: The structure of a primordial fitness landscape . . . . . . . . . . . . . .     26
Chapter 4: Information scores for individual sequences . . . . . . . . . . . . . . . . .   51
Chapter 5: Predicting mutation effects using information scores . . . . . . . . . . . .    74
Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  92
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     95
                                             v


                                         Chapter 1
                                       Introduction
    The natural world is a complex dynamic system with a very high dimensionality, and it
can be challenging to understand how all the different parts interact with each other. There
are a number of aspects of natural systems that we do not yet fully understand, and in my
research I attempt to address a couple of them. At one end of the spectrum is abiogenesis:
the spontaneous emergence of self-replicating organisms. At the other end of the spectrum
is a relatively recent evolved organ - the brain, which is the computational processing unit
that allows organisms to solve highly complex problems and is the defining feature of what
are often considered to be the most advanced forms of life on this planet. I investigate
these problems using mathematical and computational tools, and show how these seemingly
disparate systems can be understood using very similar computational and mathematical
techniques.
    As with all scientific investigations, the first course of action is to create a model that
abstracts away the irrelevant details of the system that we want to investigate, while keeping
just enough of the details to be able to observe the desired behaviour. These models of the
systems can be represented in various forms; by observing the system in a petri dish, which
is a microcosm of the natural system we want to understand; by modeling the dynamics
of the system mathematically; by agent-based simulations, i.e. simulating the system in a
computer. In my work I focus primarily on the last of these approaches, namely digital
simulations, but also use mathematical models in order to better understand the behaviour
exhibited in the simulations.
1.1      How brains work
    The functioning of brains (and central nervous systems in general) allows organisms to
solve highly complicated tasks and enable complex social dynamics within species, which
increase fitness substantially. The human brain is arguably the best example of a complex
logic processing unit, and understanding how the brain works is one of the most important
                                                1


questions that we want to answer. While understanding the structure and function of this
fascinating organ is a worthwhile endeavour in and of itself, the practical application of such
an understanding would be the ability to design and create artificial systems with similar
properties. General purpose intelligence in artificial systems would be a paradigm-altering
technology, with probably more impact on our society than any preceding technological
innovation, and would allow us to solve a multitude of different tasks that currently require
human beings to undertake.
    Human brains (and brains in general) are extremely complex organs with lots of moving
parts. While there is a lot of regularity in the structure of the brain, and we are able to
identify some functional regions of the brain (e.g., we can identify the sections of the brain
that are specifically involved in vision), there are still many gaps in our understanding of
how the regions work, and how they work in conjunction with other sections of the brain.
In recent times, technologies such as MRI scans have allowed us to look at extremely small
regions of the brain with extremely high precision (Nowogrodzki, 2018). Observing the
changes in neuronal activity when the organism is subjected to different external stimuli
gives insight into the structure of the brain, and how the different parts interact with each
other. However, generating these neuronal recordings and analysing them is quite expensive,
and we are still able to only look at a very tiny proportion of the entire brain at a given
time.
    In chapter 2 I use a digital cognitive architecture called Markov Brains in order to in-
vestigate some aspects of how brains work. Markov Brains are known to exhibit complex
problem solving behaviour, so understanding such digital cognitive architectures can give
us insight into the structure of natural cognitive architectures (brains). In particular, in
the work I have done I look at how periodic changes in the environment of a species affects
their cognitive architecture. We know that many species have gone through multiple radical
transitions in their environments, and that the structure and function of their brains have
evolved in response to these changes. Understanding how environmental transitions affect
                                               2


the structure and function of digital cognitive architectures can yield insights into how the
structure and function of brains have been affected by environmental transitions in their
evolutionary history.
1.2      The origin of life
    Ever since Darwin introduced the theory of evolution we have had an explanation for how
complex organisms and behaviours can emerge. However, the theory of evolution presupposes
an original self-replicator that was the ancestor of life on Earth, and we do not yet have
a good theory that explains how this original self-replicator arose. There is no available
fossil record of the early progenitors of life, unlike the fossil record left behind by more
recent organisms. In addition, our understanding of molecular biochemistry is insufficient
to explain the spontaneous emergence of life from non-replicating molecules. We also have
only a single data point to work with, and while it is not unreasonable to assume that life
has emerged multiple times in the history of the universe, the emergence of life on Earth is
the only instance of this occurring for which we have concrete evidence. All these factors
make understanding the origin of the first self-replicator (abiogenesis) very difficult.
    There are a number of questions regarding abiogenesis that we would like to answer. The
most obvious question is the precise details of how life emerged on earth, but this knowledge
alone would not be a general theory that could predict the likelihood of spontaneous emer-
gence of life given an underlying biochemistry. A general theory of abiogenesis would enable
us to answer fundamental questions that go beyond the single instance of abiogenesis that
we know has happened on earth. For example, one of these questions is the role of historical
contingency in the emergence of life. As Gould asked (Gould, 1990), if Earth’s history were
to be replayed a thousand times, how often would it result in a biosphere? Is the emergence
of a biosphere inevitable given the underlying chemistry, or was it an extremely unlikely
occurrence? Also, would the resulting biochemistry be very similar to the one we have now,
or would it be radically different?
    There are a number of different approaches that address the question of abiogenesis in
                                               3


a way that is removed from the particularities of Earth’s bio-chemistry, which include tools
from computer science, information theory, and statistical physics. The intent behind these
approaches is to abstract away from the specific details of Earth’s chemistry in order to reach
a more fundamental understanding of the nature of abiogenesis. Of course, the ideas and
theories developed by these, largely mathematical, approaches would be used to inform the
investigation of the spontaneous emergence of life on Earth itself.
    In chapter 3 I use the digital system Avida to generate a complete fitness landscape of
self-replicating organisms. Since the Avida system has been shown to exhibit many of the
fundamental properties of biological evolution, it is reasonable to investigate how life could
have emerged in this digital system. The complete fitness landscapes of primordial self-
replicators can be used to shed light on the structure of biological self-replicator landscapes,
even though we cannot generate a complete fitness landscape for biological systems. The
digital fitness landscape is used in this work to study the role of historical contingency
in the origin of life - in a digital system the question asked by Gould is answerable: we
can replay the evolution of life as many times as we want, and look at how similar, or
different the resulting digital biospheres are from each other. We can use the structure of
the digital primordial landscape to see how the mutational landscape of self-replicators affect
the outcome of evolution. The ideas developed in chapter 3 can perhaps guide investigations
into the nature of how life emerged on Earth. I also look at how information is encoded in
the self-replicating Avidian sequences, where the information that is encoded is exactly the
information the sequence has about its ability to undergo self-replication.
1.3      Predicting function from sequence
    Biomolecular sequences such as DNA sequences typically have some function in their
environment, i.e., they encode the “instructions” necessary to perform some task that is
necessary for the organism’s fitness. This function is not necessarily a binary function, i.e.,
different sequences that perform the same function may not be equally efficient at performing
that function. Being able to predict the level of functionality of a particular sequence is an
                                               4


active area of research, as this ability has a large number of applications. For example, in the
design of HIV anti-viral drugs it is important to determine how effective some particular drug
will be in reducing the activity of the HIV protease. Testing every candidate sequence in vivo
(i.e., in a natural organism) is infeasible, and reducing the candidate search space effectively
is extremely valuable. There are other domains outside the are of molecular biology where
the ability to predict function is useful. An example would be neuronal firing patterns; the
activity of neurons in different brain regions can be represented as sequences which encode
how the brain is processing a particular set of inputs, and mapping the function of the brain
from neuronal firings can give us a deeper understanding of how brains work.
    Currently there are a number of different approaches used to predict the function of a
biomolecular sequence. One approach is to model the underlying biochemistry and then
simulate the activity of these molecules in a computational system. This approach relies
on being able to model the underlying chemistry accurately, which is difficult, while also
keeping the model computationally tractable, which becomes increasingly harder as more
aspects of the underlying biochemistry are incorporated into the model. Other approaches
typically use a large number of existing sequences (usually a multiple sequence alignment
(MSA)) that are known to have the particular function that is of interest, in order to train
a model that “learns” how the function is encoded in the sequences (Hopf et al., 2017;
Riesselman et al., 2018). These approaches have had moderate success, but have a number
of issues: they tend to be unable to extract epistatic effects in the sequences, at least when
there are multiple sites in the sequences that are in epistasis. Deep learning models have
been proposed that take into account higher order correlations in the sequences, but they
are still unable to precisely modulate the degree of correlation; they learn the entire data set
taking into account all correlations at once. These models also run into the issue of being
computationally expensive as they involve millions of parameters that need to be fit to the
training data. The deep learning models also have the fundamental limitation of not being
able to distinguish between the signal and noise in the training data set and these models
                                                  5


typically need to be tweaked in order to separate out the noise from the signal.
    In chapter 4 I develop a novel technique of predicting function from individual sequences
that relies on the observation that the function of a sequence is exactly the “information” the
sequence has about how to perform the function in its environment. This observation allows
established information-theoretic analyses, which have traditionally only been used to cap-
ture the information content of entire ensembles, to be extended for the purpose of capturing
the information content of individual sequences by assigning each sequence an “information
score”. This information-theoretic approach addresses many of the issues prevalent in clas-
sic approaches; the information-theoretic approach is computationally cheaper than other
approaches, and does a good job of distinguishing between noise and signal in the training
data, since the information content is being explicitly extracted from the training data. I
also show that this approach recapitulates the information content of the entire training
data when the information score of all sequences are aggregated. In chapter 5 I validate this
approach on a real world data set - the binding ability of the WW domain of the YAP-1
protein to bind to its polyproline peptide ligand (Araya et al., 2012), and show that the
information-theoretic approach outperforms the best known existing approaches.
1.4      The utility of using digital systems
    Observing systems in the wild can be extremely difficult and expensive, if not outright
impossible in some cases. For starters, not all aspects of the system are easily observable,
especially on very small scales of time and size. In addition, replicability can be hard to
achieve, and perfect replicability is impossible.
    Mathematical models, on the other hand are very precise and completely replicable.
They have been used to study natural systems to great effect (Wigner, 1990). However,
there are limits on the kinds of dynamics that can be modeled mathematically, and for high
dimensional systems with many variables, completely analyzing the system is intractable.
An important aspect that mathematical models have difficulty with is stochasticity. Random
events are pervasive in natural systems, and can have profound effects on the behaviour of
                                               6


systems, which are not easily modeled.
    Digital systems have none of the issues mentioned above. With the power of modern
computers, extremely complicated systems with many moving parts can be simulated in
reasonable amounts of time. This approach has been used successfully in various fields
(Adami et al., 2016). Additionally, every aspect of the system can be analyzed in as much
depth as we care about, a benefit that is a natural consequence of being able to perfectly
replicate the observed behaviour. Random events can be simulated with ease, and because
they are fundamentally only pseudo-random, they can be replicated with ease. As a result,
simulating complex systems in a digital environment has a number of benefits, and is the
reason I use these systems in my research.
1.5      The utility of information theory
    Among the mathematical tools we have available, information theory is an extremely
powerful approach, as at its core it attempts to describe the information content of systems.
While this theory has been known in some form or the other for several decades (Shannon,
1948), it has notoriously been an under-appreciated tool when analyzing complex systems. In
particular, classical formulations have only been able to describe the information content of
entire ensembles, rather than individual components of the system (Szostak, 2003). Applying
this theory to be able to capture the information content of components of a system has never
been successfully done.
    In the work done in chapter 4 I develop a formulation that can be used to describe the
information content of individual components of a system, and show that this can be useful in
a variety of ways. By using the power of digital systems to be able to generate large amounts
of data (in particular, complete landscapes, which is impossible for non-digital systems), I
validate the formulations that I developed.
                                               7


                                         Chapter 2
       The cognitive structure and function of brains evolved in
                       periodically changing environments
    This chapter is published in (CG et al., 2018), and is a result of work done in collaboration
with Barbara Lundrigan, Laura Smale, and Arend Hintze.
2.1     Abstract
    Natural organisms have transitioned from one niche to another over the course of evolu-
tion and have adapted accordingly. In particular, if these transition go back and forth be-
tween two niches repeatedly, such as transitioning between diurnal and nocturnal lifestyles,
this should over time result in adaptations that are beneficial to both environments. Fur-
thermore, they should also adapt to the transitions themselves. Here we answer how Markov
Brains, which are an analogue to natural brains, change structurally and functionally when
experiencing periodic changes. We show that if environments change sufficiently fast, the
structural components that form the brains become useful in both environments. However,
brains evolve to perform different computations while using the same components, and thus
have computational structures that are multifunctional.
2.2     Introduction
    Changing fitness landscapes affect evolutionary dynamics profoundly (Wilke et al., 2001a;
Laakso et al., 2001). Slow changes can favour specialists, while generalists can have an
advantage in environments that change often (Travis and Travis, 2002; Li and Wilke, 2004).
At the genetic level, changing environments modify the effect of mutations, and can alter
their rate of fixation. A mutation that is beneficial in one environment could be even more
beneficial in the next, thereby accelerating fixation. Conversely, a deleterious mutation effect
can be mitigated by an environmental change, thus allowing valley crossings to occur more
often.
    As the fossil record makes abundantly clear, the history of the planet is peppered with
                                                8


sudden, drastic environmental changes that impose vastly different constraints on surviv-
ing species, e.g. The Great Oxidation Event. Another example is the mass extinction of
dinosaurs in the cretaceous, which allowed previously nocturnal rodents to fill the diurnal
niche recently vacated by sauropods (Gerkema et al., 2013; Maor et al., 2017). Transitioning
from a nocturnal to a diurnal lifestyle, or vice versa, is arguably one of the toughest transi-
tions to make for a vertebrate. A transition of this magnitude requires that not only entire
metabolic pathways adapt to the changed temperature, but also that the sensory modalities
adapt to cope with different constraints on foraging. A nocturnal lifestyle either requires
extreme adaptations of the vision system to deal with the lack of light, or requires that
organisms rely on other sensors, such as olfaction or echolocation in the case of bats. On
the other hand, the light of day allows distant objects to be detected easily, emphasizing
the importance of vision. On the downside, the higher temperature during the day causes
odors to evaporate more quickly, reducing the importance of olfaction. Lastly, a new envi-
ronment entails new predators (who may already have adapted to this environment), and
hence the entire behavioral repertoire with respect to predation response must be adapted.
As a consequence, we expect to see wide ranging structural and functional changes in the
neural architecture of organisms that have to adapt to such drastic changes. Additionally,
the changes between noctural and diurnal preference is also believed to drive speciation (San-
tini et al., 2015). We expect that periodic transitions not only force the organism to evolve
to both environments, but may also cause the organism to adapt to the transitions them-
selves (as observed in the digital evolutionary model from (Li and Wilke, 2004)). Here, we
are specifically interested in these repeated changes, since they should drastically affect the
organization and topology of the evolved neural architectures.
    There have been several computational approaches that have tried to address this ques-
tion. For example, it has been shown that neural modularity can mitigate catastrophic
forgetting (Ellefsen et al., 2015), i.e. modular structures allow organisms to retain cognitive
functions when experiencing a different environment for a number of generations. At the
                                                9


same time, it is not clear that changing environments necessarily promote modular struc-
tures (Hüsken et al., 2002; Schlosser and Wagner, 2004; Kashtan and Alon, 2005; Hintze and
Adami, 2008). At the genetic level, it has been shown that an organism’s reproductive strat-
egy affects the modularity and epistasis of the underlying genetic architecture (Misevic et al.,
2006). At the moment, it is unclear if these results will translate to neural architectures.
    Here we study how an agent controlled by a Markov Brain (Edlund et al., 2011; Marstaller
et al., 2013; Hintze et al., 2017) evolves to forage in an environment that periodically changes
over the course of evolution. In one environment, the agent can “see” the food, while in the
other the agent can only “smell” the food. We show that periodic transitions cause agents to
rearrange their neural architecture in such a way that their structural components become
useful in both environments - a phenomenon we term structural amalgamation. However, we
also show that the structurally amalgamated brains do not create functionally overlapping
modules. The brain states that an agent experiences are mostly unique to each environment,
and are not shared between environments; the degree to which functional states are shared
we term functional amalgamation.
2.3       Materials and Methods
2.3.1       Evolutionary Framework
    All experiments were performed using the MABE (Bohm et al., 2017) framework. This
allows us to evolve different neural substrates, in different environments, and using different
selection methods, and allows the results of these experiments to be compared directly. The
specific MABE modules used in these experiments are detailed below.
2.3.2       Markov Brains
    Using natural organisms for an evolutionary experiment such as we are going to perform
here, would be extremely time consuming, inconvenient, and costly. As an alternative, we
use evolvable Markov Brains (MBs) (Edlund et al., 2011; Marstaller et al., 2013), which
are networks that can be composed of probabilistic and deterministic logic gates (for a
detailed technical description of Markov Brains see (Hintze et al., 2017)). Here we only use
                                                 10


deterministic gates. Each MB is encoded by a genome, which contains genes, and each gene
encodes one of the logic gates by specifying what function it performs and how it connects
to other gates, the sensors, and the actuators of the agent the MB controls. Figure 2.1
shows the schematic of an MB, and Figure 2.2 shows the encoding that is used to construct
MBs from the genome. Among other applications, MBs have been used to understand the
evolution of animal behavior (Olson et al., 2013), decision making (Kvam et al., 2015), and
rates of evolutionary adaptation in different contexts (Schossau et al., 2016). These MBs
are used to control an agent that interacts with the environment using different kinds of
sensors, actuators, and internal machinery. We study how these MBs respond to evolutionary
pressures, as an abstract model of a natural brain. Clearly, neurons, tissues, sensory cells,
and muscles are very different from logic gates and the binary sensors and actuators we use
here. However, the analogy between these systems can be drawn on a different level; both
systems have components that are involved in different processes, and we are interested in
how evolution affects how these components are used, and in turn how their use affects their
future evolution. In natural organisms we can ablate cells (say in the visual cortex), and
observe if the loss of these cells modulates behavior. Similarly, we can toggle logic gates
on and off (like a cell ablation or genetic knockout), which allows us to draw analogous
conclusions about the function of those logic gates. As a consequence, our results will not
make specific implications about natural neural architecture, but rather how functionality is
distributed over the components optimized by selection.
2.3.3      Environment
    Agents are tasked with solving a foraging problem and their performance is measured by
how much food they can collect over a specified period of time. The environment is a square
2D grid with periodic boundary conditions (effectively turning the grid into a torus). There
are 20 pieces of food randomly distributed over the grid, and the size of the grid is 10 × 10.
When an agent is directly on top of a resource, it can consume it. To keep the quantity
of food constant, each time a food is consumed, a new piece of food is placed randomly
                                             11


Figure 2.1: Schematic view of a Markov Brain Network. The sensory inputs, hidden states,
and motor outputs at time point t at the top are read by two gates. The result of their
computation propagates to new states at time point t + 1. In case two gates try to write
into the same state, their outputs will be integrated by applying a logical OR operation.
Figure 2.2: Encoding of a Markov Brain. A string of numbers is used as the genome. The
genome is read from end to end, and specific pairs of numbers (42,213 for deterministic logic
gates) indicate the start of a “gene”. Each gene encodes the connections of one logic gate:
red sites are used for the number of inputs, green for the number of outputs , blue for the
actual addresses of the states the gate reads from, and yellow for the actual addresses of the
states the gate writes into. The remaining cyan colored sites are used to specify the logic of
the gate.
into the environment on an empty space. An agent can take one of 4 different actions at
every time step, namely, move forward, turn 45 degrees left or right, or attempt to consume
food, which is only successful if food is directly under them. The environment can be of two
different types, which we call day and night. During the day, agents can see food and not
smell the food, while during the night, agents can only smell food and not see it. Sensors
are implemented as four separate binary inputs for each type of environment. In the case
of vision, the sensors light up if the agent is directly facing the food, and within 4 steps
of the food. The number of sensors that light up is equal to 4 minus the number of steps
                                                12


               Day:                                  Night:
               Sensors:
                       : Sensor Off     : Sensor On      : Agent       : Food
Figure 2.3: Environment types as perceived by an agent. An agent is represented by a
squiggle and food by a star. The Day panel shows the agent’s line of sight with blue squares.
The Night panel similarly shows the region that can be smelled by the agent with blue
squares. The sensors that light up in the respective environments are shown below each
panel.
from the closest food (see Figure 2.3 day). In the case of olfaction, the sensors light up if
the agent is within a Manhattan distance of 4 from the food, in any direction. Similarly to
the day environment sensors, the number of sensors that light up in the night environment
is equal to 4 minus the Manhattan distance to the closest food (see Figure 2.3 night). In
either environment, if all the corresponding sensors of an agent light up, this indicates that
the agent is standing directly on top of the food. This allows agents to have similar sensory
information that correspond to both the environmental types.
    Agents experience an environment for 100 time-steps and can collect food during this
time. The environment can technically change from day to night in two different ways: either
during the lifetime of the agent, which allows them to experience both types of environments
during their lifetime, or only over generational time-scales. Here we are interested in the
changes to the neural architecture and topology resulting from generational changes to the
                                              13


environment, and as a result, individual agents will only ever encounter a single type of
environment during their lifetimes.
2.3.4      Selection Method
    Populations of agents were allowed to evolve for 10,000 generations in periodically cy-
cling environments. At each generation, agents were selected to propagate offspring to the
next generation, proportionally to their ability to collect food, which implements a Moran
process (Moran, 1958).
2.3.5      Amalgamation
    To understand how the architecture and function of the evolved MBs are affected by the
periodic environmental changes, we perform two kinds of analyses on these agents.
    Knockout Experiments and Structural Amalgamation
    After agents were allowed to evolve, their performance along the line of decent (LOD) (Lenski
et al., 2003) (the path from a random organism in the final population traced over its an-
cestry, back to the start population), was reconstructed. Each agent on the LOD was then
tested 50 times over 500 time steps, both in a purely day environment and in a purely night
environment in order to more accurately assess an individual’s ability to forage in these dif-
ferent types of environments. This also defined the baseline of performance for subsequent
knockout experiments.
    Each agent has a set of logic gates that define what we call structural components. Obvi-
ously each gate also conveys a function, but we assess the functional contribution differently
(see functional amalgamation). After all, different functions could be implemented on the
same computational structure, analogous to how computers are capable of running different
programs using the same hardware. We test the contribution of each gene by sequentially
removing one gate after another and testing how the agent performs in both the day and
night environments. The loss (or gain) in performance, for each individually knocked out
component, defines the extent to which it is involved in computations pertaining to the day
or night type of environment, and to what degree this gene is involved in both computations.
                                                14


    Modularity measures the degree of connectivity of a component within a module as op-
posed to its connectivity to the remaining components. Since evolved MBs have to solve at
most two tasks, we would a priori expect them to have either one or two modules. How-
ever, typical modularity measures do not differentiate between having only one module or
no modules at all (for an elaboration on this topic see (Hintze and Adami, 2010)). We
will therefore quantify the “Amalgamation” of the evolved brains and not the modularity.
Specifically, we want to quantify to what degree the genes that determine the function and
connectivity of logic gates in MBs, are involved in diurnal or nocturnal foraging, and if they
are needed in both environments.
    If all gates are used in both environments we argue that the brain is fully amalgamated.
To that end, performance was measured in both environments as Wday and Wnight . After
that, each component was knocked out one after another, and the performance after each
knockout in each environment was assessed as Kday,i and Knight,i (where i indicates the
index of each component). We can now quantify the contribution of each component i to
the performance of day and night as:
                                    ∆Wday,i = Wday − Kday,i                               (2.1)
                                ∆Wnight,i = Wnight − Knight,i                             (2.2)
    A gate (i) that contributes to only one type of environment, say the day, will have a high
∆Wday,i and a low ∆Wnight,i . A gene that contributes to both types of environments, will
however result in both delta values being high, while an obsolete or redundant gene will show
no effect on the deltas. Additionally, we observed cases where, when genes are deleted, the
performance of the agent in one or both environments increased, effectively creating negative
values for ∆Wnight,i or ∆Wday,i . Since we are only concerned with the degree of the effect
the deleted gate has, we used the absolute of these delta values. The overlap or degree of
amalgamation for each gene Ai can now be quantified as the lower of the values |∆Wnight,i |
                                              15


and |∆Wday,i | for each gene i. The total degree of structural amalgamation (AS ), normalized
to the number of genes N is thus:
                                       N
                                   1 X
                            AS =          min(|∆Wday,i |, |∆Wnight,i |)                 (2.3)
                                   N i=0
    A brain that has high AS is a highly amalgamated brain, in that more of its structural
components are involved in the function of both types of environments.
    State Transition Experiments and Functional Amalgamation
    We also analyzed the “experience” that brains have in both types of environments. The
experience of a brain can be captured by recording the internal states of a brain. This
is analogous to experiments done on natural organisms’ brain states (Werner, 2009). In a
nutshell, the state a brain is in at every moment defines the experience the brain has. Since
we are working with digital organisms, we recorded the internal states of a brain at every
single update as well as the states of the environment. We then reconstructed the brain state
transition graph. This graph shows how each brain state transitions to the next one, based
both on the sensory inputs the agent receives, and what the agent does. In this graph, each
node represents a brain state (experience) and edges show the transitions from state to state
given the inputs. By running the agent in different environments, we color coded which node
was visited in what environment. We expect some states (experiences) to be visited only in
one environment, while other states might be visited in both environments. This allows us
to quantify how much each state of the brain is involved in experiencing and processing each
environment (e.g. Fig 2.9). Functional amalgamation (AF ) is therefore V (the number of
states visited in both environments) over N (the total number of states experienced):
                                                   V
                                            AF =                                        (2.4)
                                                   N
                                                16


2.4     Results
    We evolved populations of 100 agents starting with the day or night environment and
switched the environments in periodic conditions (no switches, every 1 to 30 generations,
40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 500, 750, 1000, 2500, and 5000 generations) over
10,000 generations. For each experimental condition, 40 replicates were performed. For each
replicate the line of decent (LOD) was reconstructed (Lenski et al., 2003) and each agent on
the LOD was tested in the day and night environment again.
    When evolving agents in only the day or night environment we find that performance
over evolutionary time converges to a maximum that is different for the day and night
environment (see Figure 2.4). We conclude that in this computational model vision is the
superior sense, since it does not require agents to forage for food as much as olfaction requires
them to. This also shows that evolving from the night to the day environment is easier than
the other way around, which creates an asymmetry within the experiment. However, this is
also true for natural evolution, where the transitions between environments are often easier
in one direction than the other.
    Generally, agents experiencing environmental changes very often (up to about changes
every 10 generations) keep improving their performance regardless of the transitions over the
entire course of the experiment (see Figure 2.5 panel 1 and 2). Agents whose environment
changes less often (every 20 to 50 generations) improve their performance less well. They
encounter an occasional drop in fitness once in a new environment (see Figure 2.5 panel 3).
However, they still manage to optimize their performance overall, just not to the degree they
would have in a constant environment (compare to Figure 2.4) or if switches occurred more
frequently (see Figure 2.5 panel 1 and 2). Lastly, if environmental switches happen rarely
(above 100 generations) agents keep experiencing a drop in fitness once the environment
switches, and they are not capable of improving their general performance further over time
                                               17


      80                                             80
      70   2                                         70   1
      60                                             60
      50                                             50
W     40                                       W     40
      30                                             30
      20                                             20
      10                                             10
       0                                              0
           0   2000 4000 6000 8000 10000                  0   2000 4000 6000 8000 10000
                 Generations                                    Generations
Figure 2.4: Average fitness of agents from the LOD evolved in only the night environment
(panel 1) or the day environment (panel 2) without any periodic changes. The fitness mea-
sured in the day environment is shown as a dashed line. The fitness in the night environment
is shown as a solid line. Gray shadows indicate the standard error over all 40 replicates.
   (see Figure 2.5 panel 4).
   For an overview of the attained fitness in the last generation, i.e. the performance after
10,000 generations, for different periods between switches, see Figure 2.6.
   These results confirm that when environments change rapidly enough agents can not only
adapt to one environment at a time, but also adapt to the transition itself (Li and Wilke,
2004). We did not measure if the loss of function when transitioning to a new environ-
ment is in itself adaptive i.e. if the rate of function loss decreases over time. However, how
this adaptation works in agents controlled by Markov Brains is likely to be different than
in (Li and Wilke, 2004). This previous work used AVIDA (Ofria and Wilke, 2004) which
evolves sequences of computer commands that control each individual. Rapidly changing
environments cause those commands beneficial in one environment to be interspersed with
commands that are beneficial in the other environment. Markov Brains do not have this
sequential structure but instead form networks, which in our opinion much more closely re-
semble actual neural structures. The question now is how Markov Brains change structurally
                                             18


and functionally under these conditions.
2.4.1       Structural Amalgamation
    We defined the degree of structural amalgamation as the degree to which the computa-
tional components (logic gates) are used in both environments. We find that at the end of
evolution, the degree of structural amalgamation depends strongly on the speed at which
environments switch periodically (see Figure 2.7).
    This result shows that fast changing environments indeed cause structural components of
the brain to be used in both environments, while no changes or rarely changing environments
do not cause structural amalgamation. This is not too surprising once the selection pressures
these agents experience are considered. After a switch in environment, the selection pressure
for components involved in the previous environment is lost, and thus genes (gates) start to
drift. The only way this drift could be halted is by ensuring that these components are now
also useful in the new environment.
2.4.2       Functional Amalgamation
    As explained before, amalgamating the structural components of the brain does not imply
that functions have to overlap or become intertwined as well. At the same time, structural
amalgamation also does not preclude this from happening. We find that the degree of
functional amalgamation does not depend on the frequency of periodic transitions (see Figure
2.8) and is generally low (less than 30% of states are experienced in both environment).
2.5       Discussion
    We showed that MBs evolved to control an agent in periodically switching environments,
adapt not only to each environment anew after a switch, but also experience structural and
functional reorganization due to the switches. As long as environments change frequently
enough (less than 50 generations in each environment) agents do not experience a significant
loss in fitness after a switch, and can improve fitness over many switches. Natural organisms
probably experience spontaneous or periodic switches on much larger timescales, however
natural organisms certainly evolve much slower than the digital ones we used here. We argue
                                               19


that these two effects balance each other and that our results apply to natural organisms
qualitatively if not necessarily quantitatively.
    The difference between structural and functional amalgamation is an important obser-
vation. While agents amalgamate their components (genes/gates) to be used in both en-
vironments, they evolve to have state to state transitions organized so that states used in
one context are separated from states used in another. In hindsight this might not be too
surprising, since this form of organization separates mental states from each other as much
as possible; environments are experienced differently instead of similarly. Also, at every time
point, MBs can only be in one particular state, and a new input will change the state of the
brain to a new state, given the current one. As a consequence, specific states can only be
reached by a specific sequence of inputs. When the agent finds itself in a new environment
that works differently than the one it evolved in, it is unlikely that a similar sequence of
inputs will ever be observed. Instead, a different stream of inputs will necessarily lead to a
different sequence of brain states.
    What is particularly surprising, is the finding that even brains that have never experienced
a different environment, also do not have significant functional amalgamation (see Figure 2.7
no transitions). Brains that have only evolved in one environment have no direct selection
pressure against functional amalgamation. As a consequence, there should be no particular
preference or aversion to functional amalgamation, however we still observe that only a
few states are shared between both environments. This suggests that agents experience the
environment they are evolved in properly, but when they encounter a new environment, they
are experiencing distinctively different states. Again, figuratively speaking, when you see
something unknown, instead of trying to explain how this new perception fits to something
you already know you instead conclude that you haven’t seen this before. The state to state
transitions for evolved MBs resemble this principle. Observe that this is at best a loose
analogy, albeit an interesting one.
    We predict that sufficiently short periodic environmental changes should have a similar
                                                20


amalgamating effect on the structure of natural brains while preserving functional separation.
Analyzing the anatomy of natural brains across differently evolved species should reveal this
principle to be applicable to natural organisms as well.
    Further investigations will consider the effect of periodic and aperiodic switching of en-
vironments on the evolvability and robustness of the computational systems. Similarly, we
are curious about the effect of periodic transitions on agents that experience both the day
and night environment during their lifetime. These studies will shed more light on the effect
of evolutionary transitions on neural architectures.
                                              21


      80                                            80
      70   1                                        70   2
      60                                            60
      50                                            50
W     40                                      W     40
      30                                            30
      20                                            20
      10                                            10
       0                                             0
           0   2000 4000 6000 8000 10000                 0   2000 4000 6000 8000 10000
                 Generations                                   Generations
      80                                            80
      70   3                                        70   4
      60                                            60
      50                                            50
W     40                                      W     40
      30                                            30
      20                                            20
      10                                            10
       0                                             0
           0   2000 4000 6000 8000 10000                 0   2000 4000 6000 8000 10000
                 Generations                                   Generations
Figure 2.5: Average fitness of agents from the LOD evolved with periodic environmental
switches 1) every 4 generations, 2) every 10, 3) every 50, and 4) every 100 generations. The
fitness measured in the day environment is shown as a dashed line. The fitness in the night
environment is shown as a solid line. Gray shadows indicate the standard error over all 40
replicates.
                                            22


       70
       60
       50
  W    40
       30
       20
       10
        0
              0            20            40             60            80           100
                                               Period
Figure 2.6: The average fitness of organisms starting their evolution in the day environment.
The solid line shows their performance in the night environments, and the dashed line their
performance in the day environment for different periods between switches (x axis) after
10,000 generations. Each data point is the average of 40 replicate experiments, error bars
show the standard error.
       17.5
       15.0
       12.5
       10.0
  AS
        7.5
        5.0
        2.5
        0.0
                  0         20            40            60            80           100
                                               Period
Figure 2.7: Structural amalgamation (AS ) for agents evolved under different periodic condi-
tions (x axes). Agents that started to evolve in the night environment are shown as a solid
line, and agents that started to evolve in the day environment are shown as dashed lines.
Error bars indicate the standard error. Observe that the left most x-value (0) represents
agents that never experienced any environmental change, and thus don’t exhibit any struc-
tural amalgamation.
                                               23


       0.350
       0.325
       0.300
       0.275
  AF   0.250
       0.225
       0.200
       0.175
                0            20            40            60           80           100
                                                Period
Figure 2.8: Functional amalgamation (AF ) for agents evolved under different periodic con-
ditions (x axes). Agents that started to evolve in the night environment are shown as a solid
line, and agents that started to evolve in the day environment are shown as dashed lines.
Error bars indicate the standard error.
                                             24


                                                           0
                                                 V2         O0,V1,O1,V0
                                 35_R                    35_E
                                   V1,V2     V1,V2,V0
                             0   35_M     V0,V1,V2     0,O1,V3
                                    0,V3 V2                     O2
                                        3_M    0,V2,O3
                                                     O2
                                                 O3          4_M
                                                        O1        O1,O2
                                            O3,0   99_L            100_L        O1
                                           O3,0            O2     O2               V3,O3,0,O2
                                                   O2     100_M      O2   O1
                               V2,V1                      O3        O1
                                     0,V3,O3       99_M              98_M
                                                                 O1,O0       O2
                                                     99_E
                                                       O2,O0 O0,O1      O2
                                                               34_M
                                                                    O2
                                                                    65_R
                                                                       O2
                                                                     3_R
Figure 2.9: A state to state transition graph for an agent evolved only in the night envi-
ronment. The nodes represent individual brain states and their corresponding actions (E
eating, M for moving forward, R for turning right, and L for turning left). Arrows are labeled
according to the sensory inputs experienced during the lifetime of the agent. A label starting
with O indicates an olfactory input, while V indicates a vision signal. The number trailing
the O or V specifies the distance of the food. After testing the agent multiple times in both
environments, the states experienced during the day environment are colored in blue, and
the night environment colored in red. The states visited in both environments are colored
in green. The width of the edges indicates how often each transition was observed relatively
to each other. Agents start in state 0.
                                                             25


                                         Chapter 3
              The structure of a primordial fitness landscape
    This chapter is published in (CG et al., 2017) and (Nitash and Adami, 2021), and adapted
from both publications. The work presented here was done in collaboration with Thomas
LaBar, Arend Hintze, and Christoph Adami.
3.1      Abstract
    While all organisms on Earth descend from a common ancestor, there is no consensus on
whether the origin of this ancestral self-replicator was a one-off event or whether it was only
the final survivor of multiple origins. Although resolution of this mystery would enhance our
knowledge of the fundamentals of biology, including the likelihood of life on other planets,
the number of uncertainties surrounding the origin of life on Earth impede progress. For
example, the possible varieties of ancestral self-replicators and their likelihood of spontaneous
emergence and persistence are unknown. Here we use the digital evolution system Avida to
study the origin of self-replicating computer programs. By using a computational system, we
avoid many of the uncertainties inherent in any biochemical system of self-replicators (while
running the risk of ignoring a fundamental aspect of biochemistry). First, we generated
the exhaustive set of minimal-genome self-replicators and analyzed the network structure
of this fitness landscape. We then examined the evolvability of these self-replicators and
found that the evolvability of a self-replicator is dependent on its genomic architecture.
Finally, we looked at the differential ability of replicators to take over the population when
competed against each other, akin to a primordial-soup model of biogenesis, and found
that the probability of a self-replicator out-competing the others is not uniform. Instead,
progenitor (most-recent common ancestor) genotypes are clustered in a small region of the
replicator space. Our results demonstrate how computational systems can be used as test
systems for hypotheses concerning the origin of life.
                                                26


3.2      Introduction
    There is perhaps no topic in biology more fascinating–and yet more mysterious–than the
origin of life. With only one example of organic life to date, we have no way of knowing
whether the appearance of life on Earth was an extraordinarily rare event, or it if was a
commonplace occurrence that was unavoidable given Earth’s chemistry. Were we to replay
Earth’s history one thousand times (Gould, 1990), how often would it result in a biosphere?
And among the cases where life emerged, how different or how similar would the emergent
biochemistries be?
    The role of historical contingency has been studied extensively in the evolution of life (see,
e.g., (Blount et al., 2008) and references therein). Here we endeavour to ask an even more fun-
damental question: What is the role of historical contingency in the origin of life? The best
evidence suggests that the first self-replicators were RNA-based (Gilbert, 1986; Robertson
and Joyce, 2012), although other first self-replicators have been proposed (Leslie E, 2004).
Given the large number of uncertainties concerning the possible biochemistry that would
lead to the origin of self-replication and life, either on Earth or other planets, researchers
have begun to study the process of emergence in an abstract manner. Tools from com-
puter science (Pargellis, 1996; Hutton, 2002; Pargellis, 2003; Dorn et al., 2011; Walker and
Davies, 2013; Greenbaum and Pargellis, 2017), information theory (Walker, 2014; Adami,
2015; Adami and LaBar, 2017a; Davies and Walker, 2016), and statistical physics (England,
2013; Mathis et al., 2015) have been used in an attempt to understand life and its origins
at a fundamental level, removed from the peculiarities of any particular chemistry. Investi-
gations along those lines may reveal to us general laws governing the emergence of life that
are obscured by the n = 1 nature of our current evidence, point us to experiments that
probe such putative laws, and get us closer to understand the inevitability–or perhaps the
elusiveness–of life itself (Cronin and Walker, 2016).
                                                27


    At the heart of understanding the interplay between historical contingency and the origin
of life lies the structure of the fitness landscapes of these first replicators, and how that
landscape shapes the biomolecules’ subsequent evolution. While the fitness landscapes of
some RNA-based genotypes have been mapped (Jiménez et al., 2013; Petrie and Joyce,
2014) (and other RNA replicators have been evolved experimentally (Mills et al., 1967)), in
all such cases evolution already had the chance to shape the landscape for these organisms
and “dictate”, as it were, the sequences most conducive for evolution.
    The structure of primordial fitness landscapes, in comparison, is entirely unknown. While
we know, for example, that in realistic landscapes highly fit sequences are genetically close
to other highly fit sequences (this is the essence of Kauffman’s “Central Massif” hypoth-
esis (Kauffman, 1993), see also (Østman et al., 2010)), we suspect that this convenient
property–which makes fitness landscapes “traversable” (Østman et al., 2010)–is an outcome
of evolution, in particular the evolution of evolvability. What about primordial landscapes
not shaped by evolution? How often are self-replicators in the neighborhood of other self-
replicators? Are self-replicators evenly distributed among sequences, or are there (as in the
landscapes of evolved sequences) vast areas devoid of self-replicators and rare (genetic) areas
that teem with life? Can evolution easily take hold on such primordial landscapes?
    These are fundamental questions, and they are central to our quest to understand life’s
origins. If the fitness landscape consist of isolated fitness networks, as found in some mod-
ern RNA fitness landscapes (Jiménez et al., 2013; Petrie and Joyce, 2014), then one may
expect the effects of historical contingency to be strong, and the future evolution of life to
depend on the characteristics of the first replicator. However, if there exist “neutral net-
works” that connect genotypes across the fitness landscape (as found in computational RNA
landscapes (Huynen et al., 1996)) then the effect of history may be diminished. Can we learn
more about these options?
    Recently, we have used the digital evolution platform Avida as a model system to study
questions concerning the origin of life (Ofria et al., 2009a). In Avida, a population of self-
                                               28


replicating computer programs undergo mutation and selection, and are thus undergoing
Darwinian evolution explicitly (Pennock, 2007). Because the genomic content required for
self-replication is non-trivial, most avidian genomes are non-viable, in the sense that they can-
not form “colonies” and thus propagate information in time. Thus, viable self-replicators are
rare in Avida, with their exact abundance dependent on their information content (Adami,
2015; Adami and LaBar, 2017a). Further work on these rare self-replicators showed that
while most of them were evolvable to some degree, their ability to improve in replication
speed or evolve complex traits greatly varied (LaBar et al., 2015). Furthermore, the capabil-
ity of avidian self-replicators to evolve greater complexity was determined by the algorithm
they used for replication, suggesting that the future evolution of life in this digital world
would be highly contingent on the original self-replicator (LaBar et al., 2016). However,
all of this research was performed without a complete knowledge of the underlying fitness
landscape, by sampling billions of sequences of a specific genome-size class, and testing their
capacity to self-replicate.
    Sequences used to seed evolution experiments in Avida are usually hand-written (Adami,
1998, 2006), for the simple reason that it was assumed that they would be impossible to
find by chance. Indeed, a typical hand-written ancestral replicator of length 15 instructions
is so rare–were it the only replicator among sequences of that length–that it would take a
thousand processors, executing a million sequences per second each in parallel, about 50,000
years of search to find it (Adami and LaBar, 2017a). However, it turns out that shorter
self-replicators exist in Avida. An exhaustive search of all 11,881,376 sequences of length
L = 5, as well as all 308,915,776 sequences of length L = 6 previously revealed no self-
replicators (Adami and LaBar, 2017a). However, in that investigation six replicators of
length L = 8 turned up in a random search of a billion sequences of that length, suggesting
that perhaps there are replicators among the 8 billion or so sequences of length L = 7.
    Here, we confirm that the smallest replicator in Avida must have 8 instructions by testing
all L = 7 sequences, but also report mapping the entirety of the L = 8 landscape (268 ≈
                                                29


209 × 109 sequences) to investigate the fitness landscape of primordial self-replicators of that
length. Mapping all sequences in this space allows us to determine the relatedness of self-
replicators and study whether they occur in clusters or evenly in sequence space, all without
the usual bias of studying only sequences that are among the “chosen” already. Of the almost
209 billion possible genomes, we found that precisely 9141 could undergo self-replication and
reproduction, and thus propagate their information forward in time in a noisy environment.
     We found that these 914 primordial replicators are not uniformly distributed across ge-
netic space, but instead cluster into two broad groups (discovered earlier in larger self-
replicators (LaBar et al., 2016)) that form 13 main clusters. By analyzing how these groups
(and clusters) evolve, we are able to study how the primordial landscape shapes the evo-
lutionary landscape, and how chance events early in evolutionary history can shape future
evolution.
3.3       Methods
3.3.1       Avida
     We used Avida (version 2.14) as our computational system to study the origin of self-
replication. Avida is a digital evolution system in which a population of computer programs
compete for the system resources needed to reproduce (see (Ofria et al., 2009a) for a full
description of Avida). Each of these programs is self-replicating and consists of a genome of
computer instructions that encode for replication. During this asexual reproduction process,
mutations can occur, altering the speed at which these programs reproduce. As faster
replicators will out-reproduce slower replicators, selection then leads to the spread of faster
replicators. Because avidian populations undergo Darwinian evolution, Avida has been used
to explore many complex evolutionary processes (Lenski et al., 1999; Adami et al., 2000;
Wilke et al., 2001b; Chow et al., 2004; Covert et al., 2013; Goldsby et al., 2014; Zaman
et al., 2014).
     The individual computer programs in Avida are referred to as avidians. Each avidian
   1
     The sequences of all replicators can be downloaded from 10.6084/m9.figshare.4551559.
                                                     30


organism is essentially a stack based program, where values can be pushed and popped on to
the stack, serving as a form of memory (see Fig 3.1). They consist of a genome of computer
instructions and different containers to store numbers. Each genome has a defined start
point and instructions are sequentially executed throughout the avidian’s lifetime. Some of
these instructions allow the avidian to start the replication process, copy their genome into
a new daughter avidian, and divide into two avidians (see (LaBar et al., 2016) for the full
Avida instruction set). During this replication process, mutations can occur, causing the
daughter avidian’s genome to differ from its parent. These mutations can have two broad
phenotypic outcomes. First, mutations can alter the number of instruction executions re-
quired for replication; these mutations can increase or decrease replication speed and thus
fitness. Second, the fixation of multiple mutations can lead to the evolution of complex traits
in Avida. These traits are the ability to input binary numbers from the Avida environment,
perform Boolean calculations on these numbers, and then output the result of those calcu-
lations. In the experiments described here, avidians could evolve any of the nine one- and
two-input logic functions (Not, Nand, OrNot, And, Or, AndNot, Nor, Xor, and Equals).
This is usually referred to as the “logic-9” environment (Lenski et al., 2003).
    The ability to perform the above Boolean logic calculations (possess any of these nine
traits), increases its bearer’s replication speed by increasing the number of genome instruc-
tions the bearer can execute per unit of time. The more instructions an avidian can execute
during a unit of time, the fewer units of time that are required for self-replication. These
units of time are referred to as updates (they are different from generations). During each
update, the entire population will execute 30N instructions, where N is the current popula-
tion size. The ability to execute one instruction is called a “Single Instruction Processing”
unit, or SIP. If the population is monoclonal, each avidian will receive, on average, 30 SIPs.
However, every avidian also has a merit which determines how many SIPs they receive per
update. The greater the merit, the more SIPs that individual receives. The ability to per-
form the nine calculations multiply an individual’s merit by the following values: Not and
                                                31


Figure 3.1: Schematic of an Avidian organism, showing the CPU (logic unit), the stack
(memory), the genome, and the input-output interface to the environment.
Nand: 2, OrNot and And: 4, AndNot and OR: 8, Nor and Xor: 16, and Equals: 32.
    The Avida world consists of a fixed-size toroidal grid of cells. The total number of cells
sets the maximum population size. Each cell can be occupied by at most one avidian. After
successful reproduction, a new avidian is placed into one of the world’s cells. In a well-mixed
population, any cell in the population may be chosen. In a population with spatial structure,
the new avidian is placed into one of the nine cells neighboring the parent avidian (including
the cell occupied by the parent). If there are empty cells available, the new avidian occupies
one of these cells. If all possible cells are occupied, a cell is chosen at random, its occupant
removed from the population, and the new avidian then occupies this cell. This random
removal implements a form of genetic drift in Avida. For the experiments performed here,
the population structure was spatial.
3.3.2      Experimental Design
    In order to map the entire Avida fitness landscape, we constructed all 268 ≈ 2.09 × 1011
genomes and analyzed whether they could self-replicate. This operation was performed
by running these genomes through Avida’s Analyze Mode (described in the Data Analysis
                                                 32


section) and checking whether these genomes gave their bearer non-zero fitness, and whether
they were viable. Next, we described the fitness landscape by looking for the presence of
genotype clusters among the discovered self-replicators. We constructed a network of the
fitness landscape where each genotype is a node and the length between two nodes is the
square of the Hamming distance between the genotypes. We also examined the frequency of
single instruction motifs (monomers), as well as double instruction motifs (dimers).
    To test the evolvability of the 914 self-replicators, we evolved 10 monoclonal populations
of each replicator with 3,600 individuals for 2 × 104 updates in the logic-9 environment (see
above). Point mutations occurred at a rate of 7.5 × 10−3 mutations per copied instruction,
while single-instruction insertion and deletion mutations both occurred at a rate of 5 × 10−2
mutations per division. At the end of each population’s evolution, we analyzed the most
abundant genotype from each population.
    In order to test the role of historical contingency when the appearance of self-replicators
was frequent, we ran experiments where we evolved all 914 self-replicators in the same pop-
ulation (a “primordial soup” of replicators). In each population, we placed 10 individuals of
each self-replicator. The ancestral population then had 9140 individuals and could expand
to 104 individuals at maximum capacity. These populations evolved for 5 × 104 updates in
the logic-9 environment. Mutation rates were the same as in the previous evolvability exper-
iments. This experiment was performed in 200 replicates. To identify the ancestral genotype
that outcompeted all of the other genotypes, we isolated the most abundant genotype at the
end of the experiment and traced its evolutionary history back to its original ancestor.
3.3.3      Data analysis
    Statistics on different avidians were calculated using Avida’s Analyze Mode. In Analyze
Mode, a single genotype is examined in isolation as it executes the instructions in its genome,
runs through its life-cycle, and possibly creates an offspring. This confers on experimenters
the ability to calculate the fitness for an individual avidian (number per offspring generated
per unit time) and examine other characteristics, such as whether it can reproduce per-
                                                33


fectly (all offspring are genetically identical to each other and the mother genome) or which
traits this avidian possesses. Analyze Mode was also used to calculate quantities such as
genome size. Avida’s analyze mode code is available along with the entire Avida software at
https://github.com/devosoft/avida.
    Across-population means and standard errors were calculated using the NumPy (Van
Der Walt et al., 2011) Python software package. The clusters of replicators were rendered
using Neato, which is an undirected graph embedder that creates a layout similar to that
of Multi-Dimensional Scaling (Gansner and North, 2000). Figures were plotted using the
Matplotlib Python package (Hunter et al., 2007).
3.4      Results
3.4.1       Structure of the Fitness Landscape
    Of the 268 (approximately 209 billion) genomes with 8 instructions, we found 914 that
could self-replicate. We also searched for self-replicators with seven-instruction genomes but
found none, establishing that L = 8 is the minimal self-replicator length in Avida. By discov-
ering all self-replicators in this fitness landscape, we can now calculate the precise information
content required for self-replication in Avida, using previously-established methods (Adami,
2015), as − log26 ( 914
                     268
                         ) ≈ 5.9 mers (a “mer” is a unit of entropy or information, normalized
by the number of states that each instruction can take on, see (Adami, 2004)). Our previous
estimate (Adami and LaBar, 2017a) of the information content of length-8 replicators, based
on finding 8 replicators among a billion random samples, was 5.81 ± 0.13 mers.
    To study the genetic structure of these replicators, we obtained the distribution of instruc-
tions (monomers) across the replicators’ genomes (Fig. 3.2a). This distribution is biased,
as every single replicator contained at least the three instructions required for replication:
h-copy, h-alloc, and h-divide (denoted by v, w, and x, respectively, see the mapping between
instructions and the letter mnemonic in Table 1 in the Appendix). In addition, 75% of repli-
cators have a b (nop-B), an f (if-label), and a g (mov-head) instruction, while 25% have a c
(nop-C), an h (jmp-head), and an r (swap) instruction in their sequence. We also analyzed
                                                  34


       A                                                B
                    monomer                                            dimer
Figure 3.2: A: Distribution of monomers/single instructions (i.e., proportion of self-
replicators containing a given monomer). B: Distribution of dimers (pairs of instructions).
Dimers are ordered lexicographically on the x-axis (the proportion of fg, gb, rc, and hc
dimers are labeled).
the distribution of sequential instruction pairs (dimers) and found that while most dimers
do not occur in any self-replicators, the dimers fg and gb occur in approximately 70% of the
replicators (Fig. 3.2b) and are highly over-represented . Other dimers such as rc, hc, and
dimers containing f,g,b,c,v,w, and x occur in approximately 20%-30% of replicators.
    If there were no constraint on the genetic architecture, we would expect self-replicators to
be distributed uniformly across the fitness landscape. However, we found instead that self-
replicators are not distributed uniformly in the landscape, but are grouped into 41 distinct
genotype clusters, shown in Fig. 3.3.
    The dimer distribution function we analyzed above separates primordial self-replicators
into two major categories: those that carry fg/gb motifs (“fg-replicators” for short), as
opposed to those carrying hc/rc motifs (“hc-replicators”) instead. This separation into two
classes was noted earlier from a smaller sample of the landscape (LaBar et al., 2015, 2016),
which we corroborate here. By scanning the entire landscape we can confirm that these two
types are the only types of self-replicators in the landscape, and the clusters of genotypes are
homogeneous in the sense that fg-replicators and hc-replicators do not intermix (Fig. 3.3).
Fig. 3.4 shows four examples of clusters pulled from the landscape, showing that they are
                                                35


tightly interconnected.
     Many self-replicators are isolated and 20 of these clusters consist of only 1 genotype.
However, most self-replicators are located in large clusters. Almost 75% of the self-replicators
are located in four major clusters with 212, 199, 165, and 95 genotypes each, and almost
96% are contained within the 13 clusters that have at least 14 members. There is thus a
distinct gap in the cluster size distribution, with small clusters ranging from 1-3 connected
members, while the next largest size class is 14.
     We find that clusters of replicators are highly connected among each other, with a degree
distribution that is sharply peaked around the mean degree of a cluster (see Fig. 3.5), which
is similar to what is seen in neutral networks of random RNA structures (Aguirre et al.,
2011). We find that fg-replicators form the denser clusters.
     The 914 self-replicators we found vary in fitness, but consistently we find that the fittest
self-replicators contain the fg/gb motifs and many of the lowest fitness self-replicators con-
tain the hc/rc motifs. In Fig. 3.6 we show the fitness as a function of the MDS-coordinate.
In that figure, color denotes fitness according to the scale on the right. The highest peaks
and plateaus all belong to fg-replicators. The hc-replicators appear as a valley (dark blue)
bordering the group of fg-replicators.
3.4.2       Self-Replicator Evolvability
     In order to explore the subsequent role of historical contingency after the emergence of
life, we tested the evolvability of all 914 self-replicators. First, we evolved each replicator
separately. Almost all self-replicators could evolve increased fitness (Fig. 3.7B). However,
there was a wide range of mean relative fitness; fg-replicators clearly undergo more adaptation
than hc-replicators. To explain why fg-replicators were more evolvable, we first looked at
the evolution of genome size. Replicators with the fg/gb motifs grew larger genomes than
replicators with the hc/rc motifs (Fig. 3.7c). As larger genomes can allow for the evolution
of novel traits in Avida, and thus fitness increases, we next checked whether the fg-replicators
had evolved more computational traits than the hc-replicators. In Avida, traits are snippets
                                                 36


                                                                                                                                                                                                                  802       864
                                                                                                                                       24        54                                                                                  804
                                                                                                                                                                                                                                                              821
                                                                                                                                                                                                                                                    822                                                                  631
                                                                                                                                                                                                             794         806
                                                                                                                        114         84            149        92                                                                          796                                                656       633
                                                                                                                                                                                                           808                                                                                                                   638
                                                                                                                                                                                                                                                              820            630
                                                                                                                     903                             76         43                                                         816              810                                                     635
                                                                                                                                    65                                                                         792                                                                                                           623
                                                                                                                                                                                                                                                                                      632
                                                                                                                                                                                                    858
                                                                                                                       103                         125          133                                                                                                                                           607                       636
                                                                                                                                                                                                                                   814      818                             627
                                                                                                                                                                              859         857                        800                                                                                                                         615
                                                                                                                                887
                                                                                                                                          32             141                                                                                                                                                          639          617
                                                                                                                                                                                                            843                        812                                                        628
                                                                                                                                                                  860                                                         798                                                  634
                                                                                                                                                                                        849       852                                                             754                                                                                     609
                                                                                                                                                                              848                                                                        573
                                                                                                                                        349                                                                                                                                                                                657          637
                                                                                                                                                                                                             52       90                       823
                                                                                                 347                                                                 851                     41                                825                                        581                                                                                   654
                                                                                                                                                                                                                                                                                                          350                                                               644
                                                                                                                                                                                                                                                                                                                                     621
                                                                                                                                     565                                                              855                                        575                                                               629                                                               655
                                                                                                                           283                  337                                                                                                                                                                                                    611
                                                                                                        346                                                                   844         862                                861                                                  569            672                                                               642
                                                                                              348                                                                                                                101                                                                                                            682
                                                                                                                                                                     850                                                                                   756                                                                                                                               651
                                                                                                                                                         271                                                                           131                                                                         671                              619
                                                                                                                307                                                                       112           853                                     571                        752               683                                                              652         650
                                                                                                                              265                                                                                                                                                                                                         625
                                                                                                                                         259                                                                                  22                                                                       670
                                                                                                                                                             319      846                                                             885                                                                                  678                                                                      645
                                                                                 783                     301                                                                    847       147                                                  589              577                 750                                                                   648                     643
                                                                                                                                                                                                   845        30                                                                                                                                                   640
                                                                       775                785                                                      452                                                                                                                                       675            680
                                                                                                                          277                                                                                            82                                                567                                                                    641
                                                                                                                                                                                                                                      63                874                                                           673                                                     613          624
                                                                                                          313                                                                 856       854        901                                                                                                                            668                                                                653
                                                                779                             780                                                                                                                                           871                                              674                                                        647
                                                                                 777                                                       325                                                                                                                   579                                                                                               608
                                                                                                                                                                    445                                       74        139                                               586                                                                  665                                                            614
                                                                                                                                                                                                                                 123                                                                                                                                        649        646
                                                                                                                 289       331                        439                                342
                                                                                                                                                                                                                                                                                                       679          677
                                                              778                         789                                         295                                                                                                                                                                                       681
                                61                                                                                                                                                                                                             392                866       116                                                          685                                                                      620
                                                                                                                                                                443         446                      343              344                                386                          86                                                                                                     626
                                                                         776                                                                                                                                                                                                                   669                                                                         618
                        80              88                                                                                   557                   442                                     341
                                                                                                                                                                                                                                                                                                         905
                                                              782                  784         773                 496                                                                                                                                                                                                                  684        666     658
                                                                                                          498                                                                    435                           453                     404                        408                                             676                                                                                            610
                145                              72                                                                                                           447                                                                                                                 45       94                                  26
                                                                                                                                502
                                                                                                                                                    454                436                                                                                                                                                                                                     622            616
                                  129                                774                        516                                                                                              305       281                                       402                                                                                         659                                                        612
              39         50                 20                                         786                                                492                                      441                                                                          396                                          34                         67
                                                             787                                                      494                                                                                                                                                                 143
                                                                              781                                                                                                                                                          410                                                      78
                                                                                                                                                         438       440                                                                                                           415                                                                                                                                     687
                                                                                                                                 500                                                                                 329                                                                                             663                         664                                                          591
                 899                                110                                       508                                                                                          287       299                      317                                      406                                                     105
                                 99                                                                            504                                                            444                                                                                                                                                      667
                                                                                                                                                                                                                                                388                                                        135                                                      698          701
                                                                                                                                              512                                                                                      417                394                                                                                                                             593
                                                                                                                          559        510                                                                                                                                                                                                                689                                           690
                        882                      137                                              514                                                         448                             323          263               335                                      456                                               127
                                                                                   788                                                                                  437                                                                                                                        56                                  15
                                                                                                                                                                                                                                                398                                                                                                                                590
                                 121                                                                                   518        506                                                                                                                                                                                                                      604         695                  692
                                         28                                                              490                                  553                                                   311                                                                                                     892
                                                                                                                                                                                                                    257                                            390                                               151       660         662                                                              686            702
                                                                                                                                                              537                          554                                    563                                                                                                                                                                                               696
                                                                                                                                                                        547                                                                               400                          842
                               339                                                                                    766      761                                                                                                                                                                                                                     704               602           605
                                                                                               771                                                                                                                                                                           833                           412                                                  726
                                                                                                                                                                                                             269        275                                                                                                                                                                        599
          309      291                                                                                     770                                      541                                  450                                     293
                                                                                                                                                                               544                                                                                 836                                                  413             661
                           321                                                                                                                                                                                                                                                                                                                    731                                                              600
                                                                                                                       758                                                                                                                                                                        826                                                                           725                                          594      693
  261                                                                                  765                                      769                                                                  451        253                                                                830
           327                                                                                          768                                                                                                                                               837                                                414                                                                         596
                                                                                                                                         757       540                545                                                                                              827
                          273      315                                                                                                                       539                  538                                                                                                                                  809                     730      736                                                 699     705
  303                                                                                 764                              763                                                                                                                                                                  838                                                                                                                                   603
                                                                                                                                                                                            449                 53      148      140                                                                         803                 807
                  255                                                                             762                                                                                                                                                       835                 828                                                                                   728                          729
                                               368      372                                                                        790          555                    546                                                                                            832
                                                                                                              759                                            549                 548                                                                                                                                                            723                                                                 739
                         297                                                    87                                                                                                                    42                         31                                                      831        801                                                   724                                               732
   267    285                                                          60                109                                                                                                                                              23       840                                                              817         813                                       727
                                           378                                                              772        767       760                                          543           113                                                                                829
                                  381                                                                                                                                536                                                886                                                                                                                 873                                                                           597
                                                               371                                                                                     542                                                                                                                                                                                            753     722
                                                      370                38                                                                                                                                                                                 839                         841       793      815            811                                                                           733
       333               279                                                           19           71                                                                                                                              902        64                     834                                                                                                   734
                 561                     383                                                                  128                                                                             102               345
                                                                                                                                                   551                       550                                                                                                                                                       574                                                                       713
                                                                   377        144                                       79                                                                                                                                                              863                 797                                     576            749               737         735
                                                                                               898                                       356                   552                                                                                   124                                                                       819
                                                  374      373                                                                  365                                                                                                         75                                                                                                                                                                                716
                                                                                                                                                                                                                         132
                                                                                                                                                                                                        91                         83                                                                                805                 578
                                                                                                                                                     357                                                                                                                                           799                                                        566                                                 718
                   488                                              380                                                                                                            324                                                                                                   795                791                                                                                         717
                                                                                     120         27            98                                                                                                                                        418
                                                                                                                           49       362                                                      294                                              428                 431                                                                                                             709
                                                                                                                                                                         564                           318                                                                                                                              570       751                                                                                     719
                                                    376     379                                                                                             360                                                                      430                                                                                                                          588                                                             721
                                        369                                                                                                                                                                                                                                                        824                                                                                                      740
                                                                                              136            881                             353                                                                                                        420
                                                                                                                                  361                                                                                                         427                     423      146
                    470         464                                                                                                                                      276       288        270       312                                                                              40                                                             585                         711                                  714
           469                                       375       367                                                                                      352                                                                                                                                                                                                               712                708                                         720
                                                                                                                                                                                                                                 425                                                              900      130                                                                                          715
                                                                                       382         338                                                                                                                                                   421                                                                   870      568
                                                                                                                                    384      354                                                                        432                  429                  426       138                                                                                  580
                           467              457                                                                                                                  351                                           258                                                                      73                          111                                                                                         710             738
       463                                                                                                                                                                                                                                                                                        884                                                                              744                                                         875
                    489                                                                                                                                                                     300                                  422                                                                                                                                                            743
                                    459                                                 260       254                                                                      330                                                                        424                                                                                    572
                                                                                                            278                                        358                                                                                                                                                                                              755
                                                                                                                              366        364                                                                   264                                                             62                                                                                                                                    707
                                                                                332                                                                                                                                                                                                                           100         51                                      869                                                                582
                                                                                                                     314                                                                                                                     419                                         81       21                                                                      747          742
         461                460                                                                                                                                                  306                                                434
                    466             462                                                           320                                                     363       359                         336                                                        433                                                                                                                                                                                       876
                                                                                                                                               355                                                                                                                                                                                  865                                                          741                                         583
                                                                               302      284                 326                                                                                          282                                                                 409                                        29                                                                                745           746
                                                                                                                                                                                                                                                                                       403           89                                                                        748
            465             468               485                                                                       308                                         505                556                                                           389            393                                       122
                    458                                                                              272                                                                                                                                                                      407                                                                                      700                                                867             906
                                                         476                 480     266                                                                 515                                                                                                                                                                                                                            691
                                                                                                                                                                                                                                                                                       340                                                                                     688                                                                 888
                                                                                                                       296                                                   558                                                                  401             387                                                               14                       697
                                                                 478                                          560                                                                          491                                                                                                  405                        93
                                                                                               290                                                517             509                                                                                                                                                                                706
                                                   482                    486        472                                                                                                                                                                                                                   411                                                                                703
                                                                                                                                                                                                                                                                               455                                                         134                           595                                                           907
                                                                                                                    196                                                              513                                                                 399                                  397                     85                                                                                             894
                                                                                               471                                                                                                                                                                  395                                                                                606      694                                           3               910
                                              481         474           487                              188                   216                503               499                                                                                                                                                                                                                    592
                                                                                                                                                                                                                  524                                                                                                                                                             601
                                                                                                                                        184                                                                                521                                                                                                 142            904                                                    57
                                                                                                                     187                                                     497                                                                  262                                391      385                115                                                                                                                     868
                                                                                                                                                                                                 533                                                            416         316                                                                                        598                                                      106
                                                475                        483                                                                        495                                                                                                                                              104                                                                                                           35
                                                                                     484                                                                                               511                                                                                                                                                                                            895
                                                                                                195                                 192                                                                   532                                                                                                                                   44
                                                                                                               197                                                       507                                                                   298                                                                  33                                                                              0       117
                                                                                                                                                               493                                                  523      520                         274            562        322                                                                                                                                           46
                                                     473         477         479                                                                      501                                                                                                                                               77                    66      890                                                                                                 889
                                                                                                                                                                                                                                      534                                                                                                               587           36
                                                                                                                                          193                                                          525                                                                                                                                                                    878                                      9
                                                                                                      189                214      190                                      181                                                                                                                                     126                                                                      58        6
                                                                                            210                                                                                      250                                                         256                                         334                                                                                                                               16
                                                                     200       204                                                                                                                                     527                                                                                                            150
                                                            205                                                                                                                                                                 529                                                                                                                          17
                                                                                                                                          185                                                           535                                                                286                                                                   584                               107                        95      68
                                                                                                                 186                                   198          163                                                                                         328                                      891        55        25
                                                                                                        201                                                                                                                                  252                                             280
                                                                                                                                                                                             179                  528                                  310                                                                                                         872                         12
                                                                 208                                                                                                             165                                       530
                                                                         211                                                194       191                                                                                           519                                                                                               883
                                                                                           202               212                                                   169                                                                                                                                    152                                             877
                                                                                                                                                                                                                                                                268       292        304                                                                                               4
                                                                                                                                                                                                 155        531                                                                                                                                                              96
                                                                 207                                                                                                                                                                                                                                                                              69
                                                                                 209                                                                                                                                           526                     242                                                                                                                                       1
                                                                                                       213                                                 159                      157                               522                                                                                          908                                      911
                                                                                                                                                                                                                                                                                     154
                                                                                             203                                                                         171                         177                                     231                 247                                                                                                          879
                                                                                                                                                                                                                                                                                                                                          909
                                                                                  206                                                        222                                                                                                                                                                              913
                                                                         215                                168                                          221                                                                    234
                                                                                                                                                                                               173           246                                                              153                                                                                                              7
                                                                                                   199                             219                                 249                                                                                           240                           108                                                         10                     47
                                                                                                                                                                                                                                         244               232                                                 37
                                                                                172      180                                                                                                                                                                                                                                                         893                118
                                                                                                              164                                                                                                          236                                                          59
                                                                                                                                           220                                         167
                                                                                                                       156      248                            161                                                                                                                                                                                                               897
                                                                       251                                                                                                                                                              241          237            239                                                13
                                                                                                                                                                                                                                                                                                119        912
                                                                                                                                                      225                    175                                                                                                   18
                                                                                        174                170                           227                                                                                   233
                                                                        182                                                                                         230                                                                                        235
                                                                                                                                                                                                                                                                                             5                       48
                                                                                                                     162                           229                           217                                                               243                        11
                                                                                                                                                                                           245                                          238
                                                                                                      160
                                                                             166                                                        218                       228                                                                                                                                          97
                                                                                                                                                                           224                                                                                                  896                  8
                                                                                      158                                                                                                                                                                           880
                                                                                                                178
                                                                                                                                                       226
                                                                                                176                                                                                                                                                                                       70            2
                                                                                                                                                                   223
                                                                                                                                                                          183
Figure 3.3: The complete fitness landscape of all 914 length-8 replicators. The replicators
are colored by the class of motifs they contain (fg replicators are colored in red, while hc
replicators are colored in blue.) The relative position between any pair of nodes reflects
their distance in Hamming space, displayed via multi-dimensional scaling (MDS). As a con-
sequence, it appears as if blue and red clusters are linked, which is not the case. One isolated
fg-replicator (red) is close to an hc-replicator cluster (blue), but is not connected to it. All
visible edges are between nodes that have a Hamming distance of 1 (i.e. they are a point
mutation away from each other).
                                                                                                                                                                                                                       37


    A                                                                                                                        B                                                                 119        83          64
                                                                                                                                                                                       82              5       68          8
                         11                                                                                                                                           38
                                                                                                                                                                                                80                      149        144    138
          16                                                                                                                                                  44
                                                                                                     12
                                                                                                                                                                               93                     123       36                                   76
                                                                                                                                                                                         115                              124
                                                                                                                                                                                                                                            160
                                                                                                                                                       52           104
                                                                                                                                                                                101               111                                                          94
                                    14                                                                                                                                                                      121           136      91
                                                                                       20                         10                   146
                                                                                                                                                18
                                                                                                                                                                        51       73          157                                               159       33
      5     19
                                                                                                                                                          110                                           27
                                                                                                                                                                                                                26
                                                                                                                                                                                                                          81
                                                                                                                                                                                                                                  148                              46
                                                                                                                                         142                                        152                                                  112      21        9
                                                                                                                                                89             116
                                                                                                                                                                                                              85
                                                                                                                                                                           118                                        25                                                    3
                                                                                                                                      86                134                                                                                   72          53     127
                                3                                                                                                                               11
                                                                                                                                                                                                      103                                                                        99
    0                                                                                                     15            7            96
                                                                                                                                             23                          16                                       106      145
                                                                                                                                                                                                                                                      95
                                                          13                                                                                            164
                                                                                                                                                                       151
                                                                                                                                                                                 19       158      50     29
                                                                                                                                                                                                                                      31     150
                                                                                                                                                                                                                                                               65          56
                                                                                                                                                                                                                                                                                  77
                                                                                                                               141                                                                                        120
                18                                                                                                                                                                                               143                7               131
                                                                                         6                                             129
                                                                                                                                               117            126
                                                                                                                                                                                41         43
                                                                                                                                                                                                   97    156                                                            130
                                                                                                                                                                                                                                                                                 113
                                                                                                                               98                                                                                                    6                       102
                                                                                                                                                                                                        32             34
       8                                                                                                                                                107        139
                                                                                                                                                                                                 154                                                                          74
                                                                                                                     17                         24                                      147                                                                            66
                                                                                                                                                                                75                                                        2        100
                                                                                                                                  20                                                                           114
                                 22                                                                                                       84
                                                                                                                                                           109
                                                                                                                                                                      162
                                                                                                                                                                                        90
                                                                                                                                                                                                                                 55
                                                                                                                                                                                                                                                                35
                                                                                                                                                                                                                                                                         42
                                                                                                                                                                                                                        70                   17        140
                                                                                                                                                                                                  128           57                    78
                                                                                                                                                    40              87
         21                                                                        1                                                                                         105         15
                                                                                                             2                                   132          13                                        10                               137      48          22          1
                      9                                                                                                                                                         79
                                                                                                                                                                                        39
                                                                                                                                                                                                 61                               155
                                                                                                                                                                                                                                                  67
                                                                                                                                                                                                                                                                      59
                                                                                                                                                                                                                 135                                       133
                                                                                                                                                  92      14                                           125                 62            49
                                                                                                4
                                                                                                                                                                     30
                                                                                                                                                                                153         12                                                                     69
                                                                                                                                                      88                                                 161          45                        54       4
                                                                                                                                                             122
                                                                                                                                                                       37              0                                       163
                                                                                                                                                                                                   58          60
                                                                                                                                                                                                                           108
                                                                                                                                                                                          63              71
                                                                                                                                                                                                   47
                                                                                                                                                                                                                  28
    C                                                                                                                        D
                                                        49
                                              5
                                  79                            18
                                                                                 83       44
                             86           23
                                                     58
                                                                22
                                                                                                                                                                                                                14
                                                   65
                                                                         84              31
                                                                                                  12
                                                                                                                                                                                 3
                                71                             51
                                                                         64        33
                                                                                                      36                                                                                                                                       13
                    38                  89
                                                        30
                           32                                   60      87                               19
                                                27                               94
             53
                   76
                                                         10
                                                                                                                                                        9                            7                                  4
                                  42                                    26                46      13
                                            43                                  67
            69
                   34
                            40
                                                                                                                                                                                                                                                                   10
                                     28                                         29
                                                                                                       52
              8                                                                               6              45
                    75
                            48        56                                     72
                                                                                      63
                                                                                                        3
                                                                                                                                                                                                                                                1
              78
                    50
                                                                                                              17                                    0
                                   61                                                        81
                                                                                     77
                                           21
                                                                                                      88
                                                                                                                2                                                                  5
                           16                                                7
                                                   20                39
                   82              74
                                                                                             90           54
                                                                                                                                                                                                                    8                                       12
                                                           37                     15                           55
                                                  1                       59                         95
                                                                  41
                 0
                        85
                                 57
                                                                                         68
                                                                                                                                                                11
                                                                   24                           92
                      35
                                          70        47                       11
                                                                                                                                                                                                                                       6
                              14
                                                              25      62
                                                                                      73       93
                                                                                                                                                                                                    2
                                        4       66
                                                                                      80
                                                                    9        91
Figure 3.4: Four clusters from the full landscape of self-replicators of L = 8. A: A 23-node
cluster of hc-replicators, B: the third-largest cluster in the network: an fg-replicator cluster
with 165 members. C: Another large fg-replicator cluster with 96 genotypes. D: A 15-node
hc-replicator cluster.
                                                                                                                          38


                                   0.35
                                   0.30
                 proportion of nodes
                                   0.25
                                   0.20
                                   0.15
                                   0.10
                                   0.05
                                   0.00 0 5 10 15 20 25 30 35 40
                                              node degree
Figure 3.5: Edge distribution of all replicators in the fitness landscape of L = 8. As each
cluster has a particular edge distribution, the distributions of the two different kinds of
replicators (fg-types and hc-types) do not overlap. Red: fg-replicators, blue: hc-replicators.
of code that allow the avidian to gain energy from the environment, by performing logic
operations on binary numbers that the environment provides (see Methods). Replicators
with the fg/gb motifs did evolve more novel traits than replicators with the hc/rc motifs
(Fig. 3.7D). In fact, only fg-replicators evolved traits in these experiments. Finally, we looked
at the effect of historical contingency when all 914 replicators were competed against each
other in one population. After 50,000 updates, we identify the most abundant genotype
in 200 replicate experiments and reconstruct the line-of-descent to determine which of the
replicators gave rise to it (we call that replicator the “progenitor”).
   Most replicators did not emerge as the progenitor of life in these experiments (Fig. 3.8).
Three genotypes, vvwfgxgb, vwvfgxgb, and wvvfgxgb, outcompete the other genotypes in
37, 49, and 45 populations out of 200, respectively, or in about 65% of the competitions. The
other progenitors of life were not distributed randomly among the other self-replicators either;
most of them were present in the same clusters as the three genotypes from above.Thus, while
history is a factor in which of the replicators becomes the seed of all life in these experiments,
                                                 39


Figure 3.6: Ancestral fitness of all primordial self-replicators of L = 8, where x-y coordinates
are the same as the network in Fig 3.3.
                                               40


Figure 3.7: Fitness and other characteristics of all L = 8 self-replicators before and after
evolution. A: Ancestral fitness of all replicators. B: Log mean relative fitness after 2 × 104
updates of evolution. C: Final genome size after 2 × 104 updates of evolution. D: Number
of evolved traits after 2 × 104 updates of evolution. In all plots, fg-replicators are in red and
hc-replicators are in blue. Error bars (black) are twice the standard error of the mean. All
plots are sorted in increasing order.
                                               41


more than half the time the progenitor is one of the three highest-fitness sequences. Thus,
life predominantly originates from the highest peaks of the primordial landscape.
3.5      Information content of self-replicators of length 9
     Having done a complete mapping of the fitness landscape for length 8 self-replicators,
we then generated a complete mapping of the fitness landscape of length 9 self-replicators,
as this was reasonably tractable, taking about 3-4 months of compute time on a cluster of
machines. The goal here was to have a richer landscape that could be used to do more
statistical analyses, and perhaps provide further insight into these primordial landscapes. A
complete mapping of the length 9 self-replicator landscape revealed exactly 36,171 sequences
that could self-replicate, which is indeed a much richer dataset, and one of the largest
complete landscape mappings given the 269 possible sequences of length 9.
     The information content in these ensembles is one of the simplest measures that can be
computed. As defined by Szostak (Szostak, 2003), these are the proportion of sequences in
an ensemble that carry some particular fitness trait. More precisely, the information content
is the proportion of sequences that carry the trait, relative to the total number of possible
sequences.
                                    I(Eθ ) = −logF (E ≥ θ)                                (3.1)
where F (E ≥ θ) is just the proportion Nθ /N , Nθ being the count of functional sequences,
and N the total number of possible sequences.
     This can be rewritten as I(Eθ ) = L − logD Nθ , where L is the length of the sequences (in
these ensembles, it’s typically assumed that all sequences are of equal length), and D is the
size of the alphabet from which these sequences are constructed.
     This measure is actually an approximation to the the Shannon information content of
a biomolecular sequence (Adami and Cerf, 2000), but is a convenient to way to compute
the value. This measure requires a complete mapping of the fitness landscape which is not
possible in general. There are approximations that can be done to estimate the value, but
here we have a complete mapping, which sidesteps this issue completely. Computing the
                                              42


                                                                                                                                                                                                                  802       864
                                                                                                                                       24        54                                                                                  804
                                                                                                                                                                                                                                                              821
                                                                                                                                                                                                                                                    822                                                                  631
                                                                                                                                                                                                             794         806
                                                                                                                        114         84            149        92                                                                          796                                                656       633
                                                                                                                                                                                                           808                                                                                                                   638
                                                                                                                                                                                                                                                              820            630
                                                                                                                     903                             76         43                                                         816              810                                                     635
                                                                                                                                    65                                                                         792                                                                                                           623
                                                                                                                                                                                                                                                                                      632
                                                                                                                                                                                                    858
                                                                                                                       103                         125          133                                                                                                                                           607                       636
                                                                                                                                                                                                                                   814      818                             627
                                                                                                                                                                              859         857                        800                                                                                                                         615
                                                                                                                                887
                                                                                                                                          32             141                                                                                                                                                          639          617
                                                                                                                                                                                                            843                        812                                                        628
                                                                                                                                                                  860                                                         798                                                  634
                                                                                                                                                                                        849       852                                                             754                                                                                     609
                                                                                                                                                                              848                                                                        573
                                                                                                                                        349                                                                                                                                                                                657          637
                                                                                                                                                                                                             52       90                       823
                                                                                                 347                                                                 851                     41                                825                                        581                                                                                   654
                                                                                                                                                                                                                                                                                                          350                                                               644
                                                                                                                                                                                                                                                                                                                                     621
                                                                                                                                     565                                                              855                                        575                                                               629                                                               655
                                                                                                                           283                  337                                                                                                                                                                                                    611
                                                                                                        346                                                                   844         862                                861                                                  569            672                                                               642
                                                                                              348                                                                                                                101                                                                                                            682
                                                                                                                                                                     850                                                                                   756                                                                                                                               651
                                                                                                                                                         271                                                                           131                                                                         671                              619
                                                                                                                307                                                                       112           853                                     571                        752               683                                                              652         650
                                                                                                                              265                                                                                                                                                                                                         625
                                                                                                                                         259                                                                                  22                                                                       670
                                                                                                                                                             319      846                                                             885                                                                                  678                                                                      645
                                                                                 783                     301                                                                    847       147                                                  589              577                 750                                                                   648                     643
                                                                                                                                                                                                   845        30                                                                                                                                                   640
                                                                       775                785                                                      452                                                                                                                                       675            680
                                                                                                                          277                                                                                            82                                                567                                                                    641
                                                                                                                                                                                                                                      63                874                                                           673                                                     613          624
                                                                                                          313                                                                 856       854        901                                                                                                                            668                                                                653
                                                                779                             780                                                                                                                                           871                                              674                                                        647
                                                                                 777                                                       325                                                                                                                   579                                                                                               608
                                                                                                                                                                    445                                       74        139                                               586                                                                  665                                                            614
                                                                                                                                                                                                                                 123                                                                                                                                        649        646
                                                                                                                 289       331                        439                                342
                                                                                                                                                                                                                                                                                                       679          677
                                                              778                         789                                         295                                                                                                                                                                                       681
                                61                                                                                                                                                                                                             392                866       116                                                          685                                                                      620
                                                                                                                                                                443         446                      343              344                                386                          86                                                                                                     626
                                                                         776                                                                                                                                                                                                                   669                                                                         618
                        80              88                                                                                   557                   442                                     341
                                                                                                                                                                                                                                                                                                         905
                                                              782                  784         773                 496                                                                                                                                                                                                                  684        666     658
                                                                                                          498                                                                    435                           453                     404                        408                                             676                                                                                            610
                145                              72                                                                                                           447                                                                                                                 45       94                                  26
                                                                                                                                502
                                                                                                                                                    454                436                                                                                                                                                                                                     622            616
                                  129                                774                        516                                                                                              305       281                                       402                                                                                         659                                                        612
              39         50                 20                                         786                                                492                                      441                                                                          396                                          34                         67
                                                             787                                                      494                                                                                                                                                                 143
                                                                              781                                                                                                                                                          410                                                      78
                                                                                                                                                         438       440                                                                                                           415                                                                                                                                     687
                                                                                                                                 500                                                                                 329                                                                                             663                         664                                                          591
                 899                                110                                       508                                                                                          287       299                      317                                      406                                                     105
                                 99                                                                            504                                                            444                                                                                                                                                      667
                                                                                                                                                                                                                                                388                                                        135                                                      698          701
                                                                                                                                              512                                                                                      417                394                                                                                                                             593
                                                                                                                          559        510                                                                                                                                                                                                                689                                           690
                        882                      137                                              514                                                         448                             323          263               335                                      456                                               127
                                                                                   788                                                                                  437                                                                                                                        56                                  15
                                                                                                                                                                                                                                                398                                                                                                                                590
                                 121                                                                                   518        506                                                                                                                                                                                                                      604         695                  692
                                         28                                                              490                                  553                                                   311                                                                                                     892
                                                                                                                                                                                                                    257                                            390                                               151       660         662                                                              686            702
                                                                                                                                                              537                          554                                    563                                                                                                                                                                                               696
                                                                                                                                                                        547                                                                               400                          842
                               339                                                                                    766      761                                                                                                                                                                                                                     704               602           605
                                                                                               771                                                                                                                                                                           833                           412                                                  726
                                                                                                                                                                                                             269        275                                                                                                                                                                        599
          309      291                                                                                     770                                      541                                  450                                     293
                                                                                                                                                                               544                                                                                 836                                                  413             661
                           321                                                                                                                                                                                                                                                                                                                    731                                                              600
                                                                                                                       758                                                                                                                                                                        826                                                                           725                                          594      693
  261                                                                                  765                                      769                                                                  451        253                                                                830
           327                                                                                          768                                                                                                                                               837                                                414                                                                         596
                                                                                                                                         757       540                545                                                                                              827
                          273      315                                                                                                                       539                  538                                                                                                                                  809                     730      736                                                 699     705
  303                                                                                 764                              763                                                                                                                                                                  838                                                                                                                                   603
                                                                                                                                                                                            449                 53      148      140                                                                         803                 807
                  255                                                                             762                                                                                                                                                       835                 828                                                                                   728                          729
                                               368      372                                                                        790          555                    546                                                                                            832
                                                                                                              759                                            549                 548                                                                                                                                                            723                                                                 739
                         297                                                    87                                                                                                                    42                         31                                                      831        801                                                   724                                               732
   267    285                                                          60                109                                                                                                                                              23       840                                                              817         813                                       727
                                           378                                                              772        767       760                                          543           113                                                                                829
                                  381                                                                                                                                536                                                886                                                                                                                 873                                                                           597
                                                               371                                                                                     542                                                                                                                                                                                            753     722
                                                      370                38                                                                                                                                                                                 839                         841       793      815            811                                                                           733
       333               279                                                           19           71                                                                                                                              902        64                     834                                                                                                   734
                 561                     383                                                                  128                                                                             102               345
                                                                                                                                                   551                       550                                                                                                                                                       574                                                                       713
                                                                   377        144                                       79                                                                                                                                                              863                 797                                     576            749               737         735
                                                                                               898                                       356                   552                                                                                   124                                                                       819
                                                  374      373                                                                  365                                                                                                         75                                                                                                                                                                                716
                                                                                                                                                                                                                         132
                                                                                                                                                                                                        91                         83                                                                                805                 578
                                                                                                                                                     357                                                                                                                                           799                                                        566                                                 718
                   488                                              380                                                                                                            324                                                                                                   795                791                                                                                         717
                                                                                     120         27            98                                                                                                                                        418
                                                                                                                           49       362                                                      294                                              428                 431                                                                                                             709
                                                                                                                                                                         564                           318                                                                                                                              570       751                                                                                     719
                                                    376     379                                                                                             360                                                                      430                                                                                                                          588                                                             721
                                        369                                                                                                                                                                                                                                                        824                                                                                                      740
                                                                                              136            881                             353                                                                                                        420
                                                                                                                                  361                                                                                                         427                     423      146
                    470         464                                                                                                                                      276       288        270       312                                                                              40                                                             585                         711                                  714
           469                                       375       367                                                                                      352                                                                                                                                                                                                               712                708                                         720
                                                                                                                                                                                                                                 425                                                              900      130                                                                                          715
                                                                                       382         338                                                                                                                                                   421                                                                   870      568
                                                                                                                                    384      354                                                                        432                  429                  426       138                                                                                  580
                           467              457                                                                                                                  351                                           258                                                                      73                          111                                                                                         710             738
       463                                                                                                                                                                                                                                                                                        884                                                                              744                                                         875
                    489                                                                                                                                                                     300                                  422                                                                                                                                                            743
                                    459                                                 260       254                                                                      330                                                                        424                                                                                    572
                                                                                                            278                                        358                                                                                                                                                                                              755
                                                                                                                              366        364                                                                   264                                                             62                                                                                                                                    707
                                                                                332                                                                                                                                                                                                                           100         51                                      869                                                                582
                                                                                                                     314                                                                                                                     419                                         81       21                                                                      747          742
         461                460                                                                                                                                                  306                                                434
                    466             462                                                           320                                                     363       359                         336                                                        433                                                                                                                                                                                       876
                                                                                                                                               355                                                                                                                                                                                  865                                                          741                                         583
                                                                               302      284                 326                                                                                          282                                                                 409                                        29                                                                                745           746
                                                                                                                                                                                                                                                                                       403           89                                                                        748
            465             468               485                                                                       308                                         505                556                                                           389            393                                       122
                    458                                                                              272                                                                                                                                                                      407                                                                                      700                                                867             906
                                                         476                 480     266                                                                 515                                                                                                                                                                                                                            691
                                                                                                                                                                                                                                                                                       340                                                                                     688                                                                 888
                                                                                                                       296                                                   558                                                                  401             387                                                               14                       697
                                                                 478                                          560                                                                          491                                                                                                  405                        93
                                                                                               290                                                517             509                                                                                                                                                                                706
                                                   482                    486        472                                                                                                                                                                                                                   411                                                                                703
                                                                                                                                                                                                                                                                               455                                                         134                           595                                                           907
                                                                                                                    196                                                              513                                                                 399                                  397                     85                                                                                             894
                                                                                               471                                                                                                                                                                  395                                                                                606      694                                           3               910
                                              481         474           487                              188                   216                503               499                                                                                                                                                                                                                    592
                                                                                                                                                                                                                  524                                                                                                                                                             601
                                                                                                                                        184                                                                                521                                                                                                 142            904                                                    57
                                                                                                                     187                                                     497                                                                  262                                391      385                115                                                                                                                     868
                                                                                                                                                                                                 533                                                            416         316                                                                                        598                                                      106
                                                475                        483                                                                        495                                                                                                                                              104                                                                                                           35
                                                                                     484                                                                                               511                                                                                                                                                                                            895
                                                                                                195                                 192                                                                   532                                                                                                                                   44
                                                                                                               197                                                       507                                                                   298                                                                  33                                                                              0       117
                                                                                                                                                               493                                                  523      520                         274            562        322                                                                                                                                           46
                                                     473         477         479                                                                      501                                                                                                                                               77                    66      890                                                                                                 889
                                                                                                                                                                                                                                      534                                                                                                               587           36
                                                                                                                                          193                                                          525                                                                                                                                                                    878                                      9
                                                                                                      189                214      190                                      181                                                                                                                                     126                                                                      58        6
                                                                                            210                                                                                      250                                                         256                                         334                                                                                                                               16
                                                                     200       204                                                                                                                                     527                                                                                                            150
                                                            205                                                                                                                                                                 529                                                                                                                          17
                                                                                                                                          185                                                           535                                                                286                                                                   584                               107                        95      68
                                                                                                                 186                                   198          163                                                                                         328                                      891        55        25
                                                                                                        201                                                                                                                                  252                                             280
                                                                                                                                                                                             179                  528                                  310                                                                                                         872                         12
                                                                 208                                                                                                             165                                       530
                                                                         211                                                194       191                                                                                           519                                                                                               883
                                                                                           202               212                                                   169                                                                                                                                    152                                             877
                                                                                                                                                                                                                                                                268       292        304                                                                                               4
                                                                                                                                                                                                 155        531                                                                                                                                                              96
                                                                 207                                                                                                                                                                                                                                                                              69
                                                                                 209                                                                                                                                           526                     242                                                                                                                                       1
                                                                                                       213                                                 159                      157                               522                                                                                          908                                      911
                                                                                                                                                                                                                                                                                     154
                                                                                             203                                                                         171                         177                                     231                 247                                                                                                          879
                                                                                                                                                                                                                                                                                                                                          909
                                                                                  206                                                        222                                                                                                                                                                              913
                                                                         215                                168                                          221                                                                    234
                                                                                                                                                                                               173           246                                                              153                                                                                                              7
                                                                                                   199                             219                                 249                                                                                           240                           108                                                         10                     47
                                                                                                                                                                                                                                         244               232                                                 37
                                                                                172      180                                                                                                                                                                                                                                                         893                118
                                                                                                              164                                                                                                          236                                                          59
                                                                                                                                           220                                         167
                                                                                                                       156      248                            161                                                                                                                                                                                                               897
                                                                       251                                                                                                                                                              241          237            239                                                13
                                                                                                                                                                                                                                                                                                119        912
                                                                                                                                                      225                    175                                                                                                   18
                                                                                        174                170                           227                                                                                   233
                                                                        182                                                                                         230                                                                                        235
                                                                                                                                                                                                                                                                                             5                       48
                                                                                                                     162                           229                           217                                                               243                        11
                                                                                                                                                                                           245                                          238
                                                                                                      160
                                                                             166                                                        218                       228                                                                                                                                          97
                                                                                                                                                                           224                                                                                                  896                  8
                                                                                      158                                                                                                                                                                           880
                                                                                                                178
                                                                                                                                                       226
                                                                                                176                                                                                                                                                                                       70            2
                                                                                                                                                                   223
                                                                                                                                                                          183
Figure 3.8: Location of “progenitors” (ancestral types that were the origin of an evolved
population 50,000 updates later) in the primordial landscape. Replicators that were never
the ancestor genotype of the entire population are in grey. Those that outcompete all other
genotypes in fewer than 6 (out of 200) competitions are colored in green. The three genomes
that eventually become the ancestor of life in over 130 competitions are in orange.
                                                                                                                                                                                                                       43


information content for length 8 sequences shows
                               I8 = −log26 (914/268 ) ≈ 5.91 mers
where “mers” is the unit of information/entropy relative to the entropy of a sequence, which
is computed by simply taking the value to the logarithm of base D. Doing this for length 9
gives
                             I9 = −log26 (36, 171/269 ) ≈ 5.77 mers
which is slightly less information content than for length 8 sequences.
    There are some differences between the fitness landscapes for length 8 and 9 self-replicators.
There is no longer a clear separation between f g, and rc replicators as there was in length
8 - there are self-replicating sequences of length 9 that contain both these motifs. We con-
structed networks similar to that of length 8, but here rotations were accounted for. By
rotations we mean that if one self-replicating sequence could be rotated to produce another
self-replicator, they’re considered to be the same sequence. Under this analysis, we found
30,547 self-replicators, which formed a total of 84 distinct clusters. A very large proportion
of these replicators turned out to be in a single cluster of size 27,753, which accounts for 94%
of the sequences. This suggests that as the size of the sequences increases, there is increased
connectivity, i.e. there are more, and shorter paths between pairs of sequences. Such a large
network structure is ripe to investigate whether it contains any graph-theoretic properties
similar to the properties of networks seen in the real world, such as RNA sequences.
3.6      Information encoding
    Next, we wanted to look at how information is encoded within these sequences. To do
this, we calculate the density of replicators that are some distance n from any particular
sequence i. This can be computed easily as
                                              Pn       (i)
                                                 k=0 Nv (k)
                                     ρi (n) = Pn      L
                                                                                           (3.2)
                                                           k
                                                 k=0 k 25
                                               44


Figure 3.9: The log of the fraction of functional sequences ϕ(n) at mutational depth n for
length 9 replicators. The red line shows the theoretical lower limit for a ϕ(n) with perfectly
compressed information (no viable mutational neighbors).
          (i)
where Nv (k) is the number of self-replicators that are up to n single-site mutations away
from the sequence i. The right hand side is scaled by nk=0 Lk 25k which is just the total
                                                           P         
number of possible single-site mutants at the same distance n to give the density ρi (n).
    Plotting the log of this value ϕ(n) = log(ρi (n)) against n shows some interesting proper-
ties of how information is encoded in these sequences. Fig 3.9 shows the information density
measure for all 36,171 self-replicator sequences (not considering rotations) of length 9, and
Fig 3.10 shows the curves for the most robust and most fragile sequences, along with the
average information density of these sequences.
    There’s clearly quite a lot of variability in how information is encoded. There are some
replicators that are very robust, i.e. they have a lot of single site mutational neighbours (up
                                                45


                                                    n
                              1      2    3     4     5    6    7     8     9
                        0
                       −1
                       −2
                       −3
              Φ(n)
                       −4
                       −5
                       −6
Figure 3.10: ϕ(n) for the most fragile (dotted line) and the most robust (dashed line) repli-
cator. The average ϕ(n) across all replicators is indicated by the solid line.
                                             46


to 74 neighbours as mentioned previously), and these replicators show antagonistic epistasis
for the first few mutations, but then show synergistic epistasis as the number of possible
mutations increases. Some replicators are not robust, i.e. they have few, if any mutational
neighbours. Some of these sequences are maximally compressed all the way to 4 mutations,
which can be seen by the intersection with the red line in Fig 3.9. All sequences eventually
show synergistic epistasis, and by definition all the curves end at the same point, denoted
by ϕ(9) = −I9 .
3.7       Discussion
    Here, we tested the role of fitness landscape structure and historical contingency in the
origin of self-replication in the digital evolution system Avida. We characterized the complete
fitness landscape of all minimal-genome self-replicators and found that viable genotypes form
clusters in the fitness landscape. These self-replicators can be separated into two replication
classes, as we previously found for self-replicators with larger genomes (LaBar et al., 2016).
We also found that one of these replication classes (the fg-replicators) is more evolvable than
the other, although the evolvability of each genotype varies. Finally, we show that, when
all self-replicators are competed against each other in a digital “primordial soup”, three
genotypes win over 65% of the competitions and many of the other “winners” come from
the same genotype cluster.
    In a previous study with Avida, we found that 6 out of 109 spontaneously-emergent
genomes with 8 instructions could self-replicate (Adami and LaBar, 2017a). Here, we found
that 914 out of ≈ 2.8 × 1011 genomes could replicate, consistent with our previous re-
sults. This concordance suggests that the information-theoretic theory of the emergence
of life, originally proposed by Adami (Adami, 2015) and tested with Avida by Adami and
LaBar (Adami and LaBar, 2017a), can accurately explain the likelihood of the chance emer-
gence of life. Thus, the emergence of self-replication, and life is dependent on the information
required for such life.
    By enumerating all of the length-8 self-replicators, we were able to show that self-
                                                  47


replicators are not uniformly distributed across the fitness landscape and that viable geno-
types cluster together. The size of these clusters varies: there are few clusters with many
genotypes and many clusters with few genotypes, but the cluster size distribution has a gap.
The edge distribution of the clusters is similar to what has been found in random RNA
structures, and the mean degree differs between replicator types.
    Genotypes with different replication mechanisms were in different clusters with no evo-
lutionary trajectory between the two. Empirical studies of RNA-based fitness landscapes,
biochemical model systems for the origin of life, also show that these landscapes consist
of isolated fitness peaks with many non-viable genotypes (Jiménez et al., 2013; Petrie and
Joyce, 2014). The fact that both RNA-based landscapes (Jiménez et al., 2013; Petrie and
Joyce, 2014) and these digital landscapes have similar structures suggests that the evolu-
tionary patterns we see in these Avida experiments may be similar to those one would have
seen in the origin of life on Earth. The presence of isolated genotype clusters in both digi-
tal and RNA fitness landscapes further suggests that the identity of the first self-replicator
may determine life’s future evolution, as other evolutionary trajectories are not accessible.
However, if populations can evolve larger genomes, non-accessible evolutionary trajectories
may later become accessible, as mathematical results on the structure of high-dimensional
fitness landscapes suggest (Gavrilets, 1997).
    To test for the effects of historical contingency in the origin of self-replication in Avida,
we evolved all of the 914 replicators in an environment where they could increase in genome
size and evolve novel traits. Previously, we found that the evolvability of spontaneously-
emergent self-replicators varied and was determined by their replication mechanism (LaBar
et al., 2016). However, those genotypes possessed fixed-length genomes of 15 instructions.
Here, we confirmed that the genotype of the first self-replicator, and more specifically the
replication mechanism of the first replicator, determine the future evolution of novel traits
in Avida. The fg-replicators showed high rates of trait evolution, while hc-replicators failed
to evolve novel traits in most populations. However, we did not detect any trade-off in
                                                48


evolvability, as we previously found (LaBar et al., 2016). This difference is likely due to
their differences in capacity to increase in genome size, as genome size increases enhance
the evolution of novel traits and fitness increases in Avida (Gupta et al., 2016; LaBar
and Adami, 2016). Would a similar dynamic occur in a hypothetical population of RNA-
based replicators? While experimental evolution of RNA replicators has been performed, the
selective environments resulted in genome size decreases (Mills et al., 1967). It is unknown
how simple RNA replicators vary in their evolvability.
    We also performed experiments to test for the role of historical contingency in scenarios
where any self-replicator could become the progenitor of digital life. Here, we found that
only three self-replicators (or their neighbors in the fitness landscape) became the last com-
mon ancestor in the majority of populations. This suggests a lack of contingency in the
ancestral self-replicator, but emphasizes the role of the ancestral genotype in determining its
future evolution. If life emerges rarely, then its future evolution will be determined by the
specific genotype that first emerges, as shown from our first set of evolvability experiments
(Fig. 3.7). However, if simple self-replicators emerge frequently, then the future evolution is
determined by the evolvability of the fittest replicators, a sort of clonal interference (Gerrish
and Lenski, 1998) among possible progenitors of life. In this case, the self-replicators that
most successfully invaded the population happened to also be of the type that evolved the
largest genomes and most complex traits. However, it can be imagined that the opposite
trend could occur (LaBar et al., 2016), and then the progenitor of life would limit the future
evolution of biological complexity.
3.8      Conclusions
    In this work we have performed the first complete mapping of a primordial sequence
landscape in which replicators are extremely rare (about one replicator per 200 million
sequences) and found two functionally inequivalent classes of replicators that differ in their
fitness as well as evolvability, and that form distinct clusters in sequence space. In direct
evolutionary competition, only the highest-fitness sequences manage to repeatedly become
                                               49


the common ancestor of all life in this microcosm, showing that despite significant diversity
of replicators, historical contingency plays only a minor role during early evolution.
    While it is unclear how the results we obtained in this digital microcosm generalize to a
biochemical microcosms, we are confident that they can guide our thinking about primor-
dial fitness landscapes. The functional sequences we discovered here are extremely rare, but
likely not as rare as putative biochemical primordial replicators. However, from a purely
statistical point of view, it is unlikely that a primordial landscape consisting of sequences
that are several orders of magnitude more rare would look qualitatively different, not would
we expect our results concerning historical contingency to change significantly. After all,
random functional RNA sequences (but not replicators, of course) within a computational
world (Aguirre et al., 2011), chosen only for their ability to fold, show similar clustering
and degree distributions as we find here. Follow-up experiments in the much larger L = 9
landscape (currently under way) will reveal which aspects of the landscape are specific, and
which ones are germane, in this digital microcosm. A comparison between fitness landscapes
across a variety of evolutionary systems, both digital (Pargellis and Greenbaum, 2016) and
biochemical (Jiménez et al., 2013), will further elucidate commonalities expected for simple
self-replicators. As the landscapes for these simple self-replicators are mapped, we expect
general properties of primordial fitness landscapes to emerge, regardless of the nature of
the replicator. As long as primordial self-replicators anywhere in the universe consist of
linear heteroplymers that encode the information necessary to replicate, studies with digi-
tal microcosms can give us clues about the origin of life that experiments with terrestrian
biochemistry cannot deliver.
                                               50


                                          Chapter 4
                 Information scores for individual sequences
    This chapter is published in (Adami and CG, 2022) and is a result of work done in
collaboration with Christoph Adami.
4.1      Abstract
    The information content of symbolic sequences (such as nucleic- or amino acid sequences,
but also neuronal firings or strings of letters) can be calculated from an ensemble of such se-
quences, but because information cannot be assigned to single sequences, we cannot correlate
information to other observables attached to the sequence. Here we show that an informa-
tion score obtained from multivariate (multiple-variable) correlations within sequences of a
“training” ensemble can be used to predict observables of out-of-sample sequences with an
accuracy that scales with the complexity of correlations, showing that functional information
emerges from a hierarchy of multi-variable correlations.
4.2      Introduction
    The term “functional information” refers to the information that is necessary and suffi-
cient to predict the function of a symbolic sequence within a particular environment. Typ-
ically, the sequence could be nucleic or amino acids, but it also could be a sequence of
neuronal firings or even a string of letters written in a particular alphabet. The function
could be the biochemical activity (for biological sequences), or the level of performance for a
neuronal network that produces the firings. For words in a language, the functionality could
be simply the likelihood that the word is understood by someone familiar with the language.
    In some sense, we often simply use the word “information” to indicate functional infor-
mation. For example, the hemoglobin DNA sequence carries the information necessary to
make the protein hemoglobin that (given the right cellular environment) is functional in the
sense that it binds oxygen with a particular affinity. For recordings from animal brains, the
functionality of the sequence of neuronal activations could be the accuracy with which the
                                                51


animal performs a rewarded task. Functional information can be defined quantitatively in
terms of the likelihood that a random sequence performs the function or task at the specified
level (Szostak, 2003; Hazen et al., 2007). Specifically, if we require a particular sequence to
perform function E at least at the level θ, then functional information about E as defined
by Szostak is given by
                                 I(Eθ ) = − log F (E ≥ θ) ,                                 (4.1)
where F (E ≥ θ) is the fraction of all sequences (say, given a particular length n) that perform
the function at a level at least θ.
    Such a definition of functional information makes eminent sense (it implies that functional
sequences are exponentially rare) and accurately links information content to functionality.
For example, in the evolution of RNA molecules that encode an enzymatic activity (Carothers
et al., 2004), functional information is correlated to both activity and structural complexity.
In the evolution of HIV protease (Gupta and Adami, 2016) functional information can be seen
to decrease when anti-viral drugs are given (reflecting the reduced activity of the protease),
then to increase again as anti-viral resistance emerges. In fact, functional information defined
in this manner turns out to be just the coarse-grained Shannon information content of the
sequences (Adami and LaBar, 2017b).
    A shortcoming of this measure is that only large groups of sequences can be characterized
in this manner, because accurately estimating the fraction F (E ≥ θ) necessitates large
ensembles of functional sequences. Here we develop an approach that allows us to attach a
score to every individual sequence in an ensemble, which then allows us to predict the level
of functionality of sequences not in the ensemble. In particular, this approach allows us to
study how functional information emerges from multi-site correlations within the sequence.
                                                52


4.3      Information content of functional symbolic sequences
    We now define the information content of a set of sequences ⃗s(k) of length L that form
an ensemble S, given a particular environment within which they are functional (Adami and
Cerf, 2000; Adami, 2004, 2012). The sequences ⃗s(k) are written in terms of the state of a joint
                                                                              (k)  (k)          (k)
random variable S = S1 S2 · · · SL (one for each monomer), that is, ⃗s(k) = (s1 , s2 , · · · , sL ),
while the environment can be thought of as the state of an environment variable, that is,
E = e. To define the information content of sequences, we need to define another ensemble:
the ensemble of functional sequences Se , which is constrained by requiring function within
environment e. Typically, the unconstrained ensemble is all the possible sequences of a given
length L, while the ensemble Se consists of only those sequences that are functional in e at
a specified level.With this notation, the information content can be written as
                              I(S : E = e) = H(S) − H(S|e) .                                   (4.2)
where H(S|e) is the entropy of sequences in Se : an entropy conditioned on the environmental
variable taking on the state E = e. Further, H(S) is the Shannon entropy of the uncon-
strained ensemble, which, if sequences of length L are drawn from an alphabet of D symbols,
is simply H(S) = L log D. The entropy of the constrained ensemble requires the knowledge
of the likelihood pk to find a functional sequence s(k) within the constrained ensemble. Thus,
Eq. (4.2), being the difference between the uncertainty about a non-functional ensemble and
the uncertainty of the functional ensemble, simply reflects the knowledge of the function
that constrains the ensemble: it is information about how to carry out the function.
    Then, we can calculate the Shannon entropy of sequences in Se as
                                              XNe
                                  H(S|e) = −      pk log pk ,                                  (4.3)
                                               k
                                               53


where Ne is the number of different functional sequences within Se . If we coarse-grain entropy
(4.3) by assuming that each sequence appears with equal probability, then pk = 1/Ne and
                                                                           
                                             L                         Ne
                           I(S : e) = log D − log Ne = − log                     ,                 (4.4)
                                                                       DL
                                                                Ne
which is Szostak’s functional information since F =             DL
                                                                   .
     Because most functional sequences of interest are exponentially rare, finding a sufficient
number of them to accurately estimate the fraction F (E ≥ θ) = Ne /DL is difficult or nearly
impossible. For example, while tens of thousands of variants of the 99-mer HIV-1 protease
are known, they are not obtained by randomly sampling the 99-mer space, and therefore
cannot be used to calculate functional information in this manner. However, it is possible
to relate the information I(S : e) to monomer, dimer, and tri-mer entropies (and so on) as
follows.
     In general, a multi-variable entropy such as H(S|e) = H(S1 S2 · · · SL ) can be decomposed
into multi-mer correlation entropies using Fano’s entropy and information decomposition
theorems (Fano, 1961)). Specifically1
                                       XL             X n
                                               m−1
                   H(S1 · · · SL ) =      (−1)                 I(Sσ(1) : · · · : Sσ(m) ) .         (4.5)
                                       m=1         σ(1)···σ(m)
and
                                         XL               XL
                   I(S1 : · · · : SL ) =    (−1)m−1               H(Sσ(1) · · · Sσ(m) ) .          (4.6)
                                         m=1          σ(1)···σ(m)
Here, we write I for informations (shared entropies) for ease of reading. In these formulæ,
σ(i) stands for a permutation of the index i. For example, the entropy decomposition theorem
   1
     Fano’s decomposition theorems are just specific information-theoretic examples of the more general
inclusion-exclusion theorem that has been known for some time. For example (Sylvester, 1883) refers to it
as a “théorème logique bien connu”.
                                                     54


(4.5) can also be written as
                                 X L             XL
               H(S1 · · · SL ) =      H(Si ) −       I(Si : Sj )
                                  i=1            i<j
                                    L
                                                                                               (4.7)
                                  X
                               +       I(Si : Sj : Sk ) − · · · (−1)L−1 I(S1 : · · · : SL ) .
                                 i<j<k
These theorems allow us to approximate the exact information content I(S : e). To start
with, if we neglect all correlations between variables, then the first-order approximation to
H(S|e) is
                                                     X L
                                         H1 (S|e) =        H(Si ) ,                            (4.8)
                                                      i=1
where H(Si ) is the (functional) monomer entropy of site i, that is,
                                                 X
                                   H(Si ) = −        pi (si ) log pi (si ) ,                   (4.9)
                                                  si
with pi (si ) the likelihood to find symbol si (one out of D possible ones) at position i.
    If we keep correlations between pairs of sites but neglect correlations between three sites,
we obtain the second-order approximation of the entropy as
                                            X L              X L
                               H2 (S|e) =        H(Si ) −        I(Si : Sj ) ,                (4.10)
                                             i=1             i<j
where I(Si : Sj ) is the entropy shared between variables Si and Sj . We can deduce how this
is written in terms of marginal entropies (entropies obtained by summing over one or more
of the monomers) using the decomposition theorem (4.6) for L = 2 as
                              I(Si : Sj ) = H(Si ) + H(Sj ) − H(Si Sj ) ,                     (4.11)
the standard Shannon formula for information.
                                                       55


     Because H(Si ) and I(Si : Sj ) can be estimated from finite ensembles of functional se-
quences, approximations such as (4.10) are sufficient to estimate the information content
I(S : e) as long as correlations between three or more variables can be neglected (see,
e.g., (Gupta and Adami, 2016)).
     While these approximations are useful for calculating the information content of sequences
from finite ensembles, they do not allow us to correlate individual characters of a sequence (for
example, its particular level of functionality) to information because, as we have emphasized
repeatedly, information is not a property of a single sequence. However, each sequence is
also a pattern, so it should be possible in principle to relate the individual pattern in the
sequence to its function. Indeed, methods of this sort have been developed before: they
use sophisticated techniques to link the patterns to function, either by regression (Hinkley
et al., 2011; Kouyos et al., 2011) or by using a multiple-sequence alignment (MSA) to train
a probabilistic model (usually called “Potts” or “Ising” model) (Schneidman et al., 2006;
Mora and Bialek, 2011; Ferguson et al., 2013; Biswas et al., 2021). We now show that there
is a much simpler approach that allows us to link patterns to function directly, without
regression or fitting a Potts model.
4.4      Model-free linkage of pattern to function
     A common technique to model correlations between variables in a sequence ⃗s = (s1 , s2 · · · , sL )
is the so-called Potts or Ising model Hamiltonian (energy functional)2 that can be written
as (Schneidman et al., 2006; Mora and Bialek, 2011)
                                          X L              X L
                                 E(⃗s) =        hi (si ) +      Jij (si , sj ) .                       (4.12)
                                           i=1              i<j
Here as before, the si are the sequence monomers taken from an alphabet of size D. For
neuronal firings, the si are binary, while for much of the literature on molecules, the si
are reduced to a small alphabet, for example si = 0 for wild-type monomer, and si = 1
   2
     Technically speaking, the model is neither a Potts nor an Ising model, as those models only have nearest-
neighbor interactions. It is best called an infinite-range Ising-like model.
                                                         56


                                                            L
                                                               2
for a mutant. The L · D parameters hi (si ) and             2
                                                                D parameters Jij (si , sj ) are Lagrange
multipliers obtained by maximizing the entropy
                                                  D L
                                                 X
                                      H(⃗s) = −       pk log pk ,                                  (4.13)
                                                  k=1
constrained to reproduce the monomer and dimer probabilities p(si ) and p(si , sj ) that can be
obtained from the MSA. The probability pk = P (⃗s(k) ) is incidentally the same as defined in
(4.3): it is the likelihood to find sequence ⃗s(k) by chance. Maximization under the constraints
yields the familiar Boltzmann distribution
                                                 1 −E(⃗s)
                                          pk =     e        ,                                      (4.14)
                                                Z
with E(⃗s) from (4.12) above, and where
                                                          (k) )
                                              X
                                          Z=        e−E(⃗s                                         (4.15)
                                                k
is the partition function.
    Now, let’s take a closer look at the entropy H(⃗s). Using (4.14), we can write this as
                                 X L X                    XX
               H(⃗s) = log Z +            h(σi )p(si ) +              p(si , sj )Jij (si , sj ) .  (4.16)
                                  i=1  si                  i<j si ,sj
It is clear that the right-hand-side of Eq. (4.16) is an approximation of the “true” entropy
H(⃗s), fitted to the MSA using the first and second-order Lagrange multipliers. Higher-
order terms are neglected. But it is a strange formula, because log Z also appears on the
right-hand-side, and we don’t have an approximation for that. But it turns out that we
can approximate the left-hand-side using averages taken from the MSA as well, up to any
order we want. This is possible because we can decompose H(⃗z) exactly using the entropy
decomposition theorem, for example written in the form (4.7). Using this approximation,
                                                    57


we will be able to determine log Z, and eliminate the Lagrange multipliers.
     Because the Potts model parameters only consider up to bi-partite correlations, we will
neglect terms of order three or higher in (4.7) for now, but it is clear that we could keep
them as long as we fit an appropriate third-order term in the Hamiltonian (4.12). Using
pi (si ) directly from the MSA, along with the pij (si , sj ), and so forth, we can insert those
terms in (4.16) to yield
                                                            XX                                                        
                   XX                                                                             pi (si )pj (sj )
 log Z = −               pi (si ) log pi (si ) + h(si ) +                      pij (si , sj ) log                  − Jij
                   i  si                                           i<j si ,sj
                                                                                                   pij (si , sj )
               X                  X                                  
           =         H(i) − ⟨hi ⟩ −              H(i : j) + ⟨Jij ⟩                                                   (4.17)
                 i                       i<j
In (4.17), we defined the average Lagrange coefficients
                                                    X
                                         ⟨hi ⟩ =          pi (si )h(si )                                             (4.18)
                                                      si
and
                                             X
                                  ⟨Jij ⟩ =          pij (si , sj )Jij (si , sj ) .                                   (4.19)
                                             si ,sj
We also introduced the simplified notation H(i) ≡ H(Si ) and H(i : j) ≡ H(Si : Sj ) and so
on, which we will use from now on.
     In order to understand the term log Z, we turn to Berg-von Hippel (BvH) theory (Berg
and von Hippel, 1987, 1988). BvH theory defines an energy functional for a DNA sequence
that is related to the binding affinity of a transcription factor using a MSA. Indeed, BvH
theory is very similar to the max-entropy approach from (Mora and Bialek, 2011), except
that it assumes that the distribution at each site is at maximum entropy. In fact, the energy
score of a DNA binding site is given precisely by an energy functional of the form (4.12), but
without the second-order term (as correlations between sites are close to irrelevant in DNA
                                                           58


binding).
      Unlike in the Potts-model approach, the counterparts to hi (si ) are not fitted to the MSA
in the BvH approach. Rather, they are directly extracted from the MSA in terms of a matrix
as3
                                              (1)           pi (s0 )
                                          Mia = λ log                  ,                                   (4.20)
                                                            pi (sa )
because this is the expression that maximizes the single-site entropy (subject to constraints).
We can see this by writing
                                                                   (1)
                                                                M
                                                                  ia
                                         pi (sa ) = pi (s0 )e−    λ      .                                 (4.21)
In Eq. (4.20), λ is a Lagrange multiplier that allows us to set the scale of the energy just as
we did in the earlier maximization (we set λ = 1 here). Furthermore, pi (s0 ) is the probability
(likelihood) to find the consensus monomer at site i. Obviously, the consensus monomer is
the one that has the highest frequency in the MSA at that position. This assures that the
consensus sequence is the one with vanishing energy (the “ground state”). The matrix M (1)
is called the “position-weight matrix” (PWM) of a sequence (Stormo, 2000), and it is fairly
successful at predicting DNA binding sites in a genome (see, e.g., (Brown and Callan Jr.,
2004)), and even at detecting correlations between binding sites (Clifford and Adami, 2015).
      We now introduce a second-order PWM (as BvH in fact did in the appendix to (Berg
and von Hippel, 1987), but for adjacent monomers only)
                                                                       
                                          (2)             pij (s0 )
                                       Mijab   = log                       .                               (4.22)
                                                        pij (sa , sb )
Just as before, pij (sa , sb ) is the likelihood to find symbol combination sa , sb at the pair of
    3
      We changed notation slightly in Eq. (4.20), from pi (si ) in Eq. (4.9) to pi (sa ) here, to conform to BvH
theory. The difference is that in BvH theory it is convenient to index the symbol, so that it is obvious that
pi (sa ) is a matrix element, while this was implicit in the form pi (si ). In the latter formulation, rather than
summing over an index we sum over the symbols si .
                                                        59


sites i, j, while pij (s0 ) is the likelihood to find the consensus pair there4 . Defining this matrix
implicitly assumes that any pair of monomers is also in statistical equilibrium. While assum-
ing that even higher-order correlations are in statistical equilibrium might be increasingly
doubtful, it is certainly more likely than assuming that the entire sequence is at maximum
entropy, as (4.14) does.
     We can now use BvH theory to calculate log Z and insert it into Eq. (4.17). In standard
BvH theory, log Z is given by (recall we are setting λ = 1 here)
                                                      X  L
                                        log Z = −           log pi (s0 ) ,                              (4.23)
                                                       i=1
but this is only the first-order approximation. In general, for any sequence ⃗s we must have
                                                           1
                                            Z=                       ,                                  (4.24)
                                                   p0 (s1 · · · sL )
which we can expand so that5
                                    L                    L                          
                                  X                    X                pij (s0 )
                     − log Z =         log pi (s0 ) −        log                        + ··· .         (4.25)
                                   i=1                  i<j
                                                                    pi (s0 )pj (s0 )
Inserting this into (4.17) we can now identify
                                 ⟨hi ⟩ = H(i) + log pi (s0 ) ,
                                                                                      
                                                                         pij (s0 )
                                ⟨Jij ⟩ = −H(i : j) − log                                  .             (4.26)
                                                                     pi (s0 )pj (s0 )
This implies that in the BvH model, there is a relationship between the energy contribution
to a site ⟨hi ⟩ and the entropy H(i) of that site, and it is precisely the relationship that is
   4                                                                                 (ij)
     A more precise (but also more cumbersome) notation would be pij (s0 ), to indicate that in pair-wise
probabilities, s0 refers to the consensus dimer.
   5
     This is the expansion of a joint probability that is at the root of Fano’s entropy and information decom-
position theorems (4.5) and (4.6).
                                                           60


 summarized by the thermodynamical Helmholtz free-energy relation
                                         F = E − TS ,                                  (4.27)
 where F = − log Z is the free energy, E is the average energy, and S is the Gibbs entropy
 (T , of course, is the temperature). As a consequence, we see that the parameters of the
 Potts model are given entirely by quantities that can be extracted directly from the MSA,
 without the need of any optimization:
                                                           (1)
                                            hi (sa ) = Mia ,                           (4.28)
                                                           (2)
                                      Jij (sa , sb ) = Mijab .                         (4.29)
 Thus, the energy of a particular sequence E(⃗s) can now be calculated from the first- and
 second-order PWMs (just as in BvH theory), without fitting any parameter to a model.
 In the next section, we extend this approach to take into account arbitrary-order correla-
 tions between monomers, in order to construct predictive energy and information scores for
 arbitrary sequences.
 4.5      Information and energy scores
      Using the PWM (4.20) obtained from an alignment of functional sequences in an ensemble
 Se , we can define the first-order energy score of an arbitrary sequence ⃗s(k) as
                                   E1 (⃗s(k) ) = Tr(M (1) Gk ) ,                       (4.30)
 where Gk is the D × L matrix representing sequence ⃗s(k) , which has a ‘1’ at position i if
 symbol a is found there, and a ‘0’ otherwise. With this definition, the consensus sequence
⃗s0 has E1 (⃗s0 ) = 0, and sequences that carry symbols at that position that rarely occur in
 the MSA score badly (have high energy). For cases where the symbol does not occur at all
 in the MSA at that position, we have to introduce the concept of pseudocounts in order to
                                                    61


deal with zeros in the denominator of (4.20), as we discuss in more detail below.
     The average first-order energy score over the ensemble Se then becomes
                                                         X  L
                                  ⟨E1 ⟩e = H1 (S|e) +          log p0 (si ) ,                        (4.31)
                                                            i
where H1 (S|e) is the first-order approximation to the Shannon entropy of the functional
ensemble Se , defined in (4.8). This is the first-order version of the Helmholtz free energy
relationship.
     In preparation of a discussion of information scores, let us also introduce the first-order
information matrix6
                                                                
                                           (1)          pi (sa )
                                         Ria    = log              ,                                 (4.32)
                                                         1/D
defined in such a way that its average is the first-order Shannon information (if we take logs
to the base of the alphabet D)
                                                        XL
                                        ⟨R1 ⟩e = L −          H(i) .                                 (4.33)
                                                          i
The corresponding information score for sequence ⃗sk is
                                        R1 (⃗s(k) ) = Tr(R(1) Gk ) ,                                 (4.34)
We now construct energy and information scores that include higher-order corrections. In
   6
     This is in fact Schneider’s information score introduced in (Schneider, 1997), but with a uniform prior
1/D and without finite sample size correction, which we will take care of using pseudocounts.
                                                       62


general,
                             X L            X L
           E(s1 · · · sL ) =      E(si ) −        E(si : sj )
                              i=1            i<j
                                L
                                                                                                  (4.35)
                              X
                           +       E(si : sj : sk ) − · · · + (−1)L−1 E(si : sj : · · · sL ) .
                             i<j<k
Here,
                                                      pi (s0 )
                                   E(si ) = log                 ,                                 (4.36)
                                                       pi (si )
                                                          pij (s0 )
                                  E(si sj ) = log                                                 (4.37)
                                                      pi (si )pj (sj )
                               E(si : sj ) = E(si ) + E(sj ) − E(si sj ) .                        (4.38)
Analogous relations hold for information scores for monomers, dimers, and so on. We now
construct approximations to the exact energy score (4.35). The first-order approximation
was already given in (4.30), but we rewrite it here:
                                      X L
              E1 (s1 · · · sL ) =          E(si ) .                                               (4.39)
                                      i=1
                                      X L             XL
              E2 (s1 · · · sL ) =          E(si ) −        E(si : sj ) ,                          (4.40)
                                      i=1             i<j
                                      X L             XL                   X L
              E3 (s1 · · · sL ) =          E(si ) −        E(si : sj ) +         E(si : sj : sk ) (4.41)
                                      i=1             i<j                 i<j<k
We can rewrite the correction terms as a function of joint energy scores using the decompo-
sition theorem. For example, using (4.38), we find
                                     X L             XL
                E2 (s1 · · · sL ) =        E(si ) −       (E(si ) + E(sj ) − E(si sj ))
                                     i=1             i<j
                                       L                             L
                                                                                                  (4.42)
                                     X                             X
                                  =        E(si sj ) − (L − 2)          E(si ) .
                                     i<j                            i=1
                                                       63


Likewise, if we neglect fourth-order correlations,
                            L                             L                                    L
                         X                              X                 1                   X
    E3 (s1 · · · sL ) =          E(si sj sk ) − (L − 3)      E(si sj ) + (L − 2)(L − 3)            E(si ) . (4.43)
                        i<j<k                            i<j
                                                                          2                   i=1
Turning now to information scores (here, R(si ) = log [Dpi (si )], R(si sj ) = log [D2 pij (si sj )]
and so on, and logs are to the base D) we can define
                                          X L
                   I1 (s1 · · · sL ) =         R(si )                                                       (4.44)
                                           i=1
                                          X L           X L
                   I2 (s1 · · · sL ) =         R(si ) −      R(si : sj )                                    (4.45)
                                           i=1          i<j
                                          X L           X L                  XL
                   I3 (s1 · · · sL ) =         R(si ) −      R(si : sj ) +        R(si : sj : sk )          (4.46)
                                           i=1          i<j                 i<j<k
where the shared information scores R(si : sj ) etc. are defined analogously to the shared
energy scores. Averaging these over the sequences in the functional ensemble Se will return
the approximations (4.8), (4.10), and so on.
4.6      Application to sequences of self-replicating programs
    In the previous section we saw that energy and information scores are built up hier-
archically, adding pairwise, three-part, and higher correlations until the full functionality
emerges. But what level of correlations do we need to keep track of in order to accurately
capture a sequence’s functionality? To test this, we will use a data set of sequences in
which correlations are important, but is also large enough so that higher-order PWMs are
statistically significant. Previously, we analyzed a set of 36,171 sequences of self-replicating
computer programs from the digital life system Avida (see, e.g., (Adami, 1998, 2006; Ofria
et al., 2009b)). Avida simulates a computer within a standard computer, within which
self-replicating programs written in a custom programming language live and thrive. The
programming language typically uses only 26 instructions (conveniently labeled by the low-
ercase letters a-z) that are chosen in such a way that programs are highly evolvable. Because
                                                          64


Figure 4.1: The four largest connected components of the network of self-replicators comprise
31,592 sequences (about 87% of the total. In these networks, two replicators are joined by
an edge if they differ by exactly one point-mutation.
the replication of instructions is noisy (giving rise to mutations), populations of programs
evolve and adapt to user-specified fitness landscapes. The Ne = 36, 171 sequences that we
analyze here represent the complete set of all possible self-replicating programs of length
L = 9 (out of a possible 269 ≈ 5.43 × 1012 ) obtained via an exhaustive search (C G and
Adami, 2021). “Self-replication” is defined here as the capacity to form a growing colony
of programs (in the absence of mutations) and does not necessarily require perfect copy-
ing of information. Understanding the structure of the fitness landscape of self-replicators
is an important element in trying to understand how replication could have emerged from
non-replication, i.e., the origin of life (Adami and LaBar, 2017b; C G et al., 2017).
    The 36,171 sequences form complex clusters in genotype space (see Fig. 4.1). We can
immediately calculate the information content of these 9-mers using Szostak’s formula (4.1),
                                               65


which gives7
                                              36, 171
                                  I9 = log26            ≈ 5.77 mers                                  (4.47)
                                                269
that is, about 64% of the sequence is information (the smallest possible replicator in this
world has L = 8, with an even more compressed code (C G and Adami, 2021).
4.6.1       Emergence of information from multi-partite correlations
     We can now ask how this information emerges from correlations, by averaging the infor-
mation scores (4.44) over Se . In Fig. 4.2 we show the approximations to the exact information
score (4.47) as a function of the order n of the approximation. As expected from the infor-
mation decomposition theorem, the approximations move closer to the exact value as more
correlations are incorporated.
4.6.2       Prediction of function from information scores
     In this section we will test how much “correlation information” (information stored in
correlations between sites, as opposed to bias in per-site frequency) is necessary to distinguish
functional from non-functional sequences. Because we will now be scoring sequences that
are not in the set Se , we have to introduce pseudocounts in the PWMs. As is clear from
(4.20), if a particular symbol sa at position i does not occur in the MSA, then the maximum
likelihood estimator of the probability pi (sa ) vanishes, leading to an infinite matrix element.
As a consequence, the score of a sequence that contains such a symbol at the position is
not well-defined. A common cure for this defect is to introduce pseudocounts, which add a
constant to every possible symbol count. While the common Laplace pseudocount adds 1
to each count (na → na + 1 and therefore pi (sa ) → (na + 1)/(Ne + D), the pseudocount is
often used as a variable that can be adapted to the statistics of the dataset in question (see,
e.g., (Nemenman et al., 2002; Nemenman, 2011)). In what follows, we adopt a pseudocount
   7
     Choosing a base for the logarithm gives units to entropy and information. In the following, we will take
logs to the base of the alphabet, giving units of “mers”. An n-mer has n mers of potential information
(entropy).
                                                      66


                 hIin
                 <latexit sha1_base64="h8wtGw4ZMzqTlfb4VyBeV//vmvM=">AAACJ3icbZBNSwMxEIaz9avWr6pHL0ERBKHselCPRS96U7BVcEuZzU7b0CS7JFm1LP0Xnr2qv8ab2KN/RExbD9r6QuDhnRlm8kap4Mb6/sArzMzOzS8UF0tLyyura+X1jbpJMs2wxhKR6JsIDAqusGa5FXiTagQZCbyOuqfD+vUdasMTdWV7KTYktBVvcQbWWc3yRihAtQXS81CPoOnMHb/ij0SnIfiBnep2uP84qPYumuWvME5YJlFZJsCY28BPbSMHbTkT2C+FmcEUWBfaeOtQgUTTyEe39+muc2LaSrR7ytKR+3siB2lMT0auU4LtmMna0Py3FkVyYrVtHTdyrtLMomLjza1MUJvQYTI05hqZFT0HwDR3x1PWAQ3MuvxKocJ7lkgJKs7DGEU/D9PhD0H0XWDBZDzTUD+oBIeV4NIld0LGKpItsk32SECOSJWckQtSI4w8kCfyTF68V+/Ne/c+xq0F72dmk/yR9/kNm6SqSQ==</latexit>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   <latexit sha1_base64="H/4j0qDymMY5B+UBdVGpkYp0dPU=">AAACFHicbVDLSsNAFJ34rPFVdekmWARXJXGhbsSiG5ct2Ac0oUwmt+3QmUmYmSgl5Avcqj/gb7gTcedW/Blx+lho64ELh3Pu5d57woRRpV33y1pYXFpeWS2s2esbm1vbxZ3dhopTSaBOYhbLVogVMCqgrqlm0EokYB4yaIaDq5HfvAWpaCxu9DCBgOOeoF1KsDZSTXSKJbfsjuHME29KShfv9nny/GlXO8VvP4pJykFowrBSbc9NdJBhqSlhkNt+qiDBZIB70DZUYA4qyMaH5s6hUSKnG0tTQjtj9fdEhrlSQx6aTo51X816I/FfLwz5zGrdPQsyKpJUgyCTzd2UOTp2RjE4EZVANBsagomk5niH9LHERJuwbF/AHYk5xyLK/AhYnvnJ6EPMchOYNxvPPGkcl72TsldzS5VLNEEB7aMDdIQ8dIoq6BpVUR0RBOgePaBH68l6sV6tt0nrgjWd2UN/YH38AOQNozQ=</latexit>
Figure 4.2: Mean information scores (Shannon information) as a function of the degree of
correlations taken into account. ⟨I1 ⟩ corresponds to the information disregarding pair-wise
correlations between sites, while ⟨I⟩2 disregards third-order correlations, and so on. The red
line is the exact information I9 = 5.77 mers.
of 10−6 throughout, but note that the results on classification accuracy scale only extremely
weakly over the 12 orders of magnitude of pseudocounts we tested. Generally speaking, an
infinitesimal pseudocount moves information scores of un-represented symbols far out into
the negative, and energy scores far to the positive. The pseudocount for higher-order PWMs
(we construct those explicitly up to n = 4, where n is the order of correlations that are
included) is adjusted so that the marginal probabilities remain correct.
    In order to predict the function of an arbitrary sequence in light of an MSA of N ≤ Ne
functional ones, we need to construct a classifier in such a manner that, given a particular
threshold τ , the score divides functional from non-functional sequences. For energy scores,
we would deem those with a score E(⃗s) < τ as functional, while for information scores we
would deeem those with I(⃗s) > τ to be replicators. Note that because we have exhaustively
classified this dataset already, we can determine the accuracy of classification (given a subset
of sequences N < Ne for a given classifier) with perfect accuracy.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                67


     In principle, we can construct classifiers directly from the information score functions
(4.44), but due to the fact that the correction terms to I1 involve the summation of many
terms with alternating signs, the pseudodocount prevents these classifiers from being effec-
tive8 . Indeed, it appears to be impossible to construct pseudocounts in a manner that does
not introduce spurious correlations at all levels n (while it is of course possible to make sure
no spurious correlations are introduced for a chosen given level). However, it turns out that
sums over the multiple-site energy and information functionals E(si sj · · · ) and R(si sj · · · )
do work well, because they do not rely on cancellations. For example, we can define the
following information classifiers (similar classifiers can be built from energy functionals):
                                                X L
                                  D1 (⃗s) =          R(si ) ,                               (4.48)
                                                 i=1
                                                X L
                                  D2 (⃗s) =          R(si sj ) ,                            (4.49)
                                                 i<j
                                                  XL
                                  D3 (⃗s) =            R(si sj sk ) ,                       (4.50)
                                                i<j<k
                                                   X L
                                  D4 (⃗s) =              R(si sj sk sm ) .                  (4.51)
                                                i<j<k<m
To test the quality of the classifiers, we create test data containing 105 sequences. Specifically,
we create three ensembles: SR with randomly generated sequences (all sequences are L = 9)
that we made sure are all non-functional, as well as a set of one-mutants S1 and two-mutants
S2 that are one or two mutations away from functional sequences, but known not to be
functional. We expect the one-mutant set in particular to be the hardest to classify.
     We show typical density distributions of scores for the classifiers in Fig. 4.3. These
classifiers were all “trained” on the full set Se , meaning, the PWMs M (n) were created with
the entire set of replicators in the MSA. As expected, the set of random sequences has a
density distribution (purple in Figs. 4.3(a-d)) far removed from the functional sequences
   8
     This holds equally when using energy score functions (4.39).
                                                     68


(light blue), but the one-mutant and two-mutant sequences have a significant overlap with
functional sequences if D1 is used as a classifier. Including higher-order correlations reduces
this overlap, until the overlap (mis-classification) almost completely disappears in D4 .
    We can characterize the accuracy of the classifier as a function of the size of the train-
ing set by measuring the quality score of an ROC (receiver operating characteristic, see,
e.g., (Fawcett, 2006)) plot. ROCs are generated by plotting the false positive rate (FPR)
against the true positive rate (TPR) as a function of the threshold τ in the classifier. A typ-
ical ROC (for classifier D2 trained on 1% of Se used on 100,000 non-functional single-point
mutants of the functional set) is shown in Fig. 4.4. The quality score Q of the classifier is
obtained by calculating the area under the curve (AUC) of the classifier, as compared to the
AUC=0.5 of a random classifier where the TPR increases linearly with the FPR.
    It is clear from Fig. 4.5 that all classifiers work well even when only 0.1% of the true
replicators (36 sequences) are used to create the MSA, but for training sets that small,
the classifiers D3 and D4 that use three-partite and four-partite correlations perform less
well than D1 and D2 on the test data for one- and two mutants of functional sequences,
which can be attributed to overfitting. Indeed, for data sets this small, the equilibrium
assumption behind the max entropy approach will be violated for those PWMs. However,
for larger MSAs, including higher-order correlations significantly improves the performance
of the classifier. When using the full set of replicators, taking into account up to four-way
correlations appears to capture almost all of the information encoded in the sequences. On
random sequences, all classifiers perform close to perfectly even on the smallest training
set size, and including multivariate correlations is clearly overkill. The quality of classifiers
based on energy functionals is the same as for those described here (data not shown), as
they are intimately connected via the Helmholtz free energy relation.
4.7      Discussion
    Fano’s entropy and information decomposition theorems allow us to understand how
functional information emerges from the combination of multivariate correlations. Here
                                                69


                (a)                                             (b)
                (c)                                             (d)
Figure 4.3: Density plots of scores generated by the classifiers (4.48) to (4.51), with PWMs
using all sequences in Se and a pseudocount 10−6 . (a) Density of scores obtained with D1 for
true sequences (Se , light blue), false sequences (SR ), purple), single mutant false (S1 , red),
and double-mutant false (S2 , green). Note that for visibility, the distributions are all scaled
so that their maximum is at 1. (b) Density of scores for classifier D2 . (c) Density of scores
for classifier D2 . Density of scores for classifier D3 .(d) Density of scores for classifier D4 .
                                                  70


       1.0
               (a)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      (b)
                                                      <latexit sha1_base64="QSvrwiupUMqKivwgjTm2/Sj2Zyw=">AAACHHicbVC7SgNBFJ31GeMrxtJmMQixCbsp1DKghWUE85BsCLOzN8mQeSwzs0pY8g0WturHiJ3YCn6IrTh5FJp44MLhnHu5954wZlQbz/t0lpZXVtfWMxvZza3tnd3cXr6uZaII1IhkUjVDrIFRATVDDYNmrADzkEEjHJyP/cYtKE2luDbDGNoc9wTtUoKNlW4C1ZfFi075uJMreCVvAneR+DNSqOSD4tfLfVDt5L6DSJKEgzCEYa1bvhebdoqVoYTBKBskGmJMBrgHLUsF5qDb6eTgkXtklcjtSmVLGHei/p5IMdd6yEPbybHp63lvLP7rhSGfW226Z+2UijgxIMh0czdhrpHuOA43ogqIYUNLMFHUHu+SPlaYGBtaNhBwRyTnWERpEAEbpUE8/hCzkQ3Mn49nkdTLJf+k5F/Z5Mpoigw6QIeoiHx0iiroElVRDRHE0QN6RE/Os/PqvDnv09YlZzazj/7A+fgBY3+mCA==</latexit>
       0.8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ⇢(D2 )
       0.6
 TPR
       0.4
       0.2
       0.0
           0.0     0.2      0.4     0.6    0.8    1.0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            D2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            <latexit sha1_base64="pRtMLjoT6XaVOLbVBu4BvKxBiBE=">AAACFnicbVA9SwNBEN2LGmP8ilraHAbBKtylUMuAFpYJmg/IhbC3N0mW7O4du3tKOPITLGyi/8B/YSeCla34X8S9JIUmPhh4vDfDzDw/YlRpx/m0Miura9n13EZ+c2t7Z7ewt99QYSwJ1EnIQtnysQJGBdQ11QxakQTMfQZNf3iR+s1bkIqG4kaPIuhw3Be0RwnWRrq+7Ja7haJTcqawl4k7J8VKtvb1Pnl4rnYL314QkpiD0IRhpdquE+lOgqWmhME478UKIkyGuA9tQwXmoDrJ9NSxfWyUwO6F0pTQ9lT9PZFgrtSI+6aTYz1Qi14q/uv5Pl9YrXvnnYSKKNYgyGxzL2a2Du00CDugEohmI0MwkdQcb5MBlphoE1feE3BHQs6xCBIvADZOvCj9ELOxCcxdjGeZNMol97Tk1kxyZTRDDh2iI3SCXHSGKugKVVEdEdRH92iCHq0n68V6td5mrRlrPnOA/sD6+AEGlqRo</latexit>
                                FPR
Figure 4.4: (a) ROC plot for classifier D2 , constructed from 1% of the Ne sequences in Se ,
and evaluated on 100,000 uniformly sampled non-functional single-point mutants of func-
tional sequences, as the threshold parameter τ is varied from τ = 45 to τ = −211. True
positives are functional. The diagonal line is the ROC of a random (unspecific) classifier. (b)
The distribution of functional sequences Se (blue) and of the non-functional single-mutant
sequences S1 , both evaluated by D2 . Note that unlike in Fig. 4.3, the distributions are not
scaled.
we have shown that defining energy and information functionals built from multi-variate
position weight matrices allow us to capture the correlations inherent in multiple sequence
alignments in a model-free manner. Standard approaches to fitting MSAs rely on expensive
inverse models to fit the bi-partite correlations in the data, but must reduce the alphabet
size considerably to reduce the number of parameters in the Potts or Ising models. It is clear
from this analysis that such a fit is entirely unnecessary, as the marginal probabilities can
be extracted from the data as long as the maximum entropy assumption is used on single
sites, pairs of sites, triples, and so on.
    We tested this approach on the largest computational genotype-phenotype map created
to date. Having a complete genotype-phenotype map allows us to calculate the informa-
tion content of functional sequences exactly, and study how this information is built up by
lower-order correlations. Further, a complete data set helps in creating ensembles that can
be used to test the classification accuracy of function prediction. In particular, we used
the model-free approach to construct classifiers that can distinguish functional from non-
                                               71


                       (a)                                              (b)
                                              (c)
Figure 4.5: Quality of classifiers (AUC of the ROC curves) when testing against different
test sets, as a function of the training set size (in percentage of the full set of size Ne ).
Note log scale. In all panels, red points and lines refer to the first-order classifiers D1 , green
dots and lines depict the performance of second-order classifiers D2 , while the third and
fourth-order classifiers are in blue, and purple, respectively. Error bars are standard error
over ten replicates (except for N = Ne ). (a) Q1 is the quality score of classifiers for a test
set made from 31,291 non-replicating one-mutants of the functional set S2 . (b) Q2 quantifies
the quality of the classifiers when tested on the two-mutant non-functional set S2 . (c) QR
measures the performance of the classifiers when tested on 31,291 randomly generated non-
functional of 9-mers.
                                               72


functional sequences even when the non-functional sequences are single-site polymorphisms
of the functional ones. Of course, this approach needs to be tested on biological data sets
with significantly longer sequences. However, since the computational cost of creating these
complex classifiers scales mainly with the level of multivariate correlations that are included,
we expect to be able to classify sequences of several hundreds of sites with full alphabet
size (for example, D = 20 or 21 for proteins) taking into account all tri-partite correlations.
For binary data such as neuronal spike trains, we expect to be able to handle sequences of
thousands of sites, possibly including even more than 4-part correlations.
    Sequence models of the Ising or Potts type perform markedly better than the stan-
dard Hidden Markov Models (HMMs) that have been developed and used since the early
1970s (Rabiner, 1989), because they can take into account arbitrary correlations between
pairs of symbols, rather than just those between adjacent symbols in HMMs. However,
going beyond pair-wise correlations is deemed impossible for Ising models because of the
associated explosion in the number of parameters. While more general pattern classification
task (of which sequence classification is but a subset) can nowadays be achieved by training
deep neural networks on annotated data, these methods have also displayed severe draw-
backs (Nguyen et al., 2015; Jo and Bengio, 2018), for example by being easily fooled by
detracting patterns. Furthermore, training such models on a large corpus of data is expen-
sive. The model-free classifiers that we have described here are different from deep networks
in important ways, as they do not require any training, and their vulnerability to overfitting
is easily controlled by adjusting the level of correlation n to be included to the size of the
MSA.
                                               73


                                         Chapter 5
           Predicting mutation effects using information scores
    The work presented in this chapter was done in collaboration with Christoph Adami.
The results described here are in the process of being written up for publication.
5.1      Introduction
    In chapter 4 I demonstrated how an information-theoretic approach could be used to
successfully classify sequences based on the information content of individual sequences in an
ensemble of sequences sharing a common property. This validation of the approach was done
on a synthetic data set, as this allowed me to comprehensively understand the dynamics of
the system, as well as allowing me to generate a complete landscape of sequences satisfying a
property. Neither of these advantages are applicable to biological data - generating complete
fitness landscapes is impossible for natural systems, and our collective understanding of the
complex dynamics of biological systems is incomplete to say the least. Nonetheless, being
able to apply the information-theoretic approaches to such data is the real test of the utility of
such an approach. In this chapter, I investigate whether the techniques developed in previous
chapters can be used to predict the effects of mutations on the ability of a biomolecular
sequence to perform its function. Being able to predict the phenotypic effects of mutations
is very valuable, e.g., being able to test drug binding efficacy, predicting the ability of a
protein to bind to an enzyme, etc (Romero et al., 2015; Currin et al., 2015; Roscoe et al.,
2013; Ostrov et al., 2016; Lek et al., 2016). In particular, I look at a biological dataset that
contains the effects of mutating the sites of the WW domain of the human YAP65 protein
(hYAP65) on the protein’s ability to bind to its polyproline peptide ligand (Araya et al.,
2012). I apply the information-theoretic tools developed in previous chapters on this data,
and show that it predicts the effects of mutations with an accuracy comparable to the state
of the art techniques, such as deep mutational scanning, models that incorporate pairwise
correlations, such as EVmutation (Hopf et al., 2017), as well as deep-learning approaches,
such as DeepSequence (Riesselman et al., 2018).
                                               74


5.2      Motivation and current approaches
    A very hard problem in biology is the prediction of the effects of mutations on the
functionality of sequences. These sequences could be RNA, DNA, etc, and their function
could be a myriad of different things: binding to a molecule, performing enzymatic reactions,
etc. There are a number of different approaches that have been taken to study this problem.
The most straightforward approach is to perform in vitro experiments by mutating individual
sites of the genotype, and assaying its resultant fitness, often referred to as Deep Mutational
scans (Fowler and Fields, 2014). While this is a legitimate approach, it suffers from being
extremely expensive to perform these experiments. In addition, the potential search space
is prohibitively large; nature has had many millennia to search through potential mutations
and preserve the ones that still maintain their fitness - requiring time and resources that
are quite simply unavailable to human researchers. This is certainly true with the current
technological capabilities in this field, but even with significant advances the search space
will likely remain intractable.
    Due to the limitations of the current approach, analyses of this kind has turned to com-
putational techniques. Computational simulations/experiments, while obviously not being a
perfect representation of the underlying biology, are nonetheless very helpful in understand-
ing the dynamics of systems. While computational techniques must necessarily abstract the
underlying system, those techniques can be used to substantially narrow down the search
space, which can then be explored more thoroughly by experiments in vitro. Attempts
have been made in this area to model the underlying chemistry of bio-molecules, and to
predict the functionality of specific sequences using these models (Kircher et al., 2014; Sim
et al., 2012). These models have had some limited success in certain areas, but are hindered
by the fact that we don’t yet fully understand the chemical properties, and the complex
multi-dimensional interactions of large biomolecules.
                                               75


    As the quality of the model improves, the computational resources scale up prohibitively
as well, making this an infeasible approach for many analyses.
    An early use of computational techniques has been to build large scale statistical models
of the system being investigated. This is done by training a model using known variations
of a particular sequence. This approach has had a moderate amount of success, but it
notably fails when there are epistatic effects between sites in the sequences, as is typically
the case with biomolecular sequences (Breen et al., 2012). In addition, this approach requires
enormous training data set sizes, often in the millions of sequences. This kind of data is not
always available, and without such training sets, such approaches do not have as high a
degree of accuracy as we would like.
    Work has been done to take into account epistatic effects between pairs of residues (Figli-
uzzi et al., 2016; Lapedes et al., 2012; Hopf et al., 2017). This approach has a higher degree
of predictive accuracy than models that do not take into account these higher order inter-
actions. However, there is clear evidence that shows epistatic effects of even higher orders
are common in how proteins/RNAs work, and these models are not able to account for
these effects. In particular, naive extensions of these approaches are not scalable - a linear
increase in the number of potentially interacting sites leads to an exponential number of
variables that need to be fit. The “pairwise” estimation models used in this way simply can-
not be extended to consider triplet, quadruple, or higher order interactions. As it happens,
these models are actually approximations to a much more general model (the information-
theoretic model described in Chapter 4), and in fact follow from the model used there. The
information-theoretic model does not suffer from the same intractability as these models
because it requires no fitting to the data.
    Recently, work has been done to try and capture higher order interactions in biomolec-
ular data by fitting the data to a non-linear multi-variable model (referred to as a deep
latent-variable model) (Riesselman et al., 2018). This approach is similar to the pairwise
model, i.e., the EVmutation (Hopf et al., 2017) approach that uses a log-ratio heuristic
                                               76


that considers pairwise residues to predict the effects of mutations. This model extends the
EVmutation approach by introducing a nonlinear latent variable model that is intended to
capture higher order constraints on the sequences in the training data. This work has shown
to be an improvement over the pairwise model, as is expected since more correlations are
being taken into account. This approach has a number of issues as well - training the model
is computationally very expensive and involves a very large number of training parameters,
the training size needed is relatively high, and as with other deep learning models, it suffers
from overfitting to the training data. Deep learning models often use a number of techniques
that incorporate additional data that is gathered from other data sources. This additional in-
formation, e.g., the likelihood of particular amino acid substitutions, is not directly available
in the training data set, and yet is crucial for the deep learning models to achieve reasonable
prediction accuracy. The issues that deep learning techniques suffer from arise from the
fact that the models being trained are unable to distinguish between signal and noise in the
training data, and many of the additional corrections to the model are needed precisely to
account for this inability to distinguish between noise and signal. In the following sections,
I show how the information-theoretic approach is a simpler, computationally cheaper, and
more understandable model, which needs only the information available in the training data
to achieve prediction accuracy on par with or better than the best existing approaches.
5.3        Using Information theory to predict mutation effects
      In this work, I apply the information-theoretic approach described in Chapter 4 to predict
effect of single substitution mutations on the ability of the YAP-1 (hYAP65) WW protein
           1
domain        to bind to their polyproline peptide ligand (Araya et al., 2012). The training
data for the mutation effects was generated as multiple sequence alignments of the protein
family using the default five search iterations of the profile HMM homology search tool
jackhammer against the UniRef100 database of non-redundant protein sequences. Length-
    1
      YAP-1 stands for “Yes Associated Protein” which is a transcriptional co-activator. The WW domain
is a modular protein domain that is repeated multiple times, and has a tendency to bind to proline-rich
proteins. (Yagi et al., 1999)
                                                  77


normalized bit scores were used to threshold sequence similarity instead of E-values, in order
to control for comparable evolutionary depth across different families. The precise details
of how the multiple sequence alignments were generated is described in the supplementary
information of (Riesselman et al., 2018). For this project I used the exact multiple sequence
alignments that were provided in the supplementary information of (Riesselman et al., 2018)
as FASTA files in the a2m format. To compare the quality of the information-theoretic
approach, I use the provided mutation data in (Riesselman et al., 2018) containing data for
363 total mutations at various sites. The data also include the prediction data for the existing
approaches, i.e., those taking into account the frequencies of occurrences of amino acids at
single sites (Araya et al., 2012), the approach in (Hopf et al., 2017) taking into account
epistatic effects of pairs of amino acids, and the deep learning approach (Riesselman et al.,
2018) that tries to take into account all correlations in the training data simultaneously.
    There are some significant advantages to using the information-theoretic approach to
predict the effect of mutations:
    This technique is computationally cheap, both in terms of time and memory. The time
complexity of this approach scales polynomially with an increase in the order of correlation.
The space (memory) complexity is more of a concern, as it scales polynomially in terms of
both the length of the sequences and the alphabet size. A number of implementation opti-
mizations, such as using more efficient data structures can potentially reduce the memory
usage. One of the other possible techniques is to store only a subset of the entries in the
position weight matrices that correspond to frequently seen symbol-site pairs, and recom-
puting the infrequently seen entries only when they are needed by the information scoring
function. Approaches that optimize for space will often result in additional time being spent
on computation, and the optimal space-time trade-off will depend on the particular data set
being analyzed.
    This approach is based in fundamental theories that have been developed a while ago,
and are fully understood, allowing a better understanding of the information that is actually
                                               78


being learned from the training data. This is especially an improvement over the deep-
learning approaches, which have notoriously been ”black-box” systems, where it is not clear
what information is being learned from the training data.
    I can precisely choose which level of correlations I want to use to predict mutation effects.
While this is certainly true of the approaches that precisely use single, and pairwise correla-
tions, I can use the same approach in all cases, by simply adjusting the level of correlation
that I am interested in. Current models take into account either single site information,
or pairwise correlations, and in the case of deep-learning models all correlations are taken
into account. In my method, I can adjust the order of correlations precisely, using the same
method in all cases.
    The information-theoretic approach extracts exactly the information about function that
is encoded in the sequences, unlike other approaches that will also extract the noise in
the sequence ensemble. Alternative approaches inherently are unable to distinguish between
signal and noise in the data, and additional effort has to be undertaken to compensate for this.
One common adjustment that is employed is to take into account data from other multiple
sequence alignments. There are various kinds of additional information that are incorporated,
such as the a-priori likelihood of a particular amino acid substitution, which are provided by
PAM (point-accepted-mutation) and BLOSUM (BLOcks SUbstitution Matrix) matrices. In
the information-theoretic approach I describe here, the only training data used is the multiple
sequence alignment for the protein family under consideration, and this approach is still able
to extract the information about mutation effects with an accuracy that outperforms the best
deep learning models that are currently known.
5.4      Methods
    To perform the prediction of mutation effects, I used the same tool that was used to
analyze the length-9 self-replicator landscape described in Chapter 4. The initial training
data set was provided in (Riesselman et al., 2018) as a multiple sequence alignment in a2m
FASTA format. The sequences are aligned relative to the wild-type sequence, with insertions
                                                79


represented with the ‘-’ symbol. Since the mutations are relative to the wild-type sequence,
it does not matter whether the insertions are considered when generating the position weight
matrices. The residues represented with lower-case alphabets were also ignored, and were
also ignored by the other techniques used to predict mutation effects. The experimentally
observed mutation effects were also provided, which I used to evaluate the efficacy of the
information-theoretic approach. In addition, the (Riesselman et al., 2018) paper also included
the predicted effects of each mutation for the single-site, pairwise, and deep learning models,
which I used as a comparison to my technique. The quality of the predictions was computed
using the Spearman Rank Correlation coefficient between the model predictions and the
experimentally observed effects, and I used the same correlation coefficient for my analysis
as a one-to-one comparison. For the deep-learning approach, the individual mutation effect
data was not provided, but the Spearman Rank Correlation was provided. The scatter plots
for the single-site and pairwise models are shown in Fig 5.1. The values of the experimental
data on the y-axis of the scatter plots is the function of the mutated sequence divided by
the function of the wild-type sequence, which indicates that values above 1 are beneficial
mutations while values below 1 are deleterious mutations. Similarly, on the x-axis the values
plotted are the predicted function of the mutated sequence subtracted by the wild-type
sequence, which indicates that values above 0 are predicted to be beneficial mutations while
values below 0 are predicted to be deleterious mutations. The Spearman Rank Correlation
coefficient is 0.582 for the independent model, and 0.565 for pairwise model, with a p-value
≤ 2.2e−16 in both cases. Note that the prediction accuracy of the independent model slightly
outperforms the prediction accuracy of the pairwise model in this data set, though in general
the independent model tends to have a lower prediction accuracy than the pairwise model.
5.4.1       Baseline scoring function
    The first approach I used was exactly as that described in Chapter 4. I scanned the mul-
tiple sequence alignment for the YAP-1 WW domain proteins, and generated an information
score matrix, where the entries in the matrix (for the first order correlations) are
                                              80


                     3                                                                3
 Experimental Data                                                Experimental Data
                     2                                                                2
                     1                                                                1
                     0
                                                                                      0
                     −12       −8        −4        0                                              −8           −4          0
                                    Independent                                                    Pairwise (EVmutation)
Figure 5.1: Scatter plot showing the correlation between the predicted mutation effects,
using only single site information (top panel), and using the correlations between pairs of
amino acids (using EVmutation) (bottom panel). x-axis is the predicted mutation effect,
and the y-axis is the experimentally observed mutation effect. The blue line is the best
linear fit to the data. The horizontal line intercepting the y-axis at 1 indicates the cutoff
above which the mutations are beneficial, and the vertical line intercepting the x-axis at
0 indicates the cutoff above which the mutation effect is predicted to be beneficial. The
Spearman Rank Correlation coefficient is 0.582, and 0.565 for the independent and pairwise
methods respectively, with a p-value ≤ 2.2e−16 for both.
                                                                                         
                                                     (1)         pi (sa )
                                                    Ria    = log                              ,                                (5.1)
                                                                  1/D
where pi (sa ) is the probability of seeing the symbol sa at site position i.The information
score matrices for higher orders of correlation are produced similarly.
                     This information score matrix was then used to score each individual mutation. For each
specified mutation, the wild-type was mutated to the amino acid at the given site, and I
scored the mutated sequence using the scoring functions for the first 3 orders of correlations.
                                                               81


                                               X L
                                   D1 (⃗s) =        R(si ) ,                              (5.2)
                                                i=1
                                               X L
                                   D2 (⃗s) =        R(si sj ) ,                           (5.3)
                                                i<j
                                                 XL
                                   D3 (⃗s) =         R(si sj sk )                         (5.4)
                                               i<j<k
    Unlike in the synthetic data set used in Chapter 4, here I only evaluate up to 3rd order,
due to memory constraints. Using these scoring functions, I generated scatter plots where
the predicted score is plotted against the experimental data, exactly as in Fig 5.1. The
scatter plots for correlation order of 1, 2, and 3 are shown in Fig 5.2. Note that the absolute
predicted value increases as more orders of correlation are taken into account due to the
increasing number of terms that are added when computing the information score. However,
the predicted function is relative to the function of the wild-type so the absolute values are
not relevant to the prediction accuracy. The Spearman Rank Correlation coefficient for 1st
order is 0.578, for 2nd order is 0.581 and for 3rd order is 0.584, with a p-value ≤ 2.2e−16 in
all cases. Clearly the prediction accuracy of the information-theoretic model improves as
additional orders of correlation are taken into account.
5.4.2      The effect of pseudocounts
    When computing the information score for a given sequence, a particular symbol at a
particular position in the sequence may have never been observed in the multiple sequence
alignment. Due to the definition of the position weight matrix, this implies a zero count
for that symbol-position pair. When computing the information score, I take the log of the
corresponding entry in the position weight matrix which is undefined when the entry is 0. To
account for this, I have to choose some pseudocount to be added to every entry. While the
choice of pseudocounts is somewhat arbitrary, the precise choice of pseudocount can affect
the quality of the predictions generated by the information scores. Adding the pseudocount
                                                 82


                     3                                                                                3
 Experimental Data                                                                Experimental Data
                     2                                                                                2
                     1                                                                                1
                     0                                                                        0
                         −4   −3   −2                           −1   0       1               −120         −80   −40   0
                                        D1                                                                      D2
                                                            3
                                        Experimental Data
                                                            2
                                                            1
                                          0
                                        −1500                        −1000        −500                     0
                                                                                  D3
Figure 5.2: Scatter plot showing the correlation between the predicted mutation effects, using
the information score for 1st order (top left panel), 2nd order (top right panel), and 3rd order
(bottom panel). x-axis is the predicted mutation effect, and the y-axis is the experimentally
observed mutation effect. The blue line is the best linear fit to the data. The horizontal line
intercepting the y-axis at 1 indicates the cutoff above which the mutations are beneficial, and
the vertical line intercepting the x-axis at 0 indicates the cutoff above which the mutation
effect is predicted to be beneficial. The Spearman Rank Correlation coefficient for 1st order
is 0.578, for 2nd order is 0.581 and for 3rd order is 0.584, with a p-value ≤ 2.2e−16 in all cases.
                                                                                 83


values to each entry in the position weight matrix, and normalizing it so that the sum in
each row is 1, gives
                                                    na + c
                                       pi (sa ) =                                        (5.5)
                                                  Ne + Dc
where c is the pseudocount, na is the count, Ne is the number of sequences in the alignment,
and D is the size of the alphabet. The information score matrix computation stays the same
as before.
    I did a sweep of different pseudocounts from 10−9 to 100 to see how it affected the infor-
mation scores for the different orders of correlations being used to generate the information
scores. The results are shown in Fig 5.3 and show that the Spearman Rank Correlation
coefficient is largest for 10−3 (at least for third order correlations), and decreases as the
pseudocount decreases. The best pseudocount value depends on the order of correlation
being considered, with different peaks for different orders.
5.4.3      Accounting for shared ancestry
    The sequences provided in the multiple sequence alignment could be biased if they share
a recent common ancestor (shared evolutionary descent). Common descent could lead to
the presence of motifs (sets of potentially non-adjacent symbols) that appear to be over-
represented and therefore incorrectly appear to be information. The information-theoretic
tool assumes that the sequences in the multiple sequence alignment are in mutation-selection
balance, which is an assumption that doesn’t hold when the sequences share a recent common
ancestor. The information-theoretic tool will interpret these motifs as being a result of
information encoding rather than the result of shared ancestry. To account for this, I weight
each of the sequences in the alignment according to their similarity to other sequences. This
means that motifs that are overly abundant due to shared ancestry, as opposed to conferring
functionality, are not allowed to dominate the entries in the position weight matrices. There
are a number of different techniques that can be used to adjust the weights. Here, I used
a technique that first computes the Hamming distance D between every pair of sequences
in the alignment. Then, I use a cutoff threshold ρ to count how many other sequences are
                                                 84


                          0.64
          Spearman Rank
                          0.62
                                                                                                      Order
                                                                                                          D1
                                                                                                          D2
                                                                                                          D3
                          0.60
                          0.58
                                 100   10−1   10−2   10−3   10−4   10−5   10−6   10−7   10−8   10−9
                                                            Pseudo−Count
Figure 5.3: A comparison of the effects of varying pseudocount values on predictive quality
(as given by the Spearman Rank Correlation coefficient calculated by comparing the pre-
dicted values against experimental data). For each value of the pseudocount (x-axis), the
prediction quality (y-axis) is shown for different orders of correlation (1st order is in red, 2nd
order in green, and 3rd order in blue). In all cases, the p-value ≤ 2.2e−16 for the correlation
coefficient.
                                                                   85


within ρ percentage Hamming similarity. The inverse of this number is then used as the
weight of that sequence. This gives each sequence a new weight w (as opposed to the unit
weight for each sequence used when computing the position weight matrix.)
                                                              !−1
                                           X
                                w(s) =         D(s, si ) < Lρ                                (5.6)
                                            si
where L is the length of the sequence. Now every entry in the position weight matrix is
adjusted as
                                                w(s)na + c
                                  pi (sa ) = P                                               (5.7)
                                               si (w(si )) + Dc
where the denominator is the sum over the weights of every sequence in the alignment, in
order to normalize the rows to 1. The information score matrix is computed as before.
The choice of the similarity threshold ρ can have a significant effect on the quality of the
predictions, although for this particular data set the effect is not as apparent. The Spearman
Rank Correlation coefficient for different values of the cutoff are shown in Fig 5.4 for different
values of ρ.
5.4.4      Dropout regularization
    A common approach in many of the existing models is regularization, where the observa-
tions of symbols in the multiple sequence alignment are adjusted to neglect the contributions
of very uncommon residues (Cortes et al., 2012). The effect of this adjustment is to not have
the model be significantly affected by noise in the data set, and has yielded good results for
learning the data. There are various ways this can be done, e.g., by squaring the individual
contributions of each site to the model, thereby diminishing the effect of residues that occur
infrequently, while maintaining the contributions of relatively common residues (commonly
referred to as L2 regularization). Here, I implement this idea by a straightforward cutoff
based on the probability of a residue occurring at a particular site, which essentially means
that residues occurring with less than θ proportion in the alignment is treated as a 0 entry
in the position weight matrix. Of course, the pseudocount described in the previous section
                                                86


                          0.610
                          0.605
          Spearman Rank
                          0.600                                                          Order
                                                                                             D1
                                                                                             D2
                                                                                             D3
                          0.595
                          0.590
                          0.585
                                  90   91   92   93    94    95    96    97    98   99
                                             Similarity Threshold Percentage
Figure 5.4: A comparison of the effects of varying the similarity threshold ρ on predictive
quality (as given by the Spearman Rank Correlation coefficient calculated by comparing
the predicted values against experimental data). For each value of the similarity threshold
(x-axis), the prediction quality (y-axis) is shown for different orders of correlation (1st order
is in red, 2nd order in green, and 3rd order in blue). In all cases, the p-value ≤ 2.2e−16 for
the correlation coefficient.
                                                            87


                          0.650
                          0.625
          Spearman Rank
                                                                                       Order
                                                                                           D1
                          0.600                                                            D2
                                                                                           D3
                          0.575
                                  0.01   0.05      0.1     0.5      1         5   10
                                                Cutoff Threshold Percentage
Figure 5.5: A comparison of the effects of varying the cutoff threshold ρ on predictive
quality (as given by the Spearman Rank Correlation coefficient calculated by comparing the
predicted values against experimental data). For each value of the cutoff threshold (x-axis)
, the prediction quality (y-axis) is shown for different orders of correlation (1st order is in
red, 2nd order in green, and 3rd order in blue). In all cases, the p-value ≤ 2.2e−16 for the
correlation coefficient.
is still used to allow for the log to be computed when calculating the information score. This
form of regularization ended up having a significant effect for certain values of θ, and in
fact, resulted in a prediction accuracy on par with the deep-learning approach employed in
(Riesselman et al., 2018). The effects of this regularization for different values of the cutoff
threshold on the different degrees of correlation are shown in Fig 5.5.
5.5     Results
   The different techniques described in the previous subsections were combined to generate
position weight matrices and predict the mutation effects for the binding ability of the
hYAP65 WW domain. I compared this against the state of the art results for single-site,
pairwise, and deep-learning approaches, shown in Fig 5.6.
   As can be seen, the information-theoretic approach developed in here performs extremely
                                                            88


                         0.68
         Spearman Rank
                         0.64
                         0.60
                         0.56
                                1   2      3             t            e             ng
                                D   D     D            nd         is
                                                         en      w                  ni
                                                                                   ar
                                                   pe         Pa               le
                                                In              ir−           p−
                                                  de                      ee
                                                                          D
                                               Method
Figure 5.6: A comparison of the predictive quality (as given by the Spearman Rank Corre-
lation coefficient calculated by comparing the predicted values against experimental data)
for the information score technique (0.599 for 1st order, 0.626 for 2nd order, and 0.644 for
3rd order), as well as the independent (Araya et al., 2012), pair-wise (Hopf et al., 2017),
and deep-learning (Riesselman et al., 2018) models. The pseudocount is 10−3 , the similarity
threshold is 99%, and the cutoff threshold is 0.5%. In all cases, the p-value ≤ 2.2e−16 for
the correlation coefficient values plotted.
                                              89


well in comparison to the current independent and pairwise approaches, and even outper-
forms the deep learning technique. This was achieved using a model that is relatively much
simpler, and much more understandable, at least relative to the deep learning approach, that
is notorious for being hard to comprehend, where what is being learned by the model is diffi-
cult to pin down. In the information-theoretic approach I use here, what is being “learned”
is pretty much by definition the information about function that is encoded by a sequence,
which after all is precisely what we want to know about any biomolecular sequence. As the
equation to compute the information score shows, there is no learning phase, other than the
construction of the position weight matrices and information score matrices. In addition,
there are many potential improvements that can be made to improve the predictive quality
of the information score approach, in particular ones that take into account the specifics of
the data set being learned. Obviously, simply increasing the correlation order is straightfor-
ward, but is not going to yield better predictive quality unless there are, in fact, higher order
correlations in the data. This again will depend on the specific data that is being analyzed,
as well as requiring the training data set to be large enough to capture these correlations to
a reasonable degree.
    Overall, the results in this chapter demonstrate clearly that the information-theoretic
approach of scoring sequences is extremely powerful, and can be used to solve problems
that are of vital importance. Within biology there are certainly many applications of this
tool, e.g., predicting the efficacy of drug treatments, predicting new variants of diseases, etc.
One extension of this technique would be to classify neuronal activity data. While neuronal
activity is not inherently linear, data such as neuronal firing patters can be represented as
linear sequences in a format that is suitable to the information-theoretic approach. Another
potential extension of this technique is the generation of new sequences with the desired
property, by generating sequences with high information content (where this information
is derived from some training data). While I have not investigated this avenue in much
depth, preliminary investigations suggest this is a viable strategy. At the very least, I have
                                                90


a cheap and efficient way of scoring potential sequences, which is the basic building block
of a generative algorithm. There may well be applications of this tool to problems outside
of biology, e.g. classifying images, developing recommender systems, to name just a few. In
these domains, there is some information encoded in the data, although the efficacy of the
information-theoretic technique to these domains will need to be validated.
                                               91


                                          Chapter 6
                                         Conclusion
    The work I have presented in this dissertation illustrates the idea that conducting ex-
periments in silico is an extremely powerful approach to understanding the structure and
function of different complex systems, as it allows for rapid investigation of hypotheses by
abstracting away the irrelevant details of the system being studied. Pruning the hypothesis
space in this fashion can yield insights into techniques that can then be tested on natural
systems. Every additional insight into the dynamics of the natural system can then be fed
back into the computational model, allowing for an iterative workflow for scientific research.
6.1      Contributions
    The work in this dissertation covers projects that investigate the structure of evolved cog-
nitive architectures, the structure of a primordial self-replicator landscape, the development
of an information-theoretic approach to classifying individual sequences, and a validation of
this technique on a real world data set.
    In chapter 2, I investigate how the structure and function of a cognitive architecture,
specifically Markov Brains, is affected by evolving in an environment that transitions between
different, but related environmental regimes. I develop metrics that can be used to classify
these evolutionary trajectories based solely on looking at the brains that have arisen as a
result of those trajectories. The work done in that project suggests avenues of research that
can be undertaken in biological brains.
    In chapter 3, I generate an exhaustive fitness landscape of self-replicators in the Avida
system. I look at the structure of this landscape, and investigate what sorts of predictions can
be made about how this landscape can affect evolution, and the role of historical contingency.
I also analyze the information content of this landscape, and look at how the information is
encoded across its components.
    In chapter 4, I develop a novel approach that extends classic information theory by
allowing the computation of information of individual sequences. I test this technique on
                                               92


the primordial self-replicator landscape, and show that the information scores of individual
sequences can be used to classify self-replicators with a very high degree of accuracy, and is
able to do so with very low computational overhead.
    In chapter 5, I validate the information-theoretic approach on a real world data set
containing information about the effects of mutations on the ability of a protein to bind to
its polyproline peptide ligand, and show that the information-theoretic approach not only
works, but has a prediction accuracy on par with the state-of-the-art techniques, including
deep learning techniques.
6.2       Future Directions
    Each of the projects described in this dissertation can be extended in multiple ways,
depending on which aspect of the systems one wants to investigate.
    In the case of the evolved cognitive architecture, one avenue of exploration is a charac-
terization of the foraging strategies evolved in different environmental regimes. This charac-
terization can take various forms, e.g., comparing the distances travelled in the environment
versus the total area of the environment that is captured in the agent’s sensors. Agents have
an incentive to minimize the distance travelled, as this requires additional updates, which is
a limited resource. At the same time, there is an incentive to explore as much of the environ-
ment as possible in order to find the food that agents need to consume. How this trade-off
is optimized will depend on both the sensory modalities available to the agent, as well as
the evolutionary history of the agent with respect to the different environmental regimes it
has evolved in. An analysis of the foraging strategies should yield a characterization of both
these factors.
    For the complete fitness landscape of primordial self-replicators, one obvious (though
computationally expensive) extension would be to generate a complete landscape of length 10
replicators. This additional data point would be much larger than the previous data sets, and
could itself reveal properties of the structure of these landscapes. Even the length 9 landscape
is sufficiently large that many graph theoretic analyses, such as network connectivity, can
                                                 93


be investigated, which would yield additional insight into the structure of these primordial
landscapes.
    The information-theoretic approach to function prediction can be improved in many ways.
As the results on predicting mutation effects clearly demonstrate, accounting for details
of the ensemble being investigated, such as shared ancestry could improve the prediction
accuracy of this approach. Obviously, the practical applications of this approach are nearly
unlimited, and applying this approach to different data sets could be complete projects
in and of themselves. Fundamentally, all that is required is for the training data to be
representable as sequences, where the sequences themselves encode the information necessary
for the function they perform. I am confident that this approach can be used successfully
in a wide variety of domains, and radically change how the information content of complex
systems is extracted, as well as how function is predicted from sequences.
                                             94


                                     BIBLIOGRAPHY
Adami, C. (1998). Introduction to Artificial Life. Springer Verlag, New York.
Adami, C. (2004). Information theory in molecular biology. Phys. Life Rev., 1:3–22.
Adami, C. (2006). Digital genetics: Unravelling the genetic basis of evolution. Nature
  Reviews Genetics, 7(2):109–118.
Adami, C. (2012). The use of information theory in evolutionary biology. Ann. N.Y. Acad.
  Sci., 1256:49–65.
Adami, C. (2015). Information-theoretic considerations concerning the origin of life. Origins
  of Life and Evolution of Biospheres, 45:9439.
Adami, C. and Cerf, N. J. (2000). Physical complexity of symbolic sequences. Physica D,
  137:62–69.
Adami, C. and CG, N. (2022). Emergence of functional information from multivariate cor-
  relations. Philosophical Transactions of the Royal Society A, 380(2227):20210250.
Adami, C. and LaBar, T. (2017a). From Entropy to Information: Biased Typewriters and
  the Origin of Life. In Walker, S. I., Davies, P. C. W., and Ellis, G. F. R., editors, From
  Matter to Life: Information and Causality, pages 130–154. Cambridge University Press,
  Cambridge, MA. arXiv:1511.05548.
Adami, C. and LaBar, T. (2017b). From entropy to information: Biased typewriters and the
  origin of life. In Walker, S., Davies, P., and Ellis, G., editors, Information and Causality:
  From Matter to Life, pages 95–112. Cambridge University Press, Cambridge, MA.
Adami, C., Ofria, C., and Collier, T. C. (2000). Evolution of biological complexity. Proceed-
  ings of the National Academy of Sciences, 97(9):4463–4468.
Adami, C., Schossau, J., and Hintze, A. (2016). The reasonable effectiveness of agent-
  based simulations in evolutionary game theory: Reply to comments on” evolutionary
  game theory using agent-based methods”. Physics of Life Reviews, 19:38–42.
Aguirre, J., Buldú, J. M., Stich, M., and Manrubia, S. C. (2011). Topological structure of
  the space of phenotypes: the case of rna neutral networks. PLoS One, 6(10):e26324.
Araya, C. L., Fowler, D. M., Chen, W., Muniez, I., Kelly, J. W., and Fields, S. (2012). A
  fundamental protein property, thermodynamic stability, revealed solely from large-scale
  measurements of protein function. Proceedings of the National Academy of Sciences,
  109(42):16858–16863.
Berg, O. G. and von Hippel, P. H. (1987). Selection of DNA binding sites by regulatory
  proteins. Statistical-mechanical theory and application to operators and promoters. J.
  Mol. Biol., 193:723–750.
                                               95


Berg, O. G. and von Hippel, P. H. (1988). Selection of DNA binding sites by regulatory
  proteins II. The binding specificty of cyclic AMP receptor protein to recognition sites. J.
  Mol. Biol., 200:709–723.
Biswas, A., Haldane, A., and Levy, R. M. (2021). The role of epistasis in determining the
  fitness landscape of HIV proteins. BioRxiv 448646.
Blount, Z., Borland, C., and Lenski, R. (2008). Historical contingency and the evolution
  of a key innovation in an experimental population of escherichia coli. Proceedings of the
  National Academy of Sciences of the United States of America, 105(23):7899–7906.
Bohm, C., CG, N., and Hintze, A. (2017). MABE (modular agent based evolver): A frame-
  work for digital evolution research. Proceedings of the European Conference of Artificial
  Life.
Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C., and Kondrashov, F. A. (2012).
  Epistasis as the primary factor in molecular evolution. Nature, 490(7421):535–538.
Brown, C. T. and Callan Jr., C. G. (2004). Evolutionary comparisons suggest many novel
  cAMP response protein binding sites in Escherichia coli. Proc. Natl. Acad. Sci. USA,
  101:2404–2409.
C G, N. and Adami, C. (2021). Information-theoretic characterization of the complete
  genotype-phenotype map of a complex pre-biotic world. Phys Life Rev, 38:111–114.
C G, N., LaBar, T., Hintze, A., and Adami, C. (2017). Origin of life in a digital microcosm.
  Philos Trans R Soc Lond A, 375:20160350.
Carothers, J. M., Oestreich, S. C., Davis, J. H., and Szostak, J. W. (2004). Informa-
  tional complexity and functional activity of RNA structures. J. American Chem. Society,
  126:5130–5137.
CG, N., LaBar, T., Hintze, A., and Adami, C. (2017). Origin of life in a digital microcosm.
  Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineer-
  ing Sciences, 375(2109):20160350.
CG, N., Lundrigan, B., Smale, L., and Hintze, A. (2018). The effect of periodic changes in the
  fitness landscape on brain structure and function. In ALIFE 2018: The 2018 Conference
  on Artificial Life, pages 469–476. MIT Press.
Chow, S. S., Wilke, C. O., Ofria, C., Lenski, R. E., and Adami, C. (2004). Adaptive radiation
  from resource competition in digital organisms. Science, 305(5680):84–86.
Clifford, J. and Adami, C. (2015). Discovery and information-theoretic characterization of
  transcription factor binding sites that act cooperatively. Phys Biol, 12:056004.
Cortes, C., Mohri, M., and Rostamizadeh, A. (2012). L2 regularization for learning kernels.
  arXiv preprint arXiv:1205.2653.
                                               96


Covert, A. W., Lenski, R. E., Wilke, C. O., and Ofria, C. (2013). Experiments on the role of
  deleterious mutations as stepping stones in adaptive evolution. Proceedings of the National
  Academy of Sciences, 110(34):E3171–E3178.
Cronin, L. and Walker, S. I. (2016). Beyond prebiotic chemistry. Science, 352(6290):1174–
  1175.
Currin, A., Swainston, N., Day, P. J., and Kell, D. B. (2015). Synthetic biology for the di-
  rected evolution of protein biocatalysts: navigating sequence space intelligently. Chemical
  Society Reviews, 44(5):1172–1239.
Davies, P. C. and Walker, S. I. (2016). The hidden simplicity of biology. Reports on Progress
  in Physics, 79(10):102601.
Dorn, E. D., Nealson, K. H., and Adami, C. (2011). Monomer Abundance Distribution
  Patterns as a Universal Biosignature: Examples from Terrestrial and Digital Life. Journal
  of Molecular Evolution, 72(3):283–295.
Edlund, J. A., Chaumont, N., Hintze, A., Koch, C., Tononi, G., and Adami, C. (2011). Inte-
  grated information increases with fitness in the evolution of animats. PLoS computational
  biology, 7(10):e1002236.
Ellefsen, K. O., Mouret, J.-B., and Clune, J. (2015). Neural modularity helps organisms
  evolve to learn new skills without forgetting old skills. PLoS Comput Biol, 11(4):e1004128.
England, J. L. (2013). Statistical physics of self-replication. The Journal of Chemical Physics,
  139(12):121923.
Fano, R. M. (1961). Transmission of Information: A Statistical Theory of Communication.
  MIT Press and John Wiley, New York and London.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27:861–
  874.
Ferguson, A. L., Mann, J. K., Omarjee, S., Ndung’u, T., Walker, B. D., and Chakraborty,
  A. K. (2013). Translating HIV sequences into quantitative fitness landscapes predicts viral
  vulnerabilities for rational immunogen design. Immunity, 38(3):606–17.
Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O., and Weigt, M. (2016). Coevolutionary
  landscape inference and the context-dependence of mutations in beta-lactamase tem-1.
  Molecular biology and evolution, 33(1):268–280.
Fowler, D. M. and Fields, S. (2014). Deep mutational scanning: a new style of protein
  science. Nature methods, 11(8):801–807.
Gansner, E. R. and North, S. C. (2000). An open graph visualization system and its appli-
  cations to software engineering. Software Practice and Experience, 30(11):1203–1233.
Gavrilets, S. (1997). Evolution and speciation on holey adaptive landscapes. Trends in
  Ecology & Evolution, 12(8):307–312.
                                                97


Gerkema, M. P., Davies, W. I., Foster, R. G., Menaker, M., and Hut, R. A. (2013). The
  nocturnal bottleneck and the evolution of activity patterns in mammals. Proceedings of
  the Royal Society of London B: Biological Sciences, 280(1765):20130508.
Gerrish, P. J. and Lenski, R. E. (1998). The fate of competing beneficial mutations in an
  asexual population. Genetica, 102:127–144.
Gilbert, W. (1986). Origin of life: The RNA world. Nature, 319(6055).
Goldsby, H. J., Knoester, D. B., Ofria, C., and Kerr, B. (2014). The Evolutionary Origin of
  Somatic Cells under the Dirty Work Hypothesis. PLoS Biology, 12(5):e1001858.
Gould, S. J. (1990). Wonderful Life: the Burgess Shale and the Nature of History. WW
  Norton & Company.
Greenbaum, B. and Pargellis, A. (2017). Self-replicators emerge from a self-organizing pre-
  biotic computer world. Artificial Life, 23.
Gupta, A. and Adami, C. (2016). Strong selection significantly increases epistatic interactions
  in the long-term evolution of a protein. PLoS Genet, 12:e1005960.
Gupta, A., LaBar, T., Miyagi, M., and Adami, C. (2016). Evolution of Genome Size in
  Asexual Digital Organisms. Scientific Reports, 6:25786.
Hazen, R. M., Griffin, P. L., Carothers, J. M., and Szostak, J. W. (2007). Functional
  information and the emergence of biocomplexity. Proc. Natl. Acad. Sci. USA, 104:8574–
  8581.
Hinkley, T., Martins, J., Chappey, C., Haddad, M., Stawiski, E., Whitcomb, J. M., Petropou-
  los, C. J., and Bonhoeffer, S. (2011). A systems analysis of mutational effects in HIV-1
  protease and reverse transcriptase. Nat Genet, 43:487–9.
Hintze, A. and Adami, C. (2008). Evolution of complex modular biological networks. PLoS
  Comput Biol, 4(2):e23.
Hintze, A. and Adami, C. (2010). Modularity and anti-modularity in networks with arbitrary
  degree distribution. Biology direct, 5(1):32.
Hintze, A., Edlund, J. A., Olson, R. S., Knoester, D. B., Schossau, J., Albantakis, L.,
  Tehrani-Saleh, A., Kvam, P., Sheneman, L., Goldsby, H., et al. (2017). Markov brains: A
  technical introduction. arXiv preprint arXiv:1709.05601.
Hopf, T. A., Ingraham, J. B., Poelwijk, F. J., Schärfe, C. P., Springer, M., Sander, C.,
  and Marks, D. S. (2017). Mutation effects predicted from sequence co-variation. Nature
  biotechnology, 35(2):128–135.
Hunter, J. D. et al. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science
  and Engineering, 9(3):90–95.
                                              98


Hüsken, M., Igel, C., and Toussaint, M. (2002). Task-dependent evolution of modularity in
  neural networks. Connection Science, 14(3):219–229.
Hutton, T. J. (2002). Evolvable self-replicating molecules in an artificial chemistry. Artificial
  Life, 8:341–356.
Huynen, M. A., Stadler, P. F., and Fontana, W. (1996). Smoothness within ruggedness:
  the role of neutrality in adaptation. Proceedings of the National Academy of Sciences,
  93(1):397–401.
Jiménez, J. I., Xulvi-Brunet, R., Campbell, G. W., Turk-MacLeod, R., and Chen, I. A.
  (2013). Comprehensive experimental fitness landscape and evolutionary network for small
  RNA. Proceedings of the National Academy of Sciences, 110(37):14984–14989.
Jo, J. and Bengio, Y. (2018). Measuring the tendency of CNNs to learn surface stastistical
  regularities. arXiv:1711.11561.
Kashtan, N. and Alon, U. (2005). Spontaneous evolution of modularity and network mo-
  tifs. Proceedings of the National Academy of Sciences of the United States of America,
  102(39):13773–13778.
Kauffman, S. A. (1993). The Origins of Order: Self-Organization and Selection in Evolution.
  Oxford University Press US.
Kircher, M., Witten, D. M., Jain, P., O’roak, B. J., Cooper, G. M., and Shendure, J. (2014).
  A general framework for estimating the relative pathogenicity of human genetic variants.
  Nature genetics, 46(3):310–315.
Kouyos, R. D., von Wyl, V., Hinkley, T., Petropoulos, C. J., Haddad, M., Whitcomb, J. M.,
  Böni, J., Yerly, S., Cellerai, C., Klimkait, T., Günthard, H. F., Bonhoeffer, S., and Swiss
  HIV Cohort Study (2011). Assessing predicted HIV-1 replicative capacity in a clinical
  setting. PLoS Pathog, 7:e1002321.
Kvam, P., Cesario, J., Schossau, J., Eisthen, H., and Hintze, A. (2015). Computational
  evolution of decision-making strategies. arXiv preprint arXiv:1509.05646.
Laakso, J., Kaitala, V., and Ranta, E. (2001). How does environmental variation translate
  into biological processes? Oikos, 92(1):119–122.
LaBar, T. and Adami, C. (2016). Different Evolutionary Paths to Complexity for Small and
  Large Populations of Digital Organisms. PLoS Computational Biology, 12(12):e1005066.
LaBar, T., Adami, C., and Hintze, A. (2015). Does self-replication imply evolvability? In
  Proceedings of the European Conference on Artificial Life 2015, pages 595–602. MIT Press.
LaBar, T., Hintze, A., and Adami, C. (2016). Evolvability Tradeoffs in Emergent Digital
  Replicators. Artificial Life, 22(4):483–498.
Lapedes, A., Giraud, B., and Jarzynski, C. (2012). Using sequence alignments to predict
  protein structure and stability with high accuracy. arXiv preprint arXiv:1207.2484.
                                                99


Lek, M., Karczewski, K. J., Minikel, E. V., Samocha, K. E., Banks, E., Fennell, T.,
  O’Donnell-Luria, A. H., Ware, J. S., Hill, A. J., Cummings, B. B., et al. (2016). Analysis
  of protein-coding genetic variation in 60,706 humans. Nature, 536(7616):285–291.
Lenski, R. E., Ofria, C., Collier, T. C., and Adami, C. (1999). Genome complexity, robustness
  and genetic interactions in digital organisms. Nature, 400(6745):661–664.
Lenski, R. E., Ofria, C., Pennock, R. T., and Adami, C. (2003). The evolutionary origin of
  complex features. Nature, 423(6936):139–144.
Leslie E, O. (2004). Prebiotic Chemistry and the Origin of the RNA World. Critical Reviews
  in Biochemistry and Molecular Biology, 39(2):99–123.
Li, Y. and Wilke, C. O. (2004). Digital evolution in time-dependent fitness landscapes.
  Artificial Life, 10(2):123–134.
Maor, R., Dayan, T., Ferguson-Gow, H., and Jones, K. (2017). Temporal niche expansion
  in mammals from a nocturnal ancestor after dinosaur extinction. bioRxiv, page 123273.
Marstaller, L., Hintze, A., and Adami, C. (2013). The Evolution of Representation in Simple
  Cognitive Networks. Neural Computation, 25(8):2079–2107.
Mathis, C., Bhattacharya, T., and Walker, S. I. (2015). The Emergence of Life as a First
  Order Phase Transition. arXiv preprint arXiv:1503.02776.
Mills, D., Peterson, R., and Spiegelman, S. (1967). An Extracellular Darwinian Experiment
  with a Self-Duplicating Nucleic Acid Molecule. Proceedings of the National Academy of
  Sciences of the United States of America, 58(1):217.
Misevic, D., Ofria, C., and Lenski, R. E. (2006). Sexual reproduction reshapes the genetic
  architecture of digital organisms. Proceedings of the Royal Society of London B: Biological
  Sciences, 273(1585):457–464.
Mora, T. and Bialek, W. (2011). Are biological systems poised at criticality? J. Stat. Phys.,
  144:268–302.
Moran, P. A. P. (1958). Random processes in genetics. In Mathematical Proceedings of the
  Cambridge Philosophical Society, volume 54, pages 60–71. Cambridge Univ. Press.
Nemenman, I. (2011). Coincidences and estimation of entropies of random variables with
  large cardinalities. Entropy, 13:2013–2023.
Nemenman, I., Shafee, F., and Bialek, W. (2002). Entropy and inference, revisited. In
  Dietterich, G., Becker, S., and Ghahramani, Z., editors, Adv Neural Inf Process Syst. vol.
  14, pages 471–478.
Nguyen, A., Yosinski, J., and Clune, J. (2015). Deep neural networks are easily fooled:
  High confidence predictions for unrecognizable images. In Computer Vision and Pattern
  Recognition (CVPR ’15). IEEE.
                                               100


Nitash, C. and Adami, C. (2021). Information-theoretic characterization of the complete
  genotype-phenotype map of a complex pre-biotic world: Comment on “from genotypes to
  organisms: State-of-the-art and perspectives of a cornerstone in evolutionary dynamics”
  by susanna manrubia et al. Physics of Life Reviews, 38:111–114.
Nowogrodzki, A. (2018). The world’s strongest mri machines are pushing human imaging to
  new limits. Nature, 563(7732):24–27.
Ofria, C., Bryson, D. M., and Wilke, C. O. (2009a). Avida: a software platform for research
  in computational evolutionary biology. In Maciej Komosinski, A. A., editor, Artificial Life
  Models in Software, pages 3–35. Springer London.
Ofria, C., Bryson, D. M., and Wilke, C. O. (2009b). Avida: A software platform for research
  in computational evolutionary biology. In Komosinski, M. and Adamatzky, A., editors,
  Artificial Life Models in Software, pages 3–35. Springer London.
Ofria, C. and Wilke, C. O. (2004). Avida: A software platform for research in computational
  evolutionary biology. Artificial life, 10(2):191–229.
Olson, R. S., Hintze, A., Dyer, F. C., Knoester, D. B., and Adami, C. (2013). Predator con-
  fusion is sufficient to evolve swarming behaviour. Journal of The Royal Society Interface,
  10(85):20130305–20130305.
Østman, B., Hintze, A., and Adami, C. (2010). Critical properties of complex fitness land-
  scapes. In H. Fellerman et al., editor, Proc. 12th Intern. Conf. on Artificial Life, pages
  126–132, Cambridge, MA. MIT Press.
Ostrov, N., Landon, M., Guell, M., Kuznetsov, G., Teramoto, J., Cervantes, N., Zhou, M.,
  Singh, K., Napolitano, M. G., Moosburner, M., et al. (2016). Design, synthesis, and testing
  toward a 57-codon genome. Science, 353(6301):819–822.
Pargellis, A. (2003). Self-organizing genetic codes and the emergence of digital life. Com-
  plexity, 8(4):69–78.
Pargellis, A. and Greenbaum, B. (2016). Digital replicators emerge from a self-organizing
  prebiotic world. In Proceedings of the Artificial Life Conference 2016, pages 60–67. MIT
  Press.
Pargellis, A. N. (1996). The spontaneous generation of digital “Life”. Physica D: Nonlinear
  Phenomena, 91(1):86–96.
Pennock, R. T. (2007). Models, simulations, instantiations, and evidence: the case of digital
  evolution. Journal of Experimental & Theoretical Artificial Intelligence, 19(1):29–42.
Petrie, K. L. and Joyce, G. F. (2014). Limits of Neutral Drift: Lessons From the In Vitro
  Evolution of Two Ribozymes. Journal of Molecular Evolution, 79(3-4):75–90.
Rabiner, L. R. (1989). A tutorial on Hidden Markov models and selcted applications in
  speech recognition. Proceeding of the IEEE, 77:257–286.
                                               101


Riesselman, A. J., Ingraham, J. B., and Marks, D. S. (2018). Deep generative models of
  genetic variation capture the effects of mutations. Nature methods, 15(10):816–822.
Robertson, M. P. and Joyce, G. F. (2012). The Origins of the RNA World. Cold Spring
  Harbor perspectives in biology, 4(5):a003608.
Romero, P. A., Tran, T. M., and Abate, A. R. (2015). Dissecting enzyme function with
  microfluidic-based deep mutational scanning. Proceedings of the National Academy of
  Sciences, 112(23):7159–7164.
Roscoe, B. P., Thayer, K. M., Zeldovich, K. B., Fushman, D., and Bolon, D. N. (2013).
  Analyses of the effects of all ubiquitin point mutants on yeast growth rate. Journal of
  molecular biology, 425(8):1363–1377.
Santini, L., Rojas, D., and Donati, G. (2015). Evolving through day and night: origin and
  diversification of activity pattern in modern primates. Behavioral Ecology, 26(3):789–796.
Schlosser, G. and Wagner, G. P. (2004). Modularity in development and evolution. University
  of Chicago Press.
Schneider, T. D. (1997). Information content of individual genetic sequences. J. theor. Biol.,
  189:427–441.
Schneidman, E., Berry, 2nd, M. J., Segev, R., and Bialek, W. (2006). Weak pairwise correla-
  tions imply strongly correlated network states in a neural population. Nature, 440:1007–12.
Schossau, J., Adami, C., and Hintze, A. (2016). Information-Theoretic Neuro-Correlates
  Boost Evolution of Cognitive Systems. Entropy, 18(1):6–22.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical
  journal, 27(3):379–423.
Sim, N.-L., Kumar, P., Hu, J., Henikoff, S., Schneider, G., and Ng, P. C. (2012). Sift web
  server: predicting effects of amino acid substitutions on proteins. Nucleic acids research,
  40(W1):W452–W457.
Stormo, G. D. (2000). DNA binding sites: Representation and discovery. Bioinformatics,
  14:16–23.
Sylvester, J. (1883). Note sur le théorème de Legendre. Comptes Rendus Acad. Sci. Paris,
  96:463–465.
Szostak, J. W. (2003). Functional information: Molecular messages. Nature, 423:689.
Travis, J. and Travis, E. (2002). Mutator dynamics in fluctuating environments. Proceedings
  of the Royal Society of London B: Biological Sciences, 269(1491):591–597.
Van Der Walt, S., Colbert, S. C., and Varoquaux, G. (2011). The NumPy Array: a Structure
  for Efficient Numerical Computation. Computing in Science & Engineering, 13(2):22–30.
                                              102


Walker, S. I. (2014). Top-Down Causation and the Rise of Information in the Emergence of
  Life. Information, 5(3):424–439.
Walker, S. I. and Davies, P. C. (2013). The algorithmic origins of life. Journal of the Royal
  Society Interface, 10(79):20120869.
Werner, G. (2009). Viewing brain processes as critical state transitions across levels of orga-
  nization: neural events in cognition and consciousness, and general principles. BioSystems,
  96(1):114–119.
Wigner, E. P. (1990). The unreasonable effectiveness of mathematics in the natural sciences.
  In Mathematics and Science, pages 291–306. World Scientific.
Wilke, C. O., Ronnewinkel, C., and Martinetz, T. (2001a). Dynamic fitness landscapes in
  molecular evolution. Physics Reports, 349(5):395–446.
Wilke, C. O., Wang, J. L., Ofria, C., Lenski, R. E., and Adami, C. (2001b). Evolution
  of digital organisms at high mutation rates leads to survival of the flattest. Nature,
  412(6844):331–333.
Yagi, R., Chen, L.-F., Shigesada, K., Murakami, Y., and Ito, Y. (1999). A ww domain-
  containing yes-associated protein (yap) is a novel transcriptional co-activator. The EMBO
  journal, 18(9):2551–2562.
Zaman, L., Meyer, J. R., Devangam, S., Bryson, D. M., Lenski, R. E., and Ofria, C. (2014).
  Coevolution Drives the Emergence of Complex Traits and Promotes Evolvability. PLoS
  Biology, 12(12):e1002023.
                                             103