1m

200’” LIBRARY
Michigan State
University

 

 

 

 

This is to certify that the
dissertation entitled

Analyzing Biological Complexity with Digital Organisms

presented by

Wei Huang

has been accepted towards fulﬁllment
of the requirements for the

Ph.D degree in Computer Science and

 

 

Engineering
W

Major Pr ssor’s Signature

,9. g-o5

 

Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

4 ._.-——-—--_.-‘-I-I-.-.-—-—c-—-.—-‘;—

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/C|RCIDateDue.indd-p.1

 

 

ANALYZING BIOLOGICAL COMPLEXITY WITH
DIGITAL ORGANISMS
By

Wei Huang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Computer Science and Engineering

2005

ABSTRACT

ANALYZING BIOLOGICAL COMPLEXITY WITH DIGITAL ORGANISMS
By

Wei Huang

We deﬁne an organism’s biological complexity to be the amount of information
contained in that organism’s genome about its environment. Not only is this an intu-
itive deﬁnition of complexity, but it also allows us to treat the subject in a rigorously
mathematical fashion. As such, we have designed a method to measure biological
complexity based on the principle of mutation-selection balance from population ge-
netics. Our method approximates the total information in a genome as the sum of
the information at each position. The information content of a given position is cal-
culated by testing all of the possible mutations for that position and calculating the
expected frequencies of potential genomes at the equilibrium state.

We use our deﬁnition of biological complexity to analyze the evolution of complex
traits in digital organisms. To many, the seemingly sudden appearance of new traits
seems to contradict the gradual evolutionary processes of mutation, selection, and
drift despite previous work that illustrates how complex organismal features can arise
if simpler traits that can be used as building blocks are selected for. Our analysis
shows that the underlying information associated with any trait evolves gradually
and often results from a combination of reusing and extending information associated
with simpler traits. Speciﬁcally, we demonstrate that the majority of the genomic
information associated with a trait is primarily correlated with preexisting traits, or
is co-opted from traits that were lost in conjunction with the appearance of the new
trait.

Next, we extend the concept of complexity to the community level where we quan-

titatively measure the distinct information stored anywhere in a whole community

about its environment. Community complexity is a new concept that is different from
the traditionally studied community diversity. We developed a measure that provides
a useful approximation of community complexity, which we plan to further reﬁne in
the future. Our current measure accurately reﬂects that community complexity in—
creases due to information gain, even when diversity is unaffected. It also shows that
the community complexity of a multi-niche environment is only slightly higher than it
in a single-niche environment when the organismal traits are identical. In such a case,
the individual organisms in the single-niche environment have higher complexity, but
many more species exist in the multi-niche environment. We systematically test this
concept across many environment types and demonstrate its robustness.

Finally, knowing how information is stored in different organisms also tells us
about the relationships among them. When new information enters a population, it
is transmitted over time from parent to child. When information is shared among
organisms in the ﬁnal population, those organisms are likely related. Inspired by this
fact, we designed a character weighting technique to improve phylogeny reconstruction
accuracy. In this method, sites are weighted based on the portion of the tree being
reconstructed. We target new information that is likely to have arisen at the branching
point we are trying to reconstruct as the basis to weight characters. This approach

lays the groundwork for a new class of top—down phylogeny reconstruction algorithms.

ACKNOWLEDGEMENTS

I would like to thank my advisor Dr. Charles Ofria, for his guidance and support
throughout my graduate career at Michigan State University. I wish to thank him for
letting me join his lab and for helping me make this thesis a reality. Charles has also
been a good friend who helped me understand the American culture and overcome
the language barrier.

All of my committee members have provided me with helpful and thoughtful
advice. I want to thank both Dr. Eric Torng and Dr. Thomas Schmidt for being my
co—advisors. Dr. Torng had many detailed and insightful suggestions for each of my
projects and this dissertation. Dr. Schmidt helped guide the biological direction of
my interdisciplinary research. They both helped me to meet my project milestones.
I would also like to thank Dr. Erik Goodman for asking many interesting questions
that helped to reﬁne my ideas.

Dehua Hang and Jason Stredwick have both been wonderful labmates and friends.
They helped me signiﬁcantly with my personal life and studies. Dehua provided
efﬁcient instruction and invaluable advice at the beginning of my graduate work, and
Jason contributed the remarkable programming skills which made Chapter 4 of this
thesis possible.

Many other people had signiﬁcant impact on my work. Dr. Richard Lenski, Dr.
Elizabeth Ostrowski, Gabriel Yedid, Dr. Kristina Hillesland, and Dusan Misevic have
been involved in many discussions of my projects, which helped ﬁll in my biological
background. Kaben Nanlohy and Brian Baer provided signiﬁcant technique support.
I also want to thank Sherri Goings, Jeffrey Clune, Matthew Rupp, Arthur Covert III,
and David Bryson for their great friendship.

I would like to thank the MSU Computer Science and Engineering Department and

the Quantitative Biology and Modeling Initiative for support and funding. Addition-

iv

ally, this project was supported by National Science Foundation grants DEB-9981397
and BIA-0219229.

My heartfelt thanks go to my family members. My parents, Ling Wang and
Yunguo Huang, gave me unconditional love and lifetime education. My parents-in-
law, J iedong Yuan and Deming Wu, provided all kinds of support, especially helping
to take care of my baby during my thesis writing.

Last but not least, my deepest thanks go to my husband, Wei Wu, for all his love
and support, and to our son Jason Yaheng Wu for having the sweetest smile a baby

can have.

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES

1 Introduction
1.1 Overview ..................................
1.2 Background ................................
1.2.1 Biological Complexity ......................
1.2.2 Avida Digital Evolution Platform ................

1.2.3 Shannon Information Theory ..................

2 Measuring Complexity with Digital Organisms
2.1 Population-Based Physical Complexity .................
2.2 Limitations of Population-Based Complexity ..............
2.3 Our Approach ...............................
2.4 The Technique ..............................
2.5 Experiments and Results .........................

2.6 Discussion .................................

3 On the Evolution of Complex Features
3. 1 Background ................................
3.2 Measuring Genomic Complexity .....................
3.3 Experimental System ...........................
3.4 Experiments and Results .........................

3.4.1 Three sources of complexity ...................

vi

ix

11
12
13
16
17
2O

24

28
29
3O
30
32
32

3.4.2 The distribution of complexity changes from a random beneﬁ-
cial mutation ...........................

3.5 Conclusions ................................

Biological Complexity of Communities

4.1 Biological Diversity ............................
4.1.1 Limitations of using diversity to measure complexity .....

4.2 Biological Complexity of a Community .................

4.3 Measuring Method ............................

4.4 Experimental Setup ............................

4.5 Results and Discussion ..........................

4.5.1 Complexity of communities in a single niche: Environment I .

40
41
42
43
44
49
50
51

4.5.2 Complexity of communities with multiple niches: Environment 11 53

4.5.3 Environmental impact on community complexity: Environments
III, IV and V ...........................

4.6 Conclusions ................................

Information-based Phylogeny Reconstruction

5.1 Background ................................

5.2 Our Approach ...............................
5.2.1 Identifying Informative Characters for Deep Bifurcations . . .
5.2.2 Reinforcing Informative Characters for Deep Bifurcations . . .
5.2.3 Methods ..............................

5.3 Related Work ...............................

5.4 Results and Discussion ..........................
5.4.1 Results with Default Settings ..................
5.4.2 The Effect of Tree Size ......................
5.4.3 The Effect of Alphabet Size ...................

vii

58
62

64
65
67
69
69
70
74
75
75
77
78

5.4.4 The Effect of Asymmetry ....................

5.5 Conclusions ................................

6 Future Work

A Environment Setup
A.1 Environment I ...............................
A.2 Environment II ..............................

A.3 Environment III ..............................

B Proof of a Nearly-Balanced Bifurcation

BIBLIOGRAPHY

viii

83

86
86
87
88

89

90

LIST OF TABLES

2.1 Samples for genome sequences with all single-site mutations and the
resulting ﬁtness of each. The site being changed is marked in bold
within each sequence above. The original symbol is “tn”. .......

ix

1.1

1.2

2.1

2.2

2.3

2.4

2.5

3.1

3.2

LIST OF FIGURES

Structure of virtual CPU in Avida. Images in this thesis/ dissertation
are presented in color. ..........................

A symmetric binary channel ........................

A comparison between the population-based method of calculating
physical complexity (upper line) and our new method (lower line). The
new method is applied to the lineage of the most abundant organism
at the end of the experiment. The population size is 3600. ......

Mutation-selection balance at site 53 of a genome. Gray bars are math-
ematical predictions of instruction frequency, black bars are experimen-
tal results ..................................

A comparison between the population-based method of calculating
physical complexity (upper line) and our new method (lower line). The
new method is applied to the lineage of the most abundant organism

at the end of the experiment. The population size of two ﬁgures are
900 and 14,400. ..............................

(a) Physical complexity over time for a lineage from an Avida exper-
iment where genome length is allowed to change. (b) Fitness over
time for the same lineage from the same experiment. (c) Fitness vs.
complexity. ................................

Organism complexity along lineage averaged over 50 trials. Upper and
lower boundaries deﬁne 95% conﬁdence interval .............

Information content over the course of a typical Avida experiment.
The most abundant genotype was chosen from the ﬁnal population,
and the information content was measured for all organisms along its
line of descent. ..............................

Average information during the acquisition of EQU. 24 trials were cen-
tered on their acquisition of EQU and had the information values av-
eraged to create this graph. A window size is 10k updates around the
time when EQU is ﬁrst acquired. ....................

10

15

21

23

25

26

33

3.3

3.4

3.5

4.1

4.2

4.3

4.4

4.5

4.6

4.7

Histograms indicating the distribution of information in the 24 trials
where organisms developed the EQU trait. (a) is the change in com-
plexity for the whole organism when EQU first arises; (b) is the com-
plexity unique to EQU, not shared by other traits; (c) is the change in
complexity of all traits other than EQU at the time EQU arises; and
(d) is the total complexity of the EQU trait, including information
shared with other traits. .........................

The distribution of complexity changes with beneﬁcial mutations from
100,000 random sampled organisms in the populations (a) at 1000 up—
dates, (b) at 10,000 updates, (c) at 100,000 updates under a 9 task
environment, and (d) at 10,000 updates of control runs. Red dashed
line represents the mean complexity changes and red dotted lines are
one standard deviation from the mean. .................

The distribution of complexity changes from a beneﬁcial mutation
along the lineages of 50 runs in the 9-task environment. Red dashed
line represents the mean complexity changes and red dotted lines are
one standard deviation from the mean. .................

Nucleotide probability distribution of two comparison sites .......

Measurements in a typical single-niche community over time. Met-
rics used are (a) total complexity of all 50 sampled individuals, (b)
community complexity, and (c) phylogenetic depth of individuals.

The results over 50 runs under Environment 1. Upper and lower bound-
aries deﬁne 95% conﬁdence interval. (a) average Shannon index, (b)
average Simpson index, and (c) community complexity. ........

Measurements in a typical multi-niche community over time. Metrics
used are (a) total complexity of all 50 sampled individuals, (b) com-
munity complexity, and (c) phylogenetic depth of individuals.

The results over 50 runs under Environment II. Upper and lower bound-
aries deﬁne 95% conﬁdence interval. (a) average Shannon index, (b)
average Simpson index, and (c) community complexity. ........

The results over 50 runs under Environment II. Upper and lower bound-
aries deﬁne 95% conﬁdence interval. (a) average number of species and
(b) average individual complexity .....................

The results over 50 runs under Environment III. Upper and lower
boundaries deﬁne 95% conﬁdence interval. (a) average Shannon in-
dex and (b) community complexity. ...................

xi

37

38

46

52

54

56

57

58

59

4.8

4.9

5.1
5.2
5.3

5.4

5.5

5.6

5.7

5.8

The results over 50 runs under Environment IV. Upper and lower
boundaries deﬁne 95% conﬁdence interval. (a) average Shannon in-
dex and (b) community complexity. ...................

The results over 50 runs under Environment V. Upper and lower bound-
aries deﬁne 95% conﬁdence interval. (a) average Shannon index and
(b) community complexity .........................

An example symmetric tree. .......................
Our default 32—leaf symmetric tree topology. ..............
Our asymmetric tree topology with 32 leaves. .............

Measurements on the default tree. The bar graph indicates the number
of sequence positions out of 200 that are assigned each rating (scale on
right side of ﬁgure); the solid line displays the average reconstruction
accuracy of the most internal bifurcation (scale on left side of ﬁgure);
the dashed lines deﬁne the 95% confidence interval around our accuracy
tests .....................................

Measurements on the tree with 16 leaves. The bar graph indicates the
number of sequence positions that are assigned each rating; the solid
line displays the average reconstruction accuracy of the most internal
bifurcation; the dashed lines deﬁne the 95% conﬁdence interval around
our accuracy tests. ............................

Measurements on the tree with 64 leaves. The bar graph indicates the
number of sequence positions that are assigned each rating; the solid
line displays the average reconstruction accuracy of the most internal
bifurcation; the dashed lines deﬁne the 95% conﬁdence interval around
our accuracy tests. ............................

Measurements on the tree constructed using sequences with a larger
alphabet size of 20. The bar graph indicates the number of sequence po—
sitions that are assigned each rating; the solid line displays the average
reconstruction accuracy of the most internal bifurcation; the dashed
lines deﬁne the 95% conﬁdence interval around our accuracy tests. . .

Measurements on the asymmetric tree. The bar graph indicates the
number of sequence positions that are assigned each rating; the solid
line displays the average reconstruction accuracy of the most internal
bifurcation; the dashed lines deﬁne the 95% conﬁdence interval around
our accuracy tests. ............................

xii

60

61

68
72
73

76

79

Chapter 1

INTRODUCTION

1. 1 Overview

Biological complexity has long been a contentious topic. Most people accept that com—
plexity generally increases through the process of evolution but few give an explicit
deﬁnition and method of measurement. My interest is to understand the relationship
between biological complexity and evolution. Is there a pervasive trend of complexity
change through evolution? Is it always an uphill climb? How does new complexity
arise? Is there a relationship between the complexity of an organism and its ﬁtness
in the environment? Can we measure the complexity of a whole community? Are
there any applications to studying complexity? The answer to each of these ques-
tions depends on the nature of the complexity deﬁnition. Many deﬁnitions have been
proposed in the past, but each of them has serious flaws, and few use a rigorously
mathematical approach.

This chapter provides a review of the biological complexity measures that others
have used, and explores the limitations of each. The discussion following details
the concept of physical complexity previously developed by Adami and Cerf [2000]
and reﬁned by Adami, Ofria, and Collier [2000]. To facilitate this discussion, I will
also review some concepts from Shannon Information Theory and briefly describe the
Avida digital life platform, the system we will be using to examine the complexity
measures discussed.

Our new approach to estimating complexity is described in chapter 2. It is based

on Adami et al.’s “physical complexity” [Adami and Cerf, 2000], which deﬁnes biologi-
cal complexity as the genetic information that an organism has about its environment.
We approximate the total information in a genome as the sum of the information at
each position. The information content of a position is calculated by testing all the
possible mutations for that position and calculating the expected frequencies of po-
tential genomes at the equilibrium state. We discuss how this method reveals the
way information is embedded in an organism during the evolutionary process, the
advantages of our method, and the initial results from applying this approach.

In chapter 3, I examine how new complexity ﬁrst arises in a population. Evolution-
ary theory explains the origin of complex organismal features through a combination
of reusing and extending information from less-complex traits. While the appear-
ance of a new trait may seem sudden, the underlying information associated with
that trait must evolve gradually. We study this process within evolving digital pop-
ulations. We show that when a new complex trait ﬁrst appears in the population,
its proper function requires the coordinated operation of many genomic positions.
However, the total information stored in the genome only increases marginally. We
demonstrate that the majority of the information associated with an emerging trait
is primarily correlated with pre—existing traits or is co—opted from traits that were
lost in conjunction with the appearance of the new trait.

Next, in chapter 4, we extend the concept of biological complexity to the commu-
nity level. We deﬁne community complexity as the sum of all distinct information
contained in the community about its environment; if multiple organisms all contain
the same information, we only count that information once. We developed a measure
that provides a useful approximation of community complexity, which we plan to
further reﬁne in the future. Community complexity is a new concept that is different
from the traditionally studied community diversity. I reviewed two popular diversity

measurements: The Shannon and Simpson diversity indices. We analyzed the com-

plexity changes during evolution from the information perspective and compared it
with diversity measurements. We designed ﬁve environments that allow us to explore
environmental impact on community complexity.

Next we examine applications of understanding organism complexity. Measuring
the information content of different organisms allows us to better understand the rela-
tionship between them. When new information enters a population, it is transmitted
to subsequent generations and could help indicate relationships among organisms in
that population. Inspired by this fact, we designed a character weighting technique
to improve phylogeny reconstruction accuracy, which I present in Chapter 5.

Phylogeny reconstruction seeks to ﬁnd evolutionary relationships among taxo-
nomic groups. We observe that sites where two distinct symbols are both highly
represented are more likely to provide useful information for reconstructing deep bi-
furcations in the tree. Both symbols may originate from adaptive events at earlier
stages of evolution. Giving high weight to these sites will decrease our uncertainty
about the root branch. We demonstrate that the neighbor joining algorithm is signif-
icantly improved in reconstructing deep bifurcations if more weight is given to these
sites. We further show the robustness of this technique with sustained reconstruction
improvements as we vary a number of characteristics about the trees, including the
number of leaves, the alphabet size used in the sequences, and the tree symmetry.

Finally in Chapter 6 I discuss further plans for reﬁning and extending this work.
For the technique of measuring the organismal complexity, I will consider epistasis and
estimate the information in two or more sites together. For the community complexity,
I will show the candidate methods that could help more accurately calculate the
unique information in each organism. For the character-weighting technique, I will
try to convert it to a fully functional algorithm to build a whole tree, not just deep

bifurcations.

1.2 Background

1.2.1 Biological Complexity

KCS (Kolmogorov-Chaitin-Solomonoff) complexity is the most widely used complex-
ity deﬁnition. It deﬁnes the complexity of a sequence as the shortest possible program
that can generate that sequence [Li and Vitanyi, 1997]. This deﬁnition works in many
intuitive cases, but has some serious problems: some apparently complex structures
can be coded in short programs such as fractals and cellular automata [Goertzel,
1993], while a long sequence with no pattern, and no meaning (effectively random)
would need a long program to generate it; one that just lists the entire sequence.

A related complexity deﬁnition is “logical depth”. Bennett [1988] deﬁnes the logi-
cal depth of a sequence by the running time of the shortest program that computes it.
Thus, while it overcomes some of the problems with KCS complexity because more
complex structures will typically take a while to generate, there are still problems
when it comes to random sequences with no actual meaning behind them. Addition-
ally, it cannot answer questions such as, “What is the shortest program to generate a
DNA sequence?” or “What is the running time for that program?” and thus is still
not an easily quantiﬁable deﬁnition for a biologist to use in measuring complexity.

A count of the number of “parts” in an organism is perhaps the simplest deﬁnition
of complexity, as suggested by Hinegardner and Engelberg [1983]. This, of course,
depends on what we recognize as parts. Hinegardner and Engelberg suggest that
at root, organisms are composed of molecules, but they do not take the differences
in the complexity of those molecules into account. This deﬁnition may provide a
useful approximation of complexity, but it neglects any complexity inherent in gene
regulation or other connections between DNA, RNA, and proteins. Many different
proteins may be synthesized from the same mRNA molecule due to manipulations

on the mRNA and in the context of translation. In fact, over the last several hun-

dred million years, most evolution has occurred only in the form of gene regulation
even though most biologists agree that there has been a huge increase in the overall
complexity of organisms.

Adami and Cerf [2000] developed physical complexity as a method to calculate the
complexity of symbolic strings. Ofria and Collier then worked with Adami [2000] to
translate this concept more directly to the study the evolution of biological complex-
ity. In their deﬁnition, the biological complexity of an organism is the information
physically stored in the genome about the environment in which it lives. In chap-
ter 2 I talk about their initial method to measure physical complexity (biological

complexity) before detailing my new, reﬁned approach.

1.2.2 Avida Digital Evolution Platform

It is a commonly held belief that the complexity of species always increases during
evolution. However, evolution by natural selection is a unique process in the biolog-
ical world—we only have one example of it—and all known life has a genetic basis
consisting of strings of nucleotides and is all believed to share a single common ances-
tor. We don’t know the evolutionary principles for other forms of life which scientists
may discover later. The eminent biologist John Maynard Smith [1992] declared that
the only way out of this quandary was to build a new form of life ourselves. “We
badly need a comparative biology,” he wrote. “So far, we have been able to study
only one evolving system, and we cannot wait for interstellar ﬂight to provide us with
a second. If we want to discover generalizations about evolving systems, we will have
to look at artiﬁcial ones.”

Having a well controlled artiﬁcial system allows us to explore the importance
of many historically contingent events in evolution. Random chance is believed to
have played a large role in this evolutionary process, as Gould hypothesized with his

thought experiment of “replaying life’s tape” [Gould, 1989]. He states: “Any replay

of the tape would lead evolution down a pathway radically different from the road
actually taken Each step proceeds for cause, but no ﬁnale can be speciﬁed at
the start, and none would ever occur a second time in the same way, because any
pathway proceeds through thousands of improbable stages. Alter any early event,
ever so slightly and without apparent importance at the time, and evolution cascades
into a radically different channel.” This means that there might be many different
ways a species could have adapted to its environment; we only see one end result.
To answer whether complexity always increases down all of these pathways, we must
have an experimental system in which to test it.

Avida [Ofria and Wilke, 2004] is a software platform used to perform experiments
in evolutionary biology. The Avida system creates an artiﬁcial environment that
maintains a population of self-replicating computer programs. These populations
are subject to mutations and are in environments with limited space and resources
(sources of energy) for which they must compete, therefore the organisms evolve by
natural selection. A population in Avida adapts in a manner analogous to biological
systems, both to maximize its replication rate and to beneﬁcially interact with its
environment. When an individual program attempts to replicate, it is subjected to
random mutations that change instructions within its memory. Mutations are classi-
ﬁed in a strictly Darwinian sense: any mutation that results in an increased ability
to reproduce in a given environment is considered beneﬁcial. Mutations causing the
organism to fail to reproduce successfully are considered lethal. Neutral mutations
cause no change in reproductive success.

Each organism (Figure 1.1) in an Avida population consists of a memory initialized
to the genomic program, three 32-bit registers, two stacks, and input and output
buffers for organisms to receive operands and return results to the environment. The
genome of an organism is composed of a Turing-complete programming language; that

is, they can perform any computable mathematical function—no explicit limitations

are imposed on what can be evolved. Indeed, we have witnessed a wide variety of

unexpected and seemingly clever adaptations arise through evolution in Avida.

 

 

inc

   
 
 
   
   

 

allocate
push
nop—B
pop .
nop—C

 

 

 

 

Stacks

 

 

 

 

 

Figure 1.1: Structure of virtual CPU in Avida. Images in this thesis/ dissertation are
presented in color.

The phenotype of an organism corresponds to the set of computations an or-
ganism performs and related parameters such as how quickly it can perform each
computation. Depending on the environment, an organism receives an energy bonus
for performing speciﬁc computations. The ﬁtness of the organism is then its total
energy divided by its gestation time. Each update (the unit of time used in Avida)
organisms receive a number of CPU cycles proportional to their energy. All organ-
isms have their CPU cycles scheduled to execute their genomes in an order as close

to parallel as possible.

Avida provides us with a system where the population dynamics can easily be
explored and where we can trivially access the genome for any individual. Since
the evolution in Avida is real, as opposed to a mere simulation, complex traits can
arise on their own. Since the system also allows us to perform tests of genomes in
isolation, it is a perfect choice for systematic studies of complexity, and we will revisit

it throughout this thesis.

1.2.3 Shannon Information Theory

Information Theory [Shannon, 1948; Cover and Thomas, 1991] uses quantitative
mathematics to formally deﬁne measures of disorder and uncertainty, which are then
used, in turn, to deﬁne the information content of a message as the reduction of
uncertainty attributed to the other message.

Information theory deﬁnes uncertainty (or entropy) as the number of bits needed
to fully specify a situation, given a set of probabilities. Let X be a discrete random
variable with alphabet X and probability mass function p(:r) = Pr(X = 1:), :1: E X.

Thus, uncertainty of the random variable X is deﬁned as:

H(X) = - Z p($)10g2p($) (1-1)
:L'EX

Uncertainty is maximized when all probabilities are equal, that is, we have no idea
about what the outcome will be. On the other hand, uncertainty is minimized when
the probability of one symbol in alphabet X is 1 and the probabilities of each of the
other symbols is 0.

Conditional entropy of one random variable given another is deﬁned as the ex-

pected value of the entropies of the conditional probabilities, averaged over the con-

ditional random variable. Speciﬁcally:

H(X|Y) = Z H(X|Y = y) = Z My) 2: p(wly)10g2p($|y) (1-2)
95y yEy CUE/Y

In Information Theory, the reduction in uncertainty of one random variable due to
the knowledge of another random variable is called the mutual information between

variables (also sometimes called mutual entropy).
I(X:Y)=H(X)—H(X|Y) (1.3)

The mutual information I (X : Y) is a measure of the dependence between two random
variables. If variables X and Y are independent from each other, knowledge of Y
won’t decrease our uncertainty about variable X. In other words, uncertainty about
X given Y (H (X [Y)) remains same as the uncertainty about X without knowing
anything else, so the mutual information between X and Y is zero. Two variables
can have a non-zero mutual information only if there is some correlation between
them. The metric of mutual information is symmetric in X and Y, that is, I (X : Y)
is equal to I (Y : X). Additionally, mutual information between a pair of variables is
always non-negative, that is, knowledge of one variable can never, on average, increase
our uncertainty about the other.

For a binary symmetric channel shown in Figure 1.2, the binary signal source
inputs 0 or 1 with equal probability. The channel’s output is equal to the input with
the probability 3/ 4. On the other hand, with the probability of 1 / 4, a ‘0’ is received
when ‘1’ is the input, and vice versa.

In this case, the entropy about the input X is:

1 1 1 1
we = —-1og2 — — _10g, 5 =

1.4
2 2 2 1 ()

 

3/4

 

 

 

3/4

 

 

 

Figure 1.2: A symmetric binary channel.

Assume we received the output Y from the channel, the conditional entropy about

input X is calculated as:

1 3 3 1 1 1 3 3 1 1
HXY=———1 ———1 — ———1 ———1-—=0.8113 1.5
(l) 2( 40824 40824)+2( 40824 40824) ( )

The reduction of entropy due to the output Y is the mutual information between

input X and output Y, which is

I(X : Y) = H(X) — H(X|Y) =1— 0.8113 = 0.1887 (1.6)

Originally, information theory was used exclusively in telecommunications to max-
imize information transmission over a noisy channel. It is now used more frequently
across many ﬁelds including biology [Schneider, 2000; Adami at 0.1., 2000]. In evo-
lutionary biology, the replication from parent to offspring is thought of as an infor-
mation transmission process. The information contained within a genome is about
its environment and determines how the organism behaves in that environment; in
particular, it determines whether or not the organism can replicate and how well it
is able to survive. Mutations are responsible for the noise during the replication,
and the quality of the resulting message will determine if a mutation is detrimental,

neutral, or occasionally even beneﬁcial.

10

 

Chapter 2

MEASURING COMPLEXITY
WITH DIGITAL ORGANISMS

Adami, Ofria, and Collier [2000] put forth an elegant deﬁnition of biological complex-
ity of an organism: the genetic information that an organism has about its environ-
ment. This builds upon Adami and Cerf ’5 deﬁnition of the physical complexity of a
symbolic string [2000] as well as Shannon Information Theory.

Adami, Ofria, and Collier developed a population-based method to measure bio-
logical complexity. They approximate the total information in a genome as the sum
of the information at each locus. The information content of a locus is measured
using information theoretic techniques on a population of organisms with the same
phenotype. This population-based method for measuring the information content of
a locus has inherent limitations such as requiring a full population at equilibrium, for
genomes to be ﬁxed-length, and for the environment to have only a single niche.

We have developed a signiﬁcantly better method for measuring the biological
complexity of an individual organism that overcomes many of these limitations. Our
new method is still based on Shannon Information Theory, but we now harness the
principle of mutation-selection balance from population genetics, which allow us ex-
trapolate out to a full population from a single genome. Speciﬁcally, we determine
the information content of each locus in the genome by testing all of the possible
mutations at that position and measuring the expected frequencies of potential genes

in the mutation-selection equilibrium state.

11

I will ﬁrst introduce Adami et al.’s method in the next section and then further
describe our new approach, its advantages, and initial results from applying this

approach.

2.1 Population-Based Physical Complexity

Conceptually, the physical complexity of an organism is the amount of information
that is stored in its genome about its environment. A genome stores information that
is expressed into the functional capabilities of the organism in a given environment.
Thus, the physical complexity of a genome or organism should mirror the functional
capabilities or phenotype of the organism. Adami et a1. use Shannon Information
Theory in the following manner to relate phenotype to physical complexity.

If we know nothing about an organism, then we have maximal uncertainty about
its genome (any genome is possible). On the other hand, if we know the organism’s
phenotype, we have less uncertainty about what the organism’s genome is (only a
small fraction of possible genomes corresponds to any speciﬁc phenotype). The dif-
ference between these uncertainties represents the information stored in the genome
about its environment, and thus its physical complexity.

Unfortunately, it is difficult to exactly deﬁne the entropy or uncertainty of a
genome given its phenotype. To approximate this uncertainty, Adami et al. proposed
that most encodings of an identical phenotype would be similar to each other —
typically only differing by a series of neutral mutations. Given this assumption, a large
enough population where all individuals possess the same phenotype should contain
the distribution of genomes necessary to calculate physical complexity. Unfortunately,
it is difficult to get a “large enough” population, so Adami et al. showed that in most
cases, it is sufﬁcient to calculate the entropy of a population of genomes site by site.

If there is no epistasis (non-linear interactions between genome positions), this will

12

 

give us the same result. This also has the beneﬁcial side effect of identifying the sites
within the genome that store the information.

To illustrate this technique, consider the case of approximating the physical com-
plexity of a phenotype where the corresponding genome space is DNA sequences. We
ﬁrst need a population of genomes in an equilibrium state that all have this phenotype.
We then calculate the per—site information of this population as follows: without any
information about the genome, we can only assume that each of the four nucleotides
has equal probability of occurring at any site 2' leading to a maximum site entropy of
Hmax = log 24 == 2. Let the frequencies for each nucleotide for site 73 within the actual

population be 130(2), 116(2), pA(2'.), pT(i). The population entropy of this site is then

C,G,A,T
Hi=- Z pj(i)108pj(i) (2-1)
3'

The information content at site i would then be:

Finally, the physical complexity for this phenotype is approximated by applying this

equation to each site and summing them together.

0 = Z [(2). (2.3)

2.2 Limitations of Population-Based Complexity

In their previous study, Adami, Ofria, and Collier used the Avida platform to examine
the evolution of physical complexity in digital organisms. Avida was setup in single-
niche, mass-action mode. The single-niche aspect means that the organisms are all

in direct competition against each other and the species with the highest ﬁtness

13

phenotype will dominate. The fact that the population is mass-action means that
there is no local structure to it so when a higher ﬁtness species evolves it can to take
over rapidly due to an exponential growth rate. Finally, the organisms were all forced
to have the same genome length (100 sites) so that sequence alignment would not be
necessary. During each experiment, they calculated the frequency of each instruction
at each site by counting the number of organisms with that instruction in the site
studied.

This population-based technique allowed for good estimates of physical complex-
ity over time in many instances, but suffers from a number of limitations. First,
the technique only produces accurate measurements if the population has reached an
equilibrium state. For example, if a beneﬁcial mutation causes a new species (and thus
new phenotype) to take over a population, all otherwise neutral sites in the genome
hitchhike to ﬁxation, and it will take time before an equilibrium is reached represent—
ing most genotypes of the new phenotype. It is often the case that a new beneﬁcial
mutation will arise before equilibrium is reached preventing us from determining the
true complexity of that phenotype. We can see these effects in Figure 2.1, the upper
line of which displays the physical complexity over time using the population-based
technique for a typical Avida experiment. Notice that each time complexity increases,
it overshoots its mark and then gradually comes down again, typically to a higher
resting level than it started.

The second problem with population-based physical complexity is the constraints
that must be placed upon the organisms due to computational concerns. The popu-
lation size must be small (typically 10,000 organisms or fewer) and the length of their
genomes is ﬁxed to prevent alignment problems. The ﬁnite population size limits the
possible range of genotypes compounding the population diversity limitation noted
earlier. If there are too many neutral sites, it would be unlikely for them all to be rep-

resented in such a limited population even at equilibrium. The ﬁxed length genome

14

 

 

 

 

 

90 -
70
.3:
X
2
a.
8 50 -
O
30 ~
1 0 L l l l
0 2 4 6 8 10
Time (in updates x 10‘)

Figure 2.1: A comparison between the population-based method of calculating phys-
ical complexity (upper line) and our new method (lower line). The new method is
applied to the lineage of the most abundant organism at the end of the experiment.
The population size is 3600. '

puts an inherent cap on complexity growth (there is only so much “blank tape” to
write information into) and precludes many powerful forms of mutations such as gene
duplications.

A ﬁnal problem is ensuring that all organisms possess the same phenotype. This
can only be achieved by using a single niche environment, and even with a single niche,
a single phenotype only occurs at equilibrium. This limits the the range of interesting
experiments that can be studied using population-based physical complexity, and
excludes the possibility of studying ecosystem complexity. Ideally we want a process
that will be able to calculate the complexity of a single organism given the rest of its

environment as a constant.

15

 

2.3 Our Approach

Here, we demonstrate a new method for calculating physical complexity that has
the ability to transcend the limitations listed above. We calculate the complexity of
a single genome from its local ﬁtness landscape by testing the ﬁtness effects of all
possible single-step mutants and using the principles of mutation-selection balance to
calculate the expected frequency of each mutant. Since we calculate the complexity
one organism at a time, we never have to worry about overshooting the correct com—
plexity due to a biased sample, we do not impose any genome size limitations, and
we can allow the environment to vary as long as we always test an organism using
the state of the environment during its lifetime. In essence, we shift the complexity
measure from the phenotypic level to the genotypic level.

This method does, however, suffer from limitations of its own. First, we only
consider single-step mutants. In the future we plan to reﬁne this method by examining
multiple sites at once in an attempt to decipher epistatic interactions and improve
our complexity measure. Second, a signiﬁcant amount of extra processing power is
required to generate all possible single point mutations from a genome and to test
the ﬁtness of each. In a computer this may be feasible, but in a natural system it is
nearly impossible given our current technology. We must therefore limit ourselves to
applying this physical complexity measurement technique to computational systems
for the moment. This is not as severe a problem as it may seem since one of the main
goals of quantifying complexity is to study its origin. In the natural world, evolution
progresses too slowly to see signiﬁcant changes to species in time spans shorter than
centuries, so experimental macro-evolution already has this restriction placed on it.
Furthermore, as we improve our techniques for calculating complexity in the digital
world, we can use this to determine the quality of other complexity approximation
algorithms that can more easily be applied to natural systems, in particular those

that involve knockouts rather than all possible mutations.

16

 

2.4 The Technique

As our ﬁrst step in calculating the physical complexity of Avida genomes, we devel-
oped a test environment that organisms can be inserted into. The test environment
is initialized to the exact same conditions as the environment that the population is
evolving in, but only one organism is tested at a time. That organism is processed
until either it gives birth and we can measure its ﬁtness, or else it dies of old age
(indicating a zero ﬁtness). Note that since the expected number of offspring per unit
time will vary depending on local resources and competition, ﬁtness can be highly de-
pendent on the organism’s environment. Once a test environment is constructed, any
organism including its mutants can be placed in the test environment to determine
its ﬁtness in that exact state of the environment.

As in the previous method, we calculate the complexity of the whole genome by
summing the complexities of the individual sites in that genome. To determine the
complexity of site 2', we start by mutating this site to all other possible states and then
use the test environment to calculate the ﬁtness of each. In the case of Avida, there
are 26 instructions in the genetic alphabet, so we need to generate 25 new genomes
to represent each possible mutation at site 2'. We then run each of the resulting
26 genomes through the test environment to determine their ﬁtnesses. With these
ﬁtnesses and a mutation rate, we can predict the abundance of each instruction at
this site were a population at equilibrium.

Intuitively, it is clear that if a genome has equal ﬁtness no matter which instruction
is at site 2', then we would expect all possible instructions to appear with about equal
frequency. Further, this would translate to a maximal entropy for that site, and thus
a zero complexity. On the other hand, if only the original instruction has a non-zero
ﬁtness, then we expect that instruction to dominate in an equilibrium population
(the others would persist at a small frequency due to detrimental mutations creating

them.) In this case, the population would have a low entropy at this genomic position,

17

 

and it would contribute maximally to complexity. It is slightly more complicated to
calculate the expected abundance at sites with mixed fitness levels; our techniques
are discussed below.

As an example, we show a sample sub-sequence from a genome in Table 2.1 where
a single site is mutated throughout. Its original state was ‘m’, but all others are

tested as well and their ﬁtnesses recorded.

 

 

 

 

Sequence Fitness
.. .akapbkawb j boacpbnaqblaqu. . . 0
. . .akapbkawb j bobcpbnaqblaqu. . . 6.46734
...akapbkawbjboccpbnaqblaqu... 0

 

...akapbkawbjbodcpbnaqblaqu... 5.94

 

 

. . .akapbkawb j bomcpbnaqblaqu. . . 6.46734

 

 

. . .akapbkawb j bochbnaqblaqu. . . 3.23367

 

 

 

 

Table 2.1: Samples for genome sequences with all single-site mutations and the re-
sulting ﬁtness of each. The site being changed is marked in bold within each sequence
above. The original symbol is “m”.

We use the mutation—selection balance principle from population genetics to take
the ﬁtness values and determine the portion of the population that we expect each
genotype to ﬁll at equilibrium. Fisher, Haldane, and Wright, pioneers of population
genetics, developed mathematical models quantifying the relative importance of se-
lection and mutation in maintaining genetic variation. We simplify and specialize
these equations to Avida, which has populations that are asexual, haploid, and have
overlapping generations, and where we only consider the possibility of site 2' mutating,
since we are not considering interactions between sites.

We need the following notation to deﬁne our mathematical model. Let pj denote
the percentage of the population occupied by genotype 3' at equilibrium, wj the ﬁtness
of genotype j, D the size of the instruction set (in our case D = 26), and p. the per-

site mutation rate. Furthermore, we assume all mutations are equally probable. The

18

average ﬁtness is deﬁned by

D
a = Z 10ka (2-4)
k=1

At equilibrium, the following equation must hold:

D
Pj =(Pjo/w)( 1—lt)+ kzlfpkwk/WW (1/0) (2-5)

In this equation, pjwj/JJ is the relative replication rate of genotype j, and 1 —
,u is the probability that genotype j replicates without mutation at site 2'. These
two factors are multiplied together to give us the rate of perfect replication within
genotype j. For the second part of the equation, n(1 / D) is the probability that
any genotype (including j) mutates into genotype j. We then multiply this by the
relative replication rate for each genotype to determine the rate that each genotype
mutates to genotype j. These two factors summed together represent the rate at
which genotype j enters the population. Since all organisms leave the population with
equal probability, at equilibrium, the rate at which genotype j enters the population
must be the same as pj, the percentage of the population occupied by genotype j at

equilibrium. We use equation 2.4 to simplify equation 2.5 as:

pj = (pjwj/JJXI — it) + u(1/D) (2-6)

To determine the ﬁnal abundance of each of the 26 genotypes at equilibrium, we
generate the 26 equations and solve them. This will always provide us with a unique
solution if we bound all probabilities to be between zero and one. The solution
will predict the abundance of each possible instruction at this site in an inﬁnite
population—exactly what we need. We can then calculate the physical complexity of
this site and repeat this process for each other site in the genome, summing them up

to determine the physical complexity of the genome as a whole.

19

.(H iii—311‘ II.".- F

 

There is only one modiﬁcation that we need to make to this process to help us de—
termine the correct physical complexity for genomes along a lineage that immediately
precedes an adaptive event. It is possible that a single-step mutant will have a higher
ﬁtness and thus contain more information about the environment in its genome. We
want to measure only the amount of information currently in the genome, so we treat
any beneﬁcial mutants as if they were neutral. That is, no single-step mutant is given
a higher ﬁtness than the original genome. In a population, we would expect this ben-
eﬁcial mutation to be selected for, but this information has not yet been incorporated

into the test genome.

2.5 Experiments and Results

Our ﬁrst experiments test how accurately our models predict the abundance of single-
step mutants. We initiate Avida experiments with a population size of 3600 where
only a single site is allowed to mutate. Given our instruction set of 26, there can
only be 26 possible genotypes in the population. We then compare our predicted
abundances to the observed abundances once an equilibrium is reached as shown in
Figure 2.2. We have performed over 30 such comparisons, and all performed similarly
well.

Our second set of experiments highlight the improved accuracy of our new method
when a population is not at equilibrium compared with the previous population-based
method for calculating complexity. In order to be able to calculate the population-
based complexity, we perform Avida experiments with large (3600), single-niche pop-
ulations with ﬁxed-length genomes. Within each experiment, we ﬁrst identify the
lineage of the most abundant genotype at the end of the experiment. At each Avida
update, we then compute both the population-based complexity (using all genotypes

alive at that update) and our single-step mutant complexity of the current genotype

20

0.25 1 j T ‘17 l r

 

0.2 -

Frequency
.0
a

.0
.—l

0.05 -

 

 

—- “F.“
._._ H a...“

1 6 1 1 16 21 26
Genotype

Figure 2.2: Mutation-selection balance at site 53 of a genome. Gray bars are mathe-
matical predictions of instruction frequency, black bars are experimental results.

on the isolated lineage. Figure 2.1 contains a plot of both complexity measures at
each Avida update for one Avida experiment.

As we saw in Figure 2.1, when the population is not at equilibrium, the population-
based complexity is inaccurate for many of the updates as it suffers from hitch-hiking
effects. On the other hand, the proposed growth in complexity over time (with minor
ﬂuctuations) is clearly visible in the single—step mutant complexity (lower line in
Figure 2.1). The minor ﬂuctuations in complexity are expected; there are occasional
decreases due to the loss of information, or the evolution of a more compressed way
to code for a phenotypic trait. Some ﬂuctuations may also be caused by inaccurate
approximations of the actual complexity due to the fact that we do not yet account for
epistasis. At equilibrium, both methods provide a qualitatively similar result, though
the population-based complexity measure is always higher than the single-step mutant
complexity.

Our third set of experiments highlight the improved accuracy of our new method

for calculating complexity when a population is at equilibrium compared with the pre-

21

 

vious population-based method. We perform 30 sets of Avida experiments (identical
to those performed in our second set of experiments) with three different population
sizes: 900, 3600, and 14,400. We focus on the comparison between the computed
complexities once the population reaches a ﬁnal equilibrium. We sampled one trial
with each population size. Figure 2.3 shows the comparison of two methods for the
population with size 900 and 14,400 and Figure 2.1 shows the comparison for the
population size 3600. As the population size increases, we found the population-
based complexity become closer to the single-step mutant complexity at equilibrium.
The mean difference over 30 trials between complexity measures at equilibrium in the
size 900 population experiments was 21.35 Complexity Units (CU), where 1 CU = 1
instruction containing maximal information. The mean difference between complex-
ity measures at equilibrium in the size 3600 population experiments was 10.04 CU.
Finally, the mean difference between complexity measures at equilibrium in the size
14,400 population experiments was a nearly negligible 1.61 CU. Clearly, as population
size increases, these measures become dramatically closer.

These results can be easily explained. In our single-step mutant method, we sim-
ulate an inﬁnite-sized population. The population-based method, on the other hand
uses a ﬁnite-sized population, where some single—step mutants may not be available
making entrOpy appear lower than it shOuld; this, in turn, leads to an overestimate
of complexity. As the population size increases, the accuracy of the population-based
method should improve, and the results should approach those of the single-step
mutant method. As our experiments show, this pattern holds, thus indicating our
new method is more accurate than the population-based method even under optimal
conditions for the population-based method.

Our ﬁnal experiments demonstrate the increased ﬂexibility of our new method
for calculating physical complexity. In particular, we show that it can be applied to

variable length genomes. We perform 50 Avida experiments with a population size

22

 

 

90- ]

‘1
O

Complexlty
01
O

 

 

 

1o 1 1 l
0 2 4 6 8 1O

 

 

 

 

 

 

Time (In updatesx10‘)
90
70
5‘
X
2
:3.
§ 50
O
30 4
10 1 l l l_
0 2 4 6 8 10

Tlme (In updates x 10‘)

Figure 2.3: A comparison between the population-based method of calculating phys-
ical complexity (upper line) and our new method (lower line). The new method is
applied to the lineage of the most abundant organism at the end of the experiment.
The population size of two ﬁgures are 900 and 14,400.

23

 

of 10,000 organisms where we allow the size of the genome to change. Figure 2.4(a)
is the result from a sample trial. It displays our new physical complexity measure
of a lineage from this experiment. There are many clear jumps in complexity over
time. Figure 2.4(b) shows how ﬁtness changes for the same lineage and Figure 2.4(c)
shows how the complexity jumps correlate with ﬁtness increases. The correlation
coefficient between the complexity and the log ﬁtness is 0.979 for this case study;
the correlation coefﬁcient averaged over 50 trials is 0.920 with standard deviation
0.055. This suggests that we are accurately reﬂecting the true complexity as at each
ﬁtness jump, more information about the environment is encoded into the genome.
We also observed downward spikes occasionally occur in both graphs. Clearly the
mutations corresponding to the loss of information are detrimental mutations that
brieﬂy exist along the lineage and can be an important step on the way to signiﬁcant
ﬁtness improvements [Lenskz' et al., 2003].

We averaged physical complexity over 50 trials of population size 10,000. The
complexity shows the trend of increasing in Figure 2.5. According to the Natural
Maxwell’s Demon proposed by Adami et al. [2000], complexity should increase be-
cause natural selection “monitors” the mutations; whenever the mutation contains
information about the environment, it is allowed to enter the genome. Hence, the
total information in the genome increases during the evolution, which also means the

complexity increases.

2.6 Discussion

We designed a point-mutation method to measure physical complexity and compared
this method with the old population-based method. The new method is not limited
by the population size; the population does not have to be in equilibrium state to

be measured for complexity. The method can be used to measure variable length

24

 

 

 

 

 

 

 

 

 

 

 

 

 

 

80 (a)
g 60
O
3.
E
8 40
20
0 2 4 6 8 10
Time (in updates x 10‘)
(b)
106
W
8 104 +
5
i: 2
10 [
10°
0 2 4 6 8 10
Time (in updates x 10‘)
(C)
106 .—
m . . u“
8 10" ~ -
5 .-
E 2 ,, ......
10 - ' ...
-P
10° . _-'
20 40 60 80
Complexity

Figure 2.4: (a) Physical complexity over time for a lineage from an Avida experiment
where genome length is allowed to change. (b) Fitness over time for the same lineage
from the same experiment. (c) Fitness vs. complexity.

25

 

 

(A)
01

Average Complexlty
0)
O

N
U!

20

 

15

 

 

 

10 I l l l I l l l 1
O 1 2 3 4 5 6 7 8 9 10

Time (In updates x 10‘)

Figure 2.5: Organism complexity along lineage averaged over 50 trials. Upper and
lower boundaries deﬁne 95% conﬁdence interval.

genomes, so insertion and deletion mutations are allowed in the experiments. With
our method, we are able to measure the organism’s complexity under both single
niche environment and multi-niche environment.

We performed experiments with digital organisms and analyzed the complexity
changes of the organisms over evolution. The question whether complexity always
has an increasing trend has been around for a long time. Since Darwinian evolution
is a long-term procedure in nature that we only have one DNA-based example, it is
nearly impossible to get a conclusive result from the natural world. In Avida, we are
able to perform many experiments and observe macro—evolutionary dynamics so we
can study how this process works. The data we have collected thus far have concurred
that complexity does seem to always increase over time. These experiments, as well
as the theory that led to them, assume a single—niche environment.

Initial experiments in environments with multiple, limited resources show that
many species can easily co—exist in an Avida population and form primitive ecosys-

tems [Cooper and Ofria, 2002]. While theory does not dictate that populations in

26

 

such naturally ﬂuctuating environments always increase in complexity, although we
have observed a much more rapid ﬁtness increase in these populations. It would be
impossible to Show that any single species must always gain in complexity, but there
is much more that may be said about the community as a whole. In chapter 4, I will

talk about how we extend the technique to examine the complexity in the community.

27

 

Chapter 3

ON THE EVOLUTION OF
COMPLEX FEATURES

Evolutionary theory explains the origin of complex organismal features through a
combination of reusing and extending information from less-complex traits, and by
only needing to exploit one of many possible pathways to a viable solution. While the
appearance of a new trait may seem sudden, the underlying information associated
with that trait must evolve gradually. To study this process we used digital organisms
in Avida. Lenski et a1. [2003] found that when a new complex trait ﬁrst appears
in a population of digital organisms, its proper function immediately requires the
coordinated operation of many genomic positions. However, we show here that the
total information stored in the genome only increases marginally. As the amount of
information needed to perform a trait increases, the probability of its simultaneous
introduction drops exponentially, so a truly complex trait appearing as a whole de
none is virtually impossible. We demonstrate that the majority of the information
associated with a trait is primarily correlated with pre—existing traits or is co—opted
from traits that were lost in conjunction with the appearance of the new trait. Thus,
while total information in a genome only increases in small increments, traits that
require much more information can still arise during the normal evolutionary process.

In this chapter, I ﬁrst introduce previous research on complex traits, then describe
our method of measuring genomic complexity and the experiment system we use.

Finally I show analysis results and make conclusions.

28

 

3. 1 Background

In 1871, the zoologist St George Mivart objected to Darwin’s theory of evolution
by natural selection on the grounds that, while it could optimize existing features
of an organism, it could not explain the origin of entirely new traits [Mivart, 1871].
Speciﬁcally, Mivart felt that incipient forms on the way to produce useful structures
served no purpose, and hence would not be supported by natural selection. Darwin
had anticipated this difﬁculty, and in his ﬁrst edition of On the Origin of Species
[Darwim 1859] noted that “it is so important to bear in mind the probability of
conversion from one function to another” and furthermore, that “different kinds of
modiﬁcation would serve for the same general purpose.” In other words, Darwin
proposed that new traits arise due to functional shifts from previously existing traits,
and that even though any speciﬁc modiﬁcation may be unlikely to evolve, many
different pathways may all lead toward the same goal; even if the outcome we witness
seems incredibly unlikely, it was only one of many possibilities.

Substantial evidence has been collected indicating that complex traits can be
produced through the evolutionary process, including such examples as the evolution
of the eye [Sal'vz'ni-Plawen and Mayr, 1977; Goldsmith, 1990; Nilsson and Pelger,
1994], the Krebs cycle [Melendez—Heme et al., 1996], insecticide resistance [Newcomb
et al., 1997] and nutritive “milk” in the cockroach [Williforda et al., 2004]. A detailed
demonstration of the evolution of complex traits has even been previously performed
in digital organisms [Lenski et al., 2003].

The question remains as to how information ﬂows into the genome. If a new trait
must arise entirely, or even mostly, de nooo then the probability of this trait to appear
drops exponentially as the number of genomic positions it requires increases. Even
considering that portions of the information required to express a new trait come from
existing traits (that may be destroyed in the process) all of the unique information

associated with the new trait must arise without the beneﬁt of selection for the

29

 

incipient forms. Thus a new complex trait can only arise when the majority of its
information has already made it into the genome through the presence of preexisting

traits, typically of lesser complexity.

3.2 Measuring Genomic Complexity

We use information theory to frame our analysis of biological complexity by measuring
the amount of information associated with each trait, and how much information is
shared between traits. In general, an organism can be thought of as an information
channel [Ofria et al., 2003] that passes a message (its genome) to a recipient (its
offspring’s genome) in the presence of noise (mutations). The key difference here
from traditional studies of information theory is a feedback loop: the message being
passed will be used to build the next organism, which, in turn, will continue to pass
the message on. Any ﬂaws in the information transmitted may reduce the capacity
of the subsequent channel. However, errors also have a small probability of being
beneﬁcial, and improving the quality of the offspring.

In a typical population, organisms will be subject to an approximately uniform
mutation rate across their genomes. Positions that contain no information can mutate
freely, while those that store information important to the organism’s survival will
typically be perfectly conserved. Here we use the technique from Chapter 2 to get
the distribution of symbols at each position and estimate the amount of information

stored. The sum of the information at each site determine the genomic information.

3.3 Experimental System

To study what happens to information in genomes during the evolutionary acquisition
of a complex trait, we must use a system where we can isolate organisms associated

with the adaptive event and then fully manipulate them to measure the information

30

 

content of their genomes. This requires a level of knowledge about the state of the
system that is unprecedented in research with natural organisms, but can be easily
obtained with digital organisms in Avida.

In all of the Avida experiments presented here, we provide unlimited resources to
the organisms and space is the only limiting factor that the organisms must compete
over. Additional CPU cycles allow an organism to execute their genome more rapidly,
and therefore increase their replication rate. In the default Avida environment used
here, nine resources are available, each associated with a different Boolean-logic op-
eration. The most complex of these is called EQU, which we will focus this study
around.

We can copy digital organisms into a separate, isolated environment to test them
without inﬂuencing the course of evolution in the main] population. In particular,
in order to measure the amount of information contained at a genomic position, we
perform every possible point mutation at that site and measure the relative ﬁtness as
compared to the unmutated wild-type. If all modiﬁcations at a position are lethal,
that position clearly contains the maximum possible information. This translates to a
value of 1 Complexity Unit (CU). If all changes to the site are neutral, the site clearly
contains no information (or 0 CU). For intermediate ranges, we can use the ﬁtness
values of each mutant to calculate their expected relative abundance as described in
Chapter 2.

We can also measure genomic information in an altered environment. For example,
if we wish to measure the amount of unique information associated with the EQU
trait, we calculate an organism’s information content in the default environment, and
then recalculate it in an environment that lacks the EQU resource. Losing the ability
to perform EQU in the latter environment will not be disadvantageous, and hence
mutations at sites associated with this trait will no longer be detrimental unless

that site’s information was shared with a secondary trait. The difference between

31

these two measures is the amount of information uniquely associated with the EQU-

metabolizing trait.

3.4 Experiments and Results

We performed a set of 50 Avida trials using a default conﬁguration. Speciﬁcally, we
used population sizes of 3600 organisms, a per-site mutation rate of 0.0025, and a
genome-level insertion and deletion rate of 0.05. We used an ancestor with a length
100 genome, which was only capable of self-replication but no other functional traits.
The organisms evolved in an environment with unlimited resources associated with
all nine basic bitwise logic operations. We then measured the complexity of organisms

in four different environments.

3.4.1 Three sources of complexity

We tested the complexity of the organisms in four different environments to differen-
tiate the sources of the complexity. Environment I is the default environment with
unlimited resources for all 9 logic operations. Measuring the information content in
this environment gives us the total complexity for an organism. Environment II is
identical to Environment I, but without the resource associated with the complex
trait EQU. The difference between an organism’s complexity in environments I and
II provides us with the amount of information uniquely associated with performance
of the EQU trait. Environment III contains only the EQU resource, but none of
the others and Environment IV contains no resources at all. The difference between
complexity measures in environments III and IV results in the total amount of infor-
mation used to perform EQU, even if this information is also associated with other
traits.

Figure 3.1 shows these information measures for a typical Avida run. In this

32

 

example, the ﬁrst organism in the population to possess the EQU trait appeared at
update 24,891. While this organism did gain EQU (and, as such, possessed 4.9 CU of
additional complexity over its parent), it simultaneously lost the ability to perform
AND. When tested in an environment where EQU was not rewarded, the information
content of the genome actually dropped by 9.9 CU that had been converted for use by
EQU. In other words, the genome contained a total of 9.9 CU of complexity unique
to EQU. This complexity combined with another 23.0 CU shared with other traits
make a total of 32.9 CU required for EQU.

 

 

 

 

 

 

 

 

 

 

 

 

 

90 . .
Total Information
80 - “WW Information wlo EQU
Total EQU Info.

5 70 . Unique EQU Info.
O
8 60 -
E I
§ 50-
o
O
,5 40 r
’6
E 30 -
.0.
C
- 20 -

10 ~

0 1 1 1 l
0 1 2 3 4 5

Time (in updates x 10‘)

Figure 3.1: Information content over the course of a typical Avida experiment. The
most abundant genotype was chosen from the ﬁnal population, and the information
content was measured for all organisms along its line of descent.

A total of 24 out of the 50 trials obtained the EQU trait. Figure 3.2 displays
the average information content of genomes from these lineages, centered on update
zero as the point where EQU is ﬁrst acquired. From this ﬁgure, it is clear that the
information associated with EQU comes from three different sources. The majority
of the information (72.3%) is shared with other traits. The remainder is split between
newly incorporated information (5.7%) and information used that was once part of

now defunct traits (22.0%). The total information for task EQU is 26.8 CU averaged

33

 

 

 

 

 

 

 

 

 

  

 

 

 

 

 

 

I
Total Information
80 r ~ lnforrnation wlo EQU
Total EQU Info.

5 70 . Unique EQU lnfo. ]
0
=5 60 -
‘e‘
e
E 50 -
o
O
5 40
‘6
E 30-
8 W
1:
' 20 -

10 ~

0 ‘ m
"2 -1 0 1 2

Time (In updates x 10‘)

Figure 3.2: Average information during the acquisition of EQU. 24 trials were centered
on their acquisition of EQU and had the information values averaged to create this
graph. A window size is 10k updates around the time when EQU is ﬁrst acquired.
over 24 trials. Thus, the mean information shared with other traits is 19.4 CU; the
mean newly incorporated information is 1.5 CU and the mean information coming
from now defunct traits is 5.9 CU.

The relative importance of these information sources varies widely from one trial
to the next. To better understand the breakdown of where information comes from,
we have created a histogram isolating the sources, as seen in Figure 3.3. Surprisingly,
the change in organism complexity when EQU ﬁrst arises was negative in three of
our 24 cases (Figure 3.3a). This occurs when the traits lost during the acquisition of
EQU actually had a greater combined complexity than the complexity of the EQU
trait itself. In all three cases, the complexity was restored (through the reacquisition
of lost traits) shortly after the rise of EQU. It is also unexpected that the complexity
of the organisms can go up when EQU ﬁrst appears in the environment where EQU
is not rewarded (Figure 3.3c). This is due to other traits arising simultaneously with
EQU, or to new material in EQU now being required for other traits to function

properly. Figure 3.3b shows that the unique information associated with the EQU

34

 

 

(a) (b)

Number of Runs
01
Number of Runs
01

 

 

 

0 0
-20 -10 0 10 20 0 10 20 30 40
Org. complexity change w/ EQU Org complexity unique to EQU

 

 

10 10 w

(c) (d)

Number of Runs
01

Number of Runs
01

 

 

 

0 0
-20 -10 0 10 20 0 10 20 30 40
Org. complexity change w/o EQU Total EQU complexity

Figure 3.3: Histograms indicating the distribution of information in the 24 trials
where organisms developed the EQU trait. (a) is the change in complexity for the
whole organism when EQU ﬁrst arises; (b) is the complexity unique to EQU, not
shared by other traits; (c) is the change in complexity of all traits other than EQU
at the time EQU arises; and (d) is the total complexity of the EQU trait, including
information shared with other traits.

task. The range of this unique information is relatively large, from 0 to 16.7 CU. The

total information associated with EQU differs among organisms, ranging from 12.8

to 35.3 CU (Figure 3.3d).

3.4.2 The distribution of complexity changes from a random
beneﬁcial mutation

To further examine the amount of information that can appear de novo in association
with a beneﬁcial mutation, we performed a set of experiments at three different time

points in our 50 evolved populations (1k updates, 10k updates and 100k updates). At

35

W?444;

each time point, we chose 2000 random organisms (with replacement) from each of the
50 populations to mutate. In each case we examined the complexity both before and
after the mutation. Figure 3.4(a), (b), and (c) show the distribution of the complexity
changes associated with beneﬁcial mutations at each time point. While a total of
100,000 mutations were examined for each graph (50 runs x 2000 mutation tests
per run), only a small fraction of them were beneﬁcial. At 1000 updates, there were
18,675 beneﬁcial mutations, at 10k updates, there were 5500 beneﬁcial mutations
and at 100k updates there were 2473 beneﬁcial mutations. Over time populations
became better adapted to the environment reducing the amount of information left
to incorporate, which explains the decline in the number of beneﬁcial mutations.

The distribution of complexity increases (i.e. where complexity change is greater
than zero) exhibit a clear exponential distribution in both of the earlier time points, as
shown in Figures 3.4(a) and (b). Every instruction which adds to complexity should
have the same probability of occurring, meaning that more complex structures are
exponentially less likely. This leads to the exponential distribution observed, as long
as a sufﬁciently rich environment exists.

Figure 3.4(c) indicates a disproportionately high probability of obtaining a larger
complexity increase at the later time point of evolution. This is largely due to the
organisms becoming so well adapted to the environment that there is little room for
improvement; the mean number of tasks evolved over 50 runs at 1000 updates is 0.1,
at 10,000 updates it is 5.6, and at 100,000 updates it is 7.9 out of the possible 9 tasks.
In other words, at 100,000 updates very few traits are still available to be acquired
leading to this effect.

To account for the effects of simpler tasks on the continued evolution of complex-
ity, we performed a set of control runs where the organisms evolve in an environ-
ment where the logic operations are not rewarded. We then analyzed the complexity

changes from mutations (as above procedure) when these organisms and their mutants

36

 

 

 

Number of Organisms

   

Number of Organisms

 

 

1 1
-10 -5 0 5 10 15 20 -10 -5 0 5 10 15 20
Complexity Change Complexity Change

 

 

Number of Organisms
Number of Organisms

 

 

 

 

1
-10 -5 0 5 10 15 20 -10 -5 0 5 10 15 20
Complexity Change Complexity Change

Figure 3.4: The distribution of complexity changes with beneﬁcial mutations from
100,000 random sampled organisms in the populations (a) at 1000 updates, (b) at
10,000 updates, (c) at 100,000 updates under a 9 task environment, and (d) at 10,000
updates of control runs. Red dashed line represents the mean complexity changes
and red dotted lines are one standard deviation from the mean.
were moved into the 9—resource environment. Figure 3.4(d) shows the distribution of
these complexity effects at update 10,000. We again see a similar exponential distri—
bution of positive complexity changes. This indicates that the absence of building
blocks does not limit the amount of new complexity that comes into the genome; how-
ever, without building blocks it becomes nearly impossible to evolve more complex
traits.

To improve our understanding of how complexity enters the genome, we need to
focus on those mutations that provide a selective advantage and persist over evolu-
tionary time scales. Clearly mutations that lead to beneﬁcial traits will be selected

for. Thus we expect mutations that have ﬁxed in the population to provide more

37

 

complexity, on average, than random mutations (even random beneﬁcial mutations).
This is clearly shown in Figure 3.5 where we examine the distribution of complexity

changes in lineages in the 9—resource environment.

 

105

_l
O
A

_.
O
(.1

Number of Organisms
8
N

 

 

 

0 5 10
Complexity Change

Figure 3.5: The distribution of complexity changes from a beneﬁcial mutation along
the lineages of 50 runs in the 9—task environment. Red dashed line represents the
mean complexity changes and red dotted lines are one standard deviation from the
mean.

3.5 Conclusions

We have demonstrated that the introduction of complexity into a genome has three
aspects: complexity shared with other traits; complexity from once functional but
now defunct traits; and complexity belonging to newly incorporated information.
We performed random mutation tests on populations at different time points.
Based on the distribution of complexity changes due to beneﬁcial mutations, we see
that mutations leading to large complexity changes are exponentially rare. In partic-
ular, we see that the complex trait EQU requires on average 26.2 CU, but we rarely
see more than half of this complexity (13.1 CU), appear in a single mutational step.

If a complex trait such as EQU is to evolve, it must utilize preexisting complexity.

38

The experiments in this chapter were performed in relatively simple digital organ-
isms, as compared to life in the natural world where there are far more than just 9
resources for the organisms to interact with. Note that every new trait that appears
provides new building blocks to work with, which, in turn, increase the probability
for for even more complex adaptation. Dramatically more complex traits can emerge

in the natural world under such gradual increases in organismal complexity.

39

 

Chapter 4

BIOLOGICAL COMPLEXITY OF
COMMUNITIES

Community complexity is an important topic in ecological biology. Traditionally
biologists conﬂate the biological diversity of a community with that community’s
complexity. Biological diversity, in a broad sense, refers to the number of species
in a community and evenness in abundance among those species [Heywood, 1995;
Magurran, 2003]. Using these diversity methods to measure complexity, however,
fails to associate the complexity of a community with the complexity of the individual
organisms that live in it. For example, a community with a single complex individual
may be more complex than a diverse community of very simple organisms.

In this chapter, I present a novel deﬁnition of community complexity —— the sum
of the distinct information contained in the organisms of that community about the
environment they live in - that is a natural generalization of Adami et al.’s biological
complexity of an individual. This deﬁnition of community complexity, like traditional
diversity measures, increases as the number of species increases. However, rather than
treating all species equally, each species contribution to the community complexity
is based upon the unique information that species has about its environment. We
present a speciﬁc method for approximately measuring community complexity that
uses Shannon Information Theory to quantify the interactions between individuals
and the environment.

The organization of this chapter is as follows: I will introduce the background in

40

4.1, deﬁne biological complexity of communities in section 4.2, describe the measuring
technique in section 4.3, show experimental tests in 4.4, compare it with traditional

measurements and discuss why our deﬁnition is valuable in section 4.5.

4.1 Biological Diversity

There are many methods to measure population diversity, among the most common
are the Shannon index and the Simpson index. They both provide a single statistic
that incorporate both the variety of the species and the relative abundance of them.

The Shannon index is based on Shannon Information Theory and measures the
entropy of the population [Pielou, 1969]. The probability that a randomly selected
individual will belong to the class S, is p,, where 2p,- = 1. The diversity of the

1
community is deﬁned as:

n
H(p1.p2, ...,pn) = — 2102-108292- (4-1)
i=1

In Information Theory, this number represents per symbol entropy of a code com-
posed of it different symbols. Put in the ecological context, H measures the un-
certainty about the species of a random individual. In other words, if there are a
large number of species in the community and distributions among species is close
to even, there will be a high uncertainty about the species each individual belongs
to. Although Shannon entropy is based on the assumption of inﬁnite population size,
when population size is big enough, it provides an accurate estimation. For example,
Anand [1996] and Ray [1994] use this method to analyze the complexity of a plant
community and a Tierra digital community.

The Simpson index (1949) [Pielou, 1969] was calculated as follows. If we draw

two individuals at random and without replacement from an S species community,

41

Ni(Ni - 1),
N (N — 1) ’
each species has N,- individuals and the total size of community is N. If N,- is near N

the probability that both individuals belong to the same species is Z
i

for some species, the diversity of the community would be low. So Simpson’s index

D is one minus this probability:
N- .
D = 1 — —— 4.2
; MM < >

4.1.1 Limitations of using diversity to measure complexity

Both the Shannon index and the Simpson index use two measures to calculate diver—
sity. An increase of index may arise either due to higher number of species, greater
evenness in the abundance of species, or both. To disentangle these components,
Buzas et al. realize that the Shannon index can be separated into two parts: the
natural log of the measure of evenness E 2 6H / S and the natural log of the number
of species.

H 2 M + lnS (4.3)

While these two traits are both highly relevant to diversity, they are not sufficient
to fully represent the complexity of a community. For example, while a bacterial
community and an animal community may have same number of species and same
abundance distribution among the species, can we say that the two communities
are equally complex? The factor we are failing to consider is the complexity of the
individual organisms in the community. Our new deﬁnition of biological complexity
for a community is based on the measurements of complexity for individual organisms

that we introduced in chapter 2.

42

4.2 Biological Complexity of a Community

Adami et al. [2000] deﬁne biological complexity of an individual organism as the phys-
ical information saved in that organism’s genomes about the environment it is living
in. They point out that “genomic complexity is mirrored in functional complexity
and vice versa. Genomic complexity can be deﬁned in a consistent information-
theoretic manner. [This complexity,] roughly speaking, reﬂects the number of base
pairs in a sequence that are functional.”

We extend the deﬁnition of biological complexity from the individual level to
the community level. we deﬁne community complexity as the sum of the unique
information contained in the organisms’ genomes within a community about their
environment. That is, if the same information is contained in two or more organisms’
genomes, that information is included only once in the summation. If there are n
organisms in the community, let us arbitrarily order them from 1 to n. We can use
the notation from Information Theory to represent our deﬁnition of community com-
plexity as I (orgl, orgg, ..., orgn : E). Given this ordering, the community complexity
is the sum of the unique information of organism i relative to its predecessors, organ-
ism 1 through organism i — 1, which we denote as I (org,- : E [orgh . . . , org,_1). For
example, the new information in the third organism not contained in organisms 1 or

2 is I (org3 : E [orgh 07‘92). Thus, the total distinct information in the community is:

I(community : E) = I(orgl, orgg, ..., orgn : E) (4.4)
= I(or91 : E)
+I(or92 : Elorgl)
+I(org;3 : Elorgl, orgg)
+...

+I(orgn : Elorgl, orgg, ..., orgn_1)

43

I ‘_i"."4..".\t‘.’«q

 

There are two factors that are important for measuring community complexity:
the number of species and the average unique information in each species. Our mea-
sure ignores the evenness of species; if information is present in a population, it
is counted exactly once, no matter how many individuals possess that information.
Organisms that belong to same species contain similar information about the envi-
ronment, which they inherited from same common ancestor. Thus there tends to be
minimal unique information in one organism compared to others in same species. The
organisms belonging to different species usually have a lot of independently evolved
information. Hence, when we compare the information content of those organisms,
there will be much more unique information. Community level complexity increases
due to that unique information explaining why communities with more species will

maintain a higher complexity.

4.3 Measuring Method

As with the method for measuring biological complexity for individual organisms,
we do not yet possess the technological power to measure the mutual information
between organisms perfectly and therefore we cannot isolate the unique information
that an individual possess about its environment. We must resort to using simplifying
assumptions and approximations.

The ﬁrst term in formula 4.5 is the complexity of an individual organism in
Adami’s deﬁnition. Adami et al. [2000] proposed a method to measure it in a single
niche environment. Based on our new technique introduced in chapter 2, we evalu-
ate the information content of each site through point mutations and the mutation-
selection balance principle. Using this technique we are able to measure the complex-
ity of individuals in both a single niche environment and multi-niche environment.

To simplify the calculations at this early stage of our research, we will use only

44

 

the organism that has highest correlation with our test organism (let us call it Y) to
determine the unique information in Y given all previously measured organisms. We

will call this comparison organism X,

I(Y]E,orgl,org2, ..., Y —1)'&: I(Y]E,X) (4.5)

The reasoning behind this approximation is that when parallel genomic evolution
is rare (as this is the case in most evolving systems), correlated organisms will be
close to each other in the true phylogeny and thus share the most information.

We calculate the correlations between two organisms by comparing the probability
distribution of symbols site by site. Since alignment is not the key point in this thesis,
we assume that all sequences have equal length and do not need to be aligned. We
use capital letters for organisms and lower case letters for sites in the organisms. We
generate all possible point mutations for each site. We then put each mutant in the
environment, and test their ﬁtness. Here, we make the assumption that for any site,
all the mutations that make the mutant organisms have equal or higher ﬁtness than
the original organism have equal probability of occurring. We further assume that
all deleterious and lethal mutations will make the organism soon be purged from the
population, and therefor have a zero probability of occurring at that site. For example,
if nucleotide T is the original symbol at site i and a point mutation to nucleotide C
maintains the organism at the same ﬁtness, but point mutations to nucleotides A or
G make the organism have lower ﬁtness, the probability of nucleotide C,T,A,G for

the site i is:
0.5, 0.5, 0.0, 0.0

Similarly, we can get the distribution of nucleotides at site i of a second organism. If

it is a neutral site, any nucleotide has equal probability:

0.25, 0.25, 0.25, 0.25

45

 

We draw the bar graph shown in Figure 4.1 for both distributions, the overlapping
area is deﬁned as the “overlap fraction” r. If two distributions are exactly same, the
overlap fraction is 1. On the other hand, if two distributions have no overlap at all,
the overlap fraction is 0. In the aforementioned case, the overlap fraction is 0. 5. The
total overlap fraction summed over each site is used to represent the correlation level
between two organisms. It is obvious that this sum is between 0 and the full genome
length. We compare the new organism with each given organism and choose the one
that is most correlated with the new organism. We use this chosen organism as the

given condition in our calculation of the unique information in the new organism.

 

 

0.5

0.4

Probability
.0
(.0

.0
N

0.1

 

 

 

 

 

Figure 4.1: Nucleotide probability distribution of two comparison sites.

Following a similar approximation as the one used in Adami et al.’s paper [2000]
and chapter 2, the unique information is calculated site by site. For each site i in
organism Y, the new information in Y as compared to a previously measured organism

X is deﬁned as:
I(Yi = Ele') = H(YilXi) - H(3’ilE,Xt) (4-6)

The ﬁrst term H (YilXt) is the conditional entropy of site i in organism Y given

the possible symbol distribution of site i in organism X. In telecommunications,

46

 

conditional entropy is used as the entropy of a source message given the output
message from the transmission channel, and a similar concept can be used in this
case.

In evolutionary biology, an organism can be thought of as a channel that transmits
its genome into its offspring (where mutations are noise in that channel). These
channels can be concatenated over many generations to form a single channel that
was responsible for transmitting information from an ancestor to its descendent. If
there is no ancestor/descendent relation between two organisms, a channel can be
thought of as a linking them through their most recent common ancestor.

The key difference between traditional information channels and those found in
living organisms is that the information in a genome is a blueprint for building another
organism. In other words, if the information is corrupted (i.e., a lethal mutation
occurs), the message will not be transferred for additional generations. It is also
possible for a message to be improved (beneﬁcial mutations), changed without altering
the information content (neutral mutations), or have a partial loss of information but
still construct a mostly functional channel (deleterious mutations).

Without knowing the speciﬁc evolutionary history of each individual organism, it
is impossible for us to determine which differences between organisms are due to a
beneﬁcial mutation (i.e., are the source of new information), and which ones are neu-
tral or deleterious. This makes it difﬁcult to calculate the amount of information that
full knowledge about the genome of one organism gives us about another organism’s
genome. To overcome this difﬁculty, we designed a reasonable approach to estimate
the entropy.

We utilize the overlap fraction r calculated above for each site to calculate the

ﬁrst term in equation 4.6. The entropy of site i in organism Y given the distribution

47

 

of corresponding site i in organism X is calculated as:

H(Y,-|X,~) =r*H(X,-)+(1—r)*Hma$+H(r,1—r) (4.7)

The per-site entropy calculation depends on the degree of correlation between two
sites. When two sites are overlapped with probability r, the entropy of the ﬁrst site
is decided by the entropy of the second site, that is, H (X,) When two sites are
independent from each other with probability 1 — r, we have maximum uncertainty
about the ﬁrst site, which is 1. The last term H (r, 1 — 7‘) represents the uncertainty
we have about whether two sites are overlapped or not.

The second term in equation 4.6 is the conditional entropy of site i in organism Y
given both the environment and organism X. In most situations, the entropy given
two conditions is less than the entropy given one condition, H (YilE, X,) S H (Y,|E)
because more information can never, on average, increase our uncertainty. In our
special case organism X provides information about organism Y only in as much as
X and Y both contain information about the same environment. Thus in equation
4.7 above, we needed X in H (YilXi) to indirectly provide information about the
environment and lower our uncertainty about Y. However, now that we are also
given the environment, there is no additional information in X about Y, so we can
make the simpliﬁcation

H(KIE,X1)=H(K|E) (4-8)
The total new information in organism Y about the environment is calculated as

the sum of new information at each site.

I
I(Y : E|X) = 2 mg : E|X,) (4.9)
i=1

where l is the genome length.

48

' {‘1 9.“;- .

 

 

W" as .'- .1-

4.4 Experimental Setup

The absolute value of a complexity measurement is far less meaningful than its com-
parison to the same measurement performed on other communities, or when used to
track the same community over time. My research interest is to analyze the impact of
the environment on community complexity, particularly the changes in this measure
during evolution. To perform these analyzes, we designed ﬁve environments with
Avida.

In Environment I there are 9 tasks and no limitation on resources. It is a single
niche environment, that is, only one species can persist in the environment over long
time scales. Environment II also contains 9 tasks but there is a resource limitation
for each task allowing the persistent coexistence of multiple species. In both cases,
the organisms must implement a computational task to metabolize the associated
resource. The amount of resources an organism can get depends on the resource
availability in the environment. The resource levels in the environment change due
to the resource inﬂow into the environment, the resource decay-rate, and the rate
at which the other organisms are making use of the resource. In Avida, the more
resources an organism receives, the more CPU cycles it can use, much of which go
toward its replication.

To fully understand the relationship between the environment the population is
evolving in and its community complexity, we examined evolution in three more en-
vironments. Environment III is a control environment where there is only one simple
resource (for the task NOT). Environment IV starts out identical to Environment
III, but we inject a new resource to the environment every 10,000 updates. Thus,
the information available to be exploited in the environment continually increases.
At update 80,000, Environment IV stops changing with the same nine resources as
found in Environment II.

While Environment IV allows us to examine the introduction of new environmental

49

complexity on a community, Environment V focuses on its loss. It is identical to
Environment II for the ﬁrst 100,000 updates with the same 9 resources, but at that
point we remove eight resources and keep only the one resource associated with the
task “NOT”. The environment thus becomes same as environment III. For the detailed
setups for environments I, II, and III, see appendix A.

We performed 50 independent runs for each environmental setup to collect statis—
tically powerful results. For each run, we ﬁx the genome length at 100 and turn off all
insertion and deletion mutations leaving only substitution mutations. Since testing
all possible point mutations at each site of a genome is already very time-consuming,
we simplify our analyses not allowing mutations that cause position shift and there-
fore skipping sequence alignment and focusing on calculating the correlation between
each pair of organisms. The communities under all ﬁve environments include a single,
identical organism at the beginning of evolution, which contains no information about

speciﬁc environment.

4.5 Results and Discussion

We compare our measure of community complexity with traditional diversity mea-
surements. We categorize a community into species based on the distinct phenotype
of each organism. Here we deﬁne the phenotype of an individual as the set of tasks
it has implemented (regardless of the speciﬁcs how quickly or how often the tasks are
performed). If a species includes fewer than three organisms, the organisms belonging
to that species are likely detrimental mutants and are therefore not considered in the
calculation. The proportion of each species is the number of organisms belonging to
that species divided by the total number of organisms.

To minimize our computational cost, we sample organisms from the original com-

munity to compose a smaller, representative community and measure the information

50

in it. We randomly sample one organism from each species to make sure the informa-
tion content of the smaller community truly reﬂects the whole. To keep the sample
size consistent, we pad the smaller community up to 50 organisms with randomly
sampled organisms from the original community. This whole process is done without
replacement so no individual is ever sampled more than once. The number of species
in a community of our experiments rarely exceeds 50. There are 26 possible instruc-
tions at each site in an Avida genome instead of 4 nucleotides in DNA sequences, we
use log base 26 to calculate information content in the community. As in previous

chapters we describe 1 instruction of complexity as 1 Complexity Unit (CU).

4.5.1 Complexity of communities in a single niche: Environ-

ment I

Environment I can sustain only one abundant species since the organisms that obtain
the most energy from any combination of resources will be the most successful; no
species will never have its growth limited due to depletion of resources. All of the
other species are either derived from the dominant species or temporary survivors
from history on their way to extinction.

Since community complexity is deﬁned as the sum of the distinct information in
the community, if two organisms inherit the same information from a common an-
cestor, this information should be counted only once in the calculation of community
complexity. Thus the more overlapping information among organisms, the greater
the difference will be between the total information among individual species and the
unique information in the whole community. As a case study, we sample a typical
run evolved in Environment I and compare its total individual information for 50
organisms with the community level complexity. The results are shown in Figure 4.2.

The total information summed over individual organisms reached around 3000

CU. However, the community information is only around 700 CU at the end of the

51

 

3500 x l f r

 

 

 

a
.5 (3)
Ta. 3000 -
E
o
0 2500 - a
Ti
3
E 2000 ~ a
u
5
3 1500 — g
O
I.—
1000 ‘ L ' ‘
0 2 4 6 8 10

(D
O
O
l

 

_ 6’)

t1!
was
800

Community Complexi
-h 0!
O O
o o

 

(a)
O
O

 

 

M
O
O

 

Phylogenetic Depth

 

o 2 4 6 8 10
Time (in updates x 10‘)

Figure 4.2: Measurements in a typical single—niche community over time. Metrics used

are (a) total complexity of all 50 sampled individuals, (b) community complexity, and
(c) phylogenetic depth of individuals.

52

evolution. The large difference is due to a lot of common information among the
organisms in the population. We plot the phylogenetic depth of the population in
Figure 4.2(c). Phylogenetic depth is the cumulative number of generations in which
an organism’s genotype differs from its parent. The result shows that the community
at any time point is clustered together; there is no branching event to split the
community into multiple persistent subpopulations.

We average the community complexity over 50 runs. The results (along with their
95% conﬁdence intervals) are displayed in Figure 4.3(c). We also plot the average
Shannon index and Simpson index as a comparison in Figures 4.3(a) and 4.3(b)
respectively. Both diversity indices increase quickly at the beginning of the evolution
and plateau around 10,000 updates. We cannot see what is changing in the community
from them. However, our measurement shows the community continued to evolve

information about its environment.

4.5.2 Complexity of communities with multiple niches: En-

vironment II

Environmental heterogeneity is usually considered as the primary cause of genetic di-
versity. Even in a spatially homogeneous environment, Avida experiments with ﬁxed-
size populations show that the productivity of resources may inﬂuence the species
richness. For example, it has been shown in Avida that the maximum species rich-
ness emerges at intermediate productivity [Chow et al., 2004]. Our result is consistent
with Chow et al.’s result, although we use a different method to classify species. We
observed multiple species coexisting under Environment II. As a comparison to the
above case from the single niche environment, we sampled a typical run from Envi-
ronment II to analyze in detail.

We plotted the phylogenetic depth in Figure 4.4(c). Contrast to Figure 4.2(c),

the range of phylogenetic depth from the run under the multi—niche system is notably

53

 

‘r-M-m

 

2.5 I I f I

Shannonlndex

 

 

 

 

 

I
l

0.8

0.6

I

0.4 ~ I 4

 

Simpson Index

 

 

 

 

8
o

(C)

on

O

o
m
I

700 r -
600r- _

500 ~ _

Community Complexity

400

I

 

 

.—

l J

2 4 6 8 10
Time (In updates x 10‘)

p.

 

300
0

Figure 4.3: The results over 50 runs under Environment 1. Upper and lower boundaries
deﬁne 95% conﬁdence interval. (a) average Shannon index, (b) average Simpson
index, and (0) community complexity.

54

broader, which indicates that there are many species coexisting under the multi-niche
environment. The dominant species in the single niche environment contained almost
all the information about the environment. However, in a multi-niche environment
different species are specialized on metabolizing different resources and thus the in-
formation is distributed throughout the community. The average information in each
individual is comparatively smaller, as reﬂected in the total information in the sam-
pled organisms, which is only around 2200 CU at the end of evolution. However,
the community-level complexity is around 800 CU, which is higher than that in the
run from the single-niche environment (around 700 CU). The main reason for this
is that different species independently evolved similar information about the envi-
ronment. Sometime a species other than specializes on metabolizing one resource, it
compete same resource with other species. All independent information is counted
in the community complexity. Another reason is that organisms from single niche
environment contain more pleiotropic sites (where one gene inﬂuences more than one
trait). Our current method counts the information contained in pleiotropic genes only
once, which makes the information in the single niche environment a little lower.
We compared the average Shannon index and Simpson index with the average
value of our complexity measure over 50 runs. Figure 4.5 demonstrates that both
the Shannon index and the Simpson index are higher for the multi-niche environment
than they were in the single-niche environment, which means a community is more
diverse when evolved under the multi-niche environment. As with the result from the
single-niche environment, the diversity indices do not change from the very early stage
of evolution. Conversely, our complexity measurement steadily increases until the end
of evolution which means that more information continually enters the community.
The new information in the community could come from two sources. One is
when existing species evolve new information, and the other is when more species

are formed in the community; each new species must carry some unique information

55

 

3500 . , , ,
(a)

3000 - _
2500 r —
zooo — —

1500 — -

Total Individual Complexity

 

 

 

1000 . . . L

900 i i . .
g 800
700
600
500
400
300
2000 ' ' ' .

 

 

Community Complex

 

 

 

Phylogenetic Depth

 

0 2 4 6 3 10
Time (in updates x 10‘)

Figure 4.4: Measurements in a typical multi—niche community over time. Metrics used

are (a) total complexity of all 50 sampled individuals, (b) community complexity, and
(c) phylogenetic depth of individuals.

56

2.5

1.5

0.5

Shannon Index

Simpson Index

900

m
0
O

700
600
500

400

Community Complexity

300
O

 

 

p—

 

>—

 

 

10

 

 

...

 

 

 

_

 

:—

 

 

10

 

I

 

(c)

l

l

 

 

4

Time (in updates x 10‘)

6

10

Figure 4.5: The results over 50 runs under Environment II. Upper and lower bound-
aries deﬁne 95% conﬁdence interval. (a) average Shannon index, (b) average Simpson
index, and (c) community complexity.

57

to stably coexist with the other species in the community. The experimental data is

consistent with our reasoning as shown in Figure 4.6.

 

 

 

 

40 I I I I
(a)
8
a 30— ~
8
m
‘6 20- —
a
.D
E
g 10— —
o I I I L
o 2 4 6 8 1o

 

O)
O

(b)

OI
O
I
I

Average Individual Complexity
0) h
0 O
l I
l l

 

 

I 1 I

2 4 6 8 10
Time (in updatesx 10‘)

 

N
D

Figure 4.6: The results over 50 runs under Environment II. Upper and lower bound-
aries deﬁne 95% conﬁdence interval. (a) average number of species and (b) average
individual complexity.

4.5.3 Environmental impact on community complexity: En-
vironments III, IV and V

To explicitly show the relationship between the environment and the community,

we designed environments III and IV. Environment III contains only one resource

for simplest computational task “NOT”. Environment IV initially also contains only

resource for task “NOT”, but we add in a new resource every 10,000 updates until

58

all 9 basic resources are present at 80,000 updates.

The average result from Environment 111 (Figure 4.7) shows the Shannon index is
around 0.5 nats and the unique information in the communities is around 620 CU at
the end of evolution. Due to the simpler environment, these values are signiﬁcantly

lower than what we have seen previously.

 

2.5 s , , T
(a)

2 e _
1.5 — a

1i- ..

osr Km MA,
0 ﬁw _

Shannon Index

 

 

 

 

900 . . i .
(b)

\I on
O O
O o
I I

1 m

Community Complexity
8
O

 

 

 

 

500
400
300 I I I I
0 2 4 6 8 10
Time (in updatesx 10‘)

Figure 4.7: The results over 50 runs under Environment III. Upper and lower bound-
aries deﬁne 95% conﬁdence interval. (a) average Shannon index and (b) community
complexity.

From Environment IV (Figure 4.8) we observed that both the diversity index and
the information content keep increasing as we add in new resources. At the end of
the run, the Shannon index is around 2.2 nats and the information content is around

750 CU. As expected, all of these numbers are comparatively higher than those from

59

Environment III.

 

2.5 I T T If I I I

_ (a)

     

Shannon Index

 

 

 

 

_ it»

Community Complexity
0| 0) \l
O O O
O O O
I I I
I L I

A
O
O
1

 

 

l l I l l

2 4 6 e 10 12 14
Time (in updatesx 10‘)

(A)
O
O

 

Figure 4.8: The results over 50 runs under Environment IV. Upper and lower bound-
aries deﬁne 95% conﬁdence interval. (a) average Shannon index and (b) community
complexity.

We compared the results from Environment IV with Environment II. The infor-
mation in the community under Environment II reaches 700 CU of complexity around
20,000 updates while the information in the community under Environment IV do
not reach the same complexity level until 60, 000 updates. This shows that the in-
formation available in the environment limits the amount of information that can be
contained in the community. We let the communities evolve another 70,000 updates
after the resources in Environment IV are all present and we found that the infor—

mation content in these communities converge to the same levels as Environment II

60

over time.

Environment V is identical to Environment II until update 100,000 when we re-
move all of the resources except for the one associated with the task “NOT” (making
it identical to Environment III). We let the evolution continue for another 50,000
updates to let the communities adapt to their new environment. Figure 4.9 shows
how the information content of the community drops after we remove most of the
resources from the environment. After the environment changes at 100,000 updates,
we observe that the information content in the communities under Environment V

converges to the levels seen in Environment III. This result, together with the results

 

2.5 I I I l I I l

Shannon Index

 

 

 

 

8

 

CDVCD
000
000

Community Complexity
0|
0
o

 

 

 

 

300 l I 1 l l l l
0 2 4 6 8 10 12 14

Time (in updatesx 10‘)
Figure 4.9: The results over 50 runs under Environment V. Upper and lower bound-

aries deﬁne 95% conﬁdence interval. (a) average Shannon index and (b) community
complexity.

61

from environments III and IV, show that the environment has a major impact on the

evolution of community complexity.

4.6 Conclusions

We deﬁne community complexity as the sum of all distinct information in the com-
munity about the environment. Based on Shannon Information Theory, we developed
a method to estimate new information found in each organism. There are two im-
portant factors which inﬂuence the measured result: the number of species and the
average distinct information in each species.

We designed ﬁve environments in Avida and used our method to measure infor-
mation in the community. We compare it with the Shannon index and the Simpson
index for a single-niche and a multi-niche environment. The diversity indices increase
quickly at the beginning of evolution and then stay almost constant over time. Com-
pared to them, our measure of community complexity shows the information content
continues to increase in most communities as long as there is environmental com-
plexity left to be exploited. We compared the information content under different
environments and found the community complexity in the single niche environment
is less than it is in the multi-niche environment; the community complexity in a low
information environment is less than it is in a high information environment. The
community complexity changes as the available information changes in the environ-
ment. As expected, the environment plays an important role in how community
complexity evolves.

In summary, our measurement of community complexity is more sophisticated and
produces more intuitive results than previously used methods. We measure the de-
gree of interaction between a community and its environment (including interactions

between organisms in that environment). It is markedly different from traditional

62

diversity measurements and provides a valuable complement to them in determining

community complexity.

63

Chapter 5

INFORMATION-BASED
PHYLOGENY
RECONSTRUCTION

In all of the research thus far, I have focused on the design of complexity measuring
methods and have used them to analyze complexity changes in the course of evolu-
tion. Here I apply these concepts to an important biological problem. Measuring the
information content of different organisms allows us to better understand the rela-
tionships among them. When new information enters a population, it is transmitted
through the generations. If different genomes store distinct information at the same
site, it can help us classify the organisms in that population into two groups. Inspired
by this fact, we designed a character weighting technique to improve the accuracy of
phylogeny reconstruction.

Character-weighting methods are often used to improve the accuracy of phylogeny
reconstruction algorithms. Previous methods focus on removing highly variable sites
(because they are likely to show false similarities between unrelated sequences) or
sites that are too conserved (because they provide no information for reconstruction).
However, little consideration has been given to the idea of using different weights
based on the portion of the tree currently being reconstructed. Sites that provide
information about the root of a tree may have little information about the leaves, and

likewise sites that are informative about the speciﬁc relationships among neighboring

64

 

leaves, may appear random when examined across all available sequences.

We observe that sites where two distinct symbols are both highly represented
are most likely to provide useful information for reconstructing deep bifurcations
in the tree. We demonstrate that the neighbor joining algorithm is signiﬁcantly
more likely to correctly reconstruct deep bifurcations if more weight is given to these
sites. We further show the robustness of this technique with sustained reconstruction
improvements as we vary a number of characteristics about the trees, including their
size, the number of distinct symbols used in the sequences, and the tree symmetry.

This chapter is organized as follows. First we introduce the background about
phylogenetic reconstruction. Next, we describe our approach in more detail and
introduce related works. Then we give our results and discussion. Finally we provide

a conclusion.

5. 1 Background

Designing algorithms to determine the evolutionary relationships between different
species is an important research topic for computational biologists. Some researchers
need to know these relationships to trace the transmission of viruses [Dumpis et al.,
2003] and identify emerging diseases; others use it to assist in drug design by pre-
dicting the conserved residues on protein surfaces [Sen et al., 2004]. This is even an
important question in agriculture, where accurate phylogenies can be used to aid in
increasing the production of crops [Soltis and Soltis, 2003]. In general, knowing the
correct evolutionary relationships is useful for classifying organisms into meaningful
groupings, commonly called clades [Aguinaldo et al., 1997; Woese et al., 1990].

One of the most challenging aspects of phylogeny reconstruction is correctly iden-
tifying the deepest bifurcations within the phylogeny. For example, during the Cam-

brian Explosion, which occurred sometime between 570 and 530 million years ago,

65

all of the known phyla of animals appeared. However, existing techniques are unable
to use the fossil record to determine almost anything about the process by which
this occurred. Similarly, we currently are unable to resolve the evolutionary relation-
ships between the phyla of Bacteria, and thus these phyla, which number well over
25, are treated as independent groups. The goal of this chapter is to build a better
telescope into the past that will allow us to understand more about the fundamental
innovations that make up the root systems of the phylogenetic trees that we study.

The need for accurate phylogenies has driven the creation of a wide variety of
algorithms that will take in genetic data and attempt to reconstruct the original
tree. We assume the genetic data is aligned nucleotide sequence data, though our
techniques are generalized to other inputs as well. We refer to each position in the
aligned sequence as a site or character, and we refer to the actual nucleotide value at
a site as a nucleotide or character state.

What makes phylogeny reconstruction possible are “synapomorphies”: inherited
character states that are preserved in a set of species with a common ancestor that are
not shared by more distant ancestors. Over time, synapomorphic signals are weak-
ened as sites continue to mutate; meaning some descendants of a common ancestor
will lose the shared character state. What makes phylogeny reconstruction difﬁcult
are “homoplasies”: a character state shared by a set of species that is not the result
of inheritance from their common ancestor. Homoplasies can simply be coinciden-
tal, or potentially the result of convergent evolution. Homoplasies are problematic
because they are false synapomorphies and therefore provide false evidence for group-
ing together a collection of species. If there were no homoplasies, then we have what
is referred to as the perfect phylogeny problem, which can be solved correctly in
polynomial time [Cusﬁeld, 1991].

In a typical phylogeny reconstruction problem, different sites or characters will in-

cur different numbers of mutations resulting in synapomorphies, lost synapomorphies,

66

 

and homoplasies. Speciﬁc sites will provide useful information for reconstructing spe-
ciﬁc bifurcations but may provide no information or misleading information for other
bifurcations. We propose a simple method for identifying characters that are likely to
be useful for reconstructing deep bifurcations; speciﬁcally, characters with two well-
represented character states. We also present a simple character weighting scheme to
reinforce the information contained in these characters. We then validate our methods
by showing that the Neighbor Joining algorithm is able to identify deep bifurcations
more accurately when using our character weighting scheme instead of a uniform
weighting scheme. While we have only applied our technique to Neighbor Joining, we
believe it should enhance the recovery of deep bifurcations for all standard phylogeny

reconstruction techniques.

5.2 Our Approach

The German entomologist Willi Hennig [1979] argued that only shared derived char-
acters should be used as indicators of common descent. As noted earlier, a speciﬁc
character may be a shared derived character for one set of organisms and thus pro-
vide evidence for bifurcation associated with them, but may not provide evidence for
bifurcations in other parts of the tree.

For example, consider the symmetric tree structure shown in Figure 5.1. Suppose
there is a mutation in the leftmost internal branch at character i and that no other
mutations occur in character i. Then the organisms represented by leaf nodes T1, T2,
T3, and T4 will share a common character state in character i, while the organisms
represented by leaf nodes T5, T6, T7, and T8 will share a different common character
state in character i. Thus, character i will be extremely informative for reconstructing
this deepest internal bifurcation. On the other hand, character i will provide less

information for reconstructing any of the other bifurcations in the phylogeny.

67

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

T1 T2 T3 T4 T5 ' T6 T7 T8

Figure 5.1: An example symmetric tree.

The problem with ﬁnding informative sites for deep bifurcations is that few synapo—
morphies that formed from early mutations will manage to persist until the present
time. If a synapomorphy does persist, it is likely due to an adaptive mutation where
any further mutations at this character would typically be detrimental to the or-
ganism. In many cases, some noise will be introduced as either some of the common
descendants lose the shared character state through further mutations or other organ-
isms gain the shared character state through further mutations creating homoplasies.
However, even with noise, a combination of informative characters may still provide
enough information to reconstruct a deep bifurcation. The problem is identifying
sites where the information to noise signal is strong and reinforcing these sites so
that they overcome the noise included in other sites. In the next two subsections,
we present our methodology for identifying and reinforcing informative characters for

reconstructing deep bifurcations.

68

 

 

5.2.1 Identifying Informative Characters for Deep Bifurca-

tions

How can we decide which characters have useful information for reconstructing a spe—
ciﬁc bifurcation without knowing the true phylogeny? This turns out to be tractable
for deep bifurcations, or perhaps more accurately “equal bifurcations”. Suppose a
bifurcation divides the species into two groups of size X and N — X. The best pos-
sible result is that the ﬁrst group of X species will share one character state, while
the second group of N — X species will share a different character state. For deep or
equal bifurcations, X could be N / 2 while for shallow bifurcations, X will be much
smaller.

Thus, the key for identifying useful characters for deep bifurcations is ﬁnding char-
acters that have two well-represented character states. In more detail, after aligning
sequences, we measure the abundance of each character state (either nucleotides or
amino acids, depending on the type of data we are working with) at each site. We
then sort the characters by the abundance of the second most abundant character
state for that character. For example, if there are 64 species and at the ﬁrst site 32
of them have nucleotide A, 30 of them have nucleotide G, 2 have nucleotide C and

none have nucleotide T; this site would be given a rating of “30”.

5.2.2 Reinforcing Informative Characters for Deep Bifurca-

tions

Given our results from the previous discussion, we now have a way to rate characters
with regard to their ability to inform reconstruction of deep or equal bifurcations;
namely the cardinality of the second most abundant character state. We now need
a method for weighting characters based on this ranking. To simplify our investiga-

tion, we restrict our attention to a two-parameter weighting scheme (t, w). The ﬁrst

69

parameter t is a threshold where we increase the weight of all characters that have a
rating at or above the threshold. The second parameter to is the weight given to the
highly weighted sites.

Even for this two-parameter weighting scheme, the Optimal choices for t and w
are not clear. In this work, we consider all possible threshold values t but only
consider w = 2; sites with ratings below the threshold are given weight 1. We
will investigate more sophisticated weighting schemes in future work, for example,
continuous weighting schemes where a higher rated site gets more weight than any

lower rated site.

5.2.3 Methods

To determine if our approach of identifying and reinforcing sites with two abundant
character states improves reconstruction of deep bifurcations, we designed a set of
experiments to compare Neighbor Joining (NJ) [Saitou and Net, 1987] where all sites
are equally weighted with Neighbor Joining where sites are weighted using our (t, 2)
weighting scheme. Neighbor Joining is one of the most popular distance—based phy-
logeny reconstruction techniques because of its efficiency and relative effectiveness.
As it is a distance-based approach, all sequence information is translated into a col-
lection of distances between pairs of sequences. With a uniform weighting scheme,
each site contributes equally to the distance between any pair of sequences. With
our (t, 2) weighting scheme, highly rated sites contribute twice as much as other sites
to the distance between any pair of sequences. While we only test our method with
a distance—based reconstruction method, we believe the results would also hold for
character-based methods as well.

We test our method using a computer simulation approach similar to the seq-gen
application [Rambaut and Grassly, 1997]. That is, we start with a random ancestor

and a model tree topology. Based on this model tree, we use a Monte-Carlo simulation

70

 

procedure of molecular sequence evolution to generate the sequence for each internal
node in the tree and for all of the leaves. In this simulation we assume that all symbols
are equally likely to be mutated to and thus no additional information is available
from some character substitutions being more likely than others. We also do not allow
insertions or deletions so that sequence alignment is not an issue. One minor difference
between our simulation and seq-gen is that our edge lengths represent the actual
number of mutations incurred between the parent and child while the edge lengths
in most applications such as seq-gen represent a probability that each site suffers a
mutation. For each tree topology, we randomly generated 1000 groups of sequence
data. We then apply the Neighbor Joining method to the resulting sequence data
using a uniform character weighting scheme as well as our (t, 2) character weighting
scheme for all values of t. For each topology, we compute the percentage of cases that
Neighbor Joining is able to successfully reconstruct the deepest bifurcation with both
weighting schemes. Since we have 1000 replicates for each topology, we are able to
produce statistically signiﬁcant results.

Our default parameter settings are the following. The sequence length is 200, the
character set is 4 (to represent nucleotides), the number of leaves or ﬁnal sequences
is 32, and the default tree topology is symmetric so the deepest bifurcation is an
equal bifurcation. To ensure the information available to reconstruct this deepest
bifurcation is limited, our default topology has short internal branch lengths of 5
while branches near the leaves have 80 mutations. A graphical depiction of this tree
topology is given in Figure 5.2.

We study several variants of our default settings in order to understand how
different factors affect our methodology. One of the factors we vary is the tree size;
we also study trees with 16 leaves and 64 leaves. When we restrict to 16 leaves, we
eliminate the ﬁrst internal branch from Figure 5.2. When we expand to 64 leaves,

we merge two of the 32 leaf trees by adding new branches with 5 mutations from the

71

 

 

 

 

 

 

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

80

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.2: Our default 32-leaf symmetric tree topology.

new root to each of the old roots.

A second factor we vary is the number of symbols in the character set as we also
consider a size 20 character set to mimic phylogeny reconstruction from amino acid
sequences. When we use this expanded character set, we examine trees both identical
to the default tree as well as ones where the terminal branches have 140 mutations
instead of 80. This extension is necessary because the larger alphabet size reduces
the probability of homoplasy. More mutations must occur to increase the difﬁculty
of the reconstruction [Semple and Steel, 2002]. If a Neighbor Joining algorithm with
uniform character weighting reconstructs a tree perfectly, it is impossible for us to
show any improvement.

A third factor that we vary is the symmetry of the tree. Our default tree is
symmetric, which means the deepest branch is an equal bifurcation. We also study
asymmetric trees to determine if our method works when there is no bifurcation
that partitions the sequences into two equal-sized groups. Assuming that the correct
topology is a binary tree, we note that any binary tree will contain a partition (though

not necessarily the deepest partition) that divides the sequences into two groups

72

where the smaller group is at least one third of all the sequences. We test our method
using an asymmetric tree topology of 32 sequences where the smaller partition of the
deepest bifurcation has 12 (roughly 1 / 3) of all the sequences. This asymmetric tree

is graphically depicted in Figure 5.3.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B C

Figure 5.3: Our asymmetric tree topology with 32 leaves.

For each of the experiments described above, we measure the number of sites
that have each rating and produce a bar graph to show their distribution. We then
overlay each bar graph with a plot depicting the accuracy with which the most internal
bifurcation can be reconstructed assuming we set the rating threshold t to the value
on the X—axis. This accuracy indicates the frequency with which this bifurcation was
reconstructed perfectly (out of the 1000 sets of sequences tested per point). We also
include a 95% conﬁdence interval around each accuracy plot.

We note that the performance of Neighbor Joining using a uniform weighting
scheme can be observed for thresholds t where either no sites receive weight 2 or all
sites receive weight 2. This second option is guaranteed to occur when t = 0, so we
do not provide a separate plot of the performance of Neighbor Joining with a uniform

weighting scheme in our ﬁgures.

73

5.3 Related Work

Many character weighting and character-state weighting techniques have been used
in phylogeny reconstruction. For example, an extreme form of character weighting is
to eliminate characters associated with introns and hyper-variable regions. A close
analogy is the method of successive weighting [Farris, 1969; Carpenter, 1994], which
reﬁnes an initial tree estimate by reducing the weight of characters that appear to
be homoplastic on the tree so that the signal is made to stand out more strongly.
For instance, mutations occur more frequently on the third position of a codon than
on either of the ﬁrst two, so nucleotides in the third codon position receive a lower
weight. Researchers assume that the sites with higher mutation rates provide a high
level of homoplasy, which reduces reconstruction accuracy. The problem with this
assumption is that, while the highly variable sites could confuse the reconstruction
of deeper relationships, they will contain the most useful information for shallower
branch points. Our approach of weighting characters to target a speciﬁc bifurcation
seems unique.

The concept behind the character-state weighting technique is to assign weights
based on state transformations within a site. Nucleotide transitions are frequently
down-weighted relative to transversions in phylogenetic analysis. This is based on
the assumption that transitions, by virtue of their greater evolutionary rate, exhibit
relatively more homoplasy and are therefore less reliable as phylogenetic characters.
Six parameter parsimony is another method of character-state weighting, where a cost
is assigned to the transformation from any character state to any other [Williams and
Fitch, 1990] based on observed frequencies in the data set, as reconstructed on an
initial tree (obtained from unweighed parsimony). The object of this weighting is that
frequent changes are considered more likely than rare changes to have experienced
homoplasy.

Milosavljevic’: and Jurka and others [1989; 1993] generalized parsimony and com-

74

patibility approaches by a minimal length encoding principle to develop phylogeny
reconstruction approaches that can be used to identify bifurcations. Speciﬁcally, if
introducing a bifurcation reduces the number of bits needed to encode the phylogeny,
then this provides support for the existence of that bifurcation. They applied this ap—
proach to identify new subfamilies of Alu sequences [Milosavljevic’ and Jurka, 1993].
Our technique does not explicitly compare different phylogenies but simply provides

a rating to characters based on the distribution of their character states.

5.4 Results and Discussion

5.4.1 Results with Default Settings

As noted in our methods section, our default settings are 32 leaves, 4 character states,
200 characters, and a symmetric tree topology. We did 1000 replicates for this setting.
The distribution of sites with rating t is plotted in the gray bar graph in Figure 5.4.
The graph appears to have a normal distribution with a slight right—skew. Since we
are working with 32 sequences, the theoretical range for this rating is from 0 (if the
site is identical across all of the sequences) to 16 (if the top two abundances are
both 16) with 8 being the expected value if all four possible symbols are found with
identical frequency. In the distribution here, there tend to be few sites that have a
rating greater than 10 or less than 3.

For our method, we double the weight of all sites that have a rating above a
certain threshold, but we still need to specify what that threshold should be. Ideally
we want to choose a threshold that will maximally improve our ability to correctly
reconstruct the tree. For our default tree (with 32 leaves generated using length 200
sequences from an alphabet size of 4) the gray bar graph in Figure 5.4 displays the
_ number of sites for each rating (i.e., this bar graph shows the number of sites that

have a given abundance of the second most frequent symbol).

75

 

0.9% ft . . v . . 50

.0
‘2"
N
01
Number of Sites

Reconstruction Accuracy
of Most lntemal Branch
0
N
\ \

.0
O)
W

1

l

 

 

 

,3 _
. i. it ' 1
_-——’ i , .-
.- I: " ‘ i ‘1‘..- s :j . , '. i"
0 5 4 - ’+-'-‘ ff". ”1 g .f i“ 5s; -. .
° 15 13 11 9 7 5 3 1
Abundance of Second-Most-Frequent Symbol

 

Figure 5.4: Measurements on the default tree. The bar graph indicates the number
of sequence positions out of 200 that are assigned each rating (scale on right side of
ﬁgure); the solid line displays the average reconstruction accuracy of the most internal
bifurcation (scale on left side of ﬁgure); the dashed lines deﬁne the 95% conﬁdence
interval around our accuracy tests.

The resulting reconstruction accuracy of the most internal bifurcation when we
utilize our (t, 2) weighting scheme is displayed as the black line in Figure 5.4. There
are no positions that have exactly 16 of one symbol and 16 of another symbol so the
start point (57.2%) actually represents the reconstruction accuracy of most internal
bifurcation with a uniform weighting. The accuracy steadily improves as we lower
our threshold t, hitting a maximum of 76.5% before falling to the original accuracy
when weights of all sites are doubled. The largest improvement occurs when we set
our threshold to 9. The fact that the accuracy of our scheme with (t, 2) weighting
is always greater than the accuracy with uniform weighting even when t is small
suggests that our rating of sites with respect to their information content for the

deepest bifurcation is accurate. In particular, the weight given to the highest rated

sites overcomes any penalty given to the low rated sites.

76

 

5.4.2 The Effect of Tree Size

To test the effect of tree size on our method, we compared our experimental results
from the 16 leaf tree to the 32 leaf tree to the 64 leaf tree. The results for the 16
leaf tree are displayed in Figure 5.5, and the results for the 64 leaf tree are depicted
in Figure 5.6. In the 16 leaf tree, the uniform weighted accuracy is 62.8% and the
maximum accuracy occurs when t = 5 at 71.5% for a maximum improvement of
8.7%. In the 64 leaf tree, the uniform weighted accuracy is 59.0% and the maximum

accuracy occurs when t = 17 at 79.3% for a maximum improvement of 20.3%.

 

0.8 ﬁfﬁ f. . . 90

.0
N
\
\
o:
c

Number of Sites

.0
o:
‘\

Reconstruction Accuracy
of Most lntemal Branch
d:
O

 

0 . £31262. I}??? . ,. _ A w

.5 A0]; - - .14.- : -,. ... (.1. -.,_ ,3“ .... , . _.

8 7 6 5 4 3 2 1 0
Abundance of Second-Most-Frequent Symbol

. .- zzzzz

 

 

 

 

Figure 5.5: Measurements on the tree with 16 leaves. The bar graph indicates the
number of sequence positions that are assigned each rating; the solid line displays
the average reconstruction accuracy of the most internal bifurcation; the dashed lines
deﬁne the 95% conﬁdence interval around our accuracy tests.

As can be seen from these ﬁgures, the same general pattern holds for all tree
sizes. The improvement, though, increases as we increase tree size. For example, the
maximum improvement for the 16 leaf tree is only 8.7% as compared to 19.3% for the
32 leaf tree and 20.3% for the 64 leaf tree.

We also can infer some information on the optimal threshold t to use. The optimal

maximum accuracy in all three ﬁgures always occurs at a threshold t that is larger

77

 

 

130

N
0

Number of Sites

.1
0

Reconstruction Accuracy
of Most Internal Branch

 

 

 

 

2’8 23 18 13 8 6
Abundance of Second-Most-Frequent Symbol

Figure 5.6: Measurements on the tree with 64 leaves. The bar graph indicates the
number of sequence positions that are assigned each rating; the solid line displays
the average reconstruction accuracy of the most internal bifurcation; the dashed lines
deﬁne the 95% conﬁdence interval around our accuracy tests.

than the most common rating; that is, the accuracy always peaks before the highest
point of the bar graph. For example, Figure 5.4 shows that the most common rating
for sites is six, while the greatest accuracy is reached when we double weights of all
positions with rating greater than or equal to nine. Similarly in Figure 5.6, eleven
is the most common rating, while the maximum accuracy occurs at a threshold of
seventeen. In fact, in all three cases, the maximum reconstruction accuracy occurs
when the threshold t = n/4 + 1 where n is the number of leaves in the tree, though

we do not have a good explanation for this phenomenon.

5.4.3 The Effect of Alphabet Size

The nature of tree reconstruction will change when we use a different alphabet. When
we have a large alphabet, the tree will typically have fewer homoplasies, since the
probability of randomly mutating into the same state is reduced. To compensate for

this effect, we extend the external branch lengths on our tree with 32 leaves from

78

80 to 140 when using an alphabet size of 20, the number of distinct amino acids
in protein sequences. The bar graph and the reconstruction accuracy of the most
internal bifurcation after weighting are shown in Figure 5.7. For this alphabet size,
the uniform weighted accuracy is 53.0% and the maximum accuracy occurs when t = 5
at 79.2% for a maximum improvement of 26.2%. This improvement is quite striking
because it starts from a reconstruction accuracy lower than that in our default setup,
but rises to an even higher value after the weights are applied. Thus, this technique

seems even more effective on larger alphabet sizes.

 

0.9 e . . ﬂ . . . . 100

.0
on

P
N

Number of Sites

 

 

Reconstruction Accuracy
of Most lntemal Branch

 

 

15 13 1 1 9 7 5 3 1
Abundance of Second-Most-Frequent Symbol

 

Figure 5.7: Measurements on the tree constructed using sequences with a larger
alphabet size of 20. The bar graph indicates the number of sequence positions that
are assigned each rating; the solid line displays the average reconstruction accuracy
of the most internal bifurcation; the dashed lines deﬁne the 95% conﬁdence interval
around our accuracy tests.

Comparing the results from the two groups of experiments, we see that the bar
graph in the experiments with a 4-character alphabet (Figure 5.4) has a peak at a
higher rating than the bar graph in the experiments with an alphabet size of 20. This
effect is due to the fact that a larger alphabet increases the chance that a site will be
homoplasy-free. As in previous experiments, the accuracy always peaks at a higher

threshold than the most common rating, so it is not strange to see the reconstruction

79

accuracy plot for an alphabet size of four to hit its peak earlier than the one with an

alphabet size of 20.

5.4.4 The Effect of Asymmetry

In all of our experiments so far, we used a symmetric tree, which meant that the
deepest bifurcation would always be an equal bifurcation. We now consider the effect
when no symmetric bifurcation exists in the tree. As noted earlier, there will always
exist a bifurcation (though not necessarily the deepest) that partitions the tree so
that the smaller group will contain at least 1/3 of the leaves. See appendix for a
proof of this assertion.

To study the reconstruction potential of a tree with this worst-case split, we
evaluate our method on a 32 leaf tree where the deepest bifurcation does indeed
partition the leaves into a set of 12 and a set of 20. This tree is depicted in Figure 5.3.
The internal branch lengths are proportional to the lengths shown in the graph, with
shortest ones equal to 5, the medium ones equal to 10 and longest ones equal to 20.
All of the external branch lengths are equal to 80. For alphabet size 4, reconstructed
accuracy of the deepest bifurcation is 57.3% with a uniform weighting and peaks with
t = 8 at 73.5%, as shown in Figure 5.8.

Our explanation for why our method still functions correctly is as follows: if all
three subpopulations maintain the same character state as in the ancestral state, this
character is not informative at all. If there is a mutation along the most left internal
branch and it persists all the way down to leaf nodes of cluster A, but there is no
such mutation at this position in clusters B or C then a doubling of weight at this
position will elongate the distance between this subpopulation A and all other leaves.
In the ideal case, we would expect such a position to have a rating of 12 (with the
most abundant symbol having a frequency of 20). Even adding some noise to this

process, a rating of nine (which is the optimal rating threshold) is very likely for these

80

 

   
       
   
    

  

 

 

0 8 T If T W 60
>. I \
E ‘8
= g 0.7 ’ - ’40 m
8 m I ~7 “g
< a I / __\. "I I003
8 E , [11.5%. O
.. a) £33., ~.~_:'- '-
3 E ,’ I Riff 8
8 2 0.6” 4% -202
8‘6

0.5 , .. “.7. 'f - . .. ,0

15 13 11 9 7 5 3 1

Abundance of Second-Most-Frequent Symbol

Figure 5.8: Measurements on the asymmetric tree. The bar graph indicates the
number of sequence positions that are assigned each rating; the solid line displays
the average reconstruction accuracy of the most internal bifurcation; the dashed lines
deﬁne the 95% conﬁdence interval around our accuracy tests.

informative sites.

A parallel argument can be made for the branch that separates B from A and C, as
well as the one that separates C from A and B. Although reinforcing the reconstruction
of these branches does not directly elongate our test-branch, the doubling of some
sites increases the length of the branch to subtree B while the doubling of others
increase the length of the branch to subtree C. The more accurate classiﬁcation of

subtrees B and C will also decrease the likelihood that one of their leaves is directly

grouped into subtree A.

5.5 Conclusions

Our experimental results demonstrate that increasing the weight of speciﬁc sites con-
sistently improves the reconstruction accuracy of deep bifurcations under a variety
of conditions. Furthermore, it proves to be more effective as tree size or alphabet

size increases. This technique can be used as the basis of phylogeny reconstruction

81

algorithms that apply different weights to sequence positions depending on the por-
tion of the tree currently being reconstructed. In particular, we plan for adapting
our technique of identifying deep bifurcations into a recursive top-down phylogeny

reconstruction algorithm.

82

Chapter 6

FUTURE WORK

In this thesis, I have focused on designing methods to analyze complexity in living
system, using digital organisms as my test case. The next step of my research will
involve the reﬁnement of these methods. For organism complexity, I will examine
multiple sites at once in an attempt to decipher epistatic interactions and identify
redundant information or correlated sites in a genome. Genetically, the adjacent sites
have the highest probability to be correlated with each other and generate linkage
disequilibrium; not all combinations of symbols have equal probability. Calculating
the information in adjacent sites should signiﬁcantly improve the accuracy of this
complexity measurement. With current computer speeds, we can calculate the distri-
bution of the combinations of instructions for three sites, which would require testing
17,576 mutants to calculate the information in a single line of a genome. Given a
genome of length 100, this procedure would take about one hour.

In addition to devising methods of more accurately calculating the information
content of a genome, we are also interested in designing good approximations that
can be computed quickly. One such way is to make a binary decision on the infor-
mation content of each site. We knock out the site (replacing it with an inert NULL
instruction) and if the organism’s ﬁtness is reduced, we count this line as 1 CU. If
this knockout had no effect on the organism we assume that no information was being
stored at this position. Such knockout tests provide a dramatic speed increase since
they only require us to test a single mutant per position in the genome. Initial tests

have indicated that they provide only a minor loss of accuracy.

83

 

Knockout tests also have an additional advantage. Sometimes an instruction is
not functional for a genome, but when we generate all possible mutations, other
instruction at that position may inﬂuence how the genome functions. This may create
an illusion of information at that site, resulting in a positive skew to our complexity
calculation. Knockout tests, however, do not suffer from this. Additionally, multiple
knockouts are not expensive to perform and can be used to detect redundancy. For
example, if we ﬁnd two sites that can be knocked out independently without harmful
effect to the organism, but in combination they result in the loss of a trait, then
clearly these sites were involved in the redundant encoding of that trait.

To improve our measurements of community complexity, I will preprocess each or-
ganism to determine which sites contain information. When determining if organism
A contains information about B, I will only need to test those sites that are indi-
vidually informative in both to determine if this is the same information. Overall,
this should improve both the speed of the algorithm and its accuracy—we’ll no longer
have false correlations adding to our community complexity where no real information
exists. This speed increase will also allow us to compare many organisms at once.
Each time a new organism’s unique information is calculated, we can compare it to
more than just one previously tested organism.

Next, I plan to use the community complexity measure to better understand the
evolution of complexity in natural systems. I hope to collaborate with the people from
the Department of Microbiology and Molecular Genetics at Michigan State University
and measure the complexity of the bacterial communities from different environments.
The goal of this project is to evaluate the environmental impact on the information
content of the genomes.

For the character-weighting technique in Chapter 5, several directions remain to
be explored before the full power of our methodology can be realized. First, I will

examine how well this methodology applies to other reconstruction techniques besides

84

Neighbor Joining. I see no reason why it should not, but this needs to be veriﬁed
experimentally. Second, given our (t, 2) weighting scheme, I would like a reliable
and automated method for determining the optimal threshold t. Alternatively, I also
plan to consider continuous schemes where a higher rated site is given strictly higher
weight than any lower rated site.

I am planning to adapt our technique to turn any phylogeny reconstruction tech-
nique such as Neighbor Joining into a top-down recursive phylogeny reconstruction
algorithm. At the beginning of each recursive call, I use our technique to weight
each site of the sequences being reconstructed. Given these weights, I then use a
phylogeny reconstruction algorithm such as Neighbor Joining to identify a deep bi-
furcation. Given this bifurcation, I partition the sequences into two distinct subpop-
ulations and recursively apply our technique on each set of sequences. The unique
aspect of this approach is that it focuses on reconstructing the tree one bifurcation at
a time starting with deep bifurcations. Since the informative sites for each bifurcation
are different, the weights generated by our algorithm will change with each recursive
call. The main challenge for applying this technique is that when the subtrees are
rebuilt, there is not an easy way to identify the attachment points for the branch that
connects them back together. I plan to address this issue by exploring methods used
in rooting unrooted trees. For example, I will consider using samples from the other

subtree as an outgroup.

85

Appendix A

ENVIRONMENT SETUP

A. 1 Environment I

REACTION NOT not processzvaluezl.0:type=pow
requisite:max-count=1
REACTION NAND nand processzvalue=1.0:type=pow
requisite: max_count=1
REACTION AND and process:value=2.0:type=pow
requisite:max-count=1
REACTION ORN orn process:value=2.0:type=pow
requisite:max_count=1
REACTION OR or process:value=3.0:type=pow
requisite:max-count=1
REACTION ANDN andn process:value=3.0:type=pow
requisite:max-count=1
REACTION NOR nor processzvalue=4.0:type=pow
requisite:max_count=1
REACTION XOR xor processzvalue=4.0:type=pow
requisite:max_count=1
REACTION EQU equ process:value=5.0:type=pow

requisite:max-count=1

86

A.2 Environment II

RESOURCE resNOTzinﬂow=100:0utﬂow=0.01
RESOURCE resNANDzinﬂow=100:0utﬂow=0.01
RESOURCE resANDzinﬂow=100zoutﬂow=0.01
RESOURCE resORNzinﬂow=100zoutﬂow=0.01
RESOURCE resORzinﬂow=100:outﬂow=0.01
RESOURCE resANDNzinﬂow=100zoutﬂow=0.01
RESOURCE resNORzinﬂow=100:0utﬂow=0.01
RESOURCE resXORzinﬂow=100:outﬂow=0.01
RESOURCE resEQUzinﬂow=100:0utﬂow=0.01

REACTION NOT not process:resource=resNOT:
value=1.0:frac=0.0025:max=25
REACTION NAND nand process:resource=resNAND:
value=1.0:frac=0.0025:max=25
REACTION AND and process:resource=resAND:
value=2.0:frac=0.0025:max=25
REACTION ORN orn process:resource=resORN:
value=2.0:frac=0.0025:max=25
REACTION OR or process:resource=resOR:
value=4.0:frac=0.0025zmax=25
REACTION ANDN andn process:resource=resANDN:
value=4.0:frac=0.0025:max=25
REACTION NOR nor process:resource=resNOR:
value=8.0:frac=0.0025:max=25
REACTION XOR xor process:resource=resXOR:
value=8.0:frac=0.0025:max=25

87

 

REACTION EQU equ process:resource=resEQU:
value:16.0:frac=0.0025:max=25

A.3 Environment III

RESOURCE resNOTzinﬂow=100:0utﬂow=0.01

REACTION NOT not process:resource=resNOT:
value=1.0:frac=0.0025:max=25

88

Appendix B

PROOF OF A
NEARLY-BALANCED
BIFURCATION

We assert that in any binary tree there exists a bifurcation that will split the tree
into two subtrees where the smaller of the subtrees has at least 1/ 3 of the leaves of
the full tree. We prove this assertion by contradiction.

Consider an arbitrary binary tree T with N leaves. Every bifurcation B partitions
the tree into two subtrees T1 and T2 with N1 and N2 leaves respectively, where,
without loss of generality, N1 3 N2.

Assume our assertion is false. This means that for any bifurcation B in T,
N1 < 1 / 3N . Consider the bifurcation Bmax that has the largest number of leaves
in the smaller subtree, breaking ties arbitrarily. The root of the subtree T2 has two
other bifurcations connected to it that divide subtree T 2 into two smaller subtrees T3
and T4 with N3 and N4 leaves respectively, where 1 5 N3 S N4. Let B’ be the bifur-
cation that separates subtree T4 from the rest of the tree. It must be the case that
N4 > 2/ 3N or else 8' would be such a bifurcation. This contradicts our assumption
that Bmax is the bifurcation with the largest leave size for smaller subtree since the
bifurcation BI has leave size N1 + N3 for smaller subtree, which is strictly larger than

the size of Bmax which is N1. Thus our assertion is true.

89

 

BIBLIOGRAPHY

Adami, C., and N. Cerf, Physical complexity of symbolic sequences, Physica D, 137,
62—69, 2000.

Adami, C., C. Ofria, and T. C. Collier, Evolution of biological complexity, Proc. Nat.
Acad. Sci, 97, 4463—4468, 2000.

Aguinaldo, A. M. A., J. M. Turbeville, L. S. Linford, M. C. Rivera, J. R. Garey,
R. A. Raff, and J. A. Lake, Evidence for a clade of nematodes, arthropods and
other moulting animals, Nature, 387, 489—493, 1997.

Anand, M., and L. Orléci, Complexity in plant communities: the notion and quan-
tiﬁcation, J. Theor. Biol, 179, 179—186, 1996.

Bennett, C. H., Logical depth and physical complexity, in The Universal Turing Ma-
chine, A Half- Century Survey, edited by R. Herken, pp. 227—257, Oxford University
Press, Oxford, 1988.

Carpenter, J. M., Successive weighting, reliability and evidence, Cladistics, 10, 215—
220, 1994.

Chow, S. S., C. O. Wilke, C. Ofria, R. E. Lenski, and C. Adami, Adaptive radiation
from resource competition in digital organisms, Science, 305, 84—86, 2004.

Cooper, T. F., and C. Ofria, Evolution of stable ecosystems in populations of digital
organisms, in Eighth International Conference on Artiﬁcial Life, edited by M. B.
RK Standish and H. Abbass, vol. 119, pp. 227—232, MIT Press, Boston, MA, 2002.

Cover, T. M., and J. A. Thomas, Elements of Information Theory, Sinauer Associates,
1991.

Darwin, C., On the Origin of Species by Means of Natural Selection, Murray, London,
1859.

Dumpis, U., et al., An outbreak of hbv and hcv infection in a paediatric oncology
ward: Epidemiological investigations and prevention of further spread, J. Med.
Virol., 69, 331—338, 2003.

Farris, J. S., A successive approximations approach to character weighting, Syst.
Zool., 18, 374—385, 1969.

Goertzel, B., The Evolving Mind, Routledge, 1993.

Goldsmith, T., Optimization, constraint, and history in the evolution of eyes, Quart.
Rev. Biol, 65, 281—322, 1990.

90

Gould, S. J ., Wonderful Life: The Burgess Shale and the Nature of History, Penguin,
1989.

Gusﬁeld, D., Efﬁcient algorithms for inferring evolutionary trees, Networks, 21, 19-
28, 1991.

Hennig, W., D. D. Davis, and R. Zangeri, Phylogenetic Systematics, University of
Illinois Press, Champaign, Illinois, 1979.

Heywood, V. H. (Ed.), Global biodiversity assessment, p. 8, Cambridge University
Press, Cambridge, UK, 1995.

Hinegardner, R., and J. Engelberg, Biological complexity, Journal of Theoretical Bi-
ology, 104, 7—20, 1983.

Lenski, R. E., C. Ofria, R. T. Pennock, and C. Adami, The evolutionary origin of
complex features, pp. 139—144, 2003.

Li, M., and P. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applica-
tions, Springer, 1997.

Magurran, A. E., Measuring biological diversity, Blackwell Publishing, 2003.

Meléndez-Hevia, E., T. G. Waddell, and M. Cascante, The puzzle of the krebs citric
acid cycle: assembling the pieces of chemically feasible reactions, and opportunism
in the design of metabolic pathways during evolution, J. Mol. Evol, 43, 293—303,
1996.

Milosavljevic’, A., and J. J urka, Discovery by minimal length encoding: A case study
in molecular evolution, Machine Learning, 12, 69—87, 1993.

Milosavljevié, A., D. Haussler, and J. Jurka, Informed parsimonious inference of
prototypical genetic sequences, in Proceedings of the Workshop on Computational
Learning Theory (COLT), pp. 102—117, 1989.

Mivart, S. G. J ., On the genesis of species, D. Appleton and Co, Macmillan, London,
1871.

Newcomb, R. D., P. M. Campbell, D. L. Ollis, E. Cheah, R. J. Russell, and J. G.
Oakeshott, A single amino acid substitution converts a carboxylesterase to an

organophosphorus hydrolase and confers insecticide resistance on a blowﬂy, Proc.
Natl. Acad. Sci. USA, 94, 7464—7468, 1997.

Nilsson, D., and S. Pelger, A pessimistic estimate of the time required for an eye to
evolve, Proc. Roy. Soc. London, 256, 53—58, 1994.

Ofria, C., and C. O. Wilke, Avida: A software platform for research in computational
evolutionary biology, Artiﬁcial Life, 10, 191—229, 2004.

91

Ofria, C., C. Adami, and T. Collier, Selective pressures on genomes in molecular
evolution, Journal of theoretical biology, 222, 477—483, 2003.

Pielou, E. C., An introduction to mathematical ecology, chap. Ecological diversity and
its measurement, John Wiley & Sons Inc, 1969.

Rambaut, A., and N. C. Grassly, Seq-gen: An application for the monte carlo simu-

lation of dna sequence evolution along phylogenetic trees, Comput. Appl. Biosci.,
13, 235—238, 1997.

Ray, T. 8., Evolution, complexity, entropy, and artiﬁcial reality, Physica D, 75,
239—263, 1994.

Saitou, N., and M. Nei, The neighbor-joining method: a new method for reconstruct-
ing phylogenetic trees, Mol. Biol. Evol., 4, 406—25, 1987.

Salvini-Plawen, L., and E. Mayr, On the evolution of photoreceptors and eyes, Evol.
Biol, 10, 207—263, 1977.

Schneider, T., Evolution of biological information, Nucleic Acids Research, 28, 2794—
2799, 2000.

Semple, C., and M. Steel, Tree reconstruction from multi-state characters, Adv. Appl.
Math, 28, 169-184, 2002.

Sen, T. Z., et al., Predicting binding sites of hydrolase-inhibitor complexes by com—
bining several methods, BM C Bioinformatics, 5, 205, 2004.

Shannon, C. E., A mathematical theory of communication, The Bell System Technical
Journal, 27, 379—423,623—656, 1948.

Smith, J. M., Byte-sized evolution, Nature, 355, 772—773, 1992.

Soltis, D. E., and P. S. Soltis, The role of phylogenetics in comparative genetics, Plant
Physiol, 132, 1790—1800, 2003.

Williams, P. L., and W. M. Fitch, Phylogeny determination using dynamically
weighted parsimony method, Method Enzymol., 183, 616 — 626, 1990.

Williforda, A., B. Staya, and D. Bhattacharya, Evolution of a novel function: nutritive
milk in the viviparous cockroach, diploptera punctata, Evolution 55 Development,
6, 67—77, 2004.

Woese, C. R., O. Kandler, and M. L. VVheelis, Towards a natural system of organisms:
Proposal for the domains archaea, bacteria, and eucarya., P. Natl. Acad. Sci. USA,
87, 4576—4579, 1990.

92

   

l[[11]][1]lil]]iij]i]il]l]l