STRUCTURE AND EVOLUTIONARY DYNAMICS IN FITNESS LANDSCAPES By Anuraag R. Pakanati A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy Ecology, Evolutionary Biology, and Behavior—Dual Major 2015 ABSTRACT STRUCTURE AND EVOLUTIONARY DYNAMICS IN FITNESS LANDSCAPES By Anuraag R. Pakanati Evolution can be conceptualized as an optimization algorithm that allows populations to search through genotypes for those that produce high fitness solutions. This search process is commonly depicted as exploring a fitness landscape, which combines similarity relationships among genotypes with the concept of a genotype-fitness map. As populations adapt to their fitness landscape, they accumulate information about the fitness landscape in which they live. A greater understanding of evolution on fitness landscapes will help elucidate fundamental evolutionary processes. I examine methods of estimating information acquisition in evolving populations and find that these techniques have largely ignored the effects of common descent. Since information is estimated by measuring conserved genomic regions across a population, common descent can create a severe bias by increasing similarities among unselected regions. I introduce a correction method to compensate for the effects of common descent on genomic information and empirically demonstrate its efficacy. Next, I explore three instantiations of NK, Avida, and RNA fitness landscapes to better understand structural properties such as the distribution of peaks and the size of basins of attraction. I find that the fitness of peaks is correlated with the fitness of peaks within their neighborhood, and that the size of peaks’ basins of attraction tends to be proportional to the heights of the peaks. Finally, I visualize local dynamics and perform a detailed comparison between the space of what evolutionary trajectories are technically possible from a single starting point and the results of actual evolving populations. Copyright by ANURAAG R. PAKANATI 2015 ACKNOWLEDGMENTS I would like to acknowledge many people for their assistance in my journey here. My advisor, Dr. Charles Ofria, was the reason I came to Michigan State. He has provided persceptive guidance, detailed feedback, and an absolutely incredible environment of freedom in which I was able to pursue my research directions in my own peculiar and stubborn way. I also am grateful for my co-advisor, Dr. Chris Adami, who arrived at Michigan State at a fortuitious time for me, and who proved invaluable in helping develop and guide my research. Furthermore, I would like to thank my committee members, Dr. Erik Goodman and Dr. Robert Pennock, who read through drafts and provided insightful feedback that helped shape my work for the better, most notably exhorting me to go beyond studying Avida fitness landscapes. I am grateful for the unique environment for interdisciplinary research that has existed at MSU embodied by both the EEBB program and the BEACON Center. I could not have performed this research without resources provided by the HPCC at MSU and a cluster funded by the NSF. I thank current and past Devolab members, and those in the Ofria, Adami, Lenski and SENS labs, many of whom provided valuable feedback along the way. I worked with many people over the past few years, but work with Christopher Strelioff and Bjorn Østman particularly influenced many ideas explored in this work. I am also indebted to those who provided companionship and who proved generous with their hard-earned experience: Chad Byers, Andres Ramirez, Luis Zaman, Jory Schossau, and Brian Connelly, among others. Finally, I would like to acknowledge my parents, Shraven Pakanati and Sreekala Pakanati, my sister, Sheena Pakanati, and Liz Duval for their unyielding support and encouragement without which I could never have completed my research. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 Chapter 2 Estimating Complexity in Finite Populations . . . . 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work & Mathematical Theory . . . . . . . . . 2.2.1 Information Theory & Complexity . . . . . . . . . 2.2.2 Biological Complexity . . . . . . . . . . . . . . . . . 2.2.3 Entropy Correction . . . . . . . . . . . . . . . . . . 2.2.4 Neutral Theory & Evolution . . . . . . . . . . . . 2.2.5 Coalescence . . . . . . . . . . . . . . . . . . . . . . . 2.2.6 Linkage Disequilibrium . . . . . . . . . . . . . . . . 2.2.7 Models of Evolution . . . . . . . . . . . . . . . . . . 2.2.7.1 Eigen Model . . . . . . . . . . . . . . . . . . . 2.2.7.2 Infinite Alleles Model . . . . . . . . . . . . . . 2.2.7.3 Infinite Sites Model . . . . . . . . . . . . . . . 2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Common Descent Biases Complexity Estimates 2.3.2.1 Rank Order Analysis . . . . . . . . . . . . . . 2.3.2.2 Complexity Estimates . . . . . . . . . . . . . . 2.4 Finite Size Correction . . . . . . . . . . . . . . . . . . . . 2.5 Correcting for Common Descent . . . . . . . . . . . . . 2.5.0.3 Methods . . . . . . . . . . . . . . . . . . . . . 2.5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 8 8 11 12 14 14 15 17 18 19 19 22 23 24 24 26 28 32 33 33 Chapter 3 Fitness Landscapes– Peaks, Ridges, and Plateaus 3.1 Fitness Landscapes . . . . . . . . . . . . . . . . . . . . . 3.2 The Structure of Fitness Landscapes . . . . . . . . . . 3.3 Evolutionary Implications of Fitness Landscapes . . 3.4 Model Systems . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 NK . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Avida . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Landscape Summaries . . . . . . . . . . . . . . . . . . . 3.5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 NK . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Avida . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 38 41 44 46 47 48 50 52 52 53 56 v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 62 64 66 68 70 72 72 75 77 77 79 80 81 Chapter 4 Basins of Attraction . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . 4.2 Background . . . . . . . . . . . . . . . . . . 4.3 Measuring Basins . . . . . . . . . . . . . . 4.3.1 NK Fitness Landscape Variants . 4.3.2 Basin Flow Algorithm . . . . . . . . 4.3.3 Discussion . . . . . . . . . . . . . . . 4.4 Basins in Landscapes . . . . . . . . . . . . 4.4.1 Trajectories Over Time . . . . . . . 4.4.1.1 NK . . . . . . . . . . . . . . . 4.4.1.2 Avida . . . . . . . . . . . . . . 4.4.1.3 RNA . . . . . . . . . . . . . . 4.4.2 Basin Size . . . . . . . . . . . . . . . 4.4.3 Basin Size Relation With Fitness 4.4.4 Fitness Over Time . . . . . . . . . . 4.4.5 Distance Traveled From Origin . . 4.4.6 Peaks & Basins . . . . . . . . . . . . 4.4.7 Discussion . . . . . . . . . . . . . . . 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 83 84 85 85 86 90 103 104 104 107 109 111 116 122 127 132 138 140 Chapter 5 Visualization of Transient Dynamics . . . . . . . . . . . . . . . . 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Replaying the tape of life . . . . . . . . . . . . . . . . . . . . . 5.1.2 Fitness Landscapes & Epistasis . . . . . . . . . . . . . . . . . . 5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Local Landscapes—Random Points . . . . . . . . . . . . . . . . . . . 5.4 Local Landscapes—Peaks . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Repeated Trajectories—Random Points Visualization . . . . . . . 5.6 Repeated Trajectories—Peaks Visualization . . . . . . . . . . . . . 5.7 Repeated Trajectories Random Points—The fate of trajectories . . . . . . . . . . . 143 143 143 144 145 147 157 167 173 178 3.6 3.7 3.8 3.9 3.5.4 RNA . . . . . . . . . . Peaks and Nearby Peaks . 3.6.1 NK . . . . . . . . . . . 3.6.2 Avida . . . . . . . . . . 3.6.3 RNA . . . . . . . . . . Percolation . . . . . . . . . . 3.7.1 NK . . . . . . . . . . . 3.7.2 Avida . . . . . . . . . . 3.7.3 RNA . . . . . . . . . . Autocorrelation of Peaks . 3.8.1 NK . . . . . . . . . . . 3.8.2 Avida . . . . . . . . . . 3.8.3 RNA . . . . . . . . . . Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Repeated Trajectories Peaks—The fate of trajectories . . . . . . . . 184 5.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Chapter 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 vii LIST OF TABLES Table 4.1: Table 4.2: Table 4.3: Table 4.4: Coefficients of log10 B = β0 + β1 (F ) + β2 (T ) + β3 (F x T ). p-values are in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Coefficients of log10 B = β0 + β1 (F ) + β2 (S) + β3 (F x S) for ProbGE treatment. p-values are in parentheses. . . . . . . . . . . . . . . . . . 92 Coefficients of log10 B = β0 + β1 (F ) + β2 (S) + β3 (F x S) for ProbGE treatment. p-values are in parentheses. . . . . . . . . . . . . . . . . . 92 This table shows the number of basins before and after the secondary analysis is performed, as well as the mean. Each cell in the table has two entries—the first represents the value for the initial treatment with flow only to mutational neighbors, and the second represents the value of the secondary treatment with flow to genotypes two mutational steps away. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 viii LIST OF FIGURES Figure 2.1: Figure 2.2: Figure 2.3: Figure 2.4: Figure 2.5: Figure 2.6: Figure 1 from [Shannon, 1948] demonstrating a conceptualization of a communication system. . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 1A from [Wang and Lee, 2007] demonstrating background linkage disequilibrium. The covariation of mutation A and R in the sequence alignment is caused by common descent. This can be seen in the phylogenetic tree. . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Frequency rank order for population size of 100. On the x-axis for each plot is length. At each sample in each run, allele frequency per site is ranked and averaged. This is then averaged across all samples per run to condense into a single figure per run. This shows that on average, even in the absence of selection, an imbalance exists on sites due to common descent. Colors come from rank ordering, e.g. in these plots the frequency of the most common allele appears in lime green at the top. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Frequency rank order for population size of 1000. On the x-axis for each plot is length. At each sample in each run, allele frequency per site is ranked and averaged. This is then averaged across all samples per run to condense into a single figure per run. This shows that on average, even in the absence of selection, an imbalance exists on sites due to common descent. This effect is reduced, compared to the smaller population size of 100 in Figure 2.3 . . . . . . . . . . . . . . . . . . 25 Complexity Estimates for population size of 100 in the presence of no selection. On the x-axis is length and on the y-axis is information estimates. Each plot represents the distribution of information samples over a single run. The sample size increases with increasing length, leading to less uncertainty over the estimate. . . . . . . . . . . . . . 26 Complexity Estimates for population size of 1000 in the presence of no selection. On the x-axis is length and on the y-axis is information estimates. Each plot represents the distribution of information samples over a single run. The sample size increases with increasing length, leading to less uncertainty over the estimate. . . . . . . . . . . . . . 27 ix Figure 2.7: Comparison of complexity estimates from [Adami et al., 2000] and expected complexity in mers from neutrality and resulting rank order frequency distribution for corresponding neutral landscapes of length 100 with 28 bases per site. In 2.7a, the complexity is measured over time, from [Adami et al., 2000]. In 2.7b is the distribution of information estimates that occur in a size 3600 population with 28 alleles with no selection. In 2.7c is the distribution of rank ordered alleles, which 1 clearly deviates from the uniform 28 . . . . . . . . . . . . . . . . . . . 28 Figure 2.8: Per-site complexity estimates for neutral populations without selection of size 100 with 4 alleles per site for three different estimators. On the x-axis is mutation rate and on the y-axis is information estimates. Each plot represents the distribution of information samples over a single run. The line represents the ‘true’ entropy from the underlying distribution which should be 1 since there is no selection. The mutation rates vary from 0.0001 on the left to 1 on the right, with semilog scaling. 29 Figure 2.9: Per-site complexity estimates for neutral populations without selection of size 100 with 20 alleles per site for three different estimators. On the x-axis is mutation rate and on the y-axis is information estimates. Each plot represents the distribution of information samples over a single run. The line represents the ‘true’ entropy from the underlying distribution which should be 1 since there is no selection. The mutation rates vary from 0.0001 on the left to 1 on the right, with semilog scaling. 30 Figure 2.10: Relative error compared to true entropy for neutral populations for three different estimators. On the x-axis is length and on the y-axis is information estimates. Each point represents the distribution of information sample means over a single run. . . . . . . . . . . . . . . 31 Figure 2.11: Relative error compared to true entropy for neutral populations for three different estimators with 20 alleles. On the x-axis is length and on the y-axis is information estimates. Each point represents the distribution of information sample means over a single run. . . . . . . . 31 Figure 2.12: Entropy, Naive & Corrected, 4 Alleles . . . . . . . . . . . . . . . . . 34 Figure 2.13: Entropy, Naive & Corrected, 20 Alleles . . . . . . . . . . . . . . . . . 35 Figure 3.1: 39 Figure 2 from [Wright, 1932] with original caption. . . . . . . . . . . x Figure 3.2: Figure 4 from [Wright, 1932] with original caption. Here, 4NU refers to the genome wide mutation rate, whereas 4NS refers to the selective pressure. D and E are relevant for sexual populations and F corresponds to an island model. As Wright points out, differential selection on different islands in F may form the basis for or contribute to the process of speciation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 3.3: Figure 3.1 from [Kauffman, 1993] . . . . . . . . . . . . . . . . . . . 43 Figure 3.4: Figure 4 from [Gavrilets, 2003] with original caption. . . . . . . . . . 45 Figure 3.5: Comparison of RNA pf fold and fold landscapes. In Figure 3.5a, raw data for each landscape is plotted, with each fitness, count pair plotted separately. In Figure 3.5b, I discretize fitnesses of pf fold by rounding fitness to the nearest tenth. The landscapes are highly similar as would be expected since they are both mappings of fold energy within the RNA landscape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Phenotypic Summary, NK Landscape. X-axis is fitness, Y-axis is the count of the number of genotypes with that fitness. The neutral peaks line in green is overlaid by the peaks in blue since because there is so little neutrality in the landscape, the two sets are essentially identical. 54 Box & violin plots of genotype fitness values on the NK landscape for all genotypes, peak genotypes. . . . . . . . . . . . . . . . . . . . . . 55 Phenotypic Summary, Avida Landscape. X-axis is fitness, Y-axis is the count of the number of genotypes with that fitness. . . . . . . . . 57 Box & violin plots of genotype fitness values on the Avida landscape for all genotypes, peak genotypes. . . . . . . . . . . . . . . . . . . . 58 Figure 3.10: Phenotype of fitness values on the length 18, reduced instruction set landscape. Phenotypes are ordered by fitness. The blue line is log fitness of the phenotype. The green line is the gestation time. The red line is the number of genotypes. The general pattern observable is that for each level of merit, there are several optimizations using nop-C that reduce the execution time and thereby the gestation time, improving fitness. On the bottom, logic tasks performed by each ranked genotype are shown in the following order from bottom to top: NAND, NOT, AND, ORN, OR, ANDN, NOR, XOR, EQU. It can be seen that the most fit organisms perform several tasks rather than doing EQU. . . 59 Figure 3.11: Phenotypic Summary, RNA Landscape. X-axis is fitness, Y-axis is the count of the number of genotypes with that fitness. . . . . . . . . . 60 Figure 3.6: Figure 3.7: Figure 3.8: Figure 3.9: xi Figure 3.12: Box & violin plots of genotype fitness values on the RNA landscape for all genotypes, peak genotypes. . . . . . . . . . . . . . . . . . . . 61 Figure 3.13: Figure 4 from [Østman et al., 2010] with original caption . . . . . . . 63 Figure 3.14: NK data, Fitness vs Cluster mean fitness. Each neutral peak fitness along with its neighborhood fitness is shown as a dot. The colored hexagons overlay show density (count of points) and the red line shows the fit. The correlation and line fit can be seen at the top of the graph. 65 Figure 3.15: Avida data, Fitness vs Cluster mean fitness. Each neutral peak fitness along with its neighborhood fitness is shown as a dot. The colored hexagons overlay show density (count of points) and the red line shows the fit. The correlation and line fit can be seen at the top of the graph. 67 Figure 3.16: RNA data, Fitness vs Cluster mean fitness. Each neutral peak fitness along with its neighborhood fitness is shown as a dot. The colored hexagons overlay show density (count of points) and the red line shows the fit. The correlation and line fit can be seen at the top of the graph. 69 Figure 3.17: Figure 6 from [Østman et al., 2010] with original caption . . . . . . . 71 Figure 3.18: NK data, peak percolation, neighbors joined within distance 2. Visible are all points above the fitness threshold and the size of the largest connected cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 3.19: Avida data, Peak Percolation with neighbors joined within distance 2. Visible are all points above the fitness threshold and the size of the largest connected cluster. . . . . . . . . . . . . . . . . . . . . . . . . 74 Figure 3.20: RNA data, peak percolation, neighbors joined within distance 2. Visible are all points above the fitness threshold and the size of the largest connected cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 3.21: NK peak autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 3.22: Avida thresholded peak autocorrelation . . . . . . . . . . . . . . . . 79 Figure 3.23: RNA peak autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 4.1: Log(basin proportion) as a function of fitness, p = 0.0, Single Mutations. Here density flows only to single mutant neighbors. . . . . . . xii 94 Figure 4.2: Figure 4.3: Figure 4.4: Figure 4.5: Figure 4.6: Log(basin proportion) as a function of fitness, p = 0.0, Double Mutations. Here density flows to single mutant neighbors as well as double mutants after 1000 updates. . . . . . . . . . . . . . . . . . . . . . . . 95 Log(basin proportion) as a function of fitness, p = 0.2, Single Mutations. Here density flows only to single mutant neighbors. . . . . . . 96 Log(basin proportion) as a function of fitness, p = 0.2, Double Mutations. Here density flows to single mutant neighbors as well as double mutants after 1000 updates. . . . . . . . . . . . . . . . . . . . . . . . 97 Log(Basin Proportion) as a function of fitness, p = 0.5, Single Mutations. Here density flows only to single mutant neighbors. . . . . . . 98 Log(Basin Proportion) as a function of fitness, p = 0.5, Double Mutations. Here density flows to single mutant neighbors as well as double mutants after 1000 updates. . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 4.7: Log(basin proportion) as a function of fitness, p=0.8, Single Mutations. Here density flows only to single mutant neighbors. . . . . . . . . . . 100 Figure 4.8: Log(basin proportion) as a function of fitness, p=0.8, Double Mutations. Here density flows to single mutant neighbors as well as double mutants after 1000 updates. . . . . . . . . . . . . . . . . . . . . . . . 101 Figure 4.9: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the NK landscape at each of five exponential time points, as a measure of adaptation in basins over time. . . . . . 105 Figure 4.10: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the NK Exponential landscape at each of five exponential time points, as a measure of adaptation in basins over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Figure 4.11: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the Avida landscape at each of five exponential time points, as a measure of adaptation in basins over time. . . . 108 Figure 4.12: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the RNA landscape at each of five exponential time points, as a measure of adaptation in basins over time. . 109 xiii Figure 4.13: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the RNA Exponential landscape at each of five exponential time points, as a measure of adaptation in basins over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 4.14: NK, 100,000 random points, basin size after 10,000 updates. This plot, as exciting as it appears, is not an accident. There are simply 100,000 unique basins as measured by the endpoints of the trajectories. . . . 111 Figure 4.15: NK Exponential, 100,000 random points, basin size after 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. . 112 Figure 4.16: Avida, 100,000 random points, basin size after 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. . . . . . . 113 Figure 4.17: RNA, 100,000 random points, basin size after 10,000 updates. Similar to the NK landscape, most points are not accumulating in basins; over 98,000 trajectories ended in unique genotypes, with the rest accumulating in basins of size no more than four. . . . . . . . . . . . . . . . 114 Figure 4.18: RNA Exponential, 100,000 random points, basin size after 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. . 115 Figure 4.19: NK, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. Due to weak selection strength, there is not much relationship between the fitness and basin size. . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Figure 4.20: NK Exponential, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. 118 Figure 4.21: Avida, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. Avida seems to have a positive relationship with fitness but the fit is very poor.119 xiv Figure 4.22: RNA, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. Due to weak selection strength, there is not much relationship between the fitness and basin size. . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Figure 4.23: RNA Exponential, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. 121 Figure 4.24: NK, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Figure 4.25: NK Exponential, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Figure 4.26: Avida, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Figure 4.27: RNA, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Figure 4.28: RNA, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Figure 4.29: NK, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. . . . . 127 Figure 4.30: NK Exponential, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Figure 4.31: Avida, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. . 129 Figure 4.32: RNA, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. . 130 xv Figure 4.33: RNA Exponential, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Figure 4.34: NK, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. . . . . . . . . . . . . . . . 133 Figure 4.35: NK Exponential, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. . . . . . . . . 134 Figure 4.36: Avida, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. . . . . . . . . . . . . . . . 135 Figure 4.37: RNA, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. . . . . . . . . . . . . . . . 136 Figure 4.38: RNA Exponential, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. . . . . . . . 137 Figure 5.1: NK Random Starting Points, counts of beneficial/neutral/deleterious mutations relative to random starting genotype. Note that beneficial and deleterious mutations nearly overlap so only two lines appear. This overlap is because choosing a large enough sample of random points in the landscape should make beneficial and deleterious mutations nearly perfectly symmetric. At zero distance, the point itself is classified as neutral. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Figure 5.2: NK Random Starting Points, Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. Since this is the NK landscape, there is very little neutrality in this particular landscape. . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Figure 5.3: NK Random Starting Points, Landscape Visualization. This shows 1,000 NK landscapes starting at random points. The normality of the fitness distribution of the NK landscape is clearly visible and rare mutations in both directions appear at five mutations out. . . . . . . 150 xvi Figure 5.4: Avida Random Starting Points, counts of beneficial/neutral/deleterious mutations relative to randomly-selected starting genotype. Note that beneficial and deleterious mutations nearly overlap so only two lines appear. This overlap is because choosing a large enough sample of random points in the landscape should make beneficial and deleterious mutations nearly perfectly symmetric. At zero distance, the point itself is classified as neutral. . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Figure 5.5: Avida Random Starting Points, Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. . . . 152 Figure 5.6: Avida Random Starting Points, Landscape Visualization. This shows 1,000 Avida landscapes starting at random points. Rare high fitness states are revealed at a distance of five mutations out. . . . . . . . . 153 Figure 5.7: RNA Random Starting Points, counts of beneficial/neutral/deleterious mutations relative to randomly-selected starting genotype. Note that beneficial and deleterious mutations nearly overlap so only two lines appear. This overlap is because choosing a large enough sample of random points in the landscape should make beneficial and deleterious mutations nearly perfectly symmetric. At zero distance, the point itself is classified as neutral. . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Figure 5.8: RNA Random Starting Points, Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. . . . 155 Figure 5.9: RNA Random Starting Points, Landscape Visualization. This shows 1,000 RNA landscapes starting at random points. Rare high fitness states are revealed at a distance of five mutations out. . . . . . . . . 156 Figure 5.10: NK Peaks, counts of beneficial/neutral/deleterious mutations relative to starting genotype for each distance. At zero distance, the point itself is classified as neutral. . . . . . . . . . . . . . . . . . . . . . . . . . 158 xvii Figure 5.11: NK Peaks, Path Counts Simplex. This is a summary of all the possible five-mutation trajectories starting at each of 1,000 sampled peaks. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Figure 5.12: NK Peaks, Landscape Visualization. This shows 1,000 NK landscapes starting at peaks. A very few paths leading to higher fitness peaks can be seen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Figure 5.13: Avida Peaks, counts of beneficial/neutral/deleterious mutations relative to starting genotype for each distance. At zero distance, the point itself is classified as neutral. . . . . . . . . . . . . . . . . . . . . . . . 161 Figure 5.14: Avida Peaks Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. . . . . . . . . . . . 162 Figure 5.15: Avida Peaks Landscape Visualization. This shows 1,000 Avida landscapes starting at peaks. . . . . . . . . . . . . . . . . . . . . . . . . . 163 Figure 5.16: RNA Peaks, counts of beneficial/neutral/deleterious mutations relative to starting genotype for each distance. At zero distance, the point itself is classified as neutral. . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Figure 5.17: RNA Peaks Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. . . . . . . . . . . . 165 Figure 5.18: RNA Peaks Landscape Visualization. This shows 1,000 RNA landscapes starting at peaks. . . . . . . . . . . . . . . . . . . . . . . . . . 166 Figure 5.19: NK Random Points Evolutionary Trajectory Visualization. This shows 1,000 NK collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape . . . . . . 168 xviii Figure 5.20: NK Random Points Exponential Evolutionary Trajectory Visualization. This shows 1,000 NK collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Figure 5.21: Avida Random Points Evolutionary Trajectory Visualization. This shows 1,000 Avida collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape . . 170 Figure 5.22: RNA Random Points Evolutionary Trajectory Visualization. This shows 1,000 RNA collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 171 Figure 5.23: RNA Random Points Exponential Evolutionary Trajectory Visualization. This shows 1,000 RNA collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Figure 5.24: NK Peaks Evolutionary Trajectory Visualization. This shows 1,000 NK collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape . . . . . . . . . . 173 Figure 5.25: NK Peaks Exponential Evolutionary Trajectory Visualization. This shows 1,000 NK collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape . . 174 Figure 5.26: Avida Peaks Evolutionary Trajectory Visualization. This shows 1,000 Avida collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape . . . . . . . . 175 Figure 5.27: RNA Peaks Evolutionary Trajectory Visualization. This shows 1,000 RNA collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape . . . . . . . . . . 176 Figure 5.28: RNA Peaks Exponential Evolutionary Trajectory Visualization. This shows 1,000 RNA collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape . . 177 Figure 5.29: NK Random Points Landscape and Trajectory Fitness. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. . . . . . . . . . . . . . . . . . . . . . . . . . 179 xix Figure 5.30: NK Random Points Exponential Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. Here, selection strength results in a stronger improvement bias in the trajectories relative to the NK Points Trajectory Visualization. . . . . . . . . . . . . . . . . . . . . . . . . 180 Figure 5.31: Avida Random Points Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the medianminimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The medianmedian is solid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Figure 5.32: RNA Random Points Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the medianminimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The medianmedian is solid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Figure 5.33: RNA Random Points Exponential Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. Here, the higher selection strength results in a stronger improvement bias in the trajectories relative to the RNA Points Trajectory Visualization. . . . . . . . . . . . . . . . . . . . . . 183 Figure 5.34: NK Peaks Landscape and Trajectory Fitness. The landscape is in blue and the trajectories are in green. The dotted lines are the medianminimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The medianmedian is solid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Figure 5.35: NK Peaks Exponential Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. . . . . . . . . . . . . . . . . . . . . . . . . . 186 Figure 5.36: Avida Peaks Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid.187 xx Figure 5.37: RNA Peaks Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid.188 Figure 5.38: RNA Peaks Exponential Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. . . . . . . . . . . . . . . . . . . . . . . . . . 189 xxi Chapter 1 Introduction The evolution of populations of organisms is a topic of great interest; long before Darwin described the finches of the Galapagos, scientists had observed how well organisms are adapted to their specific environments. Organisms incorporate information about the environment into their genetic blueprints to create adaptive physical and behavioral characteristics. A greater understanding of how populations of organisms explore their genetic landscapes and accumulate information can help us better understand the process of evolution in natural populations. A deeper understanding of fitness landscapes and the processes that lead to effective solutions will also benefit evolutionary computation. Fitness sharing, novelty search, and other similar techniques are all built around finding better solutions by avoiding the classical pitfall of climbing a fitness gradient until permanently stuck in a local optimum. Understanding how solutions cluster and populations traverse paths between optima can help lead to improved search techniques. 1 1.1 Chapter Overview Biological complexity has been described in [Adami, 2002] as a measure of how much information an organism stores about its environment in its genome. It is theorized that natural selection increases complexity in a static single niche environment. Sequence complexity is frequently measured by subtracting the summation of per-site entropies from the maximum possible entropy. Entropy is a measure of how ‘random’ a value is when drawn from a distribution as rigorously defined in [Shannon, 1948] and applied to genomes in [Schneider et al., 1986] and [Adami et al., 2000]. Sequence complexity tends to reflect other measures of complexity, such as structural and functional complexity, but has the advantage that it is straightforward to evaluate using accessible population measures. Previous work in measuring complexity commonly overlooks the biases imposed by common descent. Coalescent theory predicts that an asexual population will often have a relatively recent common ancestor. As a consequence of this fact, many estimates of biological complexity often attribute a lack of variation due to common descent as being due to selective pressure. In Chapter 2, I explore some of the limits of existing biological information estimates, examine commonly used correction methods, show how they fail to correct for the problem, propose a correction that accounts for the effect of common descent, and demonstrate the utility of this correction experimentally. Fitness landscapes have been widely used to conceptualize the relationship between the genotypes and their associated phenotypes. The structure of fitness landscapes in biological organisms is still largely unknown and varies immensely from environment to environment. Even so, these fitness landscapes carry important consequences for evolution. For example, in the context of protein folding, peaks tend to exhibit strong local clustering, suggesting that a stable configuration will often have other stable configurations in the neighborhood. This clustering would allow evolution to have an easier time to find new stable configurations than if these variants were randomly distributed throughout the genotype space. Thus, a 2 concrete understanding of peak distribution in the fitness landscape would be relevant to genetics and medicine since one might better predict when it is productive to restrict the search for fit sequences to the immediate area around already known configurations. Chapter 3 exhaustively examines peak distributions in three very different types of landscapes—NK, Avida, and RNA, each with length 18 genomes and four possible alleles at each site, which is a scale beyond any that have been systematically explored to date. Peaks can attract evolutionary trajectories from nearby regions of the fitness landscape. When an evolutionary trajectory has reached a peak, it can be challenging to leave that peak again since it requires unlikely valley crossings or simultaneous mutations to escape. This gravitational quality of peaks is often characterized as each peak exerting a ‘basin of attraction’ on the nearby genotypes. [Kauffman, 1993] and [Ochoa et al., 2008] have examined basin size experimentally and find that basin size increases exponentially with the fitness of a peak in NK landscapes. The strength of this relationship is naturally dependent upon the amount of epistasis in the landscape; epistatic environments tend to be more rugged and the number of local maxima tends to increase with increasing epistasis. Basins may also exist on different evolutionary time scales; trivially, given an infinite amount of time and the possibility of simultaneous mutations, every evolutionary trajectory will eventually find the global maximum. In Chapter 4, I use my three model landscapes to experimentally investigate the nature of basins of attraction and their relationship to peaks. When populations evolve, the fine structure of the landscape is important; the mutational neighborhood of a genotype naturally determines what innovations are easily reachable and which are not. Previous experimental work has often focused on a particular gene complex and the evolutionary trajectories to get there. For instance, [Poelwijk et al., 2007] looked at adaptation of the bacterial β-lactamase to cefotaxime, to which the bacteria had not previously had exposure. In this case, the five mutations conferring antibiotic resistance were already known, but they were able to measure intermediates and found that it was necessary for mutational combinations to traverse a valley to reach the resistant solution. 3 In Chapter 5, I focus on visualization techniques for fitness landscapes that allow us to look exhaustively at the neighborhood of points as well as repeated trajectories originating from these points. In each of the three study landscapes (NK, Avida, RNA), I select 2,000 neighborhoods to study, half around random points and half around peaks. I visualize each study neighborhood up to five mutations out and compare these results to the first five mutations that occur in real evolutionary trajectories. As such, I compare all possible routes with the realized evolutionary routes. In RNA and NK landscapes, I further compare these results to evolutionary trajectories where the strength of selection is increased dramatically, demonstrating a qualitatively different result. 1.2 Contributions The contributions of this work include: (1) A demonstration of the role that common descent plays in biasing genomic estimates of entropy and complexity, along with a correction method to compensate for common descent, which I validate experimentally. (2) An exhaustive exploration and structural analysis of instantiations of three very different landscapes: NK, Avida, and RNA, that are each are of genome-length 18 with 4 alleles for a total of 68 billion genotypes, allowing us to identify both generalizations about these landscapes as well as distinctions. In all three landscapes, I demonstrate that peaks of higher fitness tend to have higher fitness peaks in their neighborhood. I also found that peaks are connected in the Avida and NK landscapes, whereas in the RNA landscape, peaks tend to be isolated. Similarly, in the Avida and NK landscapes, autocorrelation between peaks with regard to distance is high at low distances and eventually becomes negative. The RNA landscape has the interesting result that the autocorrelation becomes positive again at distance 18, which may point to the complement of the sequence being stable. (3) A novel structural method for measuring basins of attraction based on the 4 Page Rank algorithm, allowing us to confirm prior results about NK landscapes in a more thorough way. I found that as in previous work, basin size correlated exponentially with fitness and that increasing K increased the size of the average basin while decreasing the size of the largest basins. I also applied my new method to NKp variants of NK landscapes with neutrality and detected that neutral networks can retain significant numbers of trajectories over long periods. Finally, I linked peaks with two-mutant valley crossings and showed that this increases the strength of the exponential relationship between fitness and basin size. (4) An identification of basin structure obtained by experimental evolution and analysis of the relationship between peaks and basins. There was an exponential relationship between basin size and fitness in the NK and RNA landscapes, but not in the Avida landscape. I further show that the final fate of an evolutionary trajectory had very little to do with its starting point—trajectories in all three landscapes ended at a distance from the origin similar to the average distance of all genotypes. Additionally, I found that the concept of peaks play an important role in evolutionary trajectories—in all three landscapes, a significant proportion of trajectories were in the vicinity of a peak. (5) A new visualization technique to show landscape structure within a neighborhood and a comparison of possible routes to realized routes in both random points and peaks. In order to achieve the contributions above, these analyses were performed at a scale exceeding that previously systematically undertaken. The data alone encompasses over 250 GB, even with compression, hashing, and binary encoding schemes. High performance clusters and 100,000 hours of computing time went into collecting and analyzing this data. 5 Chapter 2 Estimating Complexity in Finite Populations 2.1 Motivation Various methods have been proposed to measure biological complexity. Some complexity estimates focus on a mathematical definition of complexity, like Kolmogorov Complexity, which measures complexity by how much a sequence can be compressed. By this measure, a completely random sequence would have maximal complexity, as it could not be compressed at all. However, as in [Adami, 2002], from a biological perspective, I am interested primarily in the information the genome contains about the environment as a measure of complexity. Other methods have been proposed to measure biological complexity as a proxy for adaptation to an environment, as well as for understanding how many sites actively convey an adaptive benefit, versus those that are selectively neutral. I focus on methods that measure complexity without additional manipulations, as this is more tractable in real populations, but is subject to biases, described further below. For the purposes of this work, I will be focusing on ideas that treat the genome as 6 an information channel, with noise produced by mutation and reduced by selection. These ideas operate on the assumption that natural selection in a single niche environment will fix adaptive mutations, while permitting non-informative sites to drift randomly, which allows us to search for sites that contain information by examining genomic commonalities within a population. A flaw of the analysis is linkage—it is possible for neutral or even detrimental mutations to hitchhike along with other mutations that are sufficiently adaptive, that they cause a sweep. This is recognized for instance in [Adami et al., 2000] as adaptive sweeps cause temporary spikes in apparent complexity, but these spikes are clearly spurious. Asexual populations, ignoring mechanisms such as horizontal gene transfer, necessarily contain maximal linkage. Sweeps thus result in distortions during which the diversity of the population is reduced. Likewise, common descent would be expected to cause similar effects; any population will have a most recent common ancestor. Therefore, some of the variation in sites, and therefore entropy measured, can be attributed to this process. Estimates for biological information do not currently take into account neutral processes such as common descent. A large confounding factor in the use of such complexity estimates is that their theoretical assumption of infinite population does not hold for natural populations. If the populations are large enough, statistically, such an assumption might introduce relatively little error, but for those where the population size is greatly dwarfed by the genotypic space (as occurs in virtually every natural population), there will be some distortion driven by coalescence and neutral processes. At the core, this is my primary claim; that given an asexual population with a common fitness function, the population itself is biased relative to the fitness landscape. This means that not all of the traits expressed are necessarily of adaptive value. It is unknown, however, what that limit may be, and one of the goals of this work is to explore this phenomenon further. This phenomenon of complexity being overestimated because of common descent has not been generally examined in the literature. Primarily, complexity is not itself a directly useful metric to many practitioners, whereas mutual information can provide important cues as to 7 identifying functionally important or related sites, which can in turn yield useful targets for drugs or therapy. However, it is still important to those who want to quantify the information that an organism might hold about its environment. Because of common descent, I would expect to see a lack of variation on many sites for a population with a large enough genome. Commonly-used information measures overestimate the amount of information in the genome, where the information content or lack of entropy is assumed to be driven by natural selection. But in fact, it may be constrained primarily by √ neutral processes. For a population of size 10000, there may be variation on only 10000 = 100 sites at a time, following work by [Zhang et al., 1990] on the diffusion of random walkers in space. This does not imply that the remaining sites are being selected for or in any way contribute directly to the fitness or complexity of an organism. Naturally, this bias would be potentially problematic for complexity estimation. One of the goals of this work is to investigate the strength of this effect and to ultimately compensate for it. 2.2 Related Work & Mathematical Theory In this section, I discuss related work beginning with early work in information theory and then introducing biological complexity concepts, before delving into common descent. These techniques are sensitive to finite sample sizes, and I discuss common correction mechanisms. Finally, I discuss concepts in neutral theory, linkage, coalescence, and models of evolution. 2.2.1 Information Theory & Complexity Information theory was first developed by Shannon in [Shannon, 1948], in the context of communication theory. At the time, relatively little was known about the limits of communication channels with regards to transmitting messages across noisy channels, although there was some work by Nyquist and Hartley, both of whom heavily influenced Shannon. Figure 2.1 shows Shannon’s abstraction of a communication system. The information 8 Figure 2.1: Figure 1 from [Shannon, 1948] demonstrating a conceptualization of a communication system. source contains a message, whether it be a sequence of characters as in a written message or a function encoding such as a function description or parameters. The transmitter is responsible for wrapping the information source into a transmittable signal. On the other end, the receiver decodes the transmitted signal into something meaningful for the destination which is the ultimate recipient of the message. In between the transmitter and the receiver is the channel, which can be any communication medium, including wires and radio signals. The channel is subject to noise and may cause part of the message to become corrupted. Shannon represented discrete information sources as a Markov process, with the Markov property that each state is conditionally independent of all history except the immediate predecessor state. He looked at the uncertainty in predicting the next symbol from the point of view of the receiver. If the symbol is perfectly constrained by the current state, there should be no uncertainty; and therefore the transmission of the next symbol provides no ‘surprise’. The surprise is highest when the chance of observing each symbol in the alphabet is equiprobable. This basic concept was used by Shannon to define a formula for entropy as in equation 2.1. 9 N H(X) = − pi log2 pi (2.1) i=1 X is a random variable with N possible states in Equation 2.1. The use of base 2 assigns units of bits to the uncertainty; another common unit is mers which uses the same base as the alphabet size. This function also corresponds to the average number of yes-no questions that would be necessary to ascertain the state of a random variable. Shannon then defined the joint entropy of a system with two random variables X and Y, denoted H(X,Y). This measures the uncertainty over each of the N · M possible states for {X, Y }. This can be seen in equation 2.2. H(X, Y ) = − p{x,y} log p{x,y} (2.2) {x,y}∈{1,...,N }×{1,...,M } The conditional entropy of X given Y then is a measure of the uncertainty of Y given the knowledge of X. This is equal to the difference of the joint entropy of X and Y and the entropy of Y. This is defined in equation 2.3. H(X|Y ) = H(X, Y ) − H(Y ) (2.3) From this came the concept of Shannon Information, which describes the information gained from knowing random variable X about the random variable Y. This is often expressed I(X;Y) and may be seen in equation 2.4. I(X; Y ) = H(X) − H(X|Y ) = H(X) + H(Y ) − H(X, Y ) 10 (2.4) 2.2.2 Biological Complexity Schneider in [Schneider et al., 1986] and others have suggested that the information content of a site can be quantified in aligned sequences as in equation 2.5. A I = −L log A − pi log pi (2.5) i=1 The basic idea underlying this equation is that in an infinitely large population, a site with no selective advantage would be predicted to have frequencies of 1 A if there are A candidates for that site. Therefore any difference between the site’s entropy and one determined by a completely uniform and random assumption should reflect a measure about the information contained in the population about the genetic landscape. [Adami and Cerf, 2000] characterize this assumption as ‘invoking the principle of insufficient reason’. This approach is used by [Adami et al., 2000] to obtain estimates of complexity over whole genomes by summing the per-site information. It is also the approach I will take for this proposal. It is also worth noting that the complexity measure provided by [Adami et al., 2000] is usually an underestimate of the true complexity. This understimation is caused because estimate subtracts off the entropy between each site individually without regard to the mutual information between sites. One correction, as provided by [Strelioff et al., 2010] and in [Adami, 2004] is to add the mutual information between sites back for an improved estimate. This 2nd-order correction of course still cannot compensate for the interactions between three or more sites, and may still contain error in either direction. There also exist many alternative ways of estimating biological complexity; for instance, [Huang et al., 2004] examines a single genome and, by classifying the one- and two-point mutants around it as beneficial or deleterious, comes up with an estimate of the complexity of a single genome. This is robust to phenomena such as neutral hitchhiking which emerge during sweeps in population-based measures, but by its nature only involves a single genotype 11 and a potentially expensive analysis of the local landscape. This may be intractable in natural organisms, since even in computational systems this can become challenging. For instance, for an organism of length 100 with 28 possible instructions per site, there are 100 C2 272 = 3, 608, 550 double mutants. There are also other measures of complexity—the functional information measure proposed by [Hazen et al., 2007], or the usage of the ev model to study the accumulation of information and estimate rates of evolution in the context of coevolving binding sites in [Schneider, 2000] or what [Weinberger, 2002] calls pragmatic information. For this work, I focus on the simple definition in [Adami et al., 2000]. 2.2.3 Entropy Correction It is well known that when N << LA , where N is the finite population size, L, the length of the genome, and A the number of alleles, that the entropy is poorly estimated since the population size cannot adequately cover the genotypic space. This limitation has spawned a number of estimators that seek to correct for the issue of biased sampling. The most basic of these is the Maximum Likelihood Estimator (MLE) and can be seen in Equation 2.6. ˆ H(θ) =− θp log θp (2.6) p∈P where θp = pn N The MLE Estimator does poorly in the N << LA case since it will estimate many probabilities at zero as a population can cover only a tiny fraction of possible genotypes. Another common family of approaches is Bayesian Estimators. [Minka, 2000] The basic idea is to estimate the probability of the entropy using a Bayesian prior, often the Dirichlet distribution. As with Bayesian approaches in general, the choice of an unbiased prior distri12 bution is nontrivial and a considerable number of different priors have been proposed, with varying results. This can be seen in Equation 2.7 ˆ H(θ) =− θp log θp (2.7) p∈P where θp = pn +α , N +α|n| where n is the number of categories and α is a parameter determining the prior. One frequently used form is the Laplace Estimator: θp = pn +1 . N +n More advanced estimators have also been developed; for instance, [Nemenman et al., 2002] improves on methods that use prior Dirichlet uniforms and instead tries to estimate the characteristics of the distribution from looking at coincidental samples. [Hausser and Strimmer, 2009] also discusses the efficacies of various methods of entropy estimation, including the NSB estimator from [Nemenman et al., 2002], maximum likelihood, the Chao-Shen estimator proposed by [Chao and Shen, 2003], and the James-Stein estimator under a variety of distributions and sampling counts. [Paninski, 2003] also compiled several solutions to correct bias in entropy inherent in using MLE estimation methods. There have also been attempts to correct for bias in sample selection, which assume there is some systematic differences between the sample to be studied and the population it is drawn from. Examples include [Dudk et al., 2005] who propose three approaches to correct bias, motivated partly by the problem of habitat estimation. [Dudk et al., 2005] points out that sampling distribution is driven by human factors driving ease of access such as roads, proximity of cities, and rivers, whereas species distribution is likely to be at least somewhat independent of those same factors. As such, these biases may cause species to be systematically oversampled in the case of species that prefer such environments or undersampled in the case of species that do not, which may easily lead to estimation errors for statistics such as population size. 13 I seek to show in the course of this work that the entropy distortion caused by common descent can, under many conditions, dwarf the error caused by finite sampling bias. This bias from common descent thus has implications for estimation of information in genomic contexts. Further, I demonstrate that this can be a proxy for estimating neutrality. 2.2.4 Neutral Theory & Evolution In his seminal work on neutral theory, [Kimura, 1968] observed that the rate of molecular evolution seemed high compared to what would be predicted from the deleterious effects of most gene mutations. This led Kimura to propose that most molecular mutations may actually be neutral in nature and that the force of random genetic drift is more important than previously thought. The theory generated significant interest from researchers over following years and provided an influential null-model of evolution; focus was directed to whether neutral evolution could explain observed outcomes, with the goal of distinguishing functional parts of the genome and those under active selection from those under no selection, and understanding what types of selection may be at play. The relevance of this theory to complexity estimates is that neutral evolution allows multiple mutations to persist at a site, thereby increasing entropy at that site and decreasing information estimates. The null hypothesis for information estimates used by [Adami and Cerf, 2000] utilizes the assumption that given an infinite population, neutral evolution would cause a non-contributing site’s fitness to be maximally divergent. For a given neutral site, one would expect 0 information, since Hmax = log A = H and therefore the complexity estimate C = Hmax − H = 0. I further investigate what this means for finite cases later in this work. 2.2.5 Coalescence If one looks at the population as a whole and traces the phylogenetic tree back in time, eventually the population ‘coalesces’ at a common ancestral organism in asexual populations. This important fact underlies the foundation of coalescent theory. There are some well-known 14 results about the statistical properties of the distribution and timeline of how individuals are related. Understanding and incorporating these results will allow us to reexamine our complexity estimates, providing clues to how many sites are fixed by chance and relatedness as compared to sites that are fixed by selection. [Derrida and Peliti, 1991] examined the process of evolution in flat fitness landscapes. Evolution on flat fitness landscapes is important because it helps shed light on variation caused by drift and coalescent processes. This better informs our understanding about the process of evolution in the presence of selection. Following their derivations, when the population size stays constant, each individual leaves one offspring on average, which implies a Poisson distribution of mean 1, at least where the population size M >> 1. The probability that an individual leaves behind m offspring is shown in Equation 2.8. pm = e−1 m! (2.8) [Derrida and Bessis, 1988] showed that, given the definition that two individuals belong to the same family if they have a common ancestor within a specified time window, the distribution of the sizes of the families is equiprobable given the number of families that currently exist in flat fitness landscapes So, if there are K families, all possible ways to allocate the individuals of the population into those K families are equally likely. 2.2.6 Linkage Disequilibrium Linkage disequilibrium (LD) refers to a non-independent association of alleles at different loci. LD may be driven by selection; for instance, a maladaptive mutation at a particular site may be consistently remedied by a compensatory change at another site. LD may also be driven by shared inheritance. It is frequently used in terms of studying sexual populations, since alleles that are physically close to each other on a genome are less likely to be broken up by recombination. In asexual populations, there is maximal linkage disequilibrium, since 15 there is no recombination to break up associations between alleles, which can result in the accumulation of neutral or even deleterious mutations via hitchhiking. Following [Balding, 2006], there is no direct quantitative way to calculate LD, but three well-known proxy measures are D, D and r2 . The D statistic is a measure of the deviation of the observed frequencies of a genotype from the expected quantity. An example of such a calculation for a simple two-locus two-allele model can be found in equation 2.9, where Xi1 ,j1 is the count of genomes with alleles i1 and j1 at loci i and j and pi1 and pj1 are the probabilities of the respective alleles appearing at those sites. The D’ statistic rescales D by dividing it by its theoretical maximum. D is again shown for the two locus, two allele case in equation 2.10. However, as [Balding, 2006] notes, these statistics are more likely to take on extreme values when allele frequency is low and are often not directly comparable between loci unless the frequency distribution is similar. The r2 statistic is the correlation coefficient between the pair of loci and can be seen for instance in Equation 2.11. These measures all use two loci; they can be summarized for a region through simple averaging, or through more complex techniques such as the LD map proposed by [Maniatis et al., 2002] which uses an exponential decay function of the local distance to fit a summary LD estimate. D = Xi1 ,j1 − pi1 pj1 D =      D min pi1 pj1 ,pi2 pj2 if D > 0 D min pi1 pj2 ,pi2 pj1 if D < 0 D r2 = √ p i1 pi2 pj1 pj2 16 (2.9) (2.10) (2.11) Figure 2.2: Figure 1A from [Wang and Lee, 2007] demonstrating background linkage disequilibrium. The covariation of mutation A and R in the sequence alignment is caused by common descent. This can be seen in the phylogenetic tree. A related concept is background linkage disequilibrium (BLD), which can be thought of as linkage disequilibrium that is driven primarily by inheritance. [Wang and Lee, 2007] outlined this problem, describing it as the problem of distinguishing covariation due to selection from covariation due to common descent. Their formulation can be seen in Figure 2.2. They investigated synonymous and nonsynonymous mutations and measured a low rate of linkage disequilibrium in the HIV virus that decays with site distance due to recombination events. [Korber et al., 1993] also examined mutual information between sites in the V3 loop of HIV and hypothesized that the covariance may be explained in part by a shared evolutionary heritage. [Bickel et al., 1996] continued the analysis of the data of [Korber et al., 1993] and proposed that it may be possible to compensate for the effects of shared ancestry by examining the similarity of sequences within suspected clades. [Service et al., 2001] estimated BLD in human genomes from a small subpopulation and found that although genetic distance strongly modulates BLD, its distribution is nonetheless nonuniform. This kind of analysis is potentially significant for researchers since it can help to identify functional sites that could be targets for drug development. 2.2.7 Models of Evolution There are several models that have been developed in order to study evolutionary dynamics, including the Wright-Fisher model [Wright, 1931, Fisher, 1930] and the Moran model, among 17 others. However, for the purposes of this work, I focus on sequence evolution in asexual organisms. As such, I utilize the quasispecies model, also known as the Eigen model. I also discuss common assumptions used in mathematical simplifications and their use in deriving mathematical results about evolution in these domains. 2.2.7.1 Eigen Model The discrete-time Eigen model, developed in [Eigen, 1971] and [Eigen and Schuster, 1977] and further applied in [Eigen et al., 1988] and [Eigen et al., 1989] is a discrete formulation of the quasispecies model devised in order to study sequence evolution. The quasispecies concept relates to the idea that given that a population is in equilibrium in the presence of mutation and selection, there may be stable molecules of lower fitness maintained by mutation, particularly if they are close to a local molecule of high fitness. The maintenance of such quasispecies would have an effect on information estimates utilizing sequence information. This equation is reproduced in Equation 2.12. Here Qij is the probability of mutating between genotype j and genotype i, Ai is the fitness-driven replication rate of genotype i, and xi is the frequency of genotype i in the population. There has been some debate about whether the quasispecies models and more traditional population genetics models are distinct, but some, such as [Wilke, 2005] believe that they are in fact studying the same phenomena and are not contradictory. The Eigen model is the primary model I use for experimental evolution. Previous work such as [Adami et al., 1995] and [Bedau and Brown, 1999] have examined frequency distributions in instantiations of the quasispecies model. [van Nimwegen et al., 1999] explored the neutral evolution of mutational robustness in neutral plateaus. x˙ i (t) = [Ai Qii − E(t)]xi (t) + Aj Qij xj (t) j=i 18 (2.12) 2.2.7.2 Infinite Alleles Model The infinite alleles model relies on the idea that genes tend to be composed of large segments of nucleotides [Hartl and Clark, 2006]. Thus, any new mutation is likely to create a new allele that does not exist in the population. [Ewens, 1972] derived the result that the expected number of alleles k is related to the sample size, n. This can be seen in Equation 2.13. E[k] = 1 + θ θ θ + + ... θ+1 θ+2 θ+n−1 (2.13) The relevance of this particular model to information estimates may be in analyzing real data where the maximum number of alleles is not known. This expected number of alleles then can be used as a proxy for the amount of information contained within a gene, since this provides us with an estimate of how many alleles there ought to be in the complete absence of selection. The main mathematical advantage in this assumption is simply that every mutation is necessarily novel, so accounting for repeated mutations or back mutations can be disregarded. 2.2.7.3 Infinite Sites Model The infinite sites model considers an infinite sequence of loci. With an infinite-length genome, any novel mutation necessarily alters a different site. The infinite sites model is a popular choice for studying DNA sequence evolution at least partially because it offers certain mathematically attractive qualities. For instance, the probability of the same site being mutated twice is zero with infinite sites from which to pick, which means that any mutation may be treated as novel. There has been some relevant work focused on trying to find analytic formulations for the numbers of neutral sites. [Watterson, 1975] was one of the many early attempts to distinguish between alleles experiencing neutrality and those undergoing active selection. The Watterson estimator 19 estimates the amount of segregating sites that would be expected given a certain mutation rate in the infinite sites model. One of the applications of the Watterson estimator is to estimate the mutation rate given real data, as given in Equation 2.14. Here Sn is the number of segregating sites, θ is the population mutation rate, and hn is the harmonic series. For instance, in [Durrett, 2008], an example is given with data from from Ward et al., 1991, in which 63 human mitochondrial sequences were sequenced, which yielded a count of 26 segregating sites. It was then possible to solve for θ in the equation, which works out to be 26 h6 3 = 5.5173. Dividing this by 360 sequences gives a population mutation rate per base of 0.0153. n−1 E[Sn ] = θhn , where hn = i=1 1 i (2.14) [Watterson, 1977] also examined distributions with the goal of trying to distinguish neutrality from heterosis. Watterson uses the order-statistics distribution. This defines the r most numerous allele frequencies, ranked in decreasing order. Specifically, Equation 2.1.12 from [Watterson, 1977] can be seen in Equation 2.15. He also uses the truncated distribution, which is the same as above, except with only frequencies above a certain threshold considered. r x1 ≥ x2 ≥ x3 ≥ ... ≥ xr ≥ 0, with (xi ) ≥ 1 (2.15) i=1 Estimators such as the Watterson Estimator provide a compelling theoretical framework for assessing the expected number and variance of segregating sites in a population under the presence of no selection. These estimators provide a useful baseline for investigating fixation and distinguishing between neutral processes and the influence of natural selection. For example, suppose using the estimator, I predict that there should be X segregating sites in the population. Instead, there are Y < X sites. It may be possible to conclude then within some degree of certainty that several sites are maintained by active pressure of selection. 20 However there are potential pitfalls here as well—for instance, the maintenance of two or more coexisting subtypes may cause systematic divergence across an array of sites. This divergence would have the effect of increasing the observed number of segregating sites. This scenario similarly confounds Shannon-Information-based complexity estimates, but may be detectable with an analysis of site frequency distributions. [Tajima, 1989] built on the work by Watterson and suggested that the difference between the estimate of mutation rate derived from segregating sites and an estimate of the mutation rate, Θ, derived from the average number of pairwise mismatches, Π. At steady state, the average number of pairwise mismatches follows the direct formula: Equation 2.16 E[Π] = Θ (2.16) This provides a D statistic in Equation 2.17 D= Π− Sn hn Var(Π − (2.17) Sn ) hn [Fu and Li, 1993] also introduced a test statistic, G, that compares the difference of the two estimates of the mutation rate in order to measure the fit of a segment of genetic material to a neutral infinite-sites model at steady state. Here they utilize the structure of a genealogy without recombination, and divide the phylogenetic tree into internal and external branches, where external branches emerge from an internal node and terminate at a sequence, and internal branches connect internal nodes, which correspond to reconstructed ancestors. The expected time for the sum of the external branches is 4Ne where Ne is the effective population size. The significance of the external branches, following [Hartl and Clark, 2006] is that mutations along them result in a singleton nucleotide that is most likely not replicated anywhere else (by the infinite sites assumption). They produced the G statistic, which again 21 examines the difference of two estimates of Θ seen in Equation 2.18, where ηe is the number of mutations along external branches and ηi is the number of mutations along internal branches. [Fu, 1995] further investigated statistical properties of segregating sites by expanding on this idea of using the internal and external branches. He found, using the Wright-Fisher model, and that the expected number of mutations of size i,ζi can be given by Equation 2.19. This expectation holds true for the infinite alleles model and holds true asymptotically for the infinite sites model. In addition, he derived values for the variance of the size of the expected number of mutations and the covariance between expected numbers of mutations of different sizes. νe − ηi hn −1 ηe − ηi hn −1 1 E[ζi ] = θ i (2.18) (2.19) [Zhang et al., 1990] examines populations as a collection of N random walkers (organisms) in a d-dimensional space (genetic lattice). They showed that over time, with a low mutation rate, the populations tended to congregate in small regions of genotypic space, even given an initial uniform distribution. They found that the population evolving on a d-dimensional √ lattice in equilibrium would have an actual dimension proportional to N where N is the size of the population. If N << D, then this is in fact a significant divergence from a uniform distribution; in other words, one expects to find convergence within a relatively small area, even with a complete absence of selection. 2.3 Approach In order to illustrate the influence of coalescent processes on complexity estimates in finite populations, I conducted a series of experiments. I did not make any mathematical assump22 tions like the infinite sites / infinite alleles approaches outlined above, and instead rely on experimental data. 2.3.1 Methods The primary experimental model I am using to evaluate the effects of neutral processes on information and complexity estimates is a string model based on the discrete Eigen model. In this model, a population with N individuals is maintained, each with L loci, each of which can take on one of A values. Every generation, N new individuals are sampled with replacement as parents from the population, proportionally to fitness, and mutation is applied to their offspring. This model can be used to study both genes with alleles abstracted at a single locus and more complex phenomena such as genes with fitness driven by contributions from multiple sites. To isolate the baseline effects of population dynamics on calculated information, it is necessary to understand how these dynamics work in the absence of selection. Thus, for this work, I studied first flat fitness landscapes where mutations provide neither fitness benefit nor penalty, and all individuals have the same fitness and the same probability of producing offspring— N1 where N is the population size. This landscape would be expected to produce the most extreme complexity estimate distortion due to shared descent. I used Gibbs sampling here, with an interval of 100 generations between samples for independence. The populations start from a common ancestor and 10,000 generations executed to allow the starting population to fan out from a single starting genotype with all zeros. The burn-in time is sufficient time to allow the population to no longer have any discernible bias towards zero, over other alleles. I sample the H1 estimate of complexity as in [Adami and Cerf, 2000] at every time step. I did this for A = 4 and A = 20 alleles, consistent with DNA/RNA and amino acids in proteins. 23 2.3.2 Common Descent Biases Complexity Estimates 2.3.2.1 Rank Order Analysis Similar to the analysis done in [Watterson, 1977], it is possible at any given time point at a site to rank the count of loci in numerical order of appearance. There appears to be a fairly stable distribution that is roughly length-independent, but that does vary with mutation rate and population size. For length 100, the rank-order distribution can be seen in Figure 2.3. The same plots can be seen for N = 1000 in Figure 2.4. These figures of rank-order distribution show what the averaged site frequency order statistics look like through time, although it is crucial to note that individual samples do in fact deviate from this distribution. Figure 2.3: Frequency rank order for population size of 100. On the x-axis for each plot is length. At each sample in each run, allele frequency per site is ranked and averaged. This is then averaged across all samples per run to condense into a single figure per run. This shows that on average, even in the absence of selection, an imbalance exists on sites due to common descent. Colors come from rank ordering, e.g. in these plots the frequency of the most common allele appears in lime green at the top. (a) Mutation rate 0.001 (b) Mutation rate 0.01 (c) Mutation rate 0.1 Both increasing population size and increasing mutation rate will, in turn, increase the diversity of the visible alleles at each site and this, in turn, increases the entropy. Although this aggregate distribution is stable, this is a summary result; sites are not actually in equilibrium on average as one would expect in a regime without selection. Measurements can and do vary from time slice to time slice and sometimes drastically. 24 Figure 2.4: Frequency rank order for population size of 1000. On the x-axis for each plot is length. At each sample in each run, allele frequency per site is ranked and averaged. This is then averaged across all samples per run to condense into a single figure per run. This shows that on average, even in the absence of selection, an imbalance exists on sites due to common descent. This effect is reduced, compared to the smaller population size of 100 in Figure 2.3 (a) Mutation rate 0.001 (b) Mutation rate 0.01 25 (c) Mutation rate 0.1 2.3.2.2 Complexity Estimates My results show that in the absence of selection, this estimate still suggests that there is a considerable amount of complexity depending upon the experimental parameters, as might be expected from the divergence from uniform visible in the rank-order distributions. Figure 2.5 contains complexity distributions for three experimental conditions with increasing mutation rate from left to right. These experiments were conducted with four possible alleles per locus and fixed length, for many different values of length. Length plays a directly linear role in the complexity estimates, leading to a virtually constant per-site estimate, likely due to the fully asexual descent in these populations; in other words linkage is complete between any pair of sites. Figure 2.5: Complexity Estimates for population size of 100 in the presence of no selection. On the x-axis is length and on the y-axis is information estimates. Each plot represents the distribution of information samples over a single run. The sample size increases with increasing length, leading to less uncertainty over the estimate. (a) Mutation rate 0.001 (b) Mutation rate 0.01 (c) Mutation rate 0.1 The complexity varies by mutation rate. This result is not surprising, since with a lower mutation rate, there is less of an opportunity for sites to diverge after the most recent coalescent event. The result is that many sites are fixed. As the mutation rate increases, there is more diversity at each site, which reduces the entropy, which increases the complexity estimate. The complexity estimate can be sizable; for instance, the maximum entropy for 26 length L is log2 4 = 2 bits per site, or log4 4 = 1 mer per site. Thus, with a mutation rate 0.001, nearly 80% of the maximum complexity estimate would appear, on average, even in the absence of any selective pressure. This ‘spurious complexity’ decreases as the mutation rate increases, with a mutation rate of 0.1 per site yielding a less than 5% total estimate of maximum complexity. The effect also appears to be linear with respect to the L parameter within each condition. This is consistent with the theoretical literature, since the number of segregating sites is tied to θ, the whole genome mutation rate. Since there is a per-site contribution, the whole genome mutation rate is driven by the length of the genome. It is also interesting to contrast the results shown in Figure 2.5 and Figure 2.6. The information estimates in the population size of 1000 are noticeably smaller compared with those with the population size of 100. In a larger population, the time from the most recent coalescence will be longer on average, and the population has thus had more time to diverge relative to this event. This increases the entropy and thereby decreases the complexity estimate, making it closer to the true complexity of 0 when no selection is present. Figure 2.6: Complexity Estimates for population size of 1000 in the presence of no selection. On the x-axis is length and on the y-axis is information estimates. Each plot represents the distribution of information samples over a single run. The sample size increases with increasing length, leading to less uncertainty over the estimate. (a) Mutation rate 0.001 (b) Mutation rate 0.01 27 (c) Mutation rate 0.1 The complexity graph from evolving a population of 3600 organisms [Adami et al., 2000] of length 100 can be seen in Figure 2.7. There are some differences between the models; in this work, simultaneous replacement per generation is used, whereas the Avida model used by [Adami et al., 2000] tends to replace individuals continuously. Figure 2.7: Comparison of complexity estimates from [Adami et al., 2000] and expected complexity in mers from neutrality and resulting rank order frequency distribution for corresponding neutral landscapes of length 100 with 28 bases per site. In 2.7a, the complexity is measured over time, from [Adami et al., 2000]. In 2.7b is the distribution of information estimates that occur in a size 3600 population with 28 alleles with no selection. In 2.7c is the distribution of rank ordered alleles, which 1 clearly deviates from the uniform 28 . (a) Complexity over [Adami et al., 2000] time, (b) complexity, neutral (c) rank order, neutral This result implies that a completely neutral distribution would register roughly 6.6 mers of information. The complexity estimates in [Adami et al., 2000] are well in excess of that figure, but nonetheless this could be as much as 10% of the final complexity estimate in an experimental model with a fairly robust population size. Thus, it does suggest that there is room for improvement in refining some of these complexity measures, even in relatively large populations. 2.4 Finite Size Correction As discussed previously in this chapter, there are a variety of entropy estimators that attempt to control for finite population size. I demonstrate in this section that the effect observed is 28 not simply due to choice of the naive estimator. The naive estimator is always an underestimate of the total amount of entropy present at each site [Sch¨ urmann and Grassberger, 2002]. Therefore, I look at the Laplace estimator (which is likely to be an overestimate) and the Chao-Shen estimator. The key takeaway here is that using a better estimator is not going to substantially change the necessity for correcting for common descent. In order to demonstrate this problem, I first look at the complexity estimates for both four and twenty alleles to enable comparisons against RNA and Proteins respectively. Figure 2.8 and 2.9 show a cross section of the complexity estimates along with the true entropy (known a priori to be exactly 1 mer by the simple fact that there is no selection in the system). Figure 2.8: Per-site complexity estimates for neutral populations without selection of size 100 with 4 alleles per site for three different estimators. On the x-axis is mutation rate and on the y-axis is information estimates. Each plot represents the distribution of information samples over a single run. The line represents the ‘true’ entropy from the underlying distribution which should be 1 since there is no selection. The mutation rates vary from 0.0001 on the left to 1 on the right, with semilog scaling. (a) Naive Estimator (b) Laplace Estimator 29 (c) Chao-Shen Estimator Figure 2.9: Per-site complexity estimates for neutral populations without selection of size 100 with 20 alleles per site for three different estimators. On the x-axis is mutation rate and on the y-axis is information estimates. Each plot represents the distribution of information samples over a single run. The line represents the ‘true’ entropy from the underlying distribution which should be 1 since there is no selection. The mutation rates vary from 0.0001 on the left to 1 on the right, with semilog scaling. (a) Naive Estimator (b) Laplace Estimator 30 (c) Chao-Shen Estimator This pattern stays consistent across the entire sample set. Figures 2.10 and 2.11 demonstrate this. The error is less with the number of alleles Figure 2.10: Relative error compared to true entropy for neutral populations for three different estimators. On the x-axis is length and on the y-axis is information estimates. Each point represents the distribution of information sample means over a single run. (a) Naive Estimator (b) Laplace Estimator (c) Chao-Shen Estimator Figure 2.11: Relative error compared to true entropy for neutral populations for three different estimators with 20 alleles. On the x-axis is length and on the y-axis is information estimates. Each point represents the distribution of information sample means over a single run. (a) Naive Estimator (b) Laplace Estimator 31 (c) Chao-Shen Estimator The takeaway is that current entropy correction schemes do not correct for the issue of common descent. All three, unsurprisingly, do better as population size and mutation rate both increase. This means that the choice of entropy bias correction method is itself unlikely to help significantly in correcting for spurious complexity deriving from common descent and does not matter that much in this particular case as the vast majority of the error is not coming from entropy estimation error or simple biased sampling. The sample is always biased by selection. 2.5 Correcting for Common Descent In the previous section, I raise the question of how common descent affects results, examine entropy correction methods, and demonstrate that existing methods do not correct for common descent. In this section, I seek to correct calculations of entropy and information to control for common descent given a priori knowledge about the fitness landscape. For any population-based calculations, I must assume a single niche environment and a population at equilibrium. In order to determine the idealized dynamics for maximum and minimum entropy at a site, I define two types of sites, informative sites, at which selection is acting, and neutral sites, which are not directly subject to selection. I define p as the proportion of the genome containing neutral sites. I further designate Hn as cumulative entropy from all non-informative (neutral) sites and Hs as entropy from informative (selected) sites. Earlier in this chapter, I have seen that Hn will be non-maximal because of coalescence. And in finite populations, Hs will be nonzero, due to mutation-selection balance. I can then write total entropy as: Hm = p(Hn ) + (1 − p)(Hs ) (2.20) For my correction, I first must estimate p and then use the formula to estimate the amount 32 of entropy Hc attributable solely to neutral sites. Hc = p(Hn ) 2.5.0.3 (2.21) Methods To test this correction, I used a royal road landscape where every allele provides a selective factor of twice the fitness. Parameter sweeps provided us with the Hs estimate for use in the formula above. I analyzed both 4-allele and 20-allele test conditions. Next, I examined a landscape where precisely half of the sites were selected and half were completely neutral. This leads us to an expectation of L 2 log2 A bits of complexity for each genome, which in turn leads to an average per-site expected complexity of log2 A . 2 For each estimate, I used Gibbs sampling, again measuring the entropy every 100 generations. For all these results, I focused on the naive entropy estimate, since I previously established that other corrections do not correct for common descent. The actual value gives us Hm which is our measured estimate for the half-selected landscape. Finally, from my results before with completely flat landscapes, I have estimates for Hn . For each mutation rate & population size for both the 4-allele and 20-allele cases, there are three quantities: Hm , Hn , and Hs . I then apply Equation 2.20 to obtain an estimate for p and Equation 2.21 to obtain a new entropy. 2.5.1 Discussion The results of our correction are summarized in Figure 2.12 and Figure 2.13. 33 Figure 2.12: Entropy, Naive & Corrected, 4 Alleles 34 Figure 2.13: Entropy, Naive & Corrected, 20 Alleles 35 For both allele counts that I tested, some common trends emerge. The correction performs very well at low to moderate mutation rate <= 0.01 and corrects for the effect of common descent. Beyond that threshold, however, the correction begins to perform poorly—even doing worse than the naive estimate. It is likely that the core assumption behind this correction, namely that sites that are actively selected do not hold entropy, is being violated. Selected sites begin to hold entropy because it is difficult for evolution to stay on a fitness peak as mutation rate increases past the error threshold, thereby maintaining both variance and entropy. It is likely possible to correct for this phenomenon, but we do not attempt to do so in this work. In the real world, however, certain assumptions are not likely to hold—for instance a selected RNA site that may contain either a G or a C would constitute a selected site with nonzero entropy. Additionally, and perhaps more seriously, estimating Hs could be challenging since, of course, perfect fitness landscape information would not be available. It may also be possible to obtain an estimate for Hs using a known subset of the genome that is highly conserved. Hn , by contrast, can be obtained by virtue of simulation rather readily given allele count, mutation rate, and population size. Second, this particular analysis is specifically done in asexual populations—dynamics and numbers will be different in sexual populations, although a similar analytic approach should hold. In conclusion, the contributions of this work are a demonstration that complexity from shared descent exists and can be substantial, a technique to compensate for the effects of common descent when calculating entropy or complexity, and a demonstration of its usage on an evolving population. 36 Chapter 3 Fitness Landscapes– Peaks, Ridges, and Plateaus Fitness landscapes represent the underlying genetic structure on which populations evolve. As such, understanding this structure can help us better understand the process of evolution. More broadly, there is continued debate on the structure of fitness landscapes in biology, with consequences for how and if populations cross valleys. In this chapter, I present some of the related theory of fitness landscapes and demonstrate some preliminary data using the fitness landscapes from a theoretical NK landscape, a genetic programming Avida landscape, and an RNA folding landscape. Before a meaningful discussion of fitness landscapes can take place, the term fitness must first be defined. Unsurprisingly, fitness can mean different things in different contexts, and to different people. I adopt the following definition from [Kauffman, 1993] for Chapter 3. In future chapters, I tie my fitness concept to the replication rate of organisms for the purpose of studying evolutionary dynamics. Before continuing, I must clarify what I mean by a fitness landscape. For an evolutionary biologist, “fitness” applies principally to an entire organism. It has 37 components of fecundity, fertility, and other factors, leading to reproductive success (Crow and Kimura 1965, 1970; Ewens 1973). These include complex issues such as the frequency of each genotype variant in a region, and even the entire ecosystem with which each organism interacts (S. A. Levin 1978). Therefore, in the general context, it is difficult to assign a fitness to a gene or even to a genotype, since all these factors depend upon the other organisms in the population. For the purposes of the present chapter, I shall use the term “fitness landscape” in a much more restricted sense to refer to any well-defined property and its distribution across an ensemble. For example, the capacity of each protein in protein space to catalyze a specific reaction under specified conditions is, in principle, a well-specified property. The velocity of the reaction catalyzed by each protein can then be defined as the fitness of that protein. Then the distribution of velocities across the space of proteins constitutes the fitness landscape with respect to that defined function... 3.1 Fitness Landscapes The concept of a fitness landscape was first elucidated by Wright in [Wright, 1931] and [Wright, 1932]. It refers to the idea that genetic adaptations may confer a fitness benefit (or penalty) in a reliable way, most often due to adjustment to the environment, and that the possible neighbors that a given population may explore are constrained by the current location in genotypic space that the population occupies. This limitation is for the simple reason that offspring are more likely to have mutations into neighboring regions of the landscape, at least for reasonably low mutation rates where evolution could be studied. Fitness landscapes can be vast. As Wright points out in his seminal work, if there are 1000 genes, each with 10 possible alleles, there would be 101000 possible combinations of genes. If each of these combinations solely determined fitness, the fitness landscape would 38 Figure 3.1: Figure 2 from [Wright, 1932] with original caption. then also consist of 101000 mappings between genotypes and fitness. This number is well in excess of the 1080 estimated number of atoms in the universe. Estimates of the number of genes in the human genome further exceed this number; for instance [Roest et al., 2000] estimates the number at 28,000-34,000, although this exact number has been the focus of intense interest and revisions and more recently been estimated at 19,000 protein coding genes [Ezkurdia et al., 2014]. Regardless, the fitness landscape, even if it were purely determined by the genome, which it most assuredly is not, would be immense. Furthermore, even at very large population sizes, only a small number of possible genotypes can be sampled, when confronted with such vast landscapes. Wright visualized these combinations as representing a gene field in two dimensions. A 2D contour plot visualization of a fitness landscape from [Wright, 1932] is shown in Figure 3.1. On the subject of the relationship between evolution and fitness landscapes, Wright wrote: ... With something like 101000 possibilities, it may be taken as certain that there will be an enormous number of widely separated harmonious combinations. The chance that a random combination is as adaptive as those characteristic of the species may be as low as 10−100 and still leave room for 10800 separate peaks, each 39 Figure 3.2: Figure 4 from [Wright, 1932] with original caption. Here, 4NU refers to the genome wide mutation rate, whereas 4NS refers to the selective pressure. D and E are relevant for sexual populations and F corresponds to an island model. As Wright points out, differential selection on different islands in F may form the basis for or contribute to the process of speciation. surrounded by 10100 more or less similar combinations. In a rugged field of this character, selection will easily carry the species to the nearest peak, but there may be innumerable other peaks which are higher but which are separated by “valleys”. The problem of evolution as I see it is that of a mechanism by which the species may continually find its way from lower to higher peaks in such a field. [Gavrilets, 2004] further distinguished between two conceptions of fitness landscape that often appear in the literature. The first is that the fitness landscape is determined by the fitness of gene combinations. This is the view described by Wright above as he explicitly defines fitness as determined by gene combinations. An alternative formulation that Gavrilets outlines and which often appears in the population genetics literature is that the fitness landscape is the relationship between gene frequencies and the mean fitness of a population. This formulation begins to suffer in multilocus systems where genes may have complex epistatic interactions that are not simply additive. There are other conceptions of fitness landscapes that Gavrilets points out. For instance, 40 fitness landscapes may be visualized in terms of continuous quantitative traits. This can be useful in understanding the selection pressures on these traits, and is commonly used in quantitative biology to visualize selection on traits and investigate evolutionary tradeoffs. Another area of broad interest in fitness landscapes is that of frequency dependence in which the fitness of an organism is not dependent mainly on the fitness lattice, but also on the distribution of other individuals within the population. Negative frequency dependent selection is often driven by predators or disease. An example would be a virus that becomes common and causes a host population to develop immunity–new variations to which the host was not immune would have higher fitness and thereby spread rapidly. Positive frequency dependent selection can be observed in the study of cooperators and cheaters, among other domains–a certain level of cooperation is required for cooperation to become a viable strategy. For the purposes of this work, I focus on fitness landscapes as gene combinations. Epistasis plays an important role and is crucial in determining the landscape structure. However, I will ignore cross-organism interactions such as frequency dependence, which are much harder to represent in traditional landscapes, and move the fitness into an even higher dimension making it a function of both genotype and environment. 3.2 The Structure of Fitness Landscapes The distribution of peaks in fitness landscapes is an open problem, with consequences including the likely distribution of viable protein sequences and the distribution of viable RNA/DNA sequences for viruses. Peak distribution is also important in the evolutionary computation space; the goal is to find the best solutions possible, which correspond to local optima or peaks, so improved understanding of evolutionary processes will also be useful. Kauffman in [Kauffman, 1993] stated: ... Depending upon the distribution of the fitness values, the fitness landscape 41 can be more or less mountainous. It may have many peaks of high fitness flanked by steep ridges and precipitous cliffs falling to profound valleys of very low fitness. Or it may be, like the gentle Normandy countryside, smoothly rolling with low hills and gentle valleys. Kauffman introduced his NK model of fitness landscapes, which is explored in more detail in the next chapter. A significant finding from analyses of these landscapes was the realization that peaks were not distributed randomly in genotype space, but rather more closely packed than should be expected by chance. That is to say, there was global structure within the fitness landscape. Kauffman wrote: ... Like the alps, our landscape here possesses a kind of Massif Central, or high region, of genotype space where all the good optima are located... Kaufmann analyzed the structure of the fitness landscape empirically by measuring random walks starting from random initial genotypes in the landscape. The correlation structure was measured by using a technique adapted from [Weinberger, 1991], which appears in equation 3.1. Here, ft is the fitness at step t in the random walk, and ft+s is the fitness s steps later. R(t, s) = E(ft · ft+s ) − E(ft ) · E(ft+s ) var (f ) (3.1) As a result of this work, the claim that peaks tend to be clustered globally in the landscape is often referred to as the Massif Central Hypothesis. However, Kauffman describes what he calls a ‘complexity catastrophe’ that occurs in increasingly rugged landscapes. A depiction of the complexity catastrophe can be seen in Figure 3.3. However, this view of fitness landscapes and their underlying structure is not undisputed. Gavrilets in [Gavrilets, 2004] had this to say at the beginning of his chapter on nearly neutral networks: 42 Figure 3.3: Figure 3.1 from [Kauffman, 1993] ... Indeed, everybody knows from his or her own hiking experience that it is impossible to get from the top of one hill to another without having to descend to a kind of valley or depression between them. Our intuition tells us that things will stay the same in landscapes with many more dimensions than the three we are so well familiar with, and that extended ridges or chains of very shallow valleys connecting high-fitness genotypes are improbable. If the fitness landscape is constant (in space and time), then stochastic fluctuations in the population genetic structure must be a major mechanism for crossing the valleys. However, the main conclusion of the previous chapter is that there is no satisfactory solution to the problem of stochastic transitions across even moderately deep valleys in static landscapes. Only if the valleys are very shallow, can stochastic transitions happen on a time scale short enough to be of biological significance. But both speciation 43 and diversification seem to require crossing deep valleys in static landscapes describing RI. A logical conclusion is that speciation and diversification on rugged landscapes are impossible. There must be some kind of an error or a weakness in the chain of arguments presented in the previous paragraph... Gavrilets presented an alternative view of the structure between peaks; namely that peaks are connected by ‘ridges’ that are composed of neutral and nearly neutral genotypes. He called these ‘holey landscapes’. His arguments for this view of fitness landscapes are motivated by percolation arguments; namely that at a certain threshold of connectedness, there is a tendency for all elements to be connected. A visual example of this from [Gavrilets, 2003] can be seen in Figure 3.4. He carries this same basic core argument into higher dimensions as justification for the idea of neutral networks. Among the evidence [Gavrilets, 2004] cited for the existence of neutral networks is the NK family of models, although he relies primarily on variants like the NKp model by [Barnett, 1998] for formulating his arguments. 3.3 Evolutionary Implications of Fitness Landscapes So far, two models of fitness landscapes have presented quite differing views on how fitness landscapes may be arranged. These have strong biological implications that affect how populations might be able to evolve on them. In Kauffman’s conception of rugged fitness landscapes, the complexity catastrophe implies that as an adaptive walk progresses, further improvements become harder and harder to obtain; in fact these innovations become exponentially harder to cross as fitness peaks are achieved since larger and larger valleys need to be crossed. In the competing view outlined by Gavrilets, continuous improvement is permitted via these incredibly long neutral networks, making it unnecessary for the process of evolution to cross deep valleys. 44 Figure 3.4: Figure 4 from [Gavrilets, 2003] with original caption. 45 There has been some work on evaluating the structure of fitness landscapes in a biological context. For instance, [Fontana et al., 1993] examines correlation and local structure in an RNA-folding landscape where the objective is to minimize free energy on a variety of alphabets and found that RNA landscapes correspond roughly to a K of 8 in the NK landscape, and also that there exist neutral networks in the vicinity of peaks, in contrast to NK landscapes. [Stadler et al., 2001] investigated the topology and evolutionary implications thereof in RNA landscapes. [Kryazhimskiy et al., 2009] classifies fitness landscapes by looking at the rate of adaptation along fitness trajectories and tying this to real microbial data from long term evolutionary studies. There have also been limited fitness landscape mappings, such as the analysis done by [St Onge et al., 2007] and [Poelwijk et al., 2007] which have helped to provide real data on the nature of fitness landscapes for gene clusters and enzymes, respectively. 3.4 Model Systems For my evaluation of fitness landscapes, I examine three different model systems in order to study structural properties of the landscapes. I perform exhaustive genotype-to-phenotype mappings for each of these environments. Each landscape examined was of length 18 with four possible alleles per site, constituting a fitness landscape with 418 = 68, 719, 476, 736 genotypes. To my knowledge, no systematic and exhaustive analysis of landscapes this large has been conducted prior to this work. I choose a common length and number of alleles to eliminate confounding factors on evolution from structure and make the landscapes more easily comparable. The three model systems are NK, which is a theoretical family of landscapes commonly used in the discussion of fitness landscapes, Avida, which provides us with a genetic programming landscape where genotypes map on to functional computer programs, and an RNA landscape where the fitness is of a genotype is determined by its free energy after folding. This 46 provides us with a theoretical landscape widely used to study evolution and optimization, a genetic programming language, and a biologically-derived landscape. The purpose of this work is to study some of the structural properties of these landscapes. It is also important to note that these are instantiations of each type of landscape: there are many variants of each, and it may not be possible to generalize. Nevertheless, they provide a useful starting point for the systematic investigation of landscapes, and I introduce them in more detail below. 3.4.1 NK The NK fitness landscape, originally introduced in [Kauffman and Levin, 1987], is a biologicallyinspired model of a fitness landscape, in which an organism’s fitness is determined by N genes, each with K other interacting loci. This model has been used to study properties of adaptive walks and the influence of epistasis on rugged landscapes. This K parameter tunes the amount of epistasis in the landscape; as K increases, the amount of epistasis increases as does the number of local optima. At K=0, there is a single peak in the landscape, and it is a trivial problem for evolution to solve since each site may be optimized independently. At K=N-1, the landscape is completely uncorrelated and any mutation effectively randomizes the organism’s fitness. As in the original NK model, I calculate fitness by using the arithmetic mean of the fitness contributions from each of the N sites as in Equation 3.2. Furthermore, I also use the default nearest adjacency model of the NK landscape, where each gene is epistatic with K adjacent genes. 1 F (G) = N N f (Gi ) i where f (Gi ) = f (gi,0 , ..., gi,K ) ∼ U (0, 1) Here, gi,j denotes the j-th neighbor of site i. 47 (3.2) For consistency with other landscapes studied, I use four alleles per site and I set the parameter of K = 4 to achieve a nontrivial amount of epistasis. 3.4.2 Avida Avida is a software system that has been used to study biological evolution. [Ofria, 2004] Each “digital organism” in Avida is a program that is executed in a virtual computer, complete with its own registers and stacks. The programs may, among other things, execute logic tasks, for which they receive a reward that translates to an execution speed boost relative to their competitors. The relative execution speed is called ‘merit’ and corresponds to a metabolic rate. Fitness is calculated as merit divided by the gestation time. Fitter organisms are those that are able to reproduce themselves fastest, which equates to selective pressures to increase execution speed or improve the efficiency of replication. Although no exhaustive mapping of this size has been done to study the fitness landscape of the Avida genome, there has been considerable previous work that informs about the nature of Avida fitness landscapes. [Lenski et al., 1999] sampled the landscape up to ten mutations away from simple and complex digital organisms and found that not only are complex organisms more robust to the generally deleterious effects of point mutations, but that there is significant epistasis in the mutational landscape as measured by comparing the fitness of mutational combinations with single point mutational fitness effects. There were also differences in the fragility of organisms and the fitness effects of mutations when comparing simple and complex organisms. [Lenski et al., 2003] suggested that rewarding intermediate logic functions is important to the evolution of EQU (the bitwise equals logic operator) with the implication that complex features often coopt simpler features in the process of evolution. This result is partly because higher-order logic tasks can be composed of lower order ones. The Avida landscape I study is a reduced version of the default instruction set with only four instructions: IO, nand, nop-C, and swap. Organisms automatically reproduce at the end, 48 so all genotypes are technically viable. This model is a highly simplified compared to the full Avida model in many ways. The default Avida instruction set has 26 instructions and organisms must deal with the complexity of maintaining a copy loop in order to reproduce (the organism must copy over each instruction). This requirement of maintaining the copy loop has several consequences for the resulting fitness landscape. Primarily, the vast majority of random sequences are dead since they cannot produce viable offspring, whereas in the reduced instruction set, this is not the case. However, even in the reduced instruction set, the landscape is similar in that the vast majority of random sequences still are on the lowest fitness levels. Furthermore, even small genome sizes are difficult to exhaustively analyze with 26 possible instructions per site. A length 18 organism, which would still be a small organism by Avida standards, would have 2618 possible genotypes, which is roughly a landscape of size 2 · 1025 . Clearly, a landscape this immense would defy exhaustive analysis. And this is still a small genome at a much reduced scale compared to biological entities or proteins which often have sequence lengths well above the hundreds. Exhaustive enumeration of such entities may never be completely possible given the sheer exponential (with length) scale of the size of the fitness landscape. Indeed, the fitness landscapes that I am considering here with 68 billion states (four possible instructions at each of 18 sites) is already pushing the current limits of hardware in terms of both storage and computing power. In this work, I use Avida not as the experimental evolution platform that it undoubtedly is, but specifically to evaluate the fitness of an input genomic string by executing the string and determining what logic tasks, if any, it can perform and what its gestation time would be. This genotype-to-phenotype mapping is what provides us with the fitness landscape. For the purposes of this work, I will be examining the repeated tasks environment, in which organisms may be rewarded for performing the same tasks multiple times. Avida uses a concept of ‘merit’ to determine relative CPU speed. Merit is awarded by organisms successfully performing tasks; I use the default Avida logic-9 environment. In this environment, nine logic tasks are rewarded exponentially in order of rough complexity. 49 Specifically, NAND and NOT are rewarded by a factor of 2, AND and ORN (or-not) by a factor of 4, OR and ANDN (and-not) by a factor of 8, NOR and XOR by a factor of 16, and EQU (XNOR) by a factor of 32. Organisms also have variable gestation times, since nop-C can be a modifying argument to an instruction. If this occurs, the nop-C itself is not executed by Avida and thus does not count towards the total execution length. 3.4.3 RNA The third and final landscape I examine is the result of RNA folding of length 18 sequences using the Vienna RNA Package created by [Lorenz et al., 2011]. I examine two different RNA fitness landscapes generated by two functions: fold, which computes the minimum free energy of an RNA sequence, and pf fold, which returns the Gibbs free energy of the folding ensemble. Figure 3.5 shows a comparison of the two landscapes. Since pf fold is based on the energy of the ensemble of possible folds, there is more variation in the fitness values than fold which only looks at the most likely candidate. Nonetheless, the discretization in Figure 3.5 implies that the landscapes are broadly similar. For the rest of this work, I use the results of pf fold to denote the RNA landscape. This also has the side effect of decreasing the amount of neutrality in the landscape, as well as reducing the amount of extremely low fitness genomes while spreading the mass along the rest of the landscape, compared to the fold landscape. Previous work has found neutrality in RNA landscapes, but in this instantiation there is little–likely because of the decision to use the ensemble rather than the most likely sequence when calculating free energy. [Fontana et al., 1993] 50 Figure 3.5: Comparison of RNA pf fold and fold landscapes. In Figure 3.5a, raw data for each landscape is plotted, with each fitness, count pair plotted separately. In Figure 3.5b, I discretize fitnesses of pf fold by rounding fitness to the nearest tenth. The landscapes are highly similar as would be expected since they are both mappings of fold energy within the RNA landscape. (a) Raw data (b) Discretizing pf fold 51 3.5 3.5.1 Landscape Summaries Methods In order to understand the structural properties of each landscape, I started by examining the distribution of fitnesses. With length 18 and four alleles per site, this works out to almost 69 billion possible genotypes. For each of the genotypes in each of the three landscapes, I look at the corresponding phenotype, which is tied simply to fitness in each landscape: an instantiation of an NK landscape with K=4, logical tasks performed in Avida, and the free energy of the ensemble of folding sequences in the RNA landscape. As part of each landscape summary, I also examine properties of peaks within the landscape. Here, I present two definitions, that of a ‘true peak’ and a ‘neutral peak’. A ‘true peak’ is a genotype with a fitness that is strictly greater than the fitness of each of its neighbors. This is a local optimum on the fitness landscape. I also define a ‘neutral peak’ as a genotype with an associated fitness that is greater than or equal to the fitness of each of its neighbors. Neutral peaks can be on plateaus or ridges in genotype space. Counterintuitive behavior can occur when the edge of a ridge or plateau adjoins a higher peak—the edge point would be considered neither a peak nor a true peak since there would be at least one genotype with higher fitness next to it, although it may neighbor other points with identical fitness that are still considered neutral peaks. The neutral peak dataset is a superset of the true peak dataset. This concept proves useful when looking at landscapes with large amounts of neutrality, such as the Avida landscape, since the highest points are rarely true peaks. I exhaustively looked at each genotype to obtain the data in this chapter. I accomplished this by splitting the genotype space into several (usually 1024 or 4096) smaller chunks. Each chunk was then evaluated, and summary information generated for the chunk, as well as candidate ‘true peaks’ and ‘neutral peaks’. These individual chunks were later combined to provide us with a list of all the peaks in as well as complete summary information for the 52 landscape. I used several techniques including binary encoding, hashing, and compression in order to deal with the dataset sizes. 3.5.2 NK The NK landscape pictured in Figure 3.6 features a normal distribution, centered at a fitness of 9. This distribution is the consequence of the fact that each of the eighteen genes has a uniform distribution with an expectation of 0.5, and the result of this summation creates a normal distribution. The theoretical interval that fitnesses can take is [0,18], but I do not see values at either extreme, again due to the normal nature of the underlying distribution. The peaks also appear to follow a normal distribution, centered a little higher than 12 for this particular landscape instantiation. There is some, but very little, neutrality, and therefore few neutral peaks that were not peaks in this landscape. The neutrality itself is caused by the fact that I have only 32 bits of precision. Figure 3.7 shows the distribution of fitnesses for all genotypes and peaks in the landscape. 53 Figure 3.6: Phenotypic Summary, NK Landscape. X-axis is fitness, Y-axis is the count of the number of genotypes with that fitness. The neutral peaks line in green is overlaid by the peaks in blue since because there is so little neutrality in the landscape, the two sets are essentially identical. 54 Figure 3.7: Box & violin plots of genotype fitness values on the NK landscape for all genotypes, peak genotypes. 55 3.5.3 Avida Figure 3.8 is the summary of the reduced-instruction Avida landscape. The fitness structure of this landscape is exponential, so I take the logarithm of fitness when interacting with the Avida landscapes. There are only 1108 possible fitness levels in the Avida landscape, which is significantly fewer than the numbers observed in the RNA and NK landscapes which are more or less continuous. Figure 3.9 shows a summary of the Avida landscape. Another pattern that can be clearly seen is a wave structure. This appears to be connected to the structure of the landscape itself—in Figure 3.10, the phenotypic structure of the landscape is visible. Notable is the gestation structure—different amounts of optimization with the nop-C create a step-like structure. There are 98,120 true peaks in total and 86,100,995 peaks in the Avida data. I thresholded the peak data set by fitness to make it more tractable for the analyses used in the rest of the chapter, which resulted in 4.5 million peaks. 56 Figure 3.8: Phenotypic Summary, Avida Landscape. X-axis is fitness, Y-axis is the count of the number of genotypes with that fitness. 57 Figure 3.9: Box & violin plots of genotype fitness values on the Avida landscape for all genotypes, peak genotypes. 58 Figure 3.10: Phenotype of fitness values on the length 18, reduced instruction set landscape. Phenotypes are ordered by fitness. The blue line is log fitness of the phenotype. The green line is the gestation time. The red line is the number of genotypes. The general pattern observable is that for each level of merit, there are several optimizations using nop-C that reduce the execution time and thereby the gestation time, improving fitness. On the bottom, logic tasks performed by each ranked genotype are shown in the following order from bottom to top: NAND, NOT, AND, ORN, OR, ANDN, NOR, XOR, EQU. It can be seen that the most fit organisms perform several tasks rather than doing EQU. 59 3.5.4 RNA The RNA landscape can be seen at a high level in Figures 3.11 and 3.12. As in the Avida landscape, there is a large preponderance of genotypes that do not fold particularly well. Also, similarly to the NK landscape, the fitness values range from 0 to 18. There are 184,020 neutral peaks and 58,024 peaks in this landscape, which indicates some clustering of the high regions in genotypic space, which will be explored further in following sections. Figure 3.11: Phenotypic Summary, RNA Landscape. X-axis is fitness, Y-axis is the count of the number of genotypes with that fitness. 60 Figure 3.12: Box & violin plots of genotype fitness values on the RNA landscape for all genotypes, peak genotypes. 61 3.6 Peaks and Nearby Peaks An analysis by [Østman et al., 2010] looked at the critical properties of true peaks in the NK landscape. One of the properties they measured was true peak fitness vs the average fitness of true peaks at a distance of two, which they call ‘cluster mean fitness’. The key question here is whether more highly fit peaks correlate with more highly fit neighborhoods. [Østman et al., 2010] used a binary NK landscape for length 20 genomes and different Ks. They also performed a random assignment between genotypes and fitnesses as a control to ensure the relationship was not spurious. They found a correlation between higher fitness of a peak and higher cluster mean fitness as seen in Figure 3.13. I perform a similar analysis here in each of the three model landscapes. I use the neutral peaks dataset to perform this analysis. One distinction from the methodology in [Østman et al., 2010] is that since my data set includes ‘neutral peaks’, I must count the neighborhood as being within a distance of two or less to include neighboring ‘neutral peaks’. 62 Figure 3.13: Figure 4 from [Østman et al., 2010] with original caption 63 3.6.1 NK My NK landscape is qualitatively different from that in [Østman et al., 2010], it has length 18, but four alleles per site, rather than two alleles per site. The results can be seen in Figure 3.14 and show a strongly correlated peak fitness with cluster mean fitness. 64 Figure 3.14: NK data, Fitness vs Cluster mean fitness. Each neutral peak fitness along with its neighborhood fitness is shown as a dot. The colored hexagons overlay show density (count of points) and the red line shows the fit. The correlation and line fit can be seen at the top of the graph. (a) Fitness vs Cluster mean fitness (b) Control, with fitnesses randomly assigned to genotypes. 65 3.6.2 Avida The results of the analysis on the Avida data can be found in Figure 3.15. Although a bit more visually confusing than the NK landscape due to the tiered structure of the Avida landscape, there is nonetheless, a strongly correlated positive relationship between the fitness of a peak and the height of any peaks in the neighborhood, and this relationship disappears completely when the genotypes are randomly assigned fitness. 66 Figure 3.15: Avida data, Fitness vs Cluster mean fitness. Each neutral peak fitness along with its neighborhood fitness is shown as a dot. The colored hexagons overlay show density (count of points) and the red line shows the fit. The correlation and line fit can be seen at the top of the graph. (a) Fitness vs Cluster mean fitness (b) Control, with fitnesses randomly assigned to genotypes. 67 3.6.3 RNA The results of the analysis on the RNA data can be found in Figure 3.16 and the same relationship holds; there is a similar positive and strong relationship between peak fitness and the fitness of peaks in their neighborhood. 68 Figure 3.16: RNA data, Fitness vs Cluster mean fitness. Each neutral peak fitness along with its neighborhood fitness is shown as a dot. The colored hexagons overlay show density (count of points) and the red line shows the fit. The correlation and line fit can be seen at the top of the graph. (a) Fitness vs Cluster mean fitness (b) Control, with fitnesses randomly assigned to genotypes. 69 3.7 Percolation Another structural property of interest is percolation. Percolation theory is often used to model fluid flow and is particularly concerned with the connectedness of a graph. This is the key property that Gavrilets postulated might allow evolution to perform walks across a fitness landscape as in 3.4. The core argument is that as you add dimensions, the probability of there being an interconnected pathway may increase. If peaks are close to each other, even by small jumps, then it should be easier for evolution to walk from peak to peak, compared with the alternative where peaks might be interspersed by wide valleys. Again, [Østman et al., 2010] explored percolation in the NK landscape, which can be seen in Figure 3.17. The basic approach is to add a fitness threshold on peaks and examine the spatial relationships of the remaining genomes. Peaks of distance no more than two apart are connected and belong to the same cluster. The same cluster may contain peaks of more than 2 distance apart due to a chain of connected peaks. For this work, I follow a similar regime. I measure clusters and connectedness for increasing fitness thresholds for peaks. Here, I measure the size of the largest connected cluster. For the RNA and NK landscapes, which are continuous, I set threshold increments of 0.1. For Avida, since there are a finite number of phenotypes, I used the possible fitness levels as thresholds. 70 Figure 3.17: Figure 6 from [Østman et al., 2010] with original caption 71 3.7.1 NK The analysis for the NK data set can be seen in Figure 3.18. As in the [Østman et al., 2010] work, there appears to be an orderly phase transition from where most of the peaks are connected in the largest cluster, to where the largest cluster is significantly smaller. 3.7.2 Avida The percolation graph for Avida can be seen in Figure 3.19. In the Avida landscape, the peaks start out almost completely connected. The horizontal leveling is tied to the phenotypic structure, since there are only 1108 possible fitness levels in the Avida landscape and these are distributed in a step-like fashion. 72 Figure 3.18: NK data, peak percolation, neighbors joined within distance 2. Visible are all points above the fitness threshold and the size of the largest connected cluster. 73 Figure 3.19: Avida data, Peak Percolation with neighbors joined within distance 2. Visible are all points above the fitness threshold and the size of the largest connected cluster. 74 3.7.3 RNA The percolation for the RNA dataset can be found in Figure 3.20. Most neutral peaks are not connected; even at a fitness threshold of zero (allowing all peaks to be considered), I see that only a few hundred out of 184,020 neutral peaks are actually connected to begin with. The lack of connectedness in the graph suggests that, in the RNA data set, peaks are often isolated. The percolation pattern and connectedness is in sharp contrast to the pattern observed in the Avida and RNA datasets. 75 Figure 3.20: RNA data, peak percolation, neighbors joined within distance 2. Visible are all points above the fitness threshold and the size of the largest connected cluster. 76 3.8 Autocorrelation of Peaks Finally, one last analysis I performed was similar to the autocorrelation analysis proposed by [Weinberger, 1991], except rather than measuring random walks, I measured a structural property of the landscape pertaining to peaks: namely peak fitness correlates with distance between peaks. In Equation 3.3, I reproduced equation 4 from [Barnett, 1998] and originally formulated by [Eigen et al., 1989]. Here, the autocorrelation at distance d is represented by ρ(d), g and g are two points within the landscape. QN (d) represents points d distance apart, f¯ is the average fitness, and σf 2 is the variance of the fitnesses of the points in the collection. In order to collect this data, I needed to look at all-pairs distances, which proved especially challenging when there were millions of data points as in the Avida set. ρ(d) = 3.8.1 1 1 2 N σf |Q (d)| (f (g) − f¯)(f (g ) − f¯) g,g (3.3) ∈QN (d) NK The autocorrelation in the NK landscape can be seen in Figure 3.21. There appears to be an exponential response of autocorrelation to distance. 77 Figure 3.21: NK peak autocorrelation 78 Figure 3.22: Avida thresholded peak autocorrelation 3.8.2 Avida The resulting autocorrelation in Avida can be seen in Figure 3.22. At distance zero, there is full autocorrelation, since each peak’s fitness correlates fully with itself. Similarly, at distance one, only neutral peaks neighbors may exist in the peak data set, which by definition have the same fitness, so the autocorrelation is one. After this, the autocorrelation drops off in what appears to be a very smooth exponential or power distribution. 79 3.8.3 RNA Figure 3.23 shows the autocorrelation of peaks. Once again, there is full autocorrelation at distance one from the neutral peaks. As in the other landscapes, when peaks are close, the correlation in fitness is high. The correlation goes negative as distance increases and then finally becomes weakly positive at distance near 18. This last phenomenon may be due to the fact that if a given sequence is stable, its complement is also likely to be stable. Figure 3.23: RNA peak autocorrelation 80 3.9 Discussion and Conclusions I have presented a comprehensive investigation of several structural properties in length 18 genomes spanning NK, Avida, and RNA landscapes comprising 68,719,476,736 genotypes each. Landscapes of this size have not been previously studied so systematically. For each landscape instantiation, I did an exhaustive genotype to phenotype analysis. This investigation provides us with insight into theoretical landscapes often used to study evolution, genetic programming, and RNA sequences. The fitnesses found within the NK landscape follow a normal distribution, as do the peak fitnesses within the NK landscape. The RNA and Avida landscapes in contrast are mostly dead—that is most genotypes have minimal fitness due to the phenotypic nature of these landscapes. On the contrary, the expectation for a random NK landscape sequence is a fitness of 9—exactly at the 50th percentile. This is one important difference between NK landscapes and the landscapes I see in nature. The peaks in Avida and RNA landscapes skew towards the tail of higher fitnesses, but in RNA it is even more pronounced than in Avida due to the large prevalence of neutrality in Avida landscapes. For all three instantiations of landscapes, peaks tend to cluster near other peaks; in Section 3.6, I found that there is a clear positive relationship between the fitness of a peak and the fitness of peaks within a neighborhood of two. This relationship held across all three types of landscapes. For NK, this effect has already been demonstrated in previous work, but the broader generality of this finding is novel with its expansion to larger NK landscapes as well as for both Avida and RNA landscapes. These results provide strong quantitative evidence that peaks are not distributed randomly throughout the landscape, but do indeed tend to cluster. For both the NK and Avida landscapes, the autocorrelation starts high and drops rapidly, eventually going below zero as the distance approaches 18. The RNA landscape has a higher autocorrelation for longer, but also eventually declines and then, surprisingly, goes positive for large distances. This effect at high distances is potentially because my fitness function is 81 tied to the folding energy, and the complement of a sequence that is stable is also likely to be stable. Such a complement could only appear in regions at the maximum distance of 18. This result may indicate a certain amount of symmetry in the RNA landscape that does not exist in the other two model landscapes—namely if you locate a peak in RNA space, it is profitable to search regions around the complement. Percolation (an indication of the ability for evolution to navigate to wide ranging peaks through intermediate peaks) is only present in the Avida and NK landscapes—the RNA landscape, in contrast, starts with only a small fraction of peaks connected. This result implies that nearly neutral pathways that connect clusters of peaks may not exist on RNA landscapes. Conversely, these pathways may exist on Avida and NK landscapes. This difference implies that evolution may work differently in these landscapes, a hypothesis I investigate further in future chapters. 82 Chapter 4 Basins of Attraction 4.1 Introduction High fitness peaks dominate their local neighborhoods in genotypic space by allowing organisms to outcompete their neighbors, thus attracting and trapping evolutionary trajectories. These peaks are are often referred to as ‘basins of attraction’ because it is easier for evolution to permit a transition from a relatively unfit genotype to a relatively fit genotype, whereas the opposite transition is actively opposed by the process of natural selection. In general, the longer upward trajectories are that lead towards a peak, the larger the associated basin is likely to be. In this chapter, I explore questions of predictability and evolvability within rugged adaptive landscapes by examining the size and distribution of basins of attraction. These properties of basins have implications for biological evolution since natural selection drives populations towards higher levels of fitness, so an improved understanding of how trajectories evolve on fixed fitness landscapes may help us understand how they evolve in nature. 83 4.2 Background [Kauffman, 1993] was the first to examine basins of attraction in the context of NK landscapes, and he measured basin size by counting the number of random walks ending up at each optimum. Similarly, [Ochoa et al., 2008] examined NK fitness landscapes by exhaustively enumerating instantiations of these landscapes across a variety of K values for N = 16 and N = 18. Basins of attraction were determined for each point by starting a walk at each genotype, and picking the neighbor with the highest fitness until a local optimum was reached. If a tie for neighbor of highest fitness occurred (which may occur due to rounding error or when NKp landscapes with neutrality are used), the next step was chosen between them at random. Trajectories were then binned into genotypes which permitted them to make many conclusions about basins of attraction in NK landscapes. Among their conclusions, they found a positive exponential correlation between the fitness values of maxima and basin size. They also found that as K increased, the size of the largest fitness basin shrunk as a proportion of the fitness landscape. [Ochoa et al., 2010] used the same technique, but instead of each walk favoring the most fit neighbor, it would choose any neighboring point with greater fitness. This technique resulted in a similar relationship between fitness and proportion as in [Ochoa et al., 2008], but randomly walking across neighbors tended to increase the size of smaller basins while decreasing the size of the largest basins. Additionally, they found that a genotype at N=16,K=4 may reach as much as 70% of the available basins, but at K=14, a solution may reach as little as 30%. [Tomassini et al., 2008] measured transition probabilities between basins by examining the ‘edges’ of basins, defined as those points that belong to a basin that are one mutational step away from points that belong to other basins. The authors assembled a network upon the basins by examining the ratio of outgoing links to within-basin links. [V´erel et al., 2011] extends the methodology of [Ochoa et al., 2008] to look at NKp neutral 84 networks. The hill-climbing algorithm for determining basins was altered by performing a random walk to points greater than or equal to the max fitness bordering any given point. The final points were then used to construct neutral networks by examining edges, with single peaks corresponding to a neutral network of size one. 4.3 Measuring Basins In this section, I present a new method for measuring the size of basins, the Basin Flow Algorithm. 4.3.1 NK Fitness Landscape Variants In the previous chapter, the standard form of the NK landscape family, with a genome of length N with K interacting genes, was introduced. A modified version of the NK landscape was proposed by [Østman et al., 2010], where instead of using the arithmetic mean, the geometric mean is preferred. The biological justification is that it simulates the effects of lethality—the expression of a single bad gene is enough to drop fitness substantially. 1 N N f (G) = fi (G) 1 (4.1) where f (Gi ) = f (gi,1 , ..., gi,k ) ∼ U (0, 1) The NK landscape contains almost no neutrality since the probability of obtaining two random numbers of the same exact fitness is near zero since neutrality is possible due to finite precision. [Barnett, 1998] introduced the NKp landscape where there is a p probability that a given allele contribution will produce a gene with zero fitness, introducing neutrality. Similarly, I introduce a probability that a given allele contribution will produce a gene with a one fitness contribution. This probability has the secondary effect of increasing fitness values of the distribution due to the geometric mean. 85 4.3.2 Basin Flow Algorithm I seek to measure basins of attraction by looking at where population centroids would ‘flow towards’, given a uniform initial distribution across the entire fitness landscape. I take broad inspiration for this idea from the page rank algorithm, [Page et al., 1999] originally employed by Google, and which seeks to establish the relative importance of a web page by examining the properties of its links. Internet surfers are modeled as random walkers who make decisions about where to navigate next based on outgoing links from the current page they are visiting. The page rank seeks to identify nodes where surfers spend the most time. Similarly, here, I model the centroids of populations as random walkers on a fitness landscape and seek to identify which nodes they are drawn towards by examining one point mutational neighbors. By centroid, I mean the approximate center of the population on the genotypic lattice. Neighbors that are greater than or equal to are analogous to ‘outgoing links’ in the page rank algorithm. The accumulation of random walkers ultimately identifies nodes that are either local maxima or part of locally maximum neutral networks, or ‘sinks’. Unlike the page rank algorithm, these are treated as true sinks. This methodology differs from mutation-selection-balance population genetic models since rather than competing genotypes among populations or competing populations against each other, I am instead measuring random walks of population centroids. This method is a fixed structural measure of the fitness landscape. Previous work in this domain has focused on sampling trajectories, whereas my approach represents a belief of basin density over all genotypes in the fitness landscape. A genotype is then potentially simultaneously a member of multiple basins by distributing flow over all qualifying neighboring genotypes. I assume a uniform starting distribution, which admittedly is not the only prior belief one can have about populations upon the landscape, but is a straightforward choice. In reality, of course, finite populations necessarily cannot start with such a distribution and are already trapped in one small area of the landscape, because of both common descent (see Chapter 2) and 86 selection. With this method, I then explore the entire landscape without bias, which differs from sampling methods which are necessarily biased from the initial sample points. Additionally, I seek to examine the effect that resulting network structure has on the process of evolution. To do this, I apply a secondary analysis to measure transience of basins by allowing jumps between genotypes two mutations apart. This secondary analysis provides insight into what the basin structure might look like on an evolutionary time scale when double mutations may be regularly explored. In this work, I use an instantiation of the NK landscape described in the previous chapter, with length = 20 and two alleles per site. I use a modified version of the NKp landscape using the geometric mean as identified earlier in Equation 4.1 and where p, rather than a gene having a probability of adding 0 fitness, has a probability of multiplying by 1. As mentioned earlier, this NKP variant causes further skew to the right in NKp landscapes, as the geometric mean will be larger when a 1 is substituted rather than a random number between 0 and 1. To measure basins, I assign a density value to each genotype. The density represents the proportion of population trajectories at each genotype in the landscape. I start with a uniform initial density of 1 , aL where a is the number of alleles and L is the length of the genome. I establish directional links between genotypes that may distribute density to neighboring genotypes, based on relative fitness and the local structure of the neighborhood. This transition probability represents the chance that a population centered at a given site may evolve to a neighboring site. There is some literature pertaining to competition in the context of clonal interference between beneficial mutations as in [Gerrish and Lenski, 1998], [Desai et al., 2007], and [Wilke, 2004], but I opt to use two simple treatments instead described below. There is no coherent view of what timescale an update corresponds to in this work, since selection strength would be varied, so I am more interested in the convergent dynamics than in the transient dynamics; after 1,000 updates, I measure the density in genotypes as an estimate of the size of basin of attraction. 87 In the ProbGE treatment, I constructed an outgoing edge e(A, B) from genotype A to genotype B with weight w(A, B) as follows: w(A, B) =    0    if F (A) < F (B) (4.2) n∈NA f (B) ∪{a} I(F (n)≥F (A))F (n) if F (A) ≥ F (B) In this equation, F (A) represents the fitness of genotype A, NA is the neighborhood of one point mutations surrounding A, and I is the indicator function that enforces that these points have greater than or equal fitness relative to point A. The biological assumption underlying this treatment is that neighboring points have a chance of receiving a trajectory equivalent to their fitness divided by the total neighboring fitness of all neighboring candidates. The second treatment is the FlatGE treatment, in which I constructed an outgoing edge from genotype A to genotype to B with weight w(A, B) as follows: w(A, B) =    0    if F (A) < F (B) (4.3) n∈NA 1 ∪{a} I(F (n)≥F (A)) if F (A) ≥ F (B) In other words, all outgoing links have equal weight in the second treatment, whereas they are weighted by relative fitness in the first. After 1,000 updates, I examine the distribution of the density in points that are above a threshold of 0.00001. The key biological assumption under this treatment is that if mutations are rare enough, the population is small enough, or the selection coefficient is strong enough, sweeps may occur sufficiently fast relative to the appearance of new competitive types that the first beneficial or neutral mutation may sweep. In both models, I make the following assumptions: (1) mutations within the neighborhood are equiprobable, (2) double mutations do not occur, (3) likewise, the population centroids, or centers of mass on the genotype space, may not take more than one step at once, and (4) natural selection prevents populations from taking downward steps. These assumptions 88 of course do not necessarily hold in natural populations, but they do make the model much more tractable, and I explore relaxation of the double mutation assumption next. As I saw in the previous chapter, [Kauffman, 1993] described a feature common to the NK family of landscapes that he termed the ‘Massif Central’, which referred to the observation that peaks tend to cluster close to each other, as opposed to being randomly distributed across the landscape. [Østman et al., 2010] provides evidence that peaks tend to be clustered in the particular NK landscape I use. This idea, that peaks are not necessarily far from each other, lends itself to a secondary analysis, whereby I evaluate the idea that trajectories can and do escape local peaks during the process of evolution. This treatment relaxes the assumptions that double mutations do not occur and that natural selection prevents populations from taking downward steps. The basic assumption is that most peaks are themselves transient phenomena in an evolutionary trajectory, albeit those likely to have a longer waiting time to escape. In fact, since the density I measure represents population centroids, it would not be that difficult to cross a double mutation valley. Simultaneous mutations, compensatory mutations through a valley crossing, and even drift provide mechanisms whereby populations may evacuate regions of the fitness landscape altogether and accumulate in more highly fit regions. In my secondary treatment, after running the first 1,000 steps as above, I run another 1,000 steps where in addition to percolation every update, I perform an additional calculation: I allow points with a density over threshold δ = 0.00001 to lose mass to genotypes that are two steps out, again obeying the above equations, with the caveat that NA instead represents the 2-step mutational neighborhood of genotype A. The density threshold permits us to ignore the vast majority of points underneath the threshold, which is a relatively useful optimization since there are L(L−1) 2 2-step mutants for each genotype, and the vast majority of genotypes already contain negligible or no mass. This threshold cuts down on the processing requirement at the potential cost of sacrificing some accuracy. 89 Finally, I examine the effects of neutrality on basin size. The exact role of neutral networks is still far from understood; such regions may serve as conduits to higher regions of the landscape or as sinks where trajectories get stuck or both. The distribution of neutral regions also may play a significant role; it might be imagined that a large multidimensional neutral region surrounded primarily by deleterious genotypes may retain a lot of mass simply because trajectories are randomly walking around in this neutral region. There is some evidence for ‘survival of the flattest’ or the notion that even in the absence of fitness benefits, populations may evolve towards genotypes with fewer neutral neighbors as in [Wilke, 2001] and [Wilke et al., 2001]. 4.3.3 Discussion I perform both the ProbGE and FlatGE treatments in a length N=20 landscape for K = 2, 5, 8, 11, 14, 17, and 20 with p=0, p=0.2, p=0.5, and p=0.8. For each treatment, I generate five random landscapes. I store only genotypes above a threshold of 0.00001. Figure 4.1 shows these results for the no neutrality case (p=0.0). In order to examine the effects of fitness and treatment condition (ProbGE = 0, FlatGE = 1) on the basin size, I fit the model log10 B = β0 + β1 (F ) + β2 (T ) + β3 (F xT ) (4.4) Here, B is basin density, F is fitness, and T is treatment condition. I fit this model across each pair of fitness landscapes; i.e. for each K, the same five landscapes are used in both the ProbGE and FlatGE treatments. The reason I prefer this type of pairwise analysis is that the landscapes in question change significantly as K increases. In particular, increased epistasis permits more combinations, which increases the maximum fitness possible. The results of this can be seen in Table 4.1. To summarize, the log of the basin size is predicted by fitness (p < 4.85 ∗ 10−31 ) for all cases. Additionally, this relationship decreases with increasing 90 Table 4.1: Coefficients of log10 B = β0 + β1 (F ) + β2 (T ) + β3 (F x T ). p-values are in parentheses. K 2 5 8 11 14 17 20 Intercept -2.482 ( 1.56e-252 ) -4.629 ( 0 ) -5.819 ( 0 ) -6.464 ( 0 ) -6.608 ( 0 ) -6.265 ( 0 ) -6.004 ( 0 ) Fitness 9.284 ( 4.85e-31 ) 6.841 ( 0 ) 6.215 ( 0 ) 5.62 ( 0 ) 4.634 ( 0 ) 3.201 ( 0 ) 2.365 ( 0 ) Type 0.06139 ( 0.301 ) 0.07802 ( 8.39e-06 ) 0.08586 ( 1.86e-15 ) 0.09555 ( 2.01e-26 ) 0.1113 ( 2.77e-33 ) 0.08976 ( 1.61e-21 ) 0.05122 ( 1.73e-07 ) Fitness x Type -0.9195 ( 0.391 ) -0.3776 ( 3.74e-05 ) -0.2649 ( 9.05e-14 ) -0.2349 ( 9.46e-25 ) -0.2375 ( 1.5e-32 ) -0.1687 ( 1.31e-20 ) -0.08538 ( 8.43e-07 ) K, which is unsurprising since the additional epistasis caused by a higher K increases the number of optima in the landscape. Both the association with fitness size and basin size and the effect of increasing K on decreasing this association are consistent with the results of [Ochoa et al., 2008]. Secondly, I find that the FlatGE condition produces larger basin sizes than the ProbGE treatment. Furthermore, there is a negative interaction between fitness and treatment; which is to say that the maximum density of peaks in the FlatGE case was less than that in the ProbGE case. This interaction also seems analogous to previous work done by [Ochoa et al., 2010], who found a difference between taking the best neighboring point and taking the first neighboring point increased average basin size but decreased the largest basin sizes. The ProbGE treatment, by weighting fitness proportionately, does not exactly correspond to the best point, but it is likely a less biased version of the same effect; with equal weights, more trajectories are drawn towards local maxima. This effect also seems to decrease with increasing K, likely as a consequence of increasing ruggedness in the landscape. Next, I examine the effects of the secondary network analysis on the basin size. Allowing valley crossings which permit basins to empty into higher nearby points results in a reduction of between 74% and 82% of all basin points above the threshold for the non-neutral case. This can be seen in Table 4.4. Table 4.2 and Table 4.3 show regression data for the ProbGE and FlatGE treatments, respectively. In both of these treatments, the slope of the fitness increases after the second order manipulation. The effects of the second order network can be seen in that it has an increasingly negative effect on basin size as K increases (when controlled for the range of fitnesses), but there is a large fitness x second treatment interaction term. In other words, higher fitness basins tend to get bigger and lower fitness basins are more likely to disappear via drainage. 91 Table 4.2: Coefficients of log10 B = β0 +β1 (F )+β2 (S)+β3 (F x S) for ProbGE treatment. p-values are in parentheses. K 2 5 8 11 14 17 20 Intercept -2.482 ( 7.21e-182 ) -4.629 ( 0 ) -5.819 ( 0 ) -6.464 ( 0 ) -6.608 ( 0 ) -6.265 ( 0 ) -6.004 ( 0 ) Fitness 9.284 ( 3.84e-28 ) 6.841 ( 0 ) 6.215 ( 0 ) 5.62 ( 0 ) 4.634 ( 0 ) 3.201 ( 0 ) 2.365 ( 0 ) Second 0.9298 ( 4.67e-15 ) 0.3553 ( 9.47e-20 ) 0.05732 ( 0.015 ) -0.1768 ( 3.7e-21 ) -0.3357 ( 2.96e-74 ) -0.5051 ( 5.79e-157 ) -0.7414 ( 2.18e-290 ) Fitness x Second -4.544 ( 0.00205 ) 0.04766 ( 0.775 ) 0.6512 ( 4.28e-22 ) 0.9685 ( 1.08e-112 ) 1.138 ( 2.32e-211 ) 1.375 ( 0 ) 1.691 ( 0 ) Table 4.3: Coefficients of log10 B = β0 +β1 (F )+β2 (S)+β3 (F x S) for ProbGE treatment. p-values are in parentheses. K 2 5 8 11 14 17 20 Intercept -2.42 ( 3.48e-179 ) -4.551 ( 0 ) -5.733 ( 0 ) -6.368 ( 0 ) -6.496 ( 0 ) -6.175 ( 0 ) -5.952 ( 0 ) Fitness 8.365 ( 5.12e-24 ) 6.463 ( 0 ) 5.95 ( 0 ) 5.385 ( 0 ) 4.397 ( 0 ) 3.033 ( 0 ) 2.28 ( 0 ) Second 0.8822 ( 6.17e-14 ) 0.2978 ( 1.43e-13 ) -0.007231 ( 0.765 ) -0.2472 ( 1.33e-37 ) -0.4013 ( 6.49e-100 ) -0.532 ( 4.51e-169 ) -0.7332 ( 5.29e-279 ) Fitness x Second -3.773 ( 0.00993 ) 0.3277 ( 0.0572 ) 0.8528 ( 7.43e-35 ) 1.143 ( 3.35e-147 ) 1.282 ( 1.51e-253 ) 1.43 ( 0 ) 1.678 ( 0 ) Table 4.4: This table shows the number of basins before and after the secondary analysis is performed, as well as the mean. Each cell in the table has two entries—the first represents the value for the initial treatment with flow only to mutational neighbors, and the second represents the value of the secondary treatment with flow to genotypes two mutational steps away. K 2 5 8 11 14 17 20 ProbGE-Points 302 / 53 6581 / 1190 28290 / 5995 64672 / 15422 110800 / 29011 157246 / 41099 179852 / 44384 ProbGE-Means 0.047 / 0.072 0.183 / 0.234 0.299 / 0.354 0.390 / 0.442 0.461 / 0.508 0.518 / 0.563 0.565 / 0.610 FlatGEPoints 302 / 53 6581 / 1190 28298 / 5995 64811 / 15422 111482 / 28986 157671 / 40989 179377 / 44217 FlatGEMeans 0.047 / 0.072 0.184 / 0.234 0.299 / 0.354 0.389 / 0.442 0.460 / 0.508 0.517 / 0.563 0.565 / 0.610 92 Figure 4.3 and Figure 4.4 show the single mutation and double mutation results for an NKp landscape of p = 0.2. Figure 4.5 and Figure 4.6 show the single mutation and double mutation results for a NKp landscape of p = 0.5. Figure 4.7 and Figure 4.8 show the single mutation and double mutation results for a NKp landscape of p = 0.8. For the p = 0.5 neutrality landscape, for the K=2 and the K=5 cases, there appear to have been several cases where the neutrality in the landscape did not successfully drain all the basins. This is visually apparent in the first four boxes of Figures 4.5 and 4.6. Several basins in the first analysis drop in proportion but do not disappear. The likely conclusion is simply that 1000 updates is not enough to allow trajectories to escape these regions. An alternative hypothesis might be that some of these are new points that did not appear previously due to being beneath the threshold. This phenomenon does appear to occur at a low level, reaching up to 15% of the points in the second order analysis in the K=20 case, where thresholding plays a larger role due to a higher incidence of peaks. However, at low K, this incidence was negligible, with, for example, two points for K=2 and no points for K=5 in the p=0.2 case, and two points each for K=2 and K=5 for the p = 0.5 case. The slow drainage of mass from certain basins thus appears to be a consequence of the neutrality in the landscape. 93 Figure 4.1: Log(basin proportion) as a function of fitness, p = 0.0, Single Mutations. Here density flows only to single mutant neighbors. 94 Figure 4.2: Log(basin proportion) as a function of fitness, p = 0.0, Double Mutations. Here density flows to single mutant neighbors as well as double mutants after 1000 updates. 95 Figure 4.3: Log(basin proportion) as a function of fitness, p = 0.2, Single Mutations. Here density flows only to single mutant neighbors. 96 Figure 4.4: Log(basin proportion) as a function of fitness, p = 0.2, Double Mutations. Here density flows to single mutant neighbors as well as double mutants after 1000 updates. 97 Figure 4.5: Log(Basin Proportion) as a function of fitness, p = 0.5, Single Mutations. Here density flows only to single mutant neighbors. 98 Figure 4.6: Log(Basin Proportion) as a function of fitness, p = 0.5, Double Mutations. Here density flows to single mutant neighbors as well as double mutants after 1000 updates. 99 Figure 4.7: Log(basin proportion) as a function of fitness, p=0.8, Single Mutations. Here density flows only to single mutant neighbors. 100 Figure 4.8: Log(basin proportion) as a function of fitness, p=0.8, Double Mutations. Here density flows to single mutant neighbors as well as double mutants after 1000 updates. 101 I have shown, using a method similar to page rank, that basin size in NK landscapes is exponentially correlated with fitness. This finding is consistent with previous work such as [Ochoa et al., 2008]. I set up two treatments, ProbGE in which genotypes are linked to neighboring genotypes of equal or greater fitness and distribute mass proportionately to fitness, and FlatGE in which genotypes are linked to neighboring genotypes of equal or greater fitness and distribute mass evenly without regard to fitness. I demonstrated that these treatments have an effect on the size of basins; the FlatGE treatment increases the average size of basins while decreasing the maximum size of basins relative to the ProbGE treatment. This result was reminiscent of that obtained in [Ochoa et al., 2010], comparing the accumulation of best vs. first improvement walks. I also apply a secondary analysis where trajectories may escape local peaks by permitting them to move two mutational steps. The allowance of two mutant traversals has the effect of eliminating between 74% and 80% of peaks and increases the overall relationship between fitness and density of those that remained. This result shows that over evolutionary time, higher fitness basins tend to become even larger by cannibalizing nearby lower fitness basins. Finally, I examined the effects of neutrality. I detected the presence of neutral networks that retain mass over large periods of evolutionary time, which functionally may act as sinks; even though there are exits from the network, they leak only slowly. One confounding factor with this work would be the effect of the threshold on the interpretation of the data. Even in the K=20 case, only about 3.5% of the total genotypes appeared in the peaks data. It may be that the threshold was not low enough to capture all of the peaks in the landscape. This thresholding could result in very small basins being ignored for this analysis. This would likely skew the data reported in two ways: first, these basins would be of lower fitness and would not drain properly into nearby basins since the double mutant step was applied only above the threshold. If this skew does occur, when corrected, would only increase the reported effect of the secondary analysis of concentrating mass in larger 102 basins. In this section, I have presented a novel method for evaluating fitness landscapes using the Basin Flow Algorithm and I demonstrated its usage and applicability on a simple NK landscape. This method produces comparable results to other methods of assessing basin size and is in line with previous work on NK landscapes. It also has the advantage of not being biased by starting location and provides a snapshot of ’true’ relationships that might occur via transient dynamics. 4.4 Basins in Landscapes In the previous chapter, I had explored complete landscapes of up to length 18 for the NK, Avida, and RNA landscapes. I sought to investigate the transient dynamics in these landscapes to see how trajectories accumulate in progressively higher peaks over time. Additionally, I was interested in measuring the amount of overlap between trajectories to get a sense for the size of basins in these landscapes. I measured the size of fitness basins obtained by evolution by using a genetic algorithm to sample evolutionary trajectories I used a population size of 25 and 0.5 expected mutations on every division and sampled 100,000 runs, each originating from a different random genotype. Genotypes were sampled proportionately to fitness. Every update, in addition to mutation and selection steps, I record the genotype and fitness of the most common genome in the population, or ‘dominant’. I found that a population size of 25 was not sufficient to reliably overpower selection strength in the NK and RNA landscapes, which resulted in trajectories not accumulating in basins, as it was relatively difficult for them to stay in a peak. Therefore, I increased the strength of selection by applying the exponential transformation f (x) = 2x for the fitnesses in the RNA and NK landscapes to make these landscapes more readily comparable with the exponential Avida landscape and examine the basin distribution. This treatment preserves 103 the relative relationships of points—namely, peaks remain peaks, but the strength of selection is drastically increased. In this section, I will present both the untransformed and exponential for the NK and RNA landscapes. 4.4.1 Trajectories Over Time I looked at the 100,000 collected trajectories at five update intervals: 1, 10, 100, 1000, and 10000 to see how the fitness of trajectories shifts over time in each of the three landscapes. 4.4.1.1 NK NK basins over time can be observed in Figure 4.9. All of the distributions have a long tail—some trajectories continue to be trapped in local minima even 10,000 generations in. The resulting distributions appear to be normal or nearly-normal at any time, and as time elapses, the mean shifts upwards. 104 Figure 4.9: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the NK landscape at each of five exponential time points, as a measure of adaptation in basins over time. 105 Figure 4.10: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the NK Exponential landscape at each of five exponential time points, as a measure of adaptation in basins over time. 106 4.4.1.2 Avida In order to thoroughly test evolutionary trajectories in Avida I used two distinct evolutionary mechanisms: the same GA that I used for NK landscapes and the Avida digital evolution platform itself. The Avida system features continuous reproduction similar to a Moran model with overlapping generations, as opposed to the synchronized generations used in the GA. The two methods both produced strikingly similar results, and so for the purpose of this text, I present the Genetic Algorithm implementation for consistency with the other landscapes. The results can be seen in Figure 4.11. Again, the distribution shifts clearly towards peaks with higher fitness as more updates elapse. The tiered availability of phenotypes is also highly visible, in contrast to the continuous nature of the other landscapes. This is also visible in the Avida landscape summary in Figure 3.9 in Chapter 3. 107 Figure 4.11: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the Avida landscape at each of five exponential time points, as a measure of adaptation in basins over time. 108 Figure 4.12: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the RNA landscape at each of five exponential time points, as a measure of adaptation in basins over time. 4.4.1.3 RNA RNA basins over time can be observed in Figure 4.12. The basin fitness seems to stop improving between updates 1,000 and 10,000. 109 Figure 4.13: Distribution of fitnesses in evolutionary trajectories starting from 100,000 random points in the RNA Exponential landscape at each of five exponential time points, as a measure of adaptation in basins over time. 110 Figure 4.14: NK, 100,000 random points, basin size after 10,000 updates. This plot, as exciting as it appears, is not an accident. There are simply 100,000 unique basins as measured by the endpoints of the trajectories. 4.4.2 Basin Size The next question I investigated was the basin size after 10,000 updates. The final genotype of the 100,000 trajectories after 10,000 updates serves as a proxy for basin size. This analysis gives us an idea about the distribution of the sizes of the basins of long term evolutionary attractors. 111 Figure 4.15: NK Exponential, 100,000 random points, basin size after 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. 112 Figure 4.16: Avida, 100,000 random points, basin size after 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. 113 Figure 4.17: RNA, 100,000 random points, basin size after 10,000 updates. Similar to the NK landscape, most points are not accumulating in basins; over 98,000 trajectories ended in unique genotypes, with the rest accumulating in basins of size no more than four. 114 Figure 4.18: RNA Exponential, 100,000 random points, basin size after 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. 115 4.4.3 Basin Size Relation With Fitness Previous work on the NK landscape including that in 4.3 has shown an exponential relationship between basin fitness and size. We seek to investigate whether this relationship holds in larger landscapes when evolution is the main determinant of basins. Here I use the trajectory count as a proxy for basin size. 116 Figure 4.19: NK, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. Due to weak selection strength, there is not much relationship between the fitness and basin size. 117 Figure 4.20: NK Exponential, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. 118 Figure 4.21: Avida, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. Avida seems to have a positive relationship with fitness but the fit is very poor. 119 Figure 4.22: RNA, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. Due to weak selection strength, there is not much relationship between the fitness and basin size. 120 Figure 4.23: RNA Exponential, 100,000 random points, basin size vs basin fitness at 10,000 updates. The colored hexes represent binnings of points and are used to delineate areas of high density, where many points may be stacked. The red line and equation represent the exponential curve fit. 121 Figure 4.24: NK, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. 4.4.4 Fitness Over Time This section shows the fitness over time of 100,000 trajectories starting at random points in each of the environments. These are useful mostly to visualize the adaptation of the trajectories and understand on what time scales the trajectories experience improvement. Not surprisingly, fitness increases over time, but in the RNA and NK environments, it levels off and stops improving noticeably, whereas in the Avida, RNA Exponential, and NK Exponential environments, the period during which fitness increases in aggregate is larger. 122 Figure 4.25: NK Exponential, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. 123 Figure 4.26: Avida, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. 124 Figure 4.27: RNA, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. 125 Figure 4.28: RNA, fitness for trajectories from 100,000 starting random points, fitness over 10,000 updates. In gray is the area between the 5th and 95th quantiles. 126 Figure 4.29: NK, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. 4.4.5 Distance Traveled From Origin This section shows distance traveled from the original genotype as time elapses. I measure this in order to understand how far points travel on average to their final destinations relative to their origin in the landscape. This is a proxy method for measuring the size and reach of basins. 127 Figure 4.30: NK Exponential, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. 128 Figure 4.31: Avida, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. 129 Figure 4.32: RNA, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. 130 Figure 4.33: RNA Exponential, distance from origin for trajectories from 100,000 starting random points. In gray is the area between the 5th and 95th quantiles. 131 4.4.6 Peaks & Basins In this section, I look at the proportion of each trajectory at each timestep that is a ‘peak’ as identified in the fitness landscapes in the last section. Note that for the exponential treatments, the peaks remain the same since the exponential treatment alters the strength of relative signal, but not the fitness valence relationship between neighboring points. 132 Figure 4.34: NK, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. 133 Figure 4.35: NK Exponential, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. 134 Figure 4.36: Avida, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. 135 Figure 4.37: RNA, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. 136 Figure 4.38: RNA Exponential, proportion of trajectories from 100,000 starting random points currently in genotypes identified as peaks. 137 4.4.7 Discussion I have presented an investigation of basin size in the three model landscapes introduced in Chapter 3—the NK landscape, the Avida landscape, and the RNA landscape. For the RNA and the NK landscapes, the selection strength was insufficient to obtain a meaningful understanding of the landscape with a population size of 25. I therefore chose to increase the selection strength by remapping them with the fitness transformation x = 2x . This simple transformation preserves the peak structure in the landscape and the relationship between nearby points and makes it easier for small populations to climb gradients and stay on peaks. I present both this as well as 100,000 trajectories on the original untransformed landscape. I did not perform this treatment for the Avida landscape since it already has an exponential form, and a snapshot of basin structure successfully emerged from my sampling of evolutionary trajectories. All three landscapes exhibited increased fitness over time in the distribution of samples, as might be expected. Early improvements along the line of descent allowed for rapid increases in fitness, but eventually the rate of improvement slows down, although it never stops over the measured interval. The distribution of the endpoints of each run is landscape specific—the sparsity of phenotypes is visible in the Avida landscape, the normality of the fitnesses of peaks in the NK landscape is highly apparent and seems to be preserved through time. The exponential RNA and NK landscapes show continued improvement in marked contrast to the stagnation visible after roughly Update 100 in their non-exponential RNA and NK cousins. This difference in behavior is due to the strength of selection. I focus on the exponential landscapes for the rest of the discussion, but it is worth noting that transforming the landscape exponentially improved the solution quality for this particular parameter set quite drastically. In evolutionary computation applications, small populations and low mutation rates may benefit strongly from increased selection strength via transformation, such as my exponential treatment, in that there may be better final solution quality, but this may also lead to adverse 138 effects such as being trapped in local minima. Next, I examined the frequency of each basin size. The NK and RNA landscapes both did poorly here—Figure 4.14 for instance shows the result that there were 100,000 basins of size one—i.e. every trajectory went its own way. The RNA landscape was very similar with over 98,000 basins of size one, and a handful of size two, three, and four basins. The exponential landscapes told a different story, however. In Avida, there were fewer than 10,000 unique basins of size one, which meant that the vast majority of measured basins had at least two trajectories and there was at least one with 1000 trajectories. Since these were unbiased random samples, these results indirectly point to genotypes in the landscape that have basins that can attract 1% of the genotypes on the landscape. Likewise in RNA and NK, the exponential treatments revealed similarly large attractors, with RNA having a point with about 1.5% of the trajectories and NK having a maximum point attracting roughly 6%. I measured the interaction of basin fitness and basin size. Previous related work has found an exponential relationship in NK landscapes between basin fitness and basin size. While it is debatable whether the exponential relationship is the most appropriate one—it does not seem to apply strongly to the Avida landscape, but does do reasonably well in the exponential NK and RNA landscapes. Nevertheless, the more general assertion that higher peaks have bigger basins seems to be true. Distance in all three landscapes was also of note. The average Hamming distance of a genotype to all other genotypes is 13.5, and the numbers in all three landscapes quickly approach that. In Avida, the mean distance from the origin at 10,000 updates was 12.565, with the 5% quantile being 3 and the 95% quantile at 19. In RNA Exponential, the mean distance was 13.135 after 10,000 updates, with the 5% quantile being 5 and the 95% quantile at 19. And finally in the NK Exponential, the mean distance was 12.805, with the 5% quantile at 3 and the 95% quantile at 18. These results seem to imply that, on average, the starting point in these landscapes has little relationship with the final destination—approaching average 139 distance over evolutionary time. Basins over evolutionary time are thus not the closest local optimum, as might be expected. One interpretation of these data is that a single genotype may go into several basins, with historical contingency determining which one is the final destination. This is the hypothesis I prefer since it does not impose other requirements on the structure of the environment. A competing hypothesis is that the basins themselves have long and narrow tendrils as opposed to being regular depressions with a certain radius. I also looked at the fraction of trajectories currently in neutral peaks over time, where neutral peaks are defined as points without one mutant neighbors that have higher fitness. Throughout the 10,000 updates, the evolutionary trajectories in aggregate spent a large proportion of time in peaks, regardless of landscape. All three landscapes experienced a large boost in the first hundred updates, while presumably the process of evolution was locating easily accessible innovations. After 100 updates, however, things tended to level off, and there were qualitative differences between the landscapes in long term percentages in peaks; in the NK Exponential landscape, the proportion in peaks slowly increased over time, in Avida, it stayed more or less static, and finally in the RNA Exponential landscape, it decreased. The RNA landscape’s behavior on this metric in particular is slightly puzzling—there may be other mechanisms at play such as survival of the flattest or it may be our definition of peaks that causes this unexpected slow decline over time. Nevertheless, it is clear that peaks as I defined them play an important role throughout evolutionary time as trajectories naturally find them and stay close to them when the selection strength is sufficient. 4.5 Conclusions In this chapter I presented two broad investigations into the nature of basins of attraction in fitness landscapes. In the first investigation, in Section 4.3, I presented a novel method, the Basin Flow Algorithm, for studying basins of attraction which was inspired by the Page Rank algorithm 140 used originally by Google for web page ranking. Previous attempts at measuring basin size had featured the use of trajectories and sampling to estimate the features of basins in NK landscapes with N=20 with varying choice of K. In line with previous work, I found that basin size was linked exponentially with fitness in NK landscapes, and that this relationship decreases with increasing K. I also tested increasing levels of neutrality using the NKp family of landscapes. I found large neutral networks that occasionally were slow to drain, but it was difficult to compare directly with the NK landscapes due to the differences in fitness range naturally induced by these landscapes. I also investigated the effects of double mutations and demonstrated how allowing basins to drain via double mutations—in other words, basins with lower fitness do disappear and are absorbed into those of higher fitness, as might be expected. In the second part of the chapter, I followed the three system approach introduced in Chapter 3 and sampled 100,000 evolutionary trajectories starting at random points. I measured properties of basins of attraction for each of the NK, Avida, and RNA landscapes. I treated the NK and RNA landscapes with exponentiation in order to increase selection strength. This treatment improved the solution quality in both landscapes and leads us to believe that similar sorts of treatments could be generally applicable in evolutionary optimization—namely, small population and a high mutation rate can be overcome by manipulating selection strength. Tournament selection, for instance, may work well precisely because it is not sensitive to the selection strength differential. I also found that basins are strongly linked to our conception of peaks in Chapter 3, and that over evolutionary time, trajectories starting from random points end up at a distance from the origin that is, on average, close to the 13.5 average genotype Hamming distance in all three landscapes. The high distance implies that basins for evolutionary purposes are not at all localized, even though it is often a much shorter distance to the nearest peak. Historical contingency may thus play the most important role in determining the ultimate fate of a 141 population trajectory. In the following chapter, I investigate the fate of trajectories starting at the same origin and shift focus from global structure to local structure. 142 Chapter 5 Visualization of Transient Dynamics In the preceding chapters, I have examined evolutionary dynamics across three distinct and widely used landscapes. Here, I will use visualization techniques to further compare these landscapes, both in terms of local structure and the short-term evolutionary dynamics induced by their structures. 5.1 5.1.1 Related Work Replaying the tape of life Steven Jay Gould in [Gould, 1989] famously posited the thought experiment of ‘replaying the tape of life’, arguing that historical contingency and chance played a large role in evolution potentially even surpassing that of adaptation. This classical thought experiment asks that if it were somehow possible to rewind time and start over, how would things be different? The tension between adaptationist, i.e. that selection would produce the same result, views and those similar to Gould, favoring random chance, marked one of the important philosophical questions in evolutionary biology. [Travisano et al., 1995] sought to better understand the relationship between historical 143 contingency, chance, and adaptation in a set of cleverly designed experiments. They took clones from a historical ancestor to seed experimental populations across a set of differing environments, before putting them back into the ancestral environment. At the conclusion, they were able to estimate using an ANOVA how much of the variance derived from history (the choice of ancestor), chance (differences given the same ancestor), and adaptation (the influence of a specific environment). Fitness was highly adaptive, as might be expected, but, another phenotypic metric, cell size, demonstrated strong effects of history and chance, since cell size is itself not adaptive. Related biological experiments following this experimental approach can be found as in dinoflagellates in [Flores-Moya et al., 2012]. [Wagenaar and Adami, 2004] performed similar experiments in Avida organisms, transplanting organisms from an environment they evolved in into a novel environment. In similar environments, they found a greater importance of shared history, whereas in more dissimilar environments, the effects of adaptation dominated the variance. [Kryazhimskiy et al., 2014] also conducted a hierarchical experiment in which they studied the effects of epistasis and historical contingency. They found broad fitness-level convergence while at the same time encountering a high level of randomness at the sequence level, with the surprising result that clones descended from the same founders were not more likely to share mutations than those not sharing the same founders. 5.1.2 Fitness Landscapes & Epistasis There have been a number of investigations on fitness landscapes with regard to epistasis; [Tufts et al., 2014] examines the mutational neighborhood of hemoglobin via site-directed mutation and found sign epistasis in the mutations required for increased oxygen affinity in high-altitude pikas. Similarly, [Podgornaia and Laub, 2015] mapped 160,000 variants of PhoQ in E. Coli to identify 1659 functional variants, demonstrating significant epistasis in the resulting landscape and postulating that this epistasis has severely limited the number of orthologs observed. Similarly, [Poelwijk et al., 2007] examined adaptation on the bacterial β144 lactamase to cefotaxime, to which they had not previously had exposure. In this case, the five mutations conferring antibiotic resistance were already known, but they were able to measure intermediates and found that it was necessary for mutational combinations to traverse a valley to reach the resistant solution. Similarly, the fitness landscape in E. Coli has been analyzed in [Beerenwinkel et al., 2007], which found a strong correlation between epistasis and the amount of fitness loss inflicted by deleterious mutations. Similarly, [Covert et al., 2013] demonstrated the importance of deleterious mutations to adaptive evolution in digital Avida landscapes by comparing the achieved fitness of populations with normal adaptive evolution against those where deleterious mutants were reverted. [Szendro et al., 2013] performed a computational analysis of trajectories on the experimentallymeasured 8-locus fitness landscape from the fungus Aspergillus niger. They defined entropy measures both on endpoints and on trajectories and found that the entropy of both of these decrease with mutational supply and that larger populations become more greedy. Similarly, [Handel and Rozen, 2009] found that larger populations evolve more deterministically and can become trapped on local fitness peaks, whereas smaller populations may be able to reach higher peaks because of their ability to traverse the landscape more randomly. [Lobkovsky et al., 2011] proposed several landscape characteristics and linked them to trajectories in 39 protein folding landscapes of size up to 12,969. [Otwinowski and Plotkin, 2014] illustrates the biases often present in methodologies involving inferring fitness landscape structure by sampling populations in both NK and RNA populations. These biases are especially prevalent in living organisms due to selection pressure. 5.2 Approach My work seeks to investigate evolutionary trajectories against the backdrop of the fitness landscapes explored in previous chapters. However, unlike much previous work, where often 145 such an undertaking would be prohibitively expensive or impossible, I seek to exhaustively enumerate the possibility space before examining real evolutionary trajectories. It has been pointed out as in [Kogenaru et al., 2009] that mutational analyses restricted to only one area of the genome may be compensated for in other areas and may be therefore irrelevant—an exhaustive exploration avoids this potential issue. In this work, I examine the mutational neighborhood of points out to five mutations out. There are 18 5 35 = 2, 082, 024 points at exactly a mutational distance of five, and each of these has 5! ways to achieve it, which yields 249,842,880 possible trajectories. Much previous work has focused on a particular gene complex and analyzed the fitness of paths by measuring each mutational combination along the way. My work differs in that I look at an unconstrained mutational neighborhood, exploring all possible combinations of steps, not just limited to specific target sequences. The first segment of this work features an exploration of the local structure of several sampled points, and the second part is results of repeated evolutionary trials to see what actually happens as a result of evolution. It is worth noting that the evolutionary runs are highly parameter dependent—the same fitness landscape may experience radically different trajectories based on population size, mutation rate, and selection mechanism. I visualize the potential trajectory space of 1,000 points for each of the three model landscapes introduced in Chapter 2—the NK landscape, the Avida landscape, and the RNA landscape. In order to accomplish this, for each sampled point, I calculated the fitness for all points within a mutational distance of five. I keep track of rings, defined as all points of the same Hamming distance from the source genotype—so for instance, all one-mutant neighbors is a ring, all two-mutant neighbors another, and so on. I also keep track of links between the origin genotype and one-step mutants, between one-step mutant and two-step mutants, and continuing up to the links between four-step mutants and five-step mutants. My data was far too complex to easily plot; each point, each generating 250 million 146 trajectories is too much data for a traditional plotting program to visualize. Therefore, I preprocessed the data to accomplish this visualization. I started with a numeric matrix, with mutational distance on the x axis and fitness on the y axis. I used 100 pixels between each mutational step for the x axis. For each starting point, I assign a starting point on the left of the graph, with the y position determined by the fitness. I then find the possible endpoints; there are 54 one-mutation steps to one-mutant neighbors. These are assigned an x position of 200, and a fitness according to the fitness of the corresponding genotype. I use Bresenham’s Line Algorithm, originally outlined in [Bresenham, 1987], to determine the pixels along a line and increase the density in the matrix for each of these. I then repeat the process linking one mutation neighbors with each two mutation neighbor it may link to, and so on until I finally connect the four and five-mutant points. After this process is completed, I renormalize between each step before visualizing the data via a density heatmap. This technique is used to produce the visualizations of both the local landscapes and trajectories out to five mutational steps. 5.3 Local Landscapes—Random Points I show the visualization for 1,000 random points selected from each of these landscapes in this section, using the MLV method described above. 147 Figure 5.1: NK Random Starting Points, counts of beneficial/neutral/deleterious mutations relative to random starting genotype. Note that beneficial and deleterious mutations nearly overlap so only two lines appear. This overlap is because choosing a large enough sample of random points in the landscape should make beneficial and deleterious mutations nearly perfectly symmetric. At zero distance, the point itself is classified as neutral. 148 Figure 5.2: NK Random Starting Points, Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. Since this is the NK landscape, there is very little neutrality in this particular landscape. 149 Figure 5.3: NK Random Starting Points, Landscape Visualization. This shows 1,000 NK landscapes starting at random points. The normality of the fitness distribution of the NK landscape is clearly visible and rare mutations in both directions appear at five mutations out. 150 Figure 5.4: Avida Random Starting Points, counts of beneficial/neutral/deleterious mutations relative to randomly-selected starting genotype. Note that beneficial and deleterious mutations nearly overlap so only two lines appear. This overlap is because choosing a large enough sample of random points in the landscape should make beneficial and deleterious mutations nearly perfectly symmetric. At zero distance, the point itself is classified as neutral. 151 Figure 5.5: Avida Random Starting Points, Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. 152 Figure 5.6: Avida Random Starting Points, Landscape Visualization. This shows 1,000 Avida landscapes starting at random points. Rare high fitness states are revealed at a distance of five mutations out. 153 Figure 5.7: RNA Random Starting Points, counts of beneficial/neutral/deleterious mutations relative to randomly-selected starting genotype. Note that beneficial and deleterious mutations nearly overlap so only two lines appear. This overlap is because choosing a large enough sample of random points in the landscape should make beneficial and deleterious mutations nearly perfectly symmetric. At zero distance, the point itself is classified as neutral. 154 Figure 5.8: RNA Random Starting Points, Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. 155 Figure 5.9: RNA Random Starting Points, Landscape Visualization. This shows 1,000 RNA landscapes starting at random points. Rare high fitness states are revealed at a distance of five mutations out. 156 5.4 Local Landscapes—Peaks Next, I apply the same treatment to peaks. I select my subset of peaks to study by first sampling a random point, and then hill climb to a peak. This hill climbing is accomplished by iteratively examining the one-mutation neighbors and taking the most fit among them. Up to one hundred neutral steps are allowed, after which only positive steps are allowed, until none are available. By the end of this process, the current genotype is guaranteed to be a ‘neutral peak’ as defined earlier in Chapter 3. 157 Figure 5.10: NK Peaks, counts of beneficial/neutral/deleterious mutations relative to starting genotype for each distance. At zero distance, the point itself is classified as neutral. 158 Figure 5.11: NK Peaks, Path Counts Simplex. This is a summary of all the possible five-mutation trajectories starting at each of 1,000 sampled peaks. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. 159 Figure 5.12: NK Peaks, Landscape Visualization. This shows 1,000 NK landscapes starting at peaks. A very few paths leading to higher fitness peaks can be seen. 160 Figure 5.13: Avida Peaks, counts of beneficial/neutral/deleterious mutations relative to starting genotype for each distance. At zero distance, the point itself is classified as neutral. 161 Figure 5.14: Avida Peaks Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. 162 Figure 5.15: Avida Peaks Landscape Visualization. This shows 1,000 Avida landscapes starting at peaks. 163 Figure 5.16: RNA Peaks, counts of beneficial/neutral/deleterious mutations relative to starting genotype for each distance. At zero distance, the point itself is classified as neutral. 164 Figure 5.17: RNA Peaks Path Counts Simplex. This diagram is a summary of all of the possible five-mutation trajectories. To interpret the meaning of a point on this graph, go left to find the number of neutral mutations, up and right to find the number of beneficial mutations, and down and right to get the number of deleterious mutations. 165 Figure 5.18: RNA Peaks Landscape Visualization. This shows 1,000 RNA landscapes starting at peaks. 166 5.5 Repeated Trajectories—Random Points Visualization For each of the 1,000 points previously landscaped, I now began a further experiment to investigate what course evolution actually takes. To this end, I sample, 10,000 trajectories from each starting point for 100 generations and look at where the trajectories leave the five mutation radius. I examine the line of descent of the dominant genotype after 100 generations, and trace the unique genotypes along the line. I stop at five mutations out, for effective comparison against the complete landscapes. To enforce this five-mutation limit, I discard trajectories that skip a five distance mutant in the lineage through the simultaneous acquisition of two or more mutations. The matrices are arranged with mutational distance on the x axis and fitness on the y axis. For each mutational step along the line of descent, I draw a line connecting the origin fitness and the origin distance with the destination line and distance. I start with a numeric matrix of zeros, and once again I use Bresenham’s Line Algorithm to determine the pixels along the line and increase the density in the matrix. I normalize the entire data set at once by the number of trajectories, unlike in the landscape analysis where it was necessary to normalize between steps. I show the visualization of trajectories for 1000 points for each of the landscapes. 167 Figure 5.19: NK Random Points Evolutionary Trajectory Visualization. This shows 1,000 NK collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 168 Figure 5.20: NK Random Points Exponential Evolutionary Trajectory Visualization. This shows 1,000 NK collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 169 Figure 5.21: Avida Random Points Evolutionary Trajectory Visualization. This shows 1,000 Avida collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 170 Figure 5.22: RNA Random Points Evolutionary Trajectory Visualization. This shows 1,000 RNA collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 171 Figure 5.23: RNA Random Points Exponential Evolutionary Trajectory Visualization. This shows 1,000 RNA collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 172 5.6 Repeated Trajectories—Peaks Visualization I sample trajectories in a similar method to the previous section, with the difference being that these trajectories begin at the same peak points also used earlier in the landscaping. Figure 5.24: NK Peaks Evolutionary Trajectory Visualization. This shows 1,000 NK collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 173 Figure 5.25: NK Peaks Exponential Evolutionary Trajectory Visualization. This shows 1,000 NK collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 174 Figure 5.26: Avida Peaks Evolutionary Trajectory Visualization. This shows 1,000 Avida collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 175 Figure 5.27: RNA Peaks Evolutionary Trajectory Visualization. This shows 1,000 RNA collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 176 Figure 5.28: RNA Peaks Exponential Evolutionary Trajectory Visualization. This shows 1,000 RNA collections of up to 10,000 evolutionary trajectories each, starting at the same points as the corresponding landscape 177 5.7 Repeated Trajectories Random Points—The fate of trajectories I examine the differences between the evolutionary trajectories obtained and the possibilities in the landscape, within a five-mutation threshold. To do this, I examined for each mutational step, the fitness of all points in the landscape at that mutational step. For each step, I record the median, minimum, maximum, 5th quantile, and the 95th quantile in the landscape. I combine the data from the 1,000 points by taking the median of each of these data points, yielding the median-median, median-minimum, median-maximum, median-5th quantile, and median-95th quantile. I similarly classify trajectories into points by distance, up to five, and do a similar collection—first recording the same five statistics for individual trajectories, and then combining the results by taking medians. 178 Figure 5.29: NK Random Points Landscape and Trajectory Fitness. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. 179 Figure 5.30: NK Random Points Exponential Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. Here, selection strength results in a stronger improvement bias in the trajectories relative to the NK Points Trajectory Visualization. 180 Figure 5.31: Avida Random Points Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. 181 Figure 5.32: RNA Random Points Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. 182 Figure 5.33: RNA Random Points Exponential Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. Here, the higher selection strength results in a stronger improvement bias in the trajectories relative to the RNA Points Trajectory Visualization. 183 5.8 Repeated Trajectories Peaks—The fate of trajectories I repeat the trajectory analysis, but this time for the analyses starting at peaks. Once again, these graphs depict the possibility space compared to what the evolutionary trajectories actually achieve. 184 Figure 5.34: NK Peaks Landscape and Trajectory Fitness. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. 185 Figure 5.35: NK Peaks Exponential Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. 186 Figure 5.36: Avida Peaks Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. 187 Figure 5.37: RNA Peaks Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. 188 Figure 5.38: RNA Peaks Exponential Trajectory Visualization. The landscape is in blue and the trajectories are in green. The dotted lines are the median-minimum and median-maximum and the shaded area for each is the region between the median-5th and median-95th quantiles. The median-median is solid. 189 5.9 Discussion In this chapter, I examined the local dynamics of two datasets—random points and peaks for each of the three model landscapes and two exponential variants with stronger selection. I provide a technique for exhaustive landscape and trajectory visualization via the use of Bresenham’s Line algorithm, which allows us to visualize these complex multidimensional landscapes in a new way, which may be useful for building intuition about complex landscapes. Features such as the normal distribution of fitnesses of the NK landscape in Figure 5.1 stand in stark relief to the Avida landscape in Figure 5.4 and the RNA landscape in Figure 5.7, both of which feature a preponderance of low-fitness genotypes in the randomly selected points. This result is not surprising—as seen in Chapter 3 and previous work that the most common fitnesses in both Avida and RNA are extremely low, whereas in the NK landscape, the most common fitnesses are also the median fitness (extremely low fitnesses in the NK landscape are just as rare as extremely high fitnesses). These visualizations also show that most potential trajectories emanating from random points in the Avida and RNA landscapes lead to equally low fitness, whereas in the NK landscape, the most common potential trajectories lead from median fitness to median fitness. The main differences among these random-startingpoint potential trajectories are obvious when the more extreme cases are examined; the NK landscape contains equally rare potential trajectories to higher paths and lower paths. Despite its prevalent use in studying epistasis due to the easy tuning parameter in K, this result suggests that the NK landscape may not be the best choice to represent biological systems. This problem is most severe when trying to represent organisms with phenotypic complexity of any sort, simply because in such landscapes, most symbol sequences are of low fitness by chance. The degree of neutrality also distinguished the potential trajectory classifications among the three landscapes. The most common potential trajectory classification in the Avida landscape was three neutral, one beneficial, and one deleterious step. In contrast, as neutrality 190 was rare in the NK and RNA landscapes, three beneficial and two deleterious and two beneficial and three deleterious step pathways were about equally likely. Since these are random points, and we would expect, a priori, no particular bias in either direction. This prediction is borne out by the virtually identical beneficial and deleterious counts in Figure 5.1, Figure 5.4, and Figure 5.7. Starting from peaks the potential trajectories in all three landscapes revert to the mean. For the NK landscape, this reversion is to the expected average fitness of 9, and for the RNA and Avida landscapes, this reversion is towards the lowest fitness values. There are an extremely small number of pathways to equal or higher fitness in each of the three studied landscapes. From peaks, the most common potential trajectory type in the NK landscape was five deleterious mutations, followed by four deleterious and one beneficial (again, neutral mutations are unlikely). In Avida, the most common characterization was three deleterious, one neutral, and one beneficial, followed by four deleterious and one beneficial. Finally, in RNA, the most common potential trajectory configuration was four deleterious and one beneficial, followed by five deleterious. Next, I examined 10,000 evolutionary trajectories from random starting points in each of the three aforementioned landscapes, as well as the exponential variants for the NK and RNA landscapes as in Chapter 4. The results are visually striking and often distinguishable from the full range of potential trajectories possible in each landscape. For the Avida and RNA trajectories, the paths upward driven by evolution are clearly visible. This result contrasts with the strong horizontal bias in the landscape visualizations. In evolutionary trajectories on NK landscapes, the selective pressures skews the normality of the potential trajectories in a distinctly upwards trajectory. This pattern is more strongly biased in the NK Exponential and RNA Exponential as compared to the NK and RNA regimes, respectively. Next, I generated an equivalent set of 10,000 evolutionary trajectories, this time starting from peaks instead of random points. The weak selection in the NK and RNA landscapes 191 is clearly evident as the fitness actually appears to decline over time in Figure 5.19 and Figure 5.22 in the trajectory graphs. This effect largely disappears in the exponential variants (Figure 5.25 and Figure 5.28). The Avida landscape, in contrast, has strong horizontal lines with a slight bias upwards. As should be expected, the trajectory visualizations are consistent with the combined landscape-and-trajectory summaries; for instance in Figure 5.34; the NK low selection regime trajectories do decline in fitness, whereas the NK exponential in Figure 5.35 does significantly better—performing above the 95th quantile of the landscape. In general, also consistent with expectations, evolution was favoring the top half of trajectories. Even with weak selection, evolution seems to find improvements for random points, outperforming the landscape average. However, in the case of peaks, strong selection was needed to avoid degradation of trajectories along with the landscape. In summary, I have presented new visualization techniques which provide insight into the nature and structure of different landscapes and how evolutionary trajectories actually navigate them. 192 Chapter 6 Conclusions Populations evolve in predictable ways because of common descent and the nature of fitness landscapes. As populations adapt, they accumulate information about the fitness landscape which they inhabit. In this work, I have used computational techniques to investigate the process of evolution by studying the acquisition of information, the structure of fitness landscapes, and the transient dynamics of evolving populations on these landscapes. In Chapter 2, I established that error from common descent in complexity estimates can be sizable, present a correction to compensate for its effects, and test the correction empirically. I showed that the correction performed well in the presence of low mutation rates, but once mutation rates increased past a per-site rate of 0.01 it started to do poorly, likely because of the underlying assumption that informative sites have no entropy. This assumption breaks as the mutation rate increases and it becomes increasingly hard for a population to stay on a peak. A natural future direction for this work is to further refine the model so that it is less sensitive to this condition, but the real prize would be in a closed form characterization of the magnitude of this effect. A further important step would be to test this technique in other systems as a prelude to demonstrating its utility in real biological systems. In Chapter 3, I investigated instantiations of large NK, reduced-instruction Avida, and 193 RNA fitness landscapes of 68 billion genotypes each and studied the landscape structure, specifically peak structure and distribution. In all three landscapes, I found that peaks of higher fitness tended to have higher peaks in their neighborhood. I also found that peaks are connected in the NK and Avida landscapes but are not in the RNA landscape. Finally, I found that while peak fitness is negatively autocorrelated at high distance in NK and Avida landscapes, peaks in the RNA landscape at maximum distance are positively correlated. This result may point to the fact that the complement of some stable RNA sequences may be disproportionately likely to be stable themselves. In the future, I would like to extend my analyses to more than just peak structure—other structural properties such as local roughness have been proposed [Lobkovsky et al., 2011]. Also, a non-binary definition of peak may be useful for structural understanding especially when there is substantial neutrality or near-neutrality. In Chapter 4, I studied the concept of basins of attraction. First, I presented a novel structural method of measuring basins based on the Page Rank algorithm, which I applied to NK landscapes. With this method, I confirmed previous findings that basin size increases exponentially with fitness, and that as the K parameter increases, the average basin size increases while the largest basins decrease. I also examined NKp variants of NK landscapes with neutrality and detected neutral networks that can retain mass over time, and found that linking double mutants with my method increased the strength of the exponential relationship between fitness and basin size. In the second section, I experimentally derived basins by using 100,000 evolutionary runs starting at random points on each landscape. I again found an exponential correlation between the peak fitness and basin size in both the NK and RNA landscapes, but not in the Avida landscape. I also demonstrated that in all three landscapes, the distance from the origin point approached the average distance of 13.5 between all pairs of points in the landscape, which suggests either overlapping basins or that basins may have long and narrow tendrils. Further investigation is needed to tease this particular distinction apart. Finally, I showed that in all three model landscapes, evolutionary trajectories spend 194 a disproportionate amount of their time in peaks. For this work, it is also natural to want to adjust the evolutionary parameters of population size and mutation rate and understand how that affects landscape traversal, the accumulation in basins over time, and final fitness, which would have important consequences for both theoretical and applied work. In Chapter 5, I looked at the local dynamics of fitness landscapes. Starting from both random points and peaks, I exhaustively studied mutants up to five steps out from each source genotype and then examined the results of 10,000 evolutionary runs originating from each studied genotype. I provided a novel way to visualize both the roughly 250 million potential trajectories out to length five, for each point in each treatment, and also applied this visualization to evolutionary trajectories. There are many future directions for this particular work; in the work presented here, I have focused on techniques and visualization, but the next logical step would be to investigate epistasis and its relationship to the trajectories realized by evolution. A further goal is to gain a more in-depth understanding about how these different landscapes are both similar and different, and how sampling and full-structural analyses might agree and disagree. There are likely some landscapes that defy exhaustive analysis, or which are prohibitively expensive to test; so a broader understanding of how these might contribute to mapping the global landscape would be useful. Finally, I hope to understand more clearly how evolutionary pathways traverse from peak to peak. Peaks play an important and consistent role in evolution through time: in Chapter 4, I saw that fitness increases over time, which means that evolution must be finding pathways between successively higher peaks. To accomplish this work, I have relied heavily on computational techniques to produce exhaustive analyses of the structure and consequences of fitness landscapes. As computational power grows it should be possible to map increasingly large and realistic fitness landscapes and provide more insight into the nature and workings of these landscapes and their relevance for evolution. 195 BIBLIOGRAPHY 196 BIBLIOGRAPHY [Adami, 2002] Adami, C. (2002). What is complexity? Bioessays, 24(12):1085–1094. [Adami, 2004] Adami, C. (2004). Information theory in molecular biology. Physics of Life Reviews, 1:3. [Adami et al., 1995] Adami, C., Brown, C. T., and Haggerty, M. R. (1995). Abundancedistributions in artificial life and stochastic models: ”age and area” revisited. In Morn, F., Moreno, A., Guervs, J. J. M., and Chacn, P., editors, ECAL, volume 929 of Lecture Notes in Computer Science, pages 503–514. Springer. [Adami and Cerf, 2000] Adami, C. and Cerf, N. (2000). Physical complexity of symbolic sequences. Physica D: Nonlinear Phenomena, 137(12):62 – 69. [Adami et al., 2000] Adami, C., Ofria, C., and Collier, T. C. (2000). Evolution of biological complexity. In Proceedings of the National Academy of Sciences, pages 4463–4468. [Balding, 2006] Balding, D. J. (2006). A tutorial on statistical methods for population association studies. Nat Rev Genet, 7(10):781–791. [Barnett, 1998] Barnett, L. (1998). Ruggedness and neutrality - the nkp family of fitness landscapes. In Alive VI: Sixth International Conference on Articial Life, pages 18–27. MIT Press. [Bedau and Brown, 1999] Bedau, M. A. and Brown, C. T. (1999). Visualizing evolutionary activity of genotypes. Artificial Life, 5(1):17–35. [Beerenwinkel et al., 2007] Beerenwinkel, N., Pachter, L., Sturmfels, B., Elena, S., and Lenski, R. (2007). Analysis of epistatic interactions and fitness landscapes using a new geometric approach. BMC Evolutionary Biology, 7(1):60. [Bickel et al., 1996] Bickel, P. J., Cosman, P. C., Olshen, R. A., Spector, P. C., Rodrigo, A. G., and Mullins, J. I. (1996). Covariability of V3 Loop Amino Acids. AIDS Research and Human Retroviruses, 12(15):1401–1411. [Bresenham, 1987] Bresenham, J. (1987). Ambiguities in incremental line rastering. Computer Graphics and Applications, IEEE, 7(5):31–43. [Chao and Shen, 2003] Chao, A. and Shen, T.-J. (2003). Nonparametric estimation of shannon’s index of diversity when there are unseen species in sample. Environmental and Ecological Statistics, 10:429–443. [Covert et al., 2013] Covert, A. W., Lenski, R. E., Wilke, C. O., and Ofria, C. (2013). Ex197 periments on the role of deleterious mutations as stepping stones in adaptive evolution. Proceedings of the National Academy of Sciences, 110(34):E3171–E3178. [Derrida and Bessis, 1988] Derrida, B. and Bessis, D. (1988). Statistical properties of valleys in the annealed random map model. Journal of Physics A: Mathematical and General, 21(9):L509. [Derrida and Peliti, 1991] Derrida, B. and Peliti, L. (1991). Evolution in a flat fitness landscape. Bulletin of Mathematical Biology, 53:355–382. 10.1007/BF02460723. [Desai et al., 2007] Desai, M., Fisher, D., and Murray, A. (2007). The speed of evolution and maintenance of variation in asexual populations. Current Biology, 17:385–394. [Dudk et al., 2005] Dudk, M., Schapire, R. E., and Phillips, S. J. (2005). Correcting sample selection bias in maximum entropy density estimation. In In Advances in Neural Information Processing Systems. [Durrett, 2008] Durrett, R. (2008). Probability Models for DNA Sequence Evolution. Probability and Its Applications. Springer. [Eigen, 1971] Eigen, M. (1971). Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften, 58:465–523. 10.1007/BF00623322. [Eigen et al., 1988] Eigen, M., McCaskill, J., and Schuster, P. (1988). Molecular quasi-species. The Journal of Physical Chemistry, 92(24):6881–6891. [Eigen et al., 1989] Eigen, M., McCaskill, J., and Schuster, P. (1989). The Molecular QuasiSpecies, pages 149–263. John Wiley & Sons, Inc. [Eigen and Schuster, 1977] Eigen, M. and Schuster, P. (1977). The Hypercycle: a Principle of Natural Self-Organisation, Part A. Naturwissenschaften, 64(11):541–565. [Ewens, 1972] Ewens, W. (1972). The sampling theory of selectively neutral alleles. Theoretical Population Biology, 3(1):87 – 112. [Ezkurdia et al., 2014] Ezkurdia, I., Juan, D., Rodriguez, J. M., Frankish, A., Diekhans, M., Harrow, J., Vazquez, J., Valencia, A., and Tress, M. L. (2014). Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes. Human Molecular Genetics. [Fisher, 1930] Fisher, R. (1930). The genetical theory of natural selection. Clarendon Press, Oxford. [Flores-Moya et al., 2012] Flores-Moya, A., Rouco, M., Garca-Snchez, M. J., Garca-Balboa, C., Gonzlez, R., Costas, E., and Lpez-Rodas, V. (2012). Effects of adaptation, chance, and 198 history on the evolution of the toxic dinoflagellate alexandrium minutum under selection of increased temperature and acidification. Ecology and Evolution, 2(6):1251–1259. [Fontana et al., 1993] Fontana, W., Stadler, P. F., Bornberg-Bauer, E. G., Griesmacher, T., Hofacker, I. L., Tacker, M., Tarazona, P., Weinberger, E. D., and Schuster, P. (1993). Rna folding and combinatory landscapes. Phys. Rev. E, 47:2083–2099. [Fu, 1995] Fu, Y. X. (1995). Statistical Properties of Segregating Sites. Theoretical Population Biology, 48(2):172–197. [Fu and Li, 1993] Fu, Y. X. and Li, W. H. (1993). Statistical tests of neutrality of mutations. Genetics, 133(3):693–709. [Gavrilets, 2003] Gavrilets, S. (2003). Perspective: Models of speciation: What have we learned in 40 years? Evolution, 57(10):pp. 2197–2215. [Gavrilets, 2004] Gavrilets, S. (2004). Fitness Landscapes and the Origin of Species. Princeton University Press. [Gerrish and Lenski, 1998] Gerrish, P. and Lenski, R. (1998). The fate of competing beneficial mutations in an asexual population. Genetica, 102-103:127–144. [Gould, 1989] Gould, S. J. (1989). Wonderful Life: The Burgess Shale and the Nature of History. W. W. Norton and Company, New York. [Handel and Rozen, 2009] Handel, A. and Rozen, D. (2009). The impact of population size on the evolution of asexual microbes on smooth versus rugged fitness landscapes. BMC Evolutionary Biology, 9(1):236. [Hartl and Clark, 2006] Hartl, D. L. and Clark, A. G. (2006). Principles of Population Genetics, Fourth Edition. Sinauer Associates, Inc., 4th edition. [Hausser and Strimmer, 2009] Hausser, J. and Strimmer, K. (2009). Entropy inference and the james-stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res., 10:1469–1484. [Hazen et al., 2007] Hazen, R. M., Griffin, P. L., Carothers, J. M., and Szostak, J. W. (2007). Functional information and the emergence of biocomplexity. Proceedings of the National Academy of Sciences, 104(Suppl 1):8574–8581. [Huang et al., 2004] Huang, W., C., O., and Torng, E. (2004). Measuring biological complexity in digital organisms. In Artificial Life IX. [Kauffman, 1993] Kauffman, S. (1993). The Origins of Order: Self-Organization and Selection in Evolution. Oxford University Press, Oxford. 199 [Kauffman and Levin, 1987] Kauffman, S. and Levin, S. (1987). Towards a general theory of adaptive walks on rugged landscapes. Journal of Theoretical Biology, 128(1):11 – 45. [Kimura, 1968] Kimura, M. (1968). 217(129):624–6. Evolutionary rate at the molecular level. Nature, [Kogenaru et al., 2009] Kogenaru, M., de Vos, M. G. J., and Tans, S. J. (2009). Revealing evolutionary pathways by fitness landscape reconstruction. Critical Reviews in Biochemistry and Molecular Biology, 44(4):169–174. PMID: 19552615. [Korber et al., 1993] Korber, B. T., Farber, R. M., Wolpert, D. H., and Lapedes, A. S. (1993). Covariation of mutations in the v3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proceedings of the National Academy of Sciences, 90(15):7176–7180. [Kryazhimskiy et al., 2014] Kryazhimskiy, S., Rice, D. P., Jerison, E. R., and Desai, M. M. (2014). Global epistasis makes adaptation predictable despite sequence-level stochasticity. Science, 344(6191):1519–1522. [Kryazhimskiy et al., 2009] Kryazhimskiy, S., Tkacik, G., and Plotkin, J. B. (2009). The dynamics of adaptation on correlated fitness landscapes. Proceedings of the National Academy of Sciences, 106(44):18638–18643. [Lenski et al., 2003] Lenski, R., Ofria, C., Pennock, R., and Adami, C. (2003). The evolutionary origin of complex features. Nature, 423(6936):139–144. [Lenski et al., 1999] Lenski, R. E., Ofria, C., Collier, T. C., and Adami, C. (1999). Genome complexity, robustness and genetic interactions in digital organisms. Nature, 400(6745):661– 4. [Lobkovsky et al., 2011] Lobkovsky, A. E., Wolf, Y. I., and Koonin, E. V. (2011). Predictability of evolutionary trajectories in fitness landscapes. PLoS Comput Biol, 7(12):e1002302. [Lorenz et al., 2011] Lorenz, R., Bernhart, S., Honer zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P., and Hofacker, I. (2011). Viennarna package 2.0. Algorithms for Molecular Biology, 6(1):26. [Maniatis et al., 2002] Maniatis, N., Collins, A., Xu, C. F., Mccarthy, L. C., Hewett, D. R., Tapper, W., Ennis, S., Ke, X., and Morton, N. E. (2002). The first linkage disequilibrium (LD) maps: delineation of hot and cold blocks by diplotype analysis. Proc Natl Acad Sci U S A, 99(4):2228–33. [Minka, 2000] Minka, T. P. (2000). Bayesian inference, entropy, and the multinomial distribution. 200 [Nemenman et al., 2002] Nemenman, I., Shafee, F., and Bialek, W. (2002). Entropy and inference, revisited. In Advances in Neural Information Processing Systems 14. MIT Press. [Ochoa et al., 2008] Ochoa, G., Tomassini, M., V´erel, S., and Darabos, C. (2008). A study of nk landscapes’ basins and local optima networks. CoRR, abs/0810.3484. [Ochoa et al., 2010] Ochoa, G., V´erel, S., and Tomassini, M. (2010). First-improvement vs. best-improvement local optima networks of nk landscapes. In PPSN (1), pages 104–113. [Ofria, 2004] Ofria, C. (2004). Avida: A software platform for research in computational evolutionary biology. Artificial Life, 10:10–191. [Østman et al., 2010] Østman, B., Hintze, A., and Adami, C. (2010). Critical properties of complex fitness landscapes. In Proc. 12th Intern. Conference on Artificial Life. [Otwinowski and Plotkin, 2014] Otwinowski, J. and Plotkin, J. B. (2014). Inferring fitness landscapes by regression produces biased estimates of epistasis. Proceedings of the National Academy of Sciences, 111(22):E2301–E2309. [Page et al., 1999] Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab. Previous number = SIDL-WP-1999-0120. [Paninski, 2003] Paninski, L. (2003). Estimation of entropy and mutual information. Neural Comput., 15(6):1191–1253. [Podgornaia and Laub, 2015] Podgornaia, A. I. and Laub, M. T. (2015). Pervasive degeneracy and epistasis in a protein-protein interface. Science, 347(6222):673–677. [Poelwijk et al., 2007] Poelwijk, F. J., Kiviet, D. J., Weinreich, D. M., and Tans, S. J. (2007). Empirical fitness landscapes reveal accessible evolutionary paths. Nature, 445(7126):383– 386. [Roest et al., 2000] Roest, C. H., Jaillon, O., Bernot, A., Dasilva, C., Bouneau, L., Fischer, C., Fizames, C., Wincker, P., Brottier, P., Quetier, F., Saurin, W., and Weissenbach, J. (2000). Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet., 25(2):235–238. [Schneider, 2000] Schneider, T. D. (2000). Evolution of biological information. Nucleic Acids Research, 28(14):2794–2799. [Schneider et al., 1986] Schneider, T. D., Stormo, G. D., Gold, L., and Ehrenfeucht, A. (1986). Information content of binding sites on nucleotide sequences. J. Mol. Biol., 188:415–431. [Sch¨ urmann and Grassberger, 2002] Sch¨ urmann, T. and Grassberger, P. (2002). Entropy 201 estimation of symbol sequences. Arxiv preprint cond-mat/0203436. [Service et al., 2001] Service, S. K., Ophoff, R. A., and Freimer, N. B. (2001). The genomewide distribution of background linkage disequilibrium in a population isolate. Hum Mol Genet, 10(5):545–51. [Shannon, 1948] Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423. [St Onge et al., 2007] St Onge, R., Mani, R., Oh, J., Proctor, M., Fung, E., Davis, R., Nislow, C., Roth, F., and Giaever, G. (2007). Systematic pathway analysis using high-resolution fitness profiling of combinatorial gene deletions. Nat Genet, 39(2):199–206. [Stadler et al., 2001] Stadler, B. M., Stadler, P. F., Wagner, G. P., and Fontana, W. (2001). The topology of the possible: Formal spaces underlying patterns of evolutionary change. Journal of Theoretical Biology, 213(2):241 – 274. [Strelioff et al., 2010] Strelioff, C. C., Lenski, R. E., and Ofria, C. (2010). Evolutionary dynamics, epistatic interactions, and biological information. Journal of Theoretical Biology, 266(4):584 – 594. [Szendro et al., 2013] Szendro, I. G., Franke, J., de Visser, J. A. G. M., and Krug, J. (2013). Predictability of evolution depends nonmonotonically on population size. Proceedings of the National Academy of Sciences, 110(2):571–576. [Tajima, 1989] Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA Polymorphism. Genetics Society of America, 123:585–595. [Tomassini et al., 2008] Tomassini, M., V´erel, S., and Ochoa, G. (2008). Complex-network analysis of combinatorial spaces: The nk landscape case. Phys. Rev. E, 78:066114. [Travisano et al., 1995] Travisano, M., Mongold, J., Bennett, A., and Lenski, R. (1995). Experimental tests of the roles of adaptation, chance, and history in evolution. Science, 267(5194):87–90. [Tufts et al., 2014] Tufts, D. M., Natarajan, C., Revsbech, I. G., Projecto-Garcia, J., Hoffmann, F. G., Weber, R. E., Fago, A., Moriyama, H., and Storz, J. F. (2014). Epistasis constrains mutational pathways of hemoglobin adaptation in high-altitude pikas. Molecular Biology and Evolution. [van Nimwegen et al., 1999] van Nimwegen, E., Crutchfield, J. P., and Huynen, M. (1999). Neutral evolution of mutational robustness. Proceedings of the National Academy of Sciences, 96(17):9716–9720. [V´erel et al., 2011] V´erel, S., Ochoa, G., and Tomassini, M. (2011). Local optima networks 202 of nk landscapes with neutrality. CoRR, abs/1107.4162. [Wagenaar and Adami, 2004] Wagenaar, D. A. and Adami, C. (2004). Influence of chance, history, and adaptation on digital evolution. Artificial Life. [Wang and Lee, 2007] Wang, Q. and Lee, C. (2007). Distinguishing functional amino acid covariation from background linkage disequilibrium in hiv protease and reverse transcriptase. PLoS ONE, 2(8):e814. [Watterson, 1975] Watterson, G. (1975). On the number of segregating sites in genetical models without recombination. Theoretical Population Biology, 7(2):256 – 276. [Watterson, 1977] Watterson, G. A. (1977). Heterosis or neutrality? Genetics, 85:789–814. [Weinberger, 1991] Weinberger, E. D. (1991). Local properties of kauffman’s N - k model: A tunably rugged energy landscape. Phys. Rev. A, 44:6399–6413. [Weinberger, 2002] Weinberger, E. D. (2002). A theory of pragmatic information and its application to the quasi-species model of biological evolution. Biosystems, 66(3):105 – 119. [Wilke, 2001] Wilke, C. (2001). Adaptive evolution on neutral networks. Bulletin of Mathematical Biology, 63:715–730. 10.1006/bulm.2001.0244. [Wilke, 2005] Wilke, C. (2005). Quasispecies theory in the context of population genetics. BMC Evolutionary Biology, 5(1):44. [Wilke, 2004] Wilke, C. O. (2004). The speed of adaptation in large asexual populations. Genetics, 167(4):2045–2053. [Wilke et al., 2001] Wilke, C. O., Wang, J. L., Ofria, C., Lenski, R. E., and Adami, C. (2001). Evolution of digital organisms at high mutation rates leads to survival of the flattest. Nature, 412:331–333. [Wright, 1931] Wright, S. (1931). Evolution in mendelian populations. Genetics, 16:97–159. [Wright, 1932] Wright, S. (1932). The roles of mutation, inbreeding, crossbreeding and selection in evolution. Proc.Int.Cong.Gen., 1:356–366. [Zhang et al., 1990] Zhang, Y. C., Serva, M., and Polikarpov, M. (1990). Diffusion reproduction processes. Journal of Statistical Physics, 58:849–861. 10.1007/BF01026554. 203