DIGITAL EVOLUTION IN EXPERIMENTAL PHYLOGENETICS AND EVOLUTION EDUCATION By Cory Kohn A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Integrative Biology Doctor of Philosophy Ecology, Evolution, and Behavior Dual Major 2021 ABSTRACT DIGITAL EVOLUTION IN EXPERIMENTAL PHYLOGENETICS AND EVOLUTION EDUCATION By Cory Kohn T he creation and evaluation of known evolutionary histories and the implementa tion of student investigatory experience s on evolution are difficult endeavors that have only recently been feasible . The research presented in this dissertation is related in the ir shared use of digital evolution with Avidians as a model study system, bot h to conduct science research in experimental phylogenetics and to conduct education research in curricular intervention to aid student understanding. I first present background discussions on the Avidian digital evolution study system as implemented in Av ida and Avida - ED and its favorable use in experimental phylogenetics and biology education owing to its greater biological realism than computational simulatio ns, and greater utility and generality than biological system s. Prior work on conducting experime ntal evolution for use in phylogenetics and work on developing undergraduate lab curricula using experimental evolution are also reviewed. I establish digital evolution as an effective method for phylogenetic inference validation by demonstrating that resu lts from a known Avidian evolutionary history are concordant , under similar conditions , to established biological experimental phylogenetics work. I then furth er demonstrate the greater utility and generality of digital evolution over biological systems by experimentally testing how phylogenetic accuracy may be reduced by complex evolutionary processes operating singly or in combination, including absolute and r elative degrees of evolutionary change between lineages (i.e., inferred branch lengths), recombin ation, and natural selection. These results include that directional selection aids phylogenetic inference, while stabilizing selection impedes it. By evaluati ng clade accuracy and clade resolvability across treatments, I evaluate measures of tree support and its presentation in the form of consensus topologies and I offer several general recommendations for systematists. Using a larger and more biologically rea listic experimental design, I systematically examine a few of the complex processes that are hypo thesized to affect phylogenetic accuracy natural selection, recombination, and deviations from the model of evolution . By analyzing the substitutions that occu rred and calculating selection coefficients for derived alleles throughout their evolutionary tra jectories to fixation , I show that molecular evolution in these experiments is complex and proceeding largely as would be expected for biological populations. U sing these data to construct empirical substitution models, I demonstrate that phylogenetic infe rence is incredibly robust to significant molecular evolution model deviations. I show that neutral evolution in the presence of always - occurring population pr ocesses, such as clonal or Hill - Robertson interference and lineage sorting, result in reduced cla de support, and that selection and especially recombination, including their joint occurrence, restore this otherwise - reduced phylogenetic accuracy. Finally, t his work demonstrates that inferred branch lengths are often quite inaccurate despite clade suppo rt being accurate. While phylogenetic inference methods perform ed relatively well in both theoretically facile and challenging molecular evolution scenarios, t heir accuracy in clade support might be a remarkable case of being right for misguided reasons, s ince branch length inference were largely inaccurate , and drastically different models of evolution made little difference . This work highlights the need for f urther research that evaluates phylogenetic methods under experimental conditions and suggests th at digital evolution has a role here . Finally , I examine student understanding of the importance of biological variation in the context of a course featuring a digital evolution lab. I first describe the Avida - ED lab curriculum and its fulfillment of calls for reform in education. Then I describe the specific education context and other course features that aim to address student conceptualization of variation. I present a modified published assessment on transformational and variational understanding and f indings regarding student understanding of variation within an evolution education progression. Finally, I offer suggestions on incorporating course material t o engage student understanding of variation. Copyright by CO RY KOHN 202 1 v To all the people that did or will affect the course of my light cone. vi A CKNOWLEDGMENTS So many people helped me reach this point. I am tremendously grateful for their caring, guidance, assistance, and steady pres ence. Foremost thanks are owed to my two primary advisors Barry Williams and Jim Smith. Barry plucked my grad school application out of a pile and he quickly wooed my evolutionary biology interests and passions. He was a highly personable mentor during t he first half of my grad school years and always encourag ed me to chase my research questions. Jim started as a committee member, but my wo rk with him especially grew through our efforts on the Avida - ED team. He agreed to become my advisor, advocate, and g ood friend across my remaining years my degree. I will be forever grateful fo r all the time and energy he has invested in my cause. Thank you, Jim and Barry! The initial research design and data collection for the work in Chapter 1 was conducted by Barry and a summer undergraduate research student in the lab, Kyle Safran . Carlos An derson also advised us and his knowledge of Avida laid the foundation for everything I would learn about the system. Others taught me a great more about Avida especially various members of and to all those folks, thank you! Charles was also a valuable committee member, as was Shin - Han Shiu , who each gave actionable feedback especiall y in regard to communicating my research to various scientific communities. The initial research design and survey distribution for the wor k in Chapter 4 was conducted by the Avida - ED curriculum development and assessment team, including Jim, Rob Pennock, Louise Mead, Mike Wiser . Th ese folks, in addition to others that have joined us at various points, and our work togeth er was my entry into biology education research , which fundamentally shaped by career trajectory . The Avida - ED team also sponsored m y invo lvement with BioQUEST and SABER, communities that have been supportive and pedagogical ly inv igorating . Rob and Diane Blackwood , as the Avida - ED PI and lead programmer, respectfully, have been integral to the success of Avida - ED and all work stemming from its use. Mike worked vii closely with me on the transformational - variational scoring rubric, but moreover since my early days in graduate school I have cherished his wit and have always welcomed his conversation. I owe a lot to Louise, especially. She was the primary curriculum developer for Integrative Biology and my mentor for our FAST Fellowship work. Special tha nks to our undergraduate teaching assistants, my graduate teaching assistant, and our IBIO150 students! Louise, along with Barry and Terri McElhinny , was also my primary undergrad educator role model. Numerous other individuals and organizations made my y ears at MSU run smoother. Among others, Tom Getty, David Foran and of course, Jim made special arrangem ents for my continuation at MSU after my advising handover; I am fortunate to have had such support. Many administrative folks in IBIO, EEBB, and BEACON expertly handled my queries or business concerns over the years, and their assistance cannot go unremar ked, so thank you, again! CNS, BEACON, EEBB, IBIO, COGS, and the Grad School via the FAST Fellowship provided research, travel, and/or living funds; with out these monies, I never would have been able to afford this education. I also made many friends throu gh these organizations and our time together was enlightening and entertaining! Finally, Kevin Hall and our friend Y vette G reen will always hold esteemed place s in my memories of Giltner Hall. Before my time at MSU, my education was especially impacted by my undergraduate advisor, Susan Kalisz , and my t eaching mentors of her , Tony Bledsoe, and Laurel Roberts at Pitt. I also want to thank all my other teac hers across thirty years of education, most notably Eric Baltz who first formally introduced me to evol utionary biology at Freedom High School. with love and support. My Mom was especially integral in imbuing me with a love of learning, my Dad, Craig, always encouraged me to b e adventurous and try new things, and my Dad, Mike, - asking. Last but far from least come my you, Katie , and our joy, Hen ry , to whom I say this: Now that Daddy has finished his dissertation edits, which were years in the ma king, he and Mommy will joyously celebrate your first birthday. The two of you, and the minds of my students, are my meaning - making in the face of the Ab surd. viii T ABLE OF CONTENTS LIST OF TABLES ................................ ................................ ................................ ................................ .............. x LIST OF FIGURES ................................ ................................ ................................ ................................ ........... xi CHAPTER 1: The Ca se for Using Digital Evolution in Experimental Phylogenetics and Evolution Education ................................ ................................ ................................ ................................ ................................ ...... 1 Experimental Evolution and Digital Evolution ................................ ................................ ......................... 1 Digital evolution ................................ ................................ ................................ ................................ .. 2 Avidians in Avid a and Avida - ED ................................ ................................ ................................ ........... 3 Avid ian evolution as an instantiation of evolution ................................ ................................ ............. 4 Experimental Evolution in Phylogenetics ................................ ................................ ................................ . 6 Evaluating accuracy in phylogenetics ................................ ................................ ................................ .. 6 History of e xperimental phylogenetics studies ................................ ................................ ................... 9 Research summary ................................ ................................ ................................ ............................ 15 Experimental Evolution in Undergraduate Education ................................ ................................ ........... 16 Balancing Utility, Realism, and Generality in Experimental Evolution ................................ ................... 18 Experimental utility versus biological reali sm ................................ ................................ ................... 19 Experimental utility versus generality ................................ ................................ ............................... 20 Biological realism versus generality ................................ ................................ ................................ .. 21 Digital evolution offers a complementary balance ................................ ................................ ........... 22 CHAPTER 2: Digital Evolution Provides Direct Tests of Phylogenetic Accuracy ................................ ......... 26 Introduction ................................ ................................ ................................ ................................ ........... 26 Phylogenetic analysi s methods ................................ ................................ ................................ ......... 28 Clade support evaluation ................................ ................................ ................................ .................. 32 Overview of T7 phage experimental phylogenetics research ................................ ........................... 35 Digital ev olution for experimental phylogenetics research ................................ .............................. 38 Methods ................................ ................................ ................................ ................................ ................. 40 Experimental design ................................ ................................ ................................ .......................... 40 Analyses ................................ ................................ ................................ ................................ ............. 48 Results ................................ ................................ ................................ ................................ .................... 53 Treatment comparisons ................................ ................................ ................................ .................... 53 Accuracy and consensus thresholds ................................ ................................ ................................ . 66 Discussion ................................ ................................ ................................ ................................ ............... 73 Treatment comparisons ................................ ................................ ................................ .................... 73 Accuracy a nd consensus thresholds ................................ ................................ ................................ . 78 Conclusions ................................ ................................ ................................ ................................ ............ 80 ix CHAPTER 3: Digital Evolution Addresses Intractable Research Questions in Phylogenetics ..................... 83 Introduction ................................ ................................ ................................ ................................ ........... 83 Methods ................................ ................................ ................................ ................................ ................. 86 Experimental design ................................ ................................ ................................ .......................... 86 Analyses ................................ ................................ ................................ ................................ ............. 92 Results ................................ ................................ ................................ ................................ .................... 98 Empirical models of evolution ................................ ................................ ................................ .......... 98 Distribution of mutational effects ................................ ................................ ................................ ... 100 Characterization of observed substitutions ................................ ................................ .................... 102 Phylogenetic accuracy ................................ ................................ ................................ ..................... 118 Discussion ................................ ................................ ................................ ................................ ............. 126 Distribution of mutational effects ................................ ................................ ................................ ... 127 Characterization of ob served substitutions ................................ ................................ .................... 128 Phylogenetic accuracy ................................ ................................ ................................ ..................... 133 Conclusions ................................ ................................ ................................ ................................ .......... 138 CHAPTE R 4: Shifting Student Understanding of the Importance of Variation for Evolution in a Course Featuring Digital Evolution ................................ ................................ ................................ ........................ 144 Introduction ................................ ................................ ................................ ................................ ......... 144 A few of the difficulties in understanding and teaching evolution ................................ ................. 146 Transformationalism and variationalism ................................ ................................ ........................ 148 Trouble disl odging transformationalism ................................ ................................ ......................... 150 Experimentation in Avida - ED emphasizes individual variation ................................ ....................... 152 Motivation for curriculum ................................ ................................ ................................ ............... 155 Methods ................................ ................................ ................................ ................................ ............... 157 University and course population context ................................ ................................ ...................... 157 Course design ................................ ................................ ................................ ................................ .. 158 Assessment ................................ ................................ ................................ ................................ ...... 167 Responses ................................ ................................ ................................ ................................ ........ 169 Scoring ................................ ................................ ................................ ................................ ............. 170 Statistics ................................ ................................ ................................ ................................ .......... 174 Results and Discussion ................................ ................................ ................................ ......................... 175 Conclusions ................................ ................................ ................................ ................................ ......... 181 REFERENCES ................................ ................................ ................................ ................................ .............. 186 x LIST OF T ABLES Table 3.01. Summary statistics for empirical amino acid frequencies for ancestor Avidian instruction frequencies a nd sets of biological taxa, with values scaled to 0 1,000 for ease of comparison to a 1,000 - loci genome. For example, a minimum of 8 indicates that the lowest frequency for any of the twenty amino acid frequencies in the dataset was 8 out of 1000 (i.e., 0. 8%). Some empirical frequencies were accessed via the ExPASy resource portal (2012). ....... 91 Ta ble 3.02. Classification of substitutions by the minimum and maximum relative fitness recorded across mea surable timepoints throughout their frequency trajectory. Threshold values were determined based on nearly neutral theory (Ohta and Kimura, 1971) for a population size of 1,000, and fitness types are mutually exclusive. See the text for a description of a seventh type, quick sweep. ................................ ................................ ................................ ... 94 Table 3.03. Two examples of empirical fixed - rate models of Avidian instruction substitution out of the eight models constructed, one per experimental treatment, by p ooling substitutions observed across all ten replicates. Character - to - character substitut ion rates for the neutral, asexual evolution treatment starting with the naïve ancestor are above the diagonal and character state frequencies are the *first line of v alues below. Substitution rates for the selection, asexual evolution treatment from the p re - evolved ancestor are below the diagonal and state frequencies are the **second line. Note that the instructions are arrayed by the alphabetical order for full amino acid names, as is the convention for phylogenetic programs. Substitution rates are speci fied relative to a total of 500 per treatment, and state frequencies are out of 100 per treatment. ................................ ................................ ................................ ..... 100 Tabl e 4.01. Enrollment in Integrative Biology: From DNA to Populations in six consecutive semesters (N = 354). ................................ ................................ ................................ ................... 158 Table 4.02. Exampl e scored responses of transformational (T), variational (V), and o ther (O) for panel chosen as most likely in item 3, with item 1 or 2 reviewed depending on panel chosen. 173 Table 4.03. Mean word and character counts for paired student responses with repeated - measures t - tests (N = 254, df = 253) showing no significant differences. Relevant word and character counts include only the sub set of items used for scoring, see Figure 4.03. .............. 175 Table 4.04. Det ailed percentage breakdowns for paired - response students (N = 254 ) demonstrating understanding of biological variation pre - and post - course. ............................ 178 xi LIST OF FIGURES Figure 2.01. Representative evolutionary history topologies with branch lengths denotating the number of generations lineages evolved. The base design experienced equivalent generations of evol ution per branch, either 100 (not shown), 300 (a), or 3,00 0 generations per branch (not shown). Four designs (b - e) had differing numbers of generations per tree level, with either rations across all branches among a level. These designs a re named by tree level length from external to internal tree levels, and include SLL (b), LSL (c), LLS (d), and LSS (e). An additional treatment, LSS B (f), used the LSS branching pattern with additi - branch, resu lting in a 32 - taxon asymmetrical history. The scale bar in subplot a is 300 generations and the scale bar in subplots b - f is 3,000 generations. Internal branches are labeled at their terminal node, and all evolutionary histories were true polytomies at the ir origin. ...... 42 Figure 2.02. The four natural selection regimes visualized w ith respect to the base evolutionary history topology. Colored shading indicates shared tasks rewarded among one or more branches within a tree. Diversifying and directional selection are inversely related in these selective environment across all branches as well as selective environment on each branch but the weakest directional selection compared to the other regimes. ................. 47 Figure 2.03. Example uses for the clade support metrics of Clade Accuracy (CA), Clade Resolvability (CR), and Average Topological Accuracy (Top. Acc.) for com paring a known cladogram to four inaccurate cladograms. This four - ingroup - taxon rooted tree has two true clades to be inferred (II + IO and OI + OO, dark green circle) and therefore as many as two false negative (light green circle) and two false positive (red circle) clades per cladogram. Note that the actual topologies of the Avidian evolution experiments constitute 8 - taxon or 32 - taxon trees, with many more potential combinations of f alse negative and false positive clades. ................ 52 Figure 2.04. Number of variable sites (blue pentagons) and parsimony informative sites (purple stars) for all experimental treatme nts. Open symbols are for individual treatment replicates, and closed symbols for the treatment median. Experimental treatments are denoted by their selective condition (green labels), recombination condition (orange labels), and number of generations per branch (cyan labels); see text for further information on condition notation. . 55 Figure 2.05. Inferred median internal branch lengths (yellow) and median external branch lengths (orange) for the single best tree resu lting from NJ (square), ML (triangle), and BI (plus) analyses. Open symbols are for individual replicates, and closed symbols for the median across treatment replicates. Not e that MP is not included since it does not report trees with branch lengths expres sed as the number of substitutions per site. ................................ ........................ 56 Fi gure 2.06. Clade accuracy, the percentage of clades correctly inferred, for the set of treatments with stabilizing selection and lineages evolved for equivalent generations per branch. Best and consensus trees are included for each analysis, including NJ ( square, best trees only), MP (circle), ML (triangle), and BI (plus), with open symbols for individual replicates and closed for the median across replicates. Within each analysis, the best tree then consensus trees in the order of increasing threshold stri ctness are arrayed with tick marks at the top indicate these distinctions; see the text for further information on tree type notation. ............ 58 xii Figure 2.07. Clade resolvability, the percentage of clades correctly resolved, for the set of treatments with stabilizing selection and lineages evolved for equivalent generations per branch. Best and c onsensus trees of increasing threshold strictnes s are included for each analysis, except for NJ which only includes the best tree, with open symbols for individual replicates and closed for the median across replicates. ................................ .............................. 59 Figure 2.08. Clade accuracy for the set of treatments with stabilizing selection and lineages evolved for differing generation per tree level. Best and consensus trees of increasing threshold strictness are included for each ana lysis, except for NJ which only includes the best tree, with open symbols for individual replicates and closed for the median across replicates. ................. 60 Figure 2.09. Clade resolvability for the set of treatments with stabilizing selection and lineages evolved for differing generation per tree level. Best and consensus trees of increasing threshold strictness are included for each analysis, except for NJ which only includes the best tree, with open symb ols for individual replicates and closed for the median across replicates. ................. 60 Figure 2.10. Clade accuracy for the set of treatments with directional selection and lineages evolved for equivalent generations per branch. Best and consensus trees of increasing threshold strictness are included for each ana lysis, except for NJ which only includes the best tree, with open symbols for individual replicates and closed for the median across replicates. 61 Figure 2.11. Clade resolvability for the set of treatments with directional selection and lineages evolved for equivalent generations per branch. Best and consensus trees of increasing threshold strictness are included for each analysis, ex cept for NJ which only includes the best tree, with open symbols for individual replicates and closed for the median across replicates. 62 Figure 2.1 2. Clade accuracy for treatments with directional selection regimes 2 or 3 and/or the LSS pattern of generations per branch. Best and consensus trees of increasing threshold strictness are included for each analysis, except for NJ which onl y includes the b est tree, with open symbols for individual replicates and closed for the median across replicates. Note that the first five treatments were shown in prior figures and are included here for comparison. ... 63 Figure 2.13. Clade resolvability for treatments with directional selection regimes 2 or 3 and/or the LSS pattern of generations per branch. Best and consensus trees of increasing threshold strictness are included for each analysis, except for NJ w hich only includes the best tree, with open symbols for individual replicates and closed for the median across replicates. Note that the first five treatments were show n in prior figures and are included here for comparison. ... 63 or each treatment, with open symbols for individual replicates and closed for the median across replicates. tments demonstrating considerable variation. ... 64 Figure 2.15. BI posterior support for true four - taxon clades (orange), true two - taxon clades (blue), and false clades (red outline) for all eight t roublesome treatments. Each spoke per targe t plot is a replicate and each eight - ingroup taxon tree replicate had 2 four - taxon clades and 4 two - taxon clades, therefore 6 true clades are arrayed linearly from the outer to inner ring for each spoke. The thick target rings are in intervals of 30% and t hin rings by 10%, with the absolute center of the target plot as 100% support. Only false clades with 50% or greater support are shown. Experimental treatment labeling is as indicated in the text. ....................... 66 xiii Figure 2.16. Frequency of all true and all false clades per treatment set by relative bootstrap support for ML analyses and by posterior probability support for BI analyses. Note the x - axi s is not to scale, i ncluding the 95% support category. Troublesome treatments are the eight treatments demonstrating considerable variation among best trees across phylogenetic analyses; and non - troublesome treatments are the fifteen treatments demonstrati ng near - perfect clad e accuracy among best trees across phylogenetic analyses. Treatment LSS B is excluded so that only eight - ingroup taxon trees are compared. ................................ ................. 67 Figure 2.17. Relationship between clade accuracy as th e percent of correct clades for values of bootstrap support for ML analyses and posterior probability support for BI analyses. Results include data from Hillis and Bull (19 93) as estimated from their Figure 4a (black), data from Avidian evolution non - troub lesome treatments for both ML and BI analyses (blue, identical relationship), and data from Avidian evolution troublesome treatments for ML analyses (orange) and BI analyse s (red dashed). The thin grey line is the one - to - one accuracy - to - support relationshi p; values above the line are conservative as being an underestimation of accuracy and values below are liberal as being an overestimation of accuracy. Troublesome treatment s are the eight treatments demonstrating considerable variation among best trees acr oss phylogenetic analyses; and non - troublesome treatments are the fifteen treatments demonstrating near - perfect clade accuracy among best trees across phylogenetic analyses . Treatment LSS B is excluded so that only eight - ingroup taxon trees are compared. .................... 68 Figure 2.18. Proportion of trees across all non - troublesome treatments that met or exceeded clade accuracy (a), clade resolvability (b), and average topological accuracy (c) thresholds for each analysis and consensus method evaluated. Non - troublesome treatments are the fifteen treatments demonstrating n ear - perfect clade accuracy among best trees across phylogenetic analyses. ................................ ................................ ................................ ................................ ........ 70 Figure 2 .19. Proportion of trees across all troublesome treatments that met or exceeded clade accuracy (a), clade resolvability (b), and average topological accuracy (c) thresholds for each analysis and consensus method evaluated. Troublesome treatments are the ei ght treatments demonstrating considerable variat ion among best trees across phylogenetic analyses. ............ 72 Figure 3.01. The distribution of mutational effects for five genotypes with th e relative fitn ess proportions of all genotypes differing from a randomly chosen genotype by one mutation. Relative fitness bin size varies between 0.1, 0.5, and a logarithmic scale of base 10, with differences in scaling demarcated by red dashed lines bel ow the x - axis. The genotypes evaluated included the pre - evolved ancestor in its ancestral selection environment (black), the same genotype in the selective environment at the root of the evolutionary history (orange), the naïve ancestor in the same root en vironment (yell ow), a final descendant of the pre - evolved ancestor at the tip of the evolutionary history (blue), and a final descendant of the naïve ancestor in the same tip environment (green). ................................ ................................ 102 Figure 3.0 2. Number of recorded substitutions per treatment (mean ± 95% CI) relative to the expectation from neutral theory (dashed line). Experimental treatments are denoted by their selective condition (green labels), starting ancestor genotype (pink labels), and recombination condition (orange labels). ................................ ................................ ................................ ........... 103 Figure 3.03. Number of observed substitutions by fitness type per selection experiment. Experiments began with either the naïve ancestor under asexual (a, N = 338,936) or sexual (b, N = 341,842) reproduction, or with the pre - evolved ancestor under asexual (c, N = 286,780) or sexual (d, N = 301,292) reproduction. For each treatment, data from all replicates are pooled, and substitutions are colored by fi tness type as per Table 3.02. ................................ ............... 106 xiv Figure 3.04. Representative treatment patterns for the number of fixations by Avidian locus. Selected patterns include single treatment replicates (a and c) or p ooled treatment replicates (b and d), under neutr al evolution (a) or selection (b d), and starting with the naïve ancestor (a c) or pre - evolved ancestor (d). Substitutions are colored by fitness type as per Table 3.02. ....... 108 Figure 3.05. Representative patterns for asexual (a) and sexual (b) treatments under selection for the number of observed substitutions fixed per 100 - generation population timepoint across all lineages (i.e., branc hes) in the ev olutionary history. For each treatment, data from all replicates are pooled, and substitutions are colored by fitness type as per Table 3.02. ........... 110 Figure 3.06. Number of observed substitutions by how many generations elapsed between entry in the population until fixation for selection treatments. Under sexual reproduction (b and c) the tail for beneficial - deleterious sub stitutions extends until 12,000 generations. For each treatment, data from all replicates are pooled, and substitutions are colored by fitness type as per Table 3.02. ................................ ................................ ................................ ............................ 112 Figure 3.07. Representative treatment patterns for the number of observed substitutions fixed per 100 g enerations. Selected patterns include asexual (a and c) or sexual reproduction (b and d), and under neutral evoluti on (a and b) or selection (c and d) . For each treatment, data from all replicates are pooled, and substitutions are colored by fitness type a s per Table 3.02. ...... 113 Figure 3.08. Representative patterns for asexual (a) and sexual (b) treatments under neutral evolution for how alleles that became substitutions changed in frequency between entry in a single population (i.e., one branch from one treatment replicate) on a middle tree level in the evolutionary history, and all substitutions that were s egregating before, during, or after this lineage betw een cladogenic events. Observed substitution frequencies are shown as circles, and straight lines connect measurements every 100 generations. Line thickness indicates the relative number of substitutions on the same fixation trajectory. ................................ ............ 115 Figure 3.09. Representative patterns for asexual (a and b) and sexual (c and d) treatments under natural selection for how alleles that became subst itutions changed in population frequency (a and c) and fitness (a d), and their effects on average population fitness (b and d). Panels a and b show a single population and panels c and d show a single population, each on a middle tree level in the evoluti onary his tory, and include all substitutions that were segregating before, during, or after this lineage between cladogenic events. Panels a and c: Observed substitution frequencies are shown as circles, and straight lines connect measurements every 100 g eneration s. Line thickness indicates the relative number of substitutions on the same fixation trajectory. Relative fitness is shown by coloration with values as indicated in the color bar, which includes the nearly neutral theory thresholds for neutrality (Table 3 .02) as dashed lines. Orange stars within circles at 100% frequency indicate that fitness was indeterminable. Panels b and d: Relative fitness of substitutions are shown as solid black lines, with thickness indicating the relative number of subst itutions on the same fixation trajectory. Dashed black lines indicate the nearly neutral threshold and dashed colored lines indicate a 1% selection advantage (blue) and disadvantage (red). Average population fitness is shown by the dotted green line and wi th values on the secondary y - axis. ................................ ................................ .. 117 Figure 3.10. Topological accuracy, the average percentage of clades correctly inferred and t. Analyses include neighbor joining (NJ), maximum parsimony (MP), and maximum likelihood (ML) and Bayesian inference (BI) with the Poisson or empirical model of evolution; open symb ols for individual replicates and closed for the median across replicates. Experimental treatments are denoted by their selective condition (green labels), starting ancestor genotype (pink labels), and recombination condition (orange labels). ................................ ................................ .................. 119 xv Figure 3.11. Relations hip between clade accuracy as the percent of correct clades for values of bootstrap support for ML analyses and posterio r probability support for BI analyses of asexual reproduction treatments pooled across naïve and pre evolved ancestor conditions. Result s include comparisons to bootstrap support for ML analyses (solid lines) or posterior probability support for BI analyse s (dashed), using the empirical model of evolution (blue) or Poisson model (orange), and under selection (filled lines) or neutral evolu tion (unfilled lines). The grey line is the one - to - one accuracy - to - support relationship; values above are conservative a s being an underestimation of accuracy and values below are liberal as an overestimation of accuracy. Note only support greater than 10% was assessed due to the large number of very low - supported clades. ................................ ................................ ................................ ........................ 121 Figure 3.12. Empirical, theoretical, and inferred median internal (yellow) and external (orange) branch lengths per treatme nt. The empirically observed rates of substitutions per s ite (dashed lines) and the expectation under neutral theory (solid line) are included for comparison to the branch lengths inferred for the single best tree resulting from each analysis and model of evolution. Analyses include NJ (square), ML with Poisson model (triangle pointing up), ML with empirical model (triangle pointing down), BI with Poisson model (plus), and BI with empirical model (cross); open symbols for individual replicates and closed fo r median across replicates. ................................ ................................ ................................ ................................ ..................... 122 Figure 3.13. Number of variable sites (blue pentagons) and parsimony informative sites (purple stars) for all experimental treatment taxon - character datasets, with open symbols for individual replicates and closed for the median across replicates. ................................ ............................ 126 Figure 4.01. Models of research authen ticity compiled from the literature on modes of investigation in lab oratory environments (Ballen et al., 2017; Buck et al., 2008; Corwin et al., 2015; Domin, 1999; Goodwin et al., 2021; Seymour et al., 2004). ................................ ............ 160 Figure 4.02. The assessment regarding melanic moth populations with idealized variational (Panel A) and transformational (Panel B) evolutionary trends, as modified from Shtulman (2006). ................................ ................................ ................................ ................................ ......... 168 Figure 4.03. Melanic moth assessment rubric including t he process for choosing which items to score and the criteria for awarding scores of variationalism (V), transformationalism (T), and other (O). Underlining indicates a scoring distinction between panels A and B, and colored font further indicates the same response as be ing scored differently depending on the chosen panel. ................................ ................................ ................................ ................................ ..................... 171 Figure 4.04. Percent of all students demonstrating understanding of biolog ical variation pre - course (dark, N = 340) and po st - course (light, N = 271), with asterisks indicating significance. 177 Figure 4.05. Percent of paired - response students (N = 254) demonstrating understanding of biological variation pre - and pos t - course, with asterisks indicating significance between pre/post states of understanding (black) and d irectionality of shifts between states (pink). ... 178 1 CHAPTER 1 : The Case for Using Digital Evolution in Experimental Phy logenetics and Evolution Education Experimental Evolution and Digital Evolution For most of its history, e volutionary biology research was impeded by a perceived inability to observe, measure, and experiment over evolutionarily relevant str etches of time (Garland and Rose, 2009) . Even Darwin, the originator of innumerable insightful research avenues and methods of study, failed to understand that evolution may be investigated experimentally. Although long - term experiments were occasional ly proposed (de Varigny, 1892) , this inability largely persisted for three quarters of the history of biology research to date , from On th e Origin of Species until experimental evolution studies began in earnest ( see Rose et al. 2004) . Experimental evolution is the study of populations across generations and under defined conditions imposed by the researcher, and its pr imary goal is to directly test evolutionary theory (Kawecki et al., 2012) . Using this approach, a few populations have been studied in nature, although laboratory investigation i s generally preferable due to greater control over environmental conditions and the ancestral population . Aspects of the evolutionary proces s can be observed in a laboratory setting over relatively limited amounts of time, and with particular study systems , most often phage, bacteria, yeast, and Drosophila . Even then, it is very difficult to record observations at the level of detail one might (Garland and Rose, 2009) , has been a very productive endeavor (Kawecki et a l., 2012) . was suggested by Lenski (2001) . The quality of that which is not of nature. Although definit ions or theories of life have been notoriously difficult to construct and defend ( e.g., Ruiz - Mirazo et al. 2004) , Lenski (2001) describes life as that which can evolve via na tural selection ; so , not only 2 the property of self - reproduct ion but also heritable variation and the propensity for variation in a population to change due to the benefit or detriment it confers. Digital e volution Digital evolution was i nspired by early c omputer viruses, code that can reproduce although not evolve . O nce computer scientists wrote code that could evolve they (Wilke and Adami, 2002) , creating a form of artificial life . These self - reproducing computer programs, or digital organisms, differ from genetic algorit hms and computational (also called numerical) simulations in that the organisms must, by themselves, reproduce and that no genotype is designated as the optimal, sought target by the experimenter. Natural selection occurs because the environment is computa tionally resource - limited and the reproduction process is designed to have random inaccuracies, or mutations, some of which may allow digital organisms to reproduce more efficiently, outcompeting other genotypes for resources . Complex computational metabol ic processes in addition to self - reproduction may arise due to mutation, with organisms being able to perform computation using environmentally encountered numbers. These phenotypes can then evolve via selection if a suitable selective environment is provi ded one that rewards such computation (Adami, 2006) . It seems plausible that digital evolution systems e xhibit the trait of open - ended evolution, as do biological systems (Lenski et al., 2015 ; Ruiz - Mirazo et al., 2004) . Open - ended evolution is the capability of an evolving population to continual ly produce novel orga nism s rather than reaching a stable state . Al though is an ongoing discussion among artificial life researchers (Taylor et al., 2016) . Digital organisms have the attributes sought for in experimental evolution model systems. Generation time is measured in seconds or les s, population sizes can be massive, measurements can be taken with heretofore unprecedented ease and precision, and digital orga nisms readily tolerate human - influenced environments . Further, e xperiments can be highly controlled, easily replicated, and even identically repeated. Of course, the necessary and 3 sufficient computational hardware is needed , although these requirements are minimal with due to an experimenter ha ving full control over the genetic and environmental conditions , e.g. disallowing all neutral and deleterious mutations, allowin g biologists the ability to evaluate otherwise untestable ideas . Avidians in Avida and Avida - ED Avida is an artificial life platform designed to study broad questions in evolutionary biology via the evolution of digital organisms , called Avidians (Ofria et al., 2009; Ofria and Wilke, 2004) . This highly manageable model system allows quick replicate experimentation and copious data output, leading to high - impact resea rch regar ding the nature of evolutionary processes ( e.g., Lenski et al. 1999, 2003; Wilke et al. 2001; Chow et al. 2004; Goldsby et al. 2012; Covert et al. 2013) . Avida - ED is the educational version of this research platform (Pennock, 2007a) . Through its approachable graphical interface, simplifi ed set of configurable experimental variables and output, and associated curriculum , Avida - ED allows students to draw connections between evolutionary processes operating in biological and digital systems, ask questions and conduct research involv ing biolo gical theory, and engage in science and engineering practices in a similar manner to biologists using Avida or other digital or biological model study systems (Kohn et al., 2018) . Avida - ED has garnered a n award from The International Society for Artificial Life (2017) and i s itself the subject of ongoing education research (Speth et al., 2009; Smith et al., 2016; Lark et al., 201 8) . Each program is freely available: Avida at http://avida.devosoft.org/ & https://github.com/devosoft/avida and Avida - ED at https://avida - ed.msu.edu/ . Avidians undergo computational metabolic processes to self - reproduce. An Avidian is a computer program consisting of a seq uence of simple, modular computer instructions, the set of which constitutes its genome. The instruc tion set consists of 26 instructions and is Turing complete, which means that in principle any computer program could be encoded within the Avidian genomic language. An genome encodes its ability to self - reproduce and 4 perhaps perform other comput ational tasks. Digital evolution experiments often begin with an organism capable of reproduction and nothing else. During reproduction, each time a parent al instruction is copied there is a chance the offspring will incorporate a different instruction at that position of its genome. This change occurs via random probabilistic means and is analogous to substitution point mutations in biological life. Althoug h genome size is fixed in Avida - ED, the research version additionally allows other configurable gene tic variables such as other types of mutation including insertions and deletions, as well as other instruction sets and even recombination. Mutations and, w ith sexual organisms, recombination result in the accumulation of genetic variation in a population genome to code for other features. The digital environment in which a population of Avidians exists is a grid - like lattice in w hich a single organism occup ies a single space in the grid . By configuring this environment grid the researcher sets the maximum population size. Environments can be configured such that the performance of specific computational functions, most commonly bi twise Boolean logic tasks, are rewarded. This reward is in the form of additional computational reso urces such that the Avidian can execute its code quicker relative to others in the population, resulting in faster offspring production. An ness is measured as a function of its reproduction efficiency and ability to perform tasks. Organism al, population, and environmental data can be saved to track the evolutionary course of a population, and past experiments can be identically replicated thr ough random number seed specification. Avidian evolution a s an instantiation o f e volution Avidian ev olution results in evolutionary mechanisms and thus outcomes that are analogous to biological reality. Because the necessary and sufficient conditions for evolution inheritance, variation, and differential reproduction (Dennet, 1995) are inherent to the system, evolution is not simulated but rather actually occurs. Albeit digital, Avi dian po pulation change is an instantiation of evolution and neither a summary of established evolutionary patterns nor a simulation thereof . Users do not program what 5 will happen, but they can adjust initial genetic and environmental condition s and, a fter initiating the experiment, they can then observe and record what happens. Rather than being a model of evolution, Avidians undergo evolution; though , Avidian genetic and environmental complexity can be considered a model of that which exists i n nature. Within their complex computational environments, Avidian adaptation often proceeds in creative ways the experimenter never would have predicted. Because Avidians have a genomic sequence and exist in populations under going evolutionary change, man y evolutionary processes or outcomes can be studied through experimentation. For molecular evolution, these have included epistasis and complexity (Adami et al., 2000; LaBar and Adami, 2016; Lenski et al., 1999a, 2003; Ofria et al., 2008; Ostrowski et al., 201 5; Strelioff et al., 2010) ; genotype - phenotype mapping (Fortuna et al., 2017) ; phenotypic plasticity (Clune et al., 2007) ; genome size evolution (Gupta et al., 2016; Ofria et al., 2003) ; mutation rate evolution (Clune et al., 2008) ; mutational and drift robustness (de Visser et al., 2003; Elena et al., 2007; LaBar and Adami, 2017; Lenski et al., 2006; Wilke et al., 2001) ; clonal and Hill - Robertson interference (Adami, 2006; Covert et al., 2013; O strowski et al., 2007; Wilke and Adami, 2002) ; and , mutational meltdown , and the effects of recombination (Misevic et al., 2010, 2006, 2004) . For ecology and macroevolution, th ese have included group selection (Beckmann et al., 2008; Clune et al., 2011; Goings et al., 2004; Goldsby et al., 2012, 2014a, 2014b; Knoester et al., 2007b) ; coevolution (Zaman et al., 2014) ; behavior, communication, and cooperation (Elsb erry et al., 2009; Goldsby et al., 2008; Goldsby and Cheng, 2008; Grabowski et al., 2013; Knoester et al., 2013, 2007a; McKinley et al., 2008) ; ecological specialization, maintenance, and extinction (Chow et al., 2004; Connelly et al., 2010; Cooper and Ofria, 2003; Fortuna et al., 2013; Ostrowski et al., 2007; Yedid et al., 2009) ; historical contingency (Yedid et al., 2 008) ; and even the origin of life (CG et al., 2017) . Th us , there exist simila rities between digital and biological organisms with respect to a remarkable array of evolutionary p henomena. their evolutionary dynamics, digital organisms can be compared with biochemical vi ruses and 6 (Wilke and Adami, 2002) . Accordingly, with Avida - ED students can perform experiments and collect actual research data amenable to hypothesis testing , learning about biology and practicing as scientists throughout their experience. Expert and novice scientists can study the power of evolution not just within the digital realm of Avida - ED, but also by analogy to the chemical and physical reality of biological life . Experimental E volution in P hylogenetics The inference of historical re lationships, i.e. , phylogenetics, is a central goal in biology (Hillis, 1995) . Phylogenies, representations of evolutionary relatedness among organisms or taxa generally, are usually inferred from molecular sequence data , although behavioral, morphological, and other characters can also be used . Th is inference process is crucially i mportant because phylogenies are created for use across all of biology to support myriad research efforts. Our ability to infer phylogenies has consistently improved , or so we think, through the advancement in molecular evolution theory and modeling, the acquisition of molecular sequence data, and the implementation of sophisticated algorithms and computational tools. Evaluating a ccuracy in phylogenetics Evolutionary histories of appreciable degr ees of evolutionary change cannot generally be observed, with few notable exceptions ; t hus, phylogenetic inference cannot be definitively tested, and the evaluation of phylogenetics methodo logies and tools has largely relied on computational simulations (Hillis, 1995; Huang et al., 2017) . Computer simulations are tools for algorithmically modeling molecular evolution under a set of assumptions . B y necessi ty, simulations rely on relatively simple models of evolution (Arenas, 2012; Huelsenbeck, 1995) . Basic simulations of molecular evolution generate probabilistic changes in characters such as nucleotides or amino acids according to a mo del of substitution rates between residues (Bull et al., 1993) . More adva nced simulations may incorporate the effects of the genetic code vi a codon evolution (Rambaut and Grass, 1997) or non - independence among chara cters 7 (Huelsenbeck and Nielsen, 1999) . Even more a dvanced methods include coalescent approaches (Huang et al., 2010) or attempt to incorporate selection and linkage (Messer, 2013) . Molecular evolution models can also aid, for example, in the simulation and analysis of tree topologies (Graybeal, 1998) and in relative and absolute rates of evolution (Kolaczkowski and Thornton, 2008) and spec iation or extinction (Rabosky and Lovette, 2008) . Simulations do provide valuable insig ht regarding a range o f conditions that theory predicts are relevant to phylogenetic inference. They can be used to generate large amounts of data with relative ease, even providing exhaustive information within their defined parameters (Huelsenbeck, 1995) . Overall, simulations are ideal for investigating the speci fic dynamic for which they are programmed. While simulations have become increasingly sophisticated, the idealized conditions they model might, alone or in part, never truly exist in nature. Simulations are less useful in the analys is of combinations of co mplex factors , and whe n emergent or unknown properties are present in complex systems (Arenas, 2012) . Simulations are limited in that they do not incorporate the full range of conditions operating in evolving populations, including those that we know to have potential in disrupting phylogenetic inference, those which we suspect might, and those which we have not yet discovered. For example, even the molecular evolution of coevolving sites remains very difficult to simulate (Arenas, 2012; Sousa et al., 2008) . Simulations, as with all models, incorporate untested assumptions and fail to incorporate many other factors (Hillis, 1995) . This gap in the credibility of simulations will always remain (Miyamoto and C racraft, 1991) and it is unknown whether the assumptions and necessary simplifications in simulations reduce their relevance (Hillis et al. 1993) . Thus, our understanding of the accuracy of phylogen etic methodologies is limited by a reliance on the evolutionary models implemented in simulations (Hillis et al., 1994) . When evaluating phylogenetic methods using a simulation that makes explicit assumptions identical, or nearly so, to the assumptions in one of the methods (e.g. , Jin and Nei 1990) , then that method will prove superior an observation that cannot be universalized to the broader range of conditions ex isting in nature 8 (Hillis, 1995) . This makes it especially challenging to evalua te the robustness of inference methods using simula tions. The only way to evaluate whether phylogenetic inference methodologies are sufficiently robust to these complexities is to evaluate their predictions using empirical, known evolutionary histories (Hillis, 1995) . This a lternative and complementary but rarely used approach experi mental phylogenetics is the analysis of data from n atural or experimental populations with a known evolutionary history. The objective of experimental phylogenetics research is to use living systems to generate known evolutionary histories with which syst ematists can use to directly test phylogenetic meth ods (Bull et al., 1993) . In contrast to approaches using computational simulations, expe rimental phylogenetics studies make substantially f ewer untested assumptions regarding the evolutionary process. The result is an expectation that the evolutionary system incorporates a degree of complexity and reality otherwise unobtainable in computation al simulations (Bull et al., 1993; Oakley, 2009) . In fact, an aim of experimental phylogenetics research, and of experimental molecular evolution b roadly, is the iterative creation of increasingly s ophisticated models based on empirical data which can then be incorporated into theory and practice through simulations (Bull et al., 1993) . Experimental e mentary to their use (Hillis et al., 1992; Huelsenbeck, 1995) . Researchers following the paradigm of experimental phylogenetics stress that th e particulars of their study system or precise conclusions are not necessarily universally applicable in n ature. Instead, an experimental study provides information about the evolution that actually occurred instead of proposing what should happen with n atural populations, it establishes what did happen with a set of evolving populations. All in all, experim (Hillis and Bull, 1993) . This notion is illustrated by Hillis et al. (1993) using an analogy between weapons testing a nd phylogenetic inference testing: difference between experimental and sim ulated bombs: the explosion of an 9 experimental bomb does not indicate what will happen every time a bomb e History of experimental phylogenetics studies The earliest studies directly evalua ting phylogenetic histories were conducted with common research species, including animals ( mice , Fitch and Atchley 1985) , plants ( oats , Baum et al. 1984) , and viruses ( T7 bacteriophage , Hillis et al. 1992) . The first two concerned the res ults of artificial selection and as such were limited in that the populations experienced relat ively minimal evolutionary change even over decades or centuries, timespans of great length to humans but hardly of note with respect to each rate of evo lutionary change . Further , the histories were incompletely known . The work of Hillis et al. (1992) was revolutionary for the field of experimental phylogenetics in that theirs was the fir , having been produced through careful laboratory experimentatio n for th e purpose of testing phylogenetic methodologies . Research using bacteriophage T7 The goals of Hillis et al. (1992) were twofold an exp erimental phylogeny using a system that could undergo considerable evolutionary change and to use the resulting phylogeny to test various methods of phylogenetic inference (Hillis et al., 199 3) . Because they knew the true evolutionary history in the lab , they could evaluate the accuracy of phylogenetic inference. Further, they argued that the perf ormance of an inference method with an experimental phylogeny provides support on its performance with (Hillis et al., 1993) . The primary results of Hillis et al. (1992) were that their rooted eight - taxon symmetric and approximately ultrametric tree topology was correctly inferred fro m restriction site data by parsimony in addition to neighbor - joining (NJ) and three other distance methods . On the other hand, branch lengths were incorrectly inferred by all methods , although parsimony was the least in accurate of the five methods . Parsimo ny was the only method of those tested that can infer ancestr al character states for internal nodes, and it did so with very high accuracy (98.6%). In a subsequent publication , Hillis et al. 10 (1994) reported analyses using sequenced DNA regio ns of the viral genomes. The nucleotide dataset had approxima tely one - third as many variable sites than that of the restriction site dataset. In a comparison , of maximum likelihood, parsimony, NJ, and two other distance methods, only parsimony inferred the correct tree topology. The researchers then controlled for t he proportion of variant sites by creat ing bootstrapped samples of restriction site and nucleotide datasets. Overall, the restriction site data still outperformed the nucleotide seque nce data for each analysis method except maximum likelihood, with parsimo ny being the most accurate among nucleotide - inferred trees and NJ performing best among restriction site analyses. The researchers attributed the better performance of restriction sit e data to its presumed greater independence among characters than in nucl eotide sequence s . This phage study (Hillis et al., 1992) has received various criticism. Sober (1993) complained that it lacked a suffic ient discussion regarding how the model of the evolutionary process used in the laboratory was similar to that found in nature. In response, Hillis et al. (1993) lecular evolution, while not representative of most systems in nature, re mained within the range of known processes occurring in nature. They single taxon chosen fo over the course of this experiment was later expanded upon by the original researchers (Bull et al., 1993) and others (Oakley and Cunningham, 2000) . Sober (1993) also suggested that the primary results of parsimony and distance methods agreeing on the true topology, might have arisen from the combination of uniform rates of evolution and equal branch durations in the evolutionar y history, which would be sufficient to guarantee statistical consistency between these methods. He even suggested that their result of agreement between models might be evidence that the evolutionary processes operating in the laboratory environment are s ignificantly different than in nature, since datasets from nature rarely produce agreement among phylogenetic inference methods. Hillis et al. (1993) responded that it is neither necessary nor suff icient for establish ing the accuracy of an inference by showing agreement 11 among methods, as had been notably shown with the case of long branch attraction (Felsenstein, 1978) . Hillis e t al. (1994) advocate d that a range of topologies and experimental conditions should b e explored in future experimental phylogenetics studies, since no single set of conditions is representative of nature . Sober (1993) concluded that experimental phylogenetics studies ma y suggest the types of evolutionary processes that allow or inhibit methods from inferring the true h istory Since simulations cannot demonstrate that nature obeys its assumptions, a combination of simulation and empirical approaches a re necessary to improve phylogenetics (Hill is, 1995; Hillis et al., 1994) . In follow - up stud ies in which th ese viral data (Hi llis et al., 1992) w ere directly compared with simulated data , the general concordance between the se investigatory approaches was demonstrated (Bull et al., 1993; Hillis and Bull, 1993 ; Oakley and Cunningham, 2000) . By creating simulated datasets varying in topology, branch lengths, mutation rates, and number of characters, Hillis and Bull (1993) investigated the approxima te range of conditions within which their vir al populations evolved. This work provided the still cited benchmark of 70% for bootstrap support values as indicating true clade relationships, and therefore demonstrating that bootstrap values are conservative measures of phylogenetic accuracy (Sleator, 2011) . Similarly, parametr ic bootstrapping was used to create simulated data modeled using detailed conversion and reversion rate estimates from their viral system (Bul l et al., 1993) . The primary result s were that parsimony, NJ, and a second distance method inferred the correct tree with consistent success , though NJ did so the most often ; and, as with their empirical data , no method produced accurate branch lengths, though all were close, and parsimony performed the best. Oakley and Cunningham (2000) further analyzed the evolved viral taxa to evaluate ancestor reconstruction methods under models of cont inuous phenotypic characters. The researchers quantified different measures for growth rate for terminal taxa and ancestral taxa at each bifurcation , and t he terminal taxa were used to reconstruct the ancestral character values for each node. They found th at inferred ancestral states were grossly ina ccurate, even 12 when the known ancestor sequence was used to root the tree. This was due to egregious homoplasy in virulence, with convergent decreases due to selection. Computer simulations of continuous characte rs were consistent with these results. Sousa et al. (2008) used the same vira l system and similar methodolo gy as Hillis et al. (1992) to evaluate an asymmetric topology with considerable branch length variation. A fourteen - ingroup taxon evolutionary history was cre ated, with between 3 to 29 lyt ic cycles ( a measure of generation lapse ) occurring between bifurcations. Phylogenetic inference was conducted using Bayesian inference, minimum evolution, maximum likelihood, and the five methods used by Hillis et al. (1992) , and with their same set of restriction enzymes in addition to DNA sequences constituting greater genomic coverage than Hillis et al. (1994) . For the nucleotide data, method s assuming or enforcing a mole cular clock model inferred the correct tree , and other methods produced trees with topological accuracy (i.e., the average of clade accuracy and clade resolvability) of 82% . This superiority of clock - based methods was attribut ed to a strict experimental bo ttlenecking regime thought to produc e a constant rate of change due to genetic drift. For restriction site data, the distance methods inferred the correct tree, while criterion - based methods were at best 91% accurate. For both types of data, the presence of polytomies was the primary culprit for many, although not all, instances of inaccur acy. Unlike in Hillis et al. (1994) , the superiority of restriction site data was not due to greater numbers of variable sites in the datasets, and Sousa et al. (2008) agree that the greater independence among restriction site characters may have contributed to its improvement over DNA sequence data . They additionally attribute the poor performance of their DNA data to the still - low genomic coverage of 12%. Research on the effects of natural selection in o ther systems Molecular convergence, its prevalence due to natural selection and its effect on phylogenetic inferenc e, was evaluated by several research groups. This work was conducted with organisms evolving in nature (Leitner et al., 1996) , in P etri dishes (Bull et al., 1997; 13 Cu nningham et al., 1997; Fares et al., 1998) , and in digital environments (Ha gstrom et al., 2004; Hang et al., 2007, 2003) . Leitner et al. (1996) used two HIV genes sequenced from nine individuals with well - documented epidemiological rel ationships. In HIV evolution, o ne of the genes sequenced is generally under strong positive selection for missense mutations, while the other generally has purifying selection against changes. Seven inference methods were tested as well as several models o f evolution , and datasets were constructed using each gene separately and co ncatenated . Considered a lone, t he gene under greater positive selection produced more accurate clade inference than the gene under purifying selection , and convergent molecular evolution rarely occurred and therefore was inconsequential . T he concatenated d ataset performed even better still , yet the most accurate trees differed by at least one set of clade s . While no combination of phylogenetic method, m odel, and dataset produced the correct tree, the larger the character set (i.e. , using concatenated genes) the more accurate the inferred tree proved to be for many methodological combinations. This demonstrated that differences in these methods abilities were due to algorithmic efficiency rather than consistency (Hillis, 1995) , and thus with sufficient data NJ, maximum likelihood (ML), and parsimony would each perform well. Branch leng th estimates varied and were not very accurate, with short branches overestimated and long branches underestimated, although it is not clear how the a uthors compared the known chronogram to the phylogram inferences, as the rate of evolution was presumably variable among viral populations and over time. Bull et al. (1997) adapted a bacte riophage, X 174, to infect two different hosts in a high temperature environment. Using the ancestor as the outgroup and a seven - taxon ingroup evoluti onary histor y, two different maximum likelihood analyses were conducted, one with a five - taxon evolved lineage ingro up and the other with a nine - taxon evolved lineage ingroup, including two isolates embedded within the history. Both analyses failed to reso lve the true history for these sets of taxa, because convergent evolution, consisting of both parallelisms and rever sals, resulted in over half of the observed substitutions being phylogenetically misleading. 14 Using the laboratory protocol of Hillis et al. (1992) , Cunningham et al. (1997) produced twelve separate lineages of T7 phage evolved from bifurcations of six lin eages that had the same ancestor. The bifurcations were performed after either 10, 20, or 30 l ytic cycles and each final lineage evolved through three series of bottlenecks separated by 50 lytic cycles each, with isolates stored at every bottleneck. Theref ore, the twelve - taxon star chronogram had six variously short internal branches and twelve muc h longer external branches. They observed multiple instances of parallel evolution, including of deletions and nonsense mutations, which the researchers attribute d to the selective environment. From these viral sequences, both terminal and embedded isolate s, Cunningham et al. (1998) modularly ass emble d various four - taxon phylogenies that varied in branch lengths . Taxa were chosen such tha t non - sister lineages show ed greater convergent evolution and had long external branches , with the sister lin e age of each of these having a shortened external bra nch . This would be a severe test of long - branch attraction. Using maximum likelihood, they eva luated the effect the model of evolution had on phylogenetic inference by evaluating six progressively more complex models. With short - to - long branch ratios of 1: 3 or less, model choice had little influence on phylogenetic accuracy, whereas with more extreme ratios, model difference was significant and best - fit models were more successful in resolving the long - branch attractio n effect. Fares et al. (1998) used yet another viral system, foot - and - mouth disease virus, to cr eate a known evolutionary history. Usin g parsimony, maximum likelihood, and distance method analyses, they found that no method produced the true tree, which they too attribute d to convergence due to selection . Digital evolution using the Avida platform w as used to test the effect of natural s election on parsimony , and occasionally NJ analyses , with symmetrical unrooted four - taxon tree topologies . Varying the extent of evolution occurring along internal branch es ( i.e., a single branch when unrooted) and th e external branches, Hang et al. (2003) evaluated different experimentally - evolved branch length combinations in addition to simulated data . Branches were of equivalent evolved duration per level, e.g. , with a ll external branch lineages allowed to 15 evolve for an equivalent length of time. The researchers hypothesized that phylogenetic inference would be improved through the production of synapomorphic variation due to selection during internal branch evolution , and that this benefit would be especially pronounced for longer branches, with greater evolution causing more such variation to occur . Their hypothesis was supported for parsimony analyses, with shorter in ternal branch topologies being no better inferred t han data simulated under genetic drift alone, and longer branches effect on sequence evolution, finding three s ets of loci: fixed, slightly variant, and highly va riant sites. By including this three - tiered proportional model in their genetic drift simulation, they demonstrated that the perceived phylogenetic inference benefit of selection was restored in the simula ted data. How these findings pertain to NJ or with trees with a rang e of branch length combinations is unclear. Hagstrom et al. (2004) used five different selective regimes to evaluate the relative benefit natural selection had for phylogenetic inference. The branch lengths were such that neutral evolution was expected to swamp all phylogenetically informative signal, and that only selection could cause accurate inference. Th eir results included that NJ performed well for man y different selective regimes and that parsimony only performed well when natural selection occurred along the internal branches. Hang et al. (2007) det ailed how natural selection rescues phylogenetic inference, reproducing that significant adaptation along internal branches can cause this, especially if selection is strong and maintained. T his effect was caused by the production of synapomorphic variatio n rather than non - uniform character substitution, as expected. Research summary The overall conclusion to be drawn from the entire body of published experimental phylogenetics research (summarized above) is that various inference methods generally perform well, though with considerable and inconsistent variation among methods and across studies. Further, m any methods seem to be robust to deviations from the assumptions underlying their evolutionary model s although not necessarily character independence and 16 long - branch attraction can have a strong effect . The benefit or detriment of using characters evolved under natural selection remains unclear; natural selection in biological systems tend s to interfere wi th phylogenetic inference by produc ing more homoplas y (although see Leitner et al. 1996) , and natural selection in digital s ystems tends to aid phylogenetic inference by producing more synapomorphy. Most studies remain confined to a limited range of taxa and phylogenetic - difficulty sample space s (e.g. , topologies and branch le ngths), are near ly completely deficient in replicati on, and have rarely evaluated modern systematics methods and tools . Hillis et al. (1992) important void in the science of p fter an initial flurry of research, experime (Oakley, 2009) . To explain why the field has remained small after more than twenty - five years , having offered (Oakley, 20 09) , we must examine the difficulties intrinsic to this methodological approach. E xperimental Evolution i n Undergraduate Education Undergraduate biology courses rarely address the processes of evolution (Alters and Nelson, 2002) , and even mor e rarely focus on examples of these processes in action via experimentation. Rather , courses have tended to focus on the results of evolutionary processes, especially adaptation , and the various evidence s for evolution , for example morphological, molecular , or developmental similarity in characters among taxa (Mead and Mates, 2009) . Although e xperimental evolution has - opener to (Kawecki et al., 2012) , it has only recently been i ncluded in some introductory biology t extbooks and laboratory curricula, with many still lack ing this topic coverage (Burmeister and Smith, 2016; Hillis, 2007) . Biological evolution can be difficult if not impossible to demonstrate or explore experimentally in the classroom. Providing opportunities for students themselv es to participate in experimental evol ution lab exercises is challenging 17 and finding a system in which they can conduct student - centered inquiry - based experimental evolution research is much more challenging still. A few activities that allow students to e ngage with experimental evolution labo ratory exercises have recently been produced. As with experimental evolution research generally, these activities are limited to a few model systems amenable to such work, namely microorganisms and, rarely, insects. Un fortunately, most of t hese activities are even further limited in that only one or a select few biology topics can be presented with each system . Labs with bacteria have been created to study mutation and adaptation, specifically antibiotic resistance (Krist and Showsh, 2007; Petrie et al., 2005) , or adaptive radiation, niche colonization, and relative fitness measurement (Green et al., 2011) ; yeast have been used to stud y social evolution and inclusive fitness theory (Agren et al., 2017) , or the origin of multicellul arity either over many generations of experimental evolution (Ratcliff et al., 2014) , or its limited initial evolution via rotifer predation (Pentz et al., 2015) ; and insects have allowed the study of allele frequency change in Drosophila (Plunkett and Yampolsky, 2010) and sexual selection and operational sex ratios in bean beetles (Cotner and Hebert, 2016) . Finally, se veral activities explore the timing of mutation with altered Luria - Delbruck fluctuation tests or similar protocols with either bacteria or yeast (Green and Bozzone, 2001; Handelsman et al., 1997, p. 19997; Robson and Burns, 2011; Smith et al., 2015) . Nearly all rely on inauthentic biological investigations with limited opportunities to learn and engage in the practice o f science, from asking questions and proposing hypotheses, through gathering and analyzing data, and ultimately synthesizing and presenting research conclusions. A sole exception, the bean beetle system presented by Cotner et al. (2016) , allows student inquiry and independent investigation, although the range of experiments possible are barely classifiable as experimental evolution since only one or very few generations can be observed over many weeks. T hese activitie s have several disadvantages, some of which are common with most laboratory curricula. These drawbacks include detailed laboratory protocols that must be closely followed with little room for error , expensive equipment or profuse disposable 18 r esources, and the maintenance of precise environmental conditions. These concerns can be of particular importance when contamination or in equivalent growth environments would impede fitness calculations (Green et al., 2011) . Other drawbacks owe to the dif ficulties in navigating utility, realism, and generality with study systems used for classroom experimental evolution research, as discussed previously. Among the most consequential is the lack of suitability these activities and model system s have for eng aging in student - centered inquiry - based experimental evolution research. T hese concerns are ameliorated with the Avida - ED system and its curriculum that intentionally integrate s science and engineering practices . Balancing Utility, Realism, a nd Generality in Experimental Evolution For either the evaluation of phylogenetic methodologies or the observation of evolution within a classroom , the complexity of the task requires a considered balancing between the pairwise interplay of utility, realis m, and general ity . Although my specificity in describing these tradeoffs is novel, researchers employing or critiquing experimental phylogenetics, especially Oakley (2009) , have ar gued similarly. Using my formulation follows: W hile experimentally gen erated histories maintain greater biological realism than simulations, this comes at the significant expense of reduced utility in the form of time req uired to produce an evolutionary history. He also stipulates that with both simulations and experimental phylogenetics it is necessary to assume that the operating evolutionary processes apply with generality to other systems. In the analysis presented her e , utility encompasses experimental feasibility, resource cost (e.g. , time and money ), and other diffic ulties such as technical expertise , r ealism entails the known and unknown complexities of biological evolution , a nd g enerality , also called univers ality, is the ability for inferences drawn from scientific data to be applicable, through induction, more broadly . I also include within generality the ability of a system to display open - ended evolution and be useful to explore many avenues of research . T hese three factors produce tradeoffs that must be navigated within both experimental phylogenetics resear ch and classroom experimental evolution labs . 19 Experimental utility versus biological realism Th e primary critique of experimental phylogenetics focuses on this tradeoff (Oakley, 2009) . For example, Hillis et al. (1992) sought a system in which they could manipulate and observe long - term evolution with expected population genetic dynamics, e.g. relatively low mutation rates and molecular sequence divergence due to both natural selection and genetic drift. They opted for a system with a high mutation rate and employed stringent, repeated bottlenecks (Bull et al., 1993) . Through this aim of increasin g the proportion of phylogenetically informative genetic variation, while maintaining selection for viability, the researchers approximated the molecular evolution that they desired but , reasonably, did not wish to expend the resources , e.g., time and cost , to produce in a more biologically realistic manner. Even so, this or similar approaches can t ake months or years of laboratory effort to generate a single instance of a known phylogeny (Hillis, 1995) . At the extreme, an example of reducing realism in an experimental system for the sake of utility is the harnessing of hypermutageni c polymerase chain reaction to evolve DNA sequences entirely separate from their organismal con text (Randall et al., 2016; Sanson et al., 2002; Var tanian et al., 2001) . As have others before me , I consider this to be in vitro simulation and not experimental phylogenetics, per se. Arguably the most awe - inspiring example of classroom experimental evolution is the laboratory system presented by Ratcli ff et al. (2014) in which yeast cultures are repeatedly transferred after allowing time for cells to settle at the bottom of a test tube; a protocol that reliably produc es clusters of cells that can be analogized to the origin of multicellularity. This is a fanta stic means by which students may observe evolution in action. Yet, as argued by Pentz et al. (2015) , this expe riment suffers from two significant flaws related to the utility - realism tradeoff: The select ive agent is contrived a similar agent is thought to have played no role in the repeated evolution of multicellularity. And, even so, the experiment requires su bstantial investment in time and resources, requiring two weeks of daily transfers and hundred s of sterile media tubes. Pentz et al. (2015) produced an alteration that navigates the utility - realism tradeo ff differently by using a rotifer predator to present a more plausible selective agent for 20 mul ticellularity, and thus greater realism. This protocol does reduce time and resource use , although rotifers require more technical expertise to manage. Overall, t he balance between utility and realism is extremely tough to manage in experimental evolution. Experimental utility versus generality An ideal that has not yet been widely adopted in experimental phylogenetics is the production of replicate datasets. Doing so could provide crucial insight regarding the likelihood of phylogenetic methods to produce correct inferences more generally. Additionally, doing so (Bull et al., 1 993) . As argued by Bull et al . (1993) notion that when replication is lack i ng , the reliance on a few datasets reduces the statistical utility of empirical observations. Unfortunately, replication has been prohibitively difficult to accom plish when a single replicate is costly to produce. Out of eighteen publications regarding exp erimental phylogenetics , only five unique experimental phylogenies using biological organisms have been produced for the purposes of this research (Bull et al., 1997; Cunningham et al., 1997; Fares et al., 1998; Hillis et al., 1992; Sousa et al., 2008) , plus one additional study using a convenience dataset of naturally evolving taxa (Leitner et al., 1996) , three publications employing digital evolution to evolve known evolutionary histories (Hagstrom et a l., 2004; Hang et al., 2007, 2003) , and the remainder either reexamining these datasets or commenting on them . Perhaps an even more stark instance of the utility - ge nerality tradeoff is the lack of model systems amenable to experimental phylogenetics or classroom experimental evolution . Oakley (2009) observes that experimental systems will only have utility if they can produce evolutionary histories within months or less; and such a system would require life history traits outside the norm, including short generation t imes and rapid rates of evolution. T he taxonomic divers ity between s tudies in which known non - experimentally evolved phylogenies are evaluated is high, including mice (Fitch and Atchley, 1985; Sage et al., 1993) , oa ts (Baum et al., 1984) , and HIV (Hillis et al., 1994 ; Leitner et al., 1996) . Yet the range of taxa used in 21 experimental phylogenetics is ex tremely limited v iruses and Avidians. As would be expected, the systems used in biology classrooms to observe evolution over many generations are of similar ly limited taxonomic diversity, with bacteria (Green et al., 2011; Krist and Showsh, 2007; Petrie et al., 2005) , yeast (Ratcliff et al., 2014) , and Avidians (Sp eth et al., 2009) . Of course, other systems have been used to observe one or a few generations of evolution in the classroom, e.g. bean beetles (Cotner and Hebert, 2016) , or experimental evolution res earch other than experimental phylog enetics, e.g. Drosophila (Burke et al., 2010) and Arabidopsis (Scarcelli and Kover, 2009) . This suggests that the number of systems suitable to experimental phylogenetics research or extended classroom experimental evolution is greatly limited for reasons of utility . Biological realism versus generality As noted by Bull et al (1993) special interest by itself, so the value of the study must rest on its generality to other s The researchers continue by the specific model of molecular increasing the level of molecular r esolution, we have discovered features that render the experiment unique, hence less (Bull et al. , 1993) . Depending on the inference methodologies under examination, increased modeling of molecular evolution complexity may necessitate increased taxonomic specificity and thus reduced general applicability to other systems. This realism - generality tra deoff is highlighted by classroom experimental evolution protocols focusing on individual systems with extremely limited diversity in content exploration. In fact, model lab systems in which students can engage in student - centere d inquiry - based experimenta l evolution research using biological organisms is entirely lacking. Thus, these classroom labs maintain realism for specific biology content while near - completely losing the generality of content exploration. In the modified evo lution of multicellularity lab (Pentz et al., 2015) , the incorporation of a plausible selection story increases the biological 22 realism of this major transition in evolution. However, neither the in itial variation in the tend ency to form multicellular clumps nor generations of change occur within this exercise. In this case, not only can students not experiment using a fully open - ended evolution system capable of facilitating multiple avenues of rese arch, but they cannot even observe evolution in action. Digital evolution offers a complementary balance By using digital evolution systems, these utility - realism - generality tradeoffs can be navigated differently . Doing so fill s a void between computation al simulations and experime ntal evolution with biological organisms (Ofria, 2015) . The use of digital evolution in experimental phylogenetics research or in classroom experimental evolution entails greater bi ological realism than simul ations, and greater utility and generality than biological systems. Importantly, none of these approaches serve as a substitute to the others. Experimentation with digital evolution entails much greater utility compared to biolog ical systems but less so th an simulations. Otherwise impossible experiments can be conducted in digital systems; for example, in the work of Hagstrom et al. (2004) , a condition of two of their selecti on regimes was only feasible in a digital evolution system offspring with non - neutral mutations were artificially sterilized by the environment, such that only neutral evolution could occur in the population. Generati ng an evolutionary history is relativ ely quick and easy using a system like Avida. Experiments can take minutes (e.g., Chapter 4), hours ( e.g., Chapter 2) , or weeks ( e.g., Chapter 3) compared to much less and much greater time, respectively, for simulation s and biological life, with approxima tely proportional cost in terms of computational resources with respect to simulations and overall costing much less than biological experiments. With respect to Avida - ED, r esource cost is minimal since the program, cur rently version 3. 2 , is free to use an d adequately runs on nearly all web - enabled devices, including computers, tablets, and smart phones, although its operation is quicker depending on available computing resources. This makes the observation of curious ph enomena possible within a single clas s session (e.g., Chapter 4) . As is apparent when new instructors first 23 implement Avida - ED in their classes or when scientists conduct their first experiment in Avida, t he degree of human expertise in the form of technic al knowledge and skill required for d igital evolution experiments is more akin to biological experiments than simulations. The latter requires toggling established parameters, while the former each require specialized knowledge of the organismal system and manipulation of the organism and its environment. (No comparisons are intended regarding the initial creation/ discovery/ modification of each approach.) Bull et al. (1993) conclud - However, these authors were unaware of digital evolution systems th at make this possible. Digital organi sms, such as Avidians, experience complex evolutionary and population genetic processes found in nature, thus achieving a degree of biological realism greater than that of simulations. The evolution that takes place with digital organisms is complex ( e.g., Lenski et al. 1999, 2003; Wilke et al. 2001; Chow et al. 2004; G oldsby et al. 2012; Covert et al. 2013) , and more so than with simulations of molecular evolution. For example, bounds on the presence and relative extent of mutation, recombination, and selection can be imposed upon the system by adjusting its genetic and environmental conditions. Yet the distribution of mutation effects, the character of mutations and substitutions, and the prevalence of linkage (under recombination) will organically change during population ev olution. As with biological organisms, epi stasis and genetic drift will always occur for Avidians. The researcher has control over aspects regarding organismal population size, yet less so regarding effective population size, which is a byproduct of the sy stem. Other evolutionary factors such as m igration and mate choice can be experimentally manipulated as well . Yet Avidian genetic and environmental substrates are much less complex than with biological life. For example, there is a lack of genetic modulari ty (e.g. , chromosomes and well - defined gen es), genetic expression and levels of information storage and use (e.g. , transcription and translation), and intricately complicated environments (e.g., nuanced chemical and physical interactions ) . Aspects related to all the above can be readily measured a nd compared to biological ranges with the goal of attaining 24 greater biological realism (e.g., Chapter 3) . Because the population of Avidians undergoes evolutionary change, experiments performed using Avida - ED resul t in actual research data amenable to hypo thesis testing , allowing students to learn and engage in science practices (e.g., Chapter 4) . For example, students can model their experiments after noteworthy biological counterparts , and compare their observatio ns and evidence - based reasoning to that of practicing scientists . While true with the study of any model system, the generality of research involving digital organisms must be approached carefully when designing experiments and drawing conclusions. The app licability of research produced using digital evolution systems can be case - specific and in large part owes to the degree of bi ological realism required. Fortunately, this can be measured with great utility because digital evolution approaches can generate vast quantities of data regarding the process and history of molecular evolution ( e.g., Chapter 3 ) . Similar data can be very d ifficult to produce in biological systems, if possible at all (Hang et al., 2003) . For example, one can readily observe millions of generations of evolution while ac curately recording copious measurements and maintaining the entire series of genealogical relationships and ancestral organisms, enough disk space withstanding . Students using Avida - ED have an opportunity to participate in the generation of large datasets from which they must extract salient information, a task common to many modern biological datase ts . Classroom discussion is neces sary to both introduce Avidians and explore how research using this system is applicable to biological life. An analogy that he lps students at least initially understand Avidians is a direct comparison to bacteria (Johnson and Lark, 2018) , although further discussion should treat Avidians as being their own system that shares general ities with all life, as with any model study system. Whereas replication has been prohibitively costly with biological systems, c omputational systems such as digital evolution and simulations alike are highly amenable to replication. The fact that Avidian evolution is an instantiation of evolution means that there is a richness and flexibility in the platform that cannot be matched by any tool 25 designed for a specific purpose. Avidians constitute an ideal study system to observe and experiment upon evolutio n in action in an open manner. 26 CHAPTER 2 : D igital Evolution Provides Direct Tests of Phylogenetic Accuracy Introduction Phylogen etics, the inference of evolutionary relationships , is a central goal in biology (Hillis, 1995) . The accuracy of this inference process is crucially important because phylogenies are used to support researc h throughout all of biology. While phylogenetic methodologies have flourished, their accuracy has remained difficult to test since we lack true evolutionary histories observed in nature to compare to those inferred by phylogenetic methods. Therefore, compu tational simulations have been the primary means for evaluation of such methods (Hillis, 1995; Huang et al., 2017) . While simulations provide insight re garding a range of theoretically important conditions, they are less suitable for evaluating combinations of complex factors and c annot address emergent or unknown properties of complex evolving systems (Arenas, 2012) . The generation of known evolutionary histories experimental phylogenetics is a complimentary approac h. - up research by David Hillis, James Bull, and colleagues in the early 1990 s was revolutionary. By presenting a complete T7 bacteriophage evolutionary history, various molecular data sets, and phylogenetic analyses, this work inaugurated the field of experimental phylogenetics. First, Hillis et al. (1992) produced an experimental history in a carefully - controlled laboratory environment and evaluated various methods of phylogenetic inference. Then, other types of molecular evolution data from this experiment were u sed to investigate the accuracy of bootstrap support values (Hillis and Bull, 1993) , characterize in d etail the DNA sequence evolution (Bull et al., 1993) , and evaluate clade resolution between various phylogenetic methods (Hillis et al., 1994) , among other analyses. Together, this experiment and subsequent analyses stand as the foundational and most - impactful work produced in experimental phylogenetics, especially since this field has largely remained dormant for at least the past decade. 27 The experimental approach to evaluating phylogenetic methodologies has suffered from an imbalance of utility , realism, and generality compared with computational simulations. As argued in a critique by Oakley (2009) , the greater biological realism found in experimentally generated evolutionary histories comes at the significant cost of reduced utility via time and resources an d low generality, or universal applicability, to other systems. This critique seems to have been well - placed, as evidenced by a co mplete lack of published experimental argued that digital evolution provides an approach that navigates the tradeoffs among utility, realism, and generality differently , filling a void between computational simulations and experimental evolution with biological organisms. Specifically, digital evolution entails greater biological realism than simulations, and greater utility and generality than biological systems. Here I present research to test the hypothesis that digital evolution using Avida is an effective model system for experimental phyl ogenetics research. I first evaluate the concordance of results from Avidian digital evolution treatments designed to reproduce th e basic molecular evolutionary dynamics and phylogenetic inferences of the work of Hillis and colleagues using the T7 system. The resulting correspondence between these systems under comparable conditions is evidence that Avida is a satisfactory system for experimental phylogenetics. I then further demonstrate that digital evolution entails greater utility and generality than bio logical systems by presenting digital evolution research under a range of theoretical phylogenetically - challenging scenarios, the scope of which the T7 work did not address. Across 21 additional Avidian experimental evolution treatments, the effects of nat ural selection, recombination, and differing extents of lineage evolution within an evolutionary history are investigated. The res ults of these treatments were hypothesized to correspond with predications based on phylogenetics theory and prior research us ing computational systems. The completion of such a series of treatments is evidence of the greater utility and generality of digi tal evolution, because doing so with a biological system would be much more costly if not altogether impossible. 28 Taken togeth results under similar conditions and a further demonstration of the experimental possibility of digital evolution with the correspondence of results with theory and computational systems constitute evidence that digital evolution using Avida is suitable for experimental phylogenetics research. Phylogenetic analysis methods The aim of phylogenetics is to produce hypotheses of evolutionary relatedness subject to falsification and/or statist ical measure, and has progressed from distance - based methods to criterion and model - based methods (Felsenstein, 2004; Nei and Kumar, 2000; Yang, 2006; Yang and Rannala, 2012) . An understanding of the philosophies and limitations underlying these methodologies is important for considering what makes an evolutionary history t heoretically facile versus challenging to infer. Distance - based methods use relatively simple algorithms to construct a phylog eny from a dataset based on overall similarity. With the need to analyze huge amounts of data, especially of molecular sequences, computationally efficient algorithms such as the phenetic distance - based algorithms of UPGMA (unweighted pair group method wit h arithmetic averages, Sokal and Michener 1958) and, especially, NJ (neighbor - joining, Saitou and Nei 1987) were adopted in the late 20 th century. NJ is a clustering method that uses its highly efficient algorithm to produce a single result the NJ tree. This me thod is purported to produce a very good approximation; in fact, if the distance matrix is an exact reflection of the true tree then NJ is guaranteed to determine it. Importantly, the algorithm is efficient enough to easily be manageable for hundreds or mo re taxa. For these r easons, NJ is very useful for analyzing large datasets that have a low degree of sequence divergence. However, a NJ tree is often not relied on to be a good hypothesis of an evolutionary history, since it is an approximation whose resul t cannot be directly compared to other trees within its framework (Felsen stein, 2004) . Ins tead, more sophisticated approaches rely on the optimization of a statistical criterion. 29 Maximum parsimony (MP, Swofford 1998) , maximum likelihood (ML, Felsenstein 1981) , and Bayesian inference (BI, Huelsenbeck et al. 2001) rely on the comparison of phylogenetic trees using an optimality criterion. An optimality criterion is a characteristic upon which a comparison can b e made, and phylogen etic inference is therefore an optimization search among the evolutionary hypotheses evaluated. These methods are generally superior to distance methods because instead of collapsing character state data into a single difference value, thus discarding subs tantial information, they preserve all available information by comparing sequences in the alignment, considering each site (i.e., character) at a time. However, criterion - based methods are not as computationally efficient as distance - b ased methods, especi ally because they rely on evaluating the optimality of all possible trees. This is generally cost - prohibitive, so a heuristic search algorithm is used to search within the space of all possible tree topologies. The optimality criterion under MP is the pars imony score (Swofford, 1998) . This is the tot al number of character - state changes necessary for a phylogenetic inference to explain the observed taxa - character dataset. This criterion provides a philosophically justifiable approach; homology should be assumed a priori and the hypo thesis that requires the fewest ad hoc assumptions is the most preferable. This approach resulted in the distinction between apomorphy and plesiomorphy, and thus informative synapomorphies versus symplesiomorphies, and finally homology from homoplasy. If w e are to deem phylog enetics as valuable, then we should want to maximize its utility. MP was shown to not have maximal utility (Felsenstein, 1978) . When the amount of data gets larger, a statistically consistent method should converge on the correct answer, and MP does not have this p roperty under certai n circumstances. Model - based methods challenge MP regarding statistical justification, since it has been shown that a MP result is an approximation of a ML result only if the rates of change (i.e., branch lengths) are sufficiently small (Felsenstein, 2004) . Statistical phylogenetic inference methods, incl uding ML and BI, use a probabilistic model of evolution to produce robust inferences. These models vary in their level of 30 parameterization with respect to differences in character state frequency and change. For example, with models of nucleotide evolution successively more c omplex, all state changes might have equal rates the one - parameter JC model (Jukes and Cantor, 1969) , unequal base frequencies the four - parameter F81 model (Felsenstein, 1981) , different transition and transversion rates the five - para meter HKY model (Hasegawa et al., 1985) , and unique rates between all character states the ten - parameter generalized - time - reversible, or GTR, model (Tavaré, 1986) . Additionally, among other inclusions to the mo del of evolution, ea ch of these can have one or two parameters of site - to - site rate heterogeneity modeled by allowing a proportion of invariable sites and/or by using the gamma distribution (Yang, 1994) . While ML and BI differ on how and to what degree these parameters are estimated, the aim is to identify the set of parameters that best fit the entire model: the model of evolution, tree topology, and branch lengths. The val ue that summarizes this fit i s the optimality criterion. Maximum likelihood's optimality criterion is the likelihood value (Felsenstein, 1981) , which is the probability of the data given the tree and model of evolution. It is calcu lated using the probabilities for character state changes among all possible ancestral reconstructions, with branch lengths and model of evolution parameters optimized, and by assuming both character and branch independence throughout the tree. Statistical ly, each tree topology is a m odel, and the parameters are the branch lengths and substitution parameters. Thus, under ML inference, very many statistical models are iteratively compared, and classical confidence intervals cannot be constructed (Beerenwi nkel and Siebourg, 2012; Yang and Rannala, 2012) . ML has the desirable statistical properties of being unbiased, consistent in that it approaches the true value with greater data analyzed, and efficient in that it has the sm allest variance among unbiase d estimates, though these properties might not hold under all circumstances, especially if the substitution rate model is inaccurate (Yang, 2006) . Bayesian inference's optimality criterion is the posterior probability (Huelsenbeck et al., 2001) , the probability of the tree and model of evolution given the data. The posterior is directly related, through the Bayes Theorem, to the product of the likelihood calculation and 31 the prior probability of the tree and model of evolution, and in versely to the probability of the data. Therefore, ML only concerns the data information, and its results are interpreted with respect to the data only, while BI uses additional information for the calculation of the posterior, a result that is interpretab le with respect to the evolut ionary inference. An essential difference between ML and BI is the prior. The prior probability of the tree and model is a set of probability distributions that are subject to the researcher's a priori ideas drawn from other da ta sources (Baum and Smith, 2013) and/or attempts to be as uninformative or objective as poss ible so that the calculation of the posterior is strongly inferred from the likelihood (Beerenwinkel and Siebourg, 2012) . In practice, as with many phylogenetic methodologies, (Brown et al., 2010; Yang and Rannala, 2012) . The probability of the data te rm is challenging, since it r equires synthesizing across all possible trees, branch lengths, and model parameters. Markov chain Monte Carlo (MCMC) methods, specifically using the Metropolis algorithm, allow the sufficient sampling of the probability distri bution as long as mixing betw een algorithmic chains has been sufficiently conducted. Together, these distributions result in a major deviation from the ML approach in that under BI the posterior of every parameter is a probability distribution, or credible interval. This results in mu ch greater information than a single ML estimate for each model parameter. For example, the probability of a clade, also called the clade credibility, can be expressed as a point estimate or as a distribution within a probabili ty range, and, theoretically, with a uniform prior the mode of the posterior is equivalent to the ML estimate (Beerenwinkel and Siebourg, 2012) . BI maintains the statistical consistency and other sought properties of ML (Steel, 2013) . The BI posterior probability is what one expects from statistics, since it lets us di rectly compare the probabilit ies of hypotheses given data. The intrinsic problem with classical statistics, including ML, is that these methods produce statements about the probability of the data or the method for analyzing the data, for example the proba bility that identical analyse s of data drawn from a statistical population will contain the true parameter value within a 32 confidence interval. BI produces statements about the probability of the parameter of interest, for example the credible existence of an evolutionary relationship by a posterior probability density interval. The former is a statement on a population of parameter value analyses; the latter is a probabilistic statement about the actual value of the parameter. This philosophical distinction is exemplified by the constr uction and interpretation of measures of support for clade relationships in phylogenetic analyses. Clade support evaluation Clade support or uncertainty in phylogenetic inference is often measured with nonparametric bootstrap s upport values and posterior p robabilities, although these metrics are quite distinct. Under classical statistical approaches, e.g. , using MP and ML, nonparametric bootstrapping is used to craft statements regarding the statistical estimate of the clade giv en the data. In bootstrapping , the characters of the data are sampled with replacement to generate bootstrap pseudo - samples for the taxa - character dataset. Each bootstrap pseudo - sample replicate is analyzed identically to the actual dataset, and this sampl ing and analysis process is r epeated hundreds or thousands of times. The proportion of trees among the bootstrap analyses that contain a particular clade is the bootstrap support value for that clade. This is used to determine the relative influence of the characters in the data. If t he original data sample accurately reflects the phylogenetic information of the taxa, then the bootstrap will reflect experimental repeatability. As such, a clade's bootstrap support value is a function of the likelihood of the phylogenetic data for that c lade. T he bootstrap method has remained difficult for many to interpret (Yang and Rannala, 2012) , and is often erroneously considered a measure of clade accuracy (Hillis and Bu ll, 1993) , while the BI post erior probability is easily and directly interpretable. A posterior probability is a measure of what proportion of the analysis a region of the multidimensional parameter - space is sampled within the BI search process. The entir e space constitutes all the i nformation of the prior probabilities for the tree topology, branch lengths, and free parameters in the model of evolution as well as how that information relates to the observed data. The more often a 33 specific region of parame ter space, for example a clade in tree space, is sampled, the greater the probability it is accurate. The difference between bootstrap support and posterior probability is the fundamental distinction regarding the analysis outcome the probability of the data (ML) and the prob ability of the hypothesis (BI). As bootstrapping is a resampling of the data, MCMC is a resampling of the hypothesis. A bootstrap value is the clade's support with respect to the experimental repeatability of the data, while a posteri or probability is the clade's support with respect to the possible clades containing those taxa. presentation of either fully resolved or partially resolved trees. Ful ly resolved trees incl ude the single best tree resulting from an analysis, such as the NJ result or the tree with the greatest likelihood under ML, lowest parsimony score under MP, or greatest posterior probability under BI. The latter is also termed the m aximum clade credibili ty tree, and although this tree is often fully resolved, it is not guaranteed to be. Nonparametric bootstrap replicates or clade probabilities can be used to construct a fully resolved tree that considers clade uncertainty using the m ajority rule extended (MRe) algorithm, which produces a so - called greedy consensus tree (Bryant, 2003) . This approach constructs a tree by starting with the most - supported clades and successively adding non - conflicting clades in order of greatest support. A MRe tree is fully resolved in that it lacks polyt omies, although it may include clades that have very low support albeit without conflict to other, often greatly supported, clades. A different approach, the standard majority rule (MR) consensus algorithm, uses a threshold value with clades that have insu fficient support being collapsed into polytomies. Any threshold of 50% or greater may be used, and MR trees with greater thresholds, in addition to the MRe tree, are always resolutions of the 50% MR tree, such that they will not inclu de conflicting clades but might contain additional resolved clades (Degnan et al., 2009) . Hillis and Bull (1993) provided the threshold of 70% bootstrap s upport as indicating a ccurate clades by demonstrating that bootstrap values are conservative measures of phylogenetic accuracy (i.e., accuracy greater than 95% when bootstrap support is greater than 34 70%), and this result has become a commonplace rule of th umb ( e.g., Sleator 2011) . In sim ulation studies, bootstrap values have been shown to be highly conservative estimates of clade accuracy, and while posterior probability values have been shown to be closer estimates, they may in fact be overly liberal estimates (Cumm ings et al., 2003; Wilcox et al., 2002) , with values being approximately 100% so often as to merit suspicion (Yang and Rannala, 2012) . In practice, a variety of threshold values are used, and trees presented of ten display bootstrap support values and/or posterior probabilities for every clade, or at least those deemed significant. Recently - published research shows a diversity in threshold values and trees presented. A review of the 19 empirical phylogenetics ana lyses, pre - published o nline, for the August 2018 volume of Molecular Phylogenetics and Evolution presents the diversity of support thresholds currently used in the literature. For example, Psonis et al. (2018) used a node coloration scheme to present six categories of joint disagreement between BI and ML analyses: four colors represented a posterior probability of 100% and bootstrap values of ei (Liu et al., 2018) support. Kim et al. (2018 ) 50% as m oderate support, whereas Zhang et al. (2018) ate support, and < 70% as weak support. The trend of posterior probabilities being much higher than bootstrap support values is widely exhibited with empirical work, as is disagreement among qualitative descriptions of bootstrap support thresholds. These 1 9 st udies also present trees with a diversity of clade support: Single best trees include maximum likelihood or maximum clade credibility trees in addition to MRe consensus trees. Trees exhibiting polytomies or with annotations that clades should be consid ered as such include majority rule thresholds of 50%, 70%, 95%, and 99%. 35 Overall, measures of clade support, including both clade accuracy and resolvability, appear to be highly valued among researchers, although with different weighting and interpretatio n, a nd largely without ground truthing other than the 70% bootstrap support threshold provided by Hillis and Bull (1993) . Overview of T7 phage experimental phylogenetics research Experimental evolutionary history The viral growth environment a nd experimental methodology of Hillis et al. (1992) were designed to provide ideal conditions to create phylogenetically informative variation between lineage s. Phage were grown with a che mical to increase the mutation rate, and repeated bottlenecking occurred throughout the within - branch (i.e., anagenic) evolution of each lineage. These factors increased the extent of evolutionary change that occurred during t he experiment. Additionally, n ew lineages at divisions (i.e., cladogenesis) were seeded with single - cloned populations, therefore eliminating lineage sorting or coalescent - type variatio n by fixing variation segregating in the population . T he mutagen , anage nic bottlenecking , and cladoge nic bottlenecking presumably increase d the phylogenetically informative variation by increasing inferred internal branch lengths. The experimental population sizes, relatively small for a virus, and repeated bottlenecking also decreased phylogenetically mi sleading variation by increasing the influence of genetic drift and thus decreasing the possibility of homoplasious evolution via natural selection with parallel or convergent evolution. The phage were serially grown, div ided, and transferred at periodic intervals in a predetermined manner to create known evolutionary relationships between the resulting nine viral populations. The eight terminal ingroup lineages experienced division at equivalent intervals to create a symm etrical, binary tree - like evolutio nary history, and the outgroup lineage evolved for nearly an equivalent total length of time, 105 lytic cycles compared to 120 total cycles for each ingroup (Bull et al., 1993) . This desired ultrametric topology would be predicted to have exhibited uniform and constant rates of evolution, with all ingroup lineages experiencing the same mutagenic growth enviro nment and having equal duration, a nd 36 therefore an equivalent likelihood of change (Sober, 1993) . However this was not guaranteed to have occurred because the researchers fixed the extent of time between cladogenesis, rather than the degree of evolutionary change, which itself was a natural pro duct of the evolving system (Hillis et al., 1993) . This ultrametric, symmetrical eight - ingroup taxon tree shape (i.e., topology) was chosen due to its predicted ease for phylogenetic inferen ce, and with the intention that it could be used as a null model or best - case scenario for comparison with other topologies. If this best - case scenario viral growth environment and topology failed to produce data that resolved correctly then there would no t be anything gained from attempti ng to evolve organisms in more realistic environments using topologies predicted to be more phylogenetically difficult (Bull et al., 1993; Hillis et al., 1993) . Phylogenetic and molecular analyses The resulting evolutionary history was then used to evaluate phylogenetic methods. Hillis et al. (1992) used restriction - site mapping for 34 restriction enzymes to create a taxon - character dataset containing 202 characters, excluding sites invariant across all taxa. Thes e data were used to infer and then compare the actual history with inference s produced using five phylogenetic methods. These five methods included MP, along with UPGMA, NJ, and two other distance methods, Fitch - Margoliash (Fitch and Margoliash, 1967) and Cavalli - Sforza (Cavalli - Sforza and Edwards, 1967) . For this tree of eight ingroup and one outgroup taxa there are 135,135 possible rooted bifurcating tree topologies (Felsenstein, 2004) , so a correct tree inference was unlikely to occur by chance alone. All five methods correctly inferred relations hips among taxa . The methods varie d in their prediction of branch lengths , and while MP performed best, no method inferred the correct branch lengths . The amount of homoplasy found in their dataset was approximately equivalent to levels found in empirical studies also involving nine taxa (Hillis et al., 1993) . As a follow - up study, Hillis and Bull (1993) evaluated how nonparametric bootstrap proportions compare to clade accuracy. To create very many pseudo - r eplicate datasets that 37 resemble the restriction site data of Hillis et al. (1992) , they used jackknifing (i.e., sampling without replacement) to produce 500 subsamples of 50 characters eac h. Each of these were then bootstrapp ed (i.e., sampl ed with replacement) for 100 replicates each to construct bootstrap support values . Parsimony analyses for each bootstrap sample were conducted to produce a large set of trees from which the proportion of entire true trees as well as individual true clades were evaluated. They concluded that bootstrap support values of 70% or mor e are indicative of a high probability (> 95%) the clade is real. This finding indicate d that bootstrap values may be a suitable albeit conservative measure of phylogenetic accuracy. Additionally, Bull et al. (1993) examined a DNA data set collected from the viral lineag es to characterize the molecular evolution and parameterize parametric bootstraps . These 665 sites, base pair (bp) genome , were chosen due to their likely high rate of substitutions and lack of deletions. Within the ingroup lineage evolution, a total of 18 substitutions occurred in these DNA sequences from two genomic regions overlapping three genes. This a mounted to approximately 0.0019 substitutions per site per ingroup branch, and a total approximate lineage evolution from ancesto r to terminal population of 0.0058. Parametric bootstrapping was used with detailed conversion and reversion rate estimates to produce simulated datasets of restriction site evolution. While both parametric and nonparametric bootstrapping are used to produ ce datasets similar to the original, the former involves parameterizing a simulation using evolutionary rates to create indepen dent datasets while the latter uses bootstrapping to create datasets that lack independence. Tree topology and branch length infe rence were then carried out using MP, NJ , and UPGMA analyses for these data. Bull et al. (1993) found that each method inferred the entirely cor rect tree topology with consistent success. Specifically , NJ outperformed MP and UPGMA, with success rates of 99.1%, 97.8%, and 97.3% respectively. The researchers concluded that each method would usually infer the correct topology upon repeated empiricall y generated phylogenies following their system, but that NJ would perform the best overall. They also determined that MP more a ccurately 38 predicted branch length than the two distance methods tested, although it did not perform perfectly. Hillis et al. (1994) sequenced 1,091 bp across four genes, finding 63 variable sites among the taxa. This dataset had much less ph ylogenetic potential than the restriction site dataset of Hillis et al. (1992) , as it had approximately one - third as many variable sites. MP analyses, using either weighted or unweighted characters, estimated the correct topology, although a second tree with a single clade difference was equally parsimonious. The ot her methods evaluated, ML , NJ, Fitch - Margoliash, and UPGMA, each found a single, incorrect topology differing from the correct tree by one clade . It is not clear whether each of these methods found t he same incorrect topology as one another and whether it parsimonious tree . The researchers sought to compare these DNA sequence data with the restriction site data of Hillis et al. (1992) by creating 1,000 bootstrapped samples for each , while controlling for the amount of variant sites. The percentage of clades accurately resolved varied for each analysis method, with restrict ion site data perfor ming better than DNA sequence data for each method except ML . Overall, the bootstrapped DNA sequence data produced the most accurate tree using MP (approximately 87% accurate clade resolution), and bootstrapped restriction site data produced the most accur ate trees using NJ (approximately 95%). A simple model of evolution, the four - parameter F81 model (Felsenstein, 1981) , was used for the ML analyses, and Hillis et al. (1994) suggested that the empirically determined e xtremely biased substitution matrix (Bull et al., 1993) likely contributed to its poor performance. Digital evolution for experimental phylo genetics research Digital evolution systems should, theor etically, be amenable for use with phylogenetic inference, for example in the inference of Avidian evolutionary relationships. Phylogenetics does not require biological life; these methods have been ported to cultural studies as er or lesser success, for example with the evolution of folktales (Tehrani, 2013) or the classification of plastic bag clips (Lehmer 39 et al., 2011) . The fundamental requirements for phylogenetics include the following: ancestor - descendant relationships among evolving individuals (i.e. , taxa), a series of multifurcations or at least bifurcat ions with respect to evolutionary relatedness (i.e., lineage division or cladogenesis), and heritable traits exhibiting variation (i.e., characters). Additionally, for more advanced modern methods, a set of assumptions regarding character evolution (i.e., model of evolution) is required. Avidians are organisms (taxa) that can self - reproduce by virtue of their computer instruction genomic sequence (characters) and undergo evolutionary change resulting in variation in evolutionary relatedness (cladogenesis). Established phylogenetics software is designed to handle biological sequence data such as DNA or amino acid sequences; i mportantly, the possible instructions used in Avidian genomes can be limited to or recoded as one of these sets of alphabetic characters , such that the software can be used without modification. The utility and degree of information afforded by Avida makes their mutational system of evolution known and configurable while allowing the ir substitutional system to be knowable by tracking fixat ion events (see Chapter 3). Thus, models of Avidian evolution can entail fewer untested assumptions than with biological evolution. Moreover, Avida has already had limited usage in experimental phylo genetics research (Hagstrom et al., 2004; Hang et al., 2007, 2003) . As Hillis et al. (1992) initially established the feasibility of their system using experimental conditions theorized to produce evolut ionary patterns minimally challenging for phylogenetic inference, I do so by evaluating results of Avidian evolution from a pair of treatments designed to produce the basic molecular evolutionary dynamics and phylogenetic inferences of their work. Specific ally, the molecular evolution of T7 presented by Bull et al. (1993) was used to design experimental conditions that in one Avidian evolution tre atment should approximately produce the molecular evolution exhibited by the DNA dataset of Hillis et al. (1994) , and in a second Avidian treatment that of the restriction site dataset of Hillis et al. (1992) . The phylogenetic inference method s of NJ, MP, ML, and BI, and with both the best tree resulting from each analysis and various consensus support trees, are evaluated and compared 40 for both clade accuracy and resolvabilit y. Basic signatures of molecular evolution such as the number of varia ble and parsimony informative sites as well as inferred internal and external branch lengths are also evaluated. A series of 21 additional Avidian experimental evolution treatments are t hen used to evaluate more complex conditions that are predicted to be challenging to phylogenetic inference. Whereas the influence of natural selection was minimized to the extent possible for the T7 work, various selective regimes are investigated with Av idian evolution, including relatively weaker stabilizing selection and much stronger stabilizing selection occurring uniformly throughout an experimental evolutionary history as well as directional selection variously occurring throughout an evolutionary h istory. Treatments with lineages varying in their extent of evolution, and thus inferred branch length, were conducted to investigate such dynamics known to cause issues for phylogenetic inference. And finally, the effects of recombination, as implemented in Avida, are investigated in combination with these other conditions. These experiments are used to examine how clade accuracy relates to support values within the context of individual clades and across entire trees. This large set of Avidian evolution phylogenetic data is used to reexamine the results of Hillis and Bull (1993) for bootstrap support values and additionally investigate how BI posterior support values correspond with clade accuracy. Finally, how clade accuracy and clade re solvability correspond with best, MRe, and various MR consensus thresh old trees are compared across analyses to examine tree accuracy within a whole - tree context. Methods Experimental design Base evolutionary history A total of 23 experimental treatments, each with ten replicates, were conducted, with most treatments sharing a base design. In the base design, eight - taxon Avidian evolutionary 41 histories were produced, with each lineage in the history evolving for an equivalent number of generations for any gi ven experimental treatment (Fig. 2.01a). A single genotype was used as an ancestor in Avida to initiate two independent experimental histories (i.e., branches A and B), together termed tree level 1 for the first set of lineages. Lineage evolution occurred for a set number of generations then from each lineage two new lineage s were identically seeded (e.g., branch A leading to branches C and D, tree level 2). This was again repeated with each of these lineages producing two more descendent lineages (e.g., br anch C leading to branches G and H, tree level 3). Finally, this third set of lineages evolved for that same set number of generations. 3 , or eight. An equivalent length of time, as in the elapsed number of Avidian generations, occurred during each any one extant taxon being its multiple of tree levels. For example, if each branch persisted fo r 100 generations then the number of generations from the ancestor to the extant taxon is 300 generations. 42 Figure 2.01. Representative evolutionary history topologies with branch lengths denotating the number of generations lineages evolved. The base des ign experienced equivalent generations of evolution per branch, either 100 (not shown), 300 (a), or 3,000 generations per branch (not shown). Four designs (b - e) had differing numbers of generations per tree level, with either branc hes among a level. These designs are named by tree level length from external to internal tree levels, and include SLL (b), LSL (c), LLS (d), and LSS (e). An additional treatment, LSS B ( f), used the LSS branching pattern with additional external branches t - branch, resulting in a 32 - taxon asymmetrical history. The scale bar in subplot a is 300 generations and the scale bar in subplots b - f is 3,000 generations. Internal branches are labeled at their terminal node, and all evolutionary his tories were true polytomies at their origin. The ancestor for branches A and B was a cloned 333 - instruction long Avidian. For most experiments, the ancestor was a longer counterpart to t he standard default Avidian genotype, which can replicate its genome t o produce offspring but perform no other meaningful computation, for example a task rewarded by the environment. This default genome consists of two strings of instructions, necessary fo nop - C instructions, fo r a total genome length of 100 instructions. This large nop - C region acts as genomic filler within which mutations can occur that might code for interesting computation, e.g., task perfo rmance. An additional 233 nop - C instructions were added to this filler region to create the 333 - - Using a popul ation size of 43 10,000 to increase the efficiency of selection, the pre - adapted ancestor reached full task proficiency in the logic - 9 task resource environment. After the population contained numerous organisms capable of performing all nine tasks, it evolve d an additional 100 generations to further improve task performance an d reproductive efficiency. In all treatments, genome length was fixed by disallowing insertion and deletion mutations and further requiring a null genome size differential between parent and offspring; together, these settings ensured the preservation of h omology at each position in the genome. After the set number of generations elapsed for each internal branch, new lineages were seeded using the single most abundant genotype in the exta nt population. This occurred in all treatments that lacked recombinati on. In recombination treatments, lineages were seeding with the entire extant population to additionally include the effects of lineage sorting among segregating variation. Experiments t hat included recombination were configured with resulted in complete independent assortment among loci. The per site mutation rate was determined by balancing adherence to the work of Hillis et al. (1992) while maintaining the relative likelihood that Avidians would adapt in selective environments by acquiring task performance within a reasonable amount o f time. Using values reported in Bull et al. (1993) , the substitution rate of the T7 phage history was calculated to be 18/332.5/14/100 = 3.867 * 10 - 5 substitutions/site/branch/generation: 18 substitutions were ob served among the 14 ingroup internal and external branches. Effectively 332.5 sites were used, because mutations only affected the G/C sites, which were approximately half of the 665 seq uenced DNA sites. Approximately 100 generations of evolution occurred along each branch, with an estimate of 2 - 3 generations per T7 lytic cycle and 40 cycles per branch. The aim was to use a mutation rate within one order of magnitude of this empirical T7 substitution rate, so the per site mutation rate was set to 3.867 * 10 - 4 . Avidians reproduce with over - lapping with both organisms persisting following reproduction . Thus, the effective mutation rate is one 44 half this rate, or 1.9335 * 10 - 4 , which is five times the observed substitution rate in the T7 study. With 100 generations per branch, the expected phylogenetic branch length under neutral evolution would therefor e be approximately 2 * 10 - 2 substitutions/site/branch. Finally, with a 333 - site genome, the expected number of substitutions/branch would on average be 6.4. Non - default Avida settings shared by all experiments included the following: The instruction set d isallowed seven instruction s from being incorporated into the genome via mutation. One of these, h - copy , is required for Avidian genome replication and each genotype used as an ancestor included a single copy of this instruction; a mutation changing this p osition would cause the org anism to be inviable. The remaining six disallowed instructions were if - less , set - flow , shift - r , shift - l , dec , and add . This resulted in a total of 20 possible instruction characters, one of which was effectively invariant. For u se in phylogenetic analysis algorithms, these 20 unique instructions were coded using the conventional single - letter amino acid abbreviations, allowing the resulting genetic sequences to pass as biological amino acid sequence data. The birth method was set an offspring is placed randomly into the population instead of near their parent, resulting in a lack of spatially - and genetically - structured populations and increasing the effectiveness of selection because organisms were equa lly likely to compete for s pace with relatives and nonrelatives. All other settings were as the default, most notably including the default logic - 9 task resource (i.e., tr eatments altered the presen ce/absence of a task resource, but none altered reward strength or included tasks outside this set. Avida (Ofria et al., 2009; Ofria and Wilke, 2004) version 2.9.0 was used with only minor modifications to produce custom population output files. Stabilizing selection treatments Six experimental treatments w ere conducted under stabili zing selection with all lineages evolved for an equivalent amount of time within each history. These experiments differed in having relatively weaker or stronger stabilizing selection by using either a naïve or a pre - adapted ance 45 the lineages evolved (100 or 300 per branch). For treatments using the pre - adapted a ncestor, the logic - 9 task r esource environment was used, resulting in strong stabilizing selection for task maintenance. All other stabilizing selection treatments resulted in relatively weak selection because organisms remained under selection to reproduc e efficiently, although wit hout the possibility of drastic fitness reductions due to the loss of task function. Four experimental treatments were conducted under the same relatively weak stabilizing selection but with differing numbers of generation per tr ee level throughout each history, although all branches within a level were of the same length (Fig. 2.01b - e). S treatment conditions are named by their tree leve l size reading from the external branch to the external branches of 3,000 generations in length, a set of four internal branches of 300 generations, then the basal two ingroup branches of 300 generations. Note that these treatments are named in reverse n the LSS treatment experienced an initial 300 generations, lineage division and 300 more generations, and lineage division with a final 3,000 g enerations of evolution. An additional treatment used the LSS branching pattern but differed by having additiona - B In this treatment, the external branches were bifurcated a fter 700 generations, and only one of the resulting two branches was then bifurcated after an additional 800 gen erations, and again one of the resulting two branches was bifurcated after 700 generations, with 800 final generations remaining before the expe riment ceased. For each of these bifurcations, the lineage that underwent no further bifurcations evolved for an additional length of time such that all branches evolved for a total of 3,000 generations. Rather than resulting in eight extant taxa that evol ved in a fully symmetrical pattern as with all other experimental histories, this treatment resulted in a 32 - tax on asymmetrical history. 46 Directional selection treatments Eight experimental treatments were conducted using one of four directional selection r egimes (Fig. 2.02) and in either the absence or presence of recombination, with all branches lasting for 3,000 g - 9 task resource environment. These tasks can be classified into five theoretical difficulty classes, with each class requiring a longer and/or more complex sequence of instructions. There are two tasks pe r each consisted of two environments, with e ach environment rewarding four tasks, one from each of the four easiest classes. Branch A and all its descendent lineages evolved in one environment, and branch B and its descendants in the other. In this way tasks were not selected in parallel between the only had one of the two easiest tasks and B the other. Each of ly, in selection branche s had one additional task of the two in the second difficulty class, and the final tree level branches had a further one additional task of the two in the third difficulty class. In this a unique selective environment. In every selection regime new tasks remained rewarded throughout the duration mple every task rewarded in branch C remained rewarded in branches G and H. 47 Figure 2.02. The four natural sel ection regimes visualized with respect to the base evolutionary history topology. Colored shading indicates shared tasks rewarded among one or m ore branches within a tree. Diversifying and directional selection are inversely related in these designs, with que selective environment on each branch but the weakest directional selection compared to the other regimes. In these environments, the strength of selection, as in the difference in potential fitness between an organism capable of performing none or all tasks selected in the environment, . arded per branch and the maximal advantage for performing all versus no tasks present in the anch and the maximal advantage was either 2x or maximal advantages of 2x, 8x, or 64x, respec tively. Since task performance usually entails a nominal reduction in offspring cost relative to merit and since Avidians tend to evolve tasks sequentially, the realized differences in fitness between contemporaneous organisms is most likely reduced from t hese values. Note that even still, the relatively weaker selection here is still very strong selection compared to biological organisms in most environments. All directional selection treatments used a population size of 1,000 organisms to further lessen t he influence of genetic drift. 48 Finally, four experimental treatments were conducted with directional selection ( regime evolution per tree level. Analyses Taxo n - character datasets For each of the ten replicates per experimental treatment, a taxon - character dataset was cr eated using the eight extant populations, or 32 in the case of LSS B . A random organism was sampled from each population, and therefore more abun dant genotypes had a greater likelihood of being sampled. These sampled genotypes constituted the ingroup taxa f or the replicate. The outgroup taxon was a randomly sampled organism from an extant population of a different replicate of that same experimenta l treatment; therefore, the outgroup taxon evolved for the same total number of generations and experienced simi lar evolutionary conditions to the ingroup taxa. Note that the base of the evolutionary history is a true polytomy of the outgroup, branch B, an d branch C, that is, branch B and C do not share a more recent common ancestor than either does with the outgrou p (Fig. 2.01). The complete genomic sequence of the sampled organism was used. Each sequence consisted of 20 single - letter characters that repre sented the computational instructions that were available to mutation. Three of these characters, J, O and B, di d not have amino acid abbreviation counterparts and were therefore translated to W, Y, and V, respectively. Therefore, phylogenetic programs des igned to handle amino acid character data would treat Avidian sequences as such. Since genome length was fixed t o be a constant 333 characters, sequence alignment was not required, and each locus had perfect homology. Across the 23 experimental treatments with 10 replicates per treatment a total of 220 eight - taxon by 333 - character datasets and, for the LSS B treatmen t, 10 thirty - two - taxon by 333 - character datasets were created. Each of these datasets was used to conduct four different phylogenetic analyses. 49 Phylogenetic analyses Analyses were conducted using settings that were as simplistic as possible, i.e., using de fault settings and minimally parameterized models. This was done to avoid biasing the results in favor of one or other inference method. For exa mple, several settings could be altered for each individual phylogenetic analysis with the aim of improving the inference accuracy; however, this could become untenable when inferring multiple trees from each of 230 datasets. Unless otherwise indicated, de fault settings were used for each program. In all cases, analyses were identically conducted across all experime ntal treatments and replicates. The simplest amino acid model of molecular evolution was used, the Poisson model. The Poisson model is a fixed rate model that assumes equal rates and state frequencies among all 20 characters (Bishop and Friday, 1987) , and is analogous to the JC model for nucleot ide evolution. This is, in fact, the mutational model of molecular evolution as implemented in Avida, where each instruction has an equal probability of mutating to any other. Although higher - parameterized models were variously suggested by such programs a s ProTest (Abascal et al., 2005) , there is no theoretical reason why, for example, a nop - c instruction should behave like a cysteine because both are abbreviated as C. Further, I wanted to conduct analyses as similarly as possible across treatments. Because rate heterogeneity was expecte d to occur in this Avidian evolution, with at least one site being invariable (the h - copy site necessary for reproduction, as discussed previous ly), the model of evolution additionally included rate heterogeneity among sites. The model did so by allowing both a proportion of invariable sites and a discrete Gamma model with four rate categories , since this combination of rates is very commonly used (Stamatakis, 2016) , desp ite criticism that these parameters cannot be optimized independently (Yang, 2006) . Trees were infe rred using NJ, MP, ML, and BI, and fully - consensus trees were created for each analysis, as possible. Neighbor joining trees were constructed using QuickTree, version 2.0 (Howe et al., 2002) . Since this algorithm produce s only a single tree inference, when included in figures comparing consensus trees, the NJ tree is 50 ¬ 1.1.0 ( Hoang et al., 2018, 2017) . The number of initial par simony trees evaluated was increased to 10,000 to better search the set of possible trees, and 1,000 ultrafast bootstrap replicates were conducted. These bootstrap trees were used to construct 50%, 70% , 95%, and 99% analyses reported here is a random selection among the equally parsimonious trees identified in the heuristic search. The source code of MPBoot was modified to enable the printing in the log file of th e parsimony score for each of the trees evaluated in th e set of candidate trees, which was used to calculate the minimum number of equally parsimonious trees. This value is a minimum in that since an exhaustive search was not conducted, there is a possibil ity that a region of tree space with better or equivale nt parsimony scores was not explored. ML was implemented using IQ - TREE, version 1.5.5 (Minh et al., 2017; Nguyen et al., 2015) . For each analysis, a total of 100 nonp arametric bootstrap replicates were also produced and u sed to construct consensus trees with the four above thresholds in addition likelihood value. BI was implemen ted using MrBayes, version 3.2.5 (Ronquist et al., 2012, 2011) , and specifically using the parallel processing implementation and the BEAGLE library (Altekar et al., 2004; Ronquist et al., 2012) . The single non - default sett ing was that the Markov chain was sampled every 100 generations, to provide a greater number of cladogram samples. The post - burn - in sampled trees were used to construct consensus trees with the above t hresholds as eported here is the single cladogram with the greatest tree posterior probability, also termed the maximum clade credibility tree. Among other uses, Newick utilities (Junier, 2011; Junier and Zdobnov, 2010) were used to root and produce consensus trees for each threshold. The Consense program from the PHYLIP package (Felsenstein, 2005) was used to produce MRe trees. FigTree (Rambaut, 2018) and the Iroki web application (Moore et al., 2020) were used fo r tree visualization. Finally, Python, version 2.7, and the following packages, among other general modules, were used to organize and present data: Jupyter, version 0.27.0 (Kluyver et al., 2016; Perez and Granger, 2007) ; Matplotlib, version 51 1.3.1 (Hunter, 2007) ; ETE2, version 2.2.1 (Huerta - Cepas et al., 2016) ; and DedroPy, version 3.12 (Sukumaran and Holder, 2010) . Phylogenetic measures Topological accuracy betwe en the true tree and inference tree was calculated usin g variants of the Robinson - Foulds (RF) distance. Also termed the symmetric difference, RF distance is the number of internal branches (also called edges, partitions, or splits) that are present in one tree and not the other, and vice versa (Robinson and Foulds, 1981) . RF distance is therefore the sum of false positive branches (FP, those appearing in the inferred tree and not the true tree) and false neg ative branches (FN, those appearing in the true tree and not the inferred tree). These can be further calculated as rates. The FP rate is FP divided by the number of internal branches in the true tree, and the FN rate is FN divided by the number of interna l branches in the inferred tree. The arithmetic mean of the FP and FN rates is termed the average topological error (Swenson et al., 2010) . RF distance ca n be normalized by dividing by the maximal possible RF distance, and when the trees under comparison are binary trees (i.e., are fully resolved by lacking polytomies) the average topological error is e quivalent to the normalized RF distance. Since I want t o emphasize the accuracy of inference methods rather f clades correctly inferred; the FN rate complement as metric indicates the overall accuracy of the phylogenetic inference. Examples of the se metrics are shown for a comparison of a known (or correctly inferred) cladogram (Fig. 2.03a) to four variously - inaccurate inferred cladograms (Fig. 2.03b - e). Note that the best tree topology produce d by an analysis is always fully resolved although it m ay include incorrect clades, so for these trees each false positive clade requires a counterpart false negative clade, and thus clade accuracy, clade resolvability, and average topological accuracy are necessarily equivalent values (e.g., Fig. 2.03b,e). Th is holds for all best trees across analyses; even though BI maximum clade 52 credibility trees are not necessarily fully resolved, they were for all instances analyzed here. Trees that are not fully resol ved, for example as resulting from a majority rule cons ensus algorithm, contain at least one false negative clade and may contain zero (e.g., Fig. 2.03c) or ly constructed consensus trees were evaluated for these metrics. Figure 2.03. Example uses for the clade support metrics of Clade Accuracy (CA), Clade Resolvability (CR), and Average Topological Accuracy (Top. Acc.) for comparing a known cladogram to four inaccurate cladograms. This four - ingroup - taxon rooted tree has two true clades to be inferred (II + IO and OI + OO, dark green circle) and therefore as many as two false negative (light green circle) and two false positive (red circle) cl ades per cladogram. Note that the actual topologies of the Avidian evol ution experiments constitute 8 - taxon or 32 - taxon trees, with many more potential combinations of false negative and false positive clades. Additional metrics include comparisons of infe rred branch lengths and the amount of variable and parsimony informativ e sites. Branch lengths are summarized as the median length across all internal branches and, separately, across all external branch lengths. Only branch lengths as inferred for each an on c onsensus trees, are not present and therefore all potential internal branches are included. As with the model of evolution, digital evolution allows the comparison of branch lengths to the 53 actual number of observed substitutions occurring within lineage ev olution. However, such an analysis was not conducted for this set of experiments (although, see Chapter 3). Positions in the taxon - character dataset are considered variable if at least two types of characters exist among the set of taxa. Additionally, posi tions are informative under a MP analysis (and other phylogenetic approaches) if at least two sets of characters are found among at least two taxa each. Sites which lack informativeness are either invariant across taxa, are autapomorphic in that variation is unique to single taxa, or otherwise such that every possible cladogram would score equivalent under MP. These basic measures of taxa - character data richness are directly comparable a cross phylogenetic studies and provide a basic estimate of the phylogen etic signal of the data. Statistics Recognizing that phylogenetic tree metrics lack independence, the measures reported here are limited to comparisons of median values or aggregate sum s across treatments and replicates, as indicated. Various statistical t ests have been used to compare phylogenetics simulation results ( e.g., Hall 2005; Wang et al. 2011) ; althou gh other studies eschew their use ( e.g. , Kuhner and Felsenstein 1994; Barbançon et al. 2013) . It is not clear that the Central Limit Theorem and its derivative parametric statistical tests and even non - parametric tests are applicable in such work since clade and branch inclusion (and therefore branch length) do not exhibit independence. Further, even if statistical tests were suitable for simulation studies, it is not clear that they would remain so for experimental phylogen etics research where each replicate is a separate instance of evolution . Results Treatment comparisons Treatment conditions are labeled in Figures 2.04 - 2.15 by their distinctive selection, recombination, and branch length conditions. They are also arrayed in the order described in the experimental design section, above. Sele ction conditions include the following: starting 54 with the naïve ancestor in an environment rewarding no resources, resulting in relatively weak stabilizing selection; starting with a p re - adapted ancestor in the full selective environment, resulting in rel atively strong and uniform 1 2 3 4 asex ual reproduction or its presence as sex ual reproduction. The number of generations per b ranch include topologies with uniform lengths for all tree levels of 100 , 300 , or 3 , 000 generations. Topologies of branch lengths differing in length per tree level have labels reflecti ng s hort (300 generations) or l ong (3,000 generations) lengths and are denoted from extant to ancestral tree LSS had only long external branches ; and the final of these treatments LSS B b roken - up with additional branches , resulting in a 32 - taxon asymmetrical history. The nu mber of variable and parsimony informative sites found for the taxon - character dataset for each experimental treatment replicate are shown in Figure 2.04. For the first six treatments, i.e., those with stabilizing selection and topologies with uniform 100 - or 300 - generation branches, the number of variable sites strongly differed per treatment, and for each treatment the number of informative sites was approximately half the number of variable sites. The trends among these treatments were that longer - evolve d branches yielded greater numbers of informative and variable sites, and that treatments starting with the pre - adapted ancestor resulted in fewer numbers of sites. For all other treatments, i.e., those with a t least one set of branches that was 3,000 gene rations long, n early the entire genomic sequence (333 loci) exhibited variation , with only 10 - 20 fixed sites. The number of informative sites was nearly this numerous for SL L and LSS B branch length treatments, with treatments LSL and LLS having a third les s, and treatment LSS having less than half as many. The twelve treatments conducted under directional selection and with or without recombination did not exhibit a clear pat ter n with respect to the number of informative sites, although it was reduced for t he last four of these treatments, that is, those with the LSS branch pattern. 55 Figure 2.04. Number of variable sites (blue pentagons) and parsimony informative sites (purpl e stars) for all experimental treatments. Open symbols are for individual treatment replicates, and closed symbols for the treatment median. Experimental treatments are denoted by their selective condition (green labels), recombination condition (orange la bels), and number of generations per branch (cyan labels); see text for further inf ormation on condition notation. Inferred branch lengths for each experimental treatment are summarized in Figure 2.05. Only analyses that resulted in a phylogram with branch lengths expressed as the number of inferred substitutions per site are shown, leav ing MP excluded here. The median of internal branch lengths and the median of external branch lengths are presented for the single best tree of each analysis for each of the ten replicates per treatment. Although differences between the first six treatment s, i.e., those with stabilizing selection and uniform branch lengths, appear slight at this scale, the trends match that for the number of variable and informative sites, wi th longer inferred lengths for topologies with a greater number of generations per branch and shorter inferred lengths for treatments starting with the pre - adapted ancestor. While it is barely noticeable for these first six treatments, the trend of NJ unde rpredicting branch lengths relative to the other methods is readily apparent for al l further treatments and is especially the case for conditions with longer branches. There is also a general trend across nearly all 23 treatments of ML inferring slightly s horter branch lengths than BI. 56 Figure 2.05. Inferred median internal branch lengt hs (yellow) and median external branch lengths (orange) for the single best tree resulting from NJ (square), ML (triangle), and BI (plus) analyses. Open symbols are for indi vidual replicates, and closed symbols for the median across treatment replicates. N ote that MP is not included since it does not report trees with branch lengths expressed as the number of substitutions per site. For the four treatments of stabilizing sele ction with differing numbers of generations per tree level, the pattern of inferred median internal and external branch lengths matches the treatment design (Fig 2.05): long internal branches for treatments SLL and LLS, short internal branches for treatme nts LSL and LSS, short external branches for treatment SLL, and long external branc hes for treatments LSL, LLS and LSS. Note that since internal branches include the two branches at the basal tree level (labeled A and B in Fig. 2.01) and the four branches at the middle tree level (labeled C - F in Fig. 2.01), the median across these branch es is dominated by the middle tree level. Across these treatments, with ML and BI analyses , i nferred long branches are slightly less than ten - fold longer than short branches . For treatment LLS, NJ especially underpredicted internal branch lengths compared to ML and BI. The treatment with broken - up long external branches (L SS B ) had relatively short external branches, consistent with its additional taxa with fewer generations p er branch. For all directional selection treatments, branches were generally reduce d in length from that observed for stabilizing selection treatments that also had branches of 3,000 generations, 57 for example the long external branch es of treatment LSS (Fig . 2.05 ). For ML and BI analyses with treatments either with or without recombinatio n, external branch lengths tended to increase from selection regimes 1 through 4, and internal branch lengths tended to increase from regimes 1 through 3 then slightly decre ase in regime 4 compared to 3. While NJ maintained its trend of underestimating bra nch lengths, it did not exhibit this pattern of chang e with respect to selection regimes. Internal branch lengths are greater for asexual reproduction treatments and externa l branches are greater for sexual reproduction treatments. For the final four direc tional selection treatments with the LSS branch pattern, i nferred internal and external branch lengths were similar to although reduced from the stabilizing selection LSS tr eatment. E xternal branch lengths for these four treatments, while similar to one an other , were increased compared to the standard selection regimes 2 and 3 without recombination. Clade accuracy and clade resolvability results are presented separately in Figures 2.06 - 2.13 for sets of experimental treatments. Within each figure, treatments are separated by dotted blue lines, and conditions are labeled as previously indicated. Within each treatment, all ten replicates of each of sixteen phyloge netic analysis and consensus threshold combinations are presented: The NJ tree, then for each of M and then consensus trees of thresholds 50%, 70%, 95%, and 99%, with dashes differing in width at the figure t op indicating the relative ordering of these trees within the set of trees per analysis. Analyses a re denoted by separate symbols; and sets of trees per analysis are further visually distinguished by a thin vertical line. Clade accuracy was near - perfect fo r five of the six treatments conducted under stabilizing selection with lineages evolved for an equ ivalent amount of time (Fig. 2.06). One replicate each for the 100 - generation per branch treatments under weak stabilizing selection resulted in either a NJ or MP best tree inferring a single incorrect clade . The 100 - generation treatment under strong, unif orm stabilizing selection had decreased clade accuracy for multiple , although NJ and BI best trees performed better o verall as indicated by their medians retaining 100% accuracy. For this same treatment, ML continued to 58 have decreased accuracy for at least one replicate up to and including the 70% consensus threshold. These treatments, stabilizing selection with lineages evolved for an equivalent amount of time , did not score as highly in terms of clade resolvability (Fig. 2.07). Treatments with 300 generations per branch maintained greater clade resolvability than their 100 - generation equivalents. And compared to those w ith weaker stabilizing selection, treatments with strong stabilizing selection had greatly reduced resolvability, with a few replicates having a complete comb - or rake - like topology of 0% resolvability (as in the example of Fig 2.03c) . Comparing the median s across consensus thresholds, BI maintained the best resolvability followed by MP then ML for each treatment demonstrating variation in clade resolvability . Finally, recombination appears to have slightly increased clade resolvability across consensus thr esholds for each analysis when comparing the otherwise equivalent 100 - generation treatments. Figu re 2.06. Clade accuracy, the percentage of clades correctly inferred, for the set of treatments with stabilizing selection and lineages evolved for equivalen t generations per branch. Best and consensus trees are included for each analysis, including NJ (sq uare, best trees only), MP (circle), ML (triangle), and BI (plus), with open symbols for individual replicates and closed for the median across replicates. W ithin each analysis, the best tree then consensus trees in the order of increasing threshold strict ness are arrayed with tick marks at the top indicate these distinctions; see the text for further information on tree type notation. 59 Figure 2.07. Clade res olvability, the percentage of clades correctly resolved, for the set of treatments with stabilizing selection and lineages evolved for equivalent generations per branch. Best and consensus trees of increasing threshold strictness are included for each anal ysis, except for NJ which only includes the best tree, with open symbols for individual replicates and closed for the median across replicates. W eak stabilizing selection with differing numbers of generations per tree level resulted in mixed results for cl ade accuracy (Fig. 2.08) . Clade accuracy was perfect for all analyses and consensus thresholds when all internal branches were long (i.e., SLL) and when external long branches were disrupted by increased taxon sampling , (i.e., LSS B ). However, with long ext ernal branches and at least one set of short internal branches clade accuracy is reduced for many i f not all replicate best trees across analyses and does not reach 100% for all replicates until consensus threshold 70% or 95% for MP and ML and threshold 95 % for BI. Clade resolvability is largely perfect for treatment SLL across analyses and consensus th resholds, and treatment LSS B has reduced resolvability for stricter thresholds ( Fig. 2.09 ). Clade resolvability is quite low for treatments LSL, LLS, and LSS . Treatment LSL had about half the resolution across analyses compared to LLS, and treatment LSS ha d very low resolution, with stricter thresholds having replicates of 0% resolvability. Comparing across consensus thresholds for both clade accuracy (Fig. 2. 08) and resolvability (Fig. 2.09), BI perform ed better than ML, and ML better than MP . 60 Figure 2.0 8. Clade accuracy for the set of treatments with stabilizing selection and lineages evolved for differing generation per tree level. Best and consensus trees of increasing threshold strictness are included for each analysis, except for NJ which only includ es the best tree, with open symbols for individual replicates and closed for the median across replicates. Figure 2.09. Clade resolvability for the set of treatments with stabilizing selection and lineages evolved for differing generation per tree level. Best and consensus trees of increasing threshold strictness are included for each analysis, except for NJ which only includes the best tree, with open symbo ls for individual replicates and closed for the median across replicates. The treatments conducted under directional selection and with a consistent 3,000 generations per branch had essentially perfect clade accuracy across analyses and consensus 61 thresholds (Fig. 2.10) . The single exception was one MP best tree replicate that had one clade incorrect, al though an equally parsimonious tree was identified although not selected for i nclusion here due to random chance. Clade resolvability was generally quite high for most replicates (Fig. 2.11). For both the set of sexual and asexual reproduction treatments, clade resolvability was reduced from regime 1 through 3, with the slight disti nction between regimes 3 and 4. Comparing across consensus thresholds, BI maintained greater clade resolvability than ML, with MP having the lowest. The set of treatments with re combination had slightly lower clade resolvability at stricter consensus thres holds than their equivalent selective regime treatments without recombination. Figure 2.10. Clade accuracy for the set of treatments with directional selection and lineages evo lved for equivalent generations per branch. Best and consensus trees of increa sing threshold strictness are included for each analysis, except for NJ which only includes the best tree, with open symbols for individual replicates and closed for the median a cross replicates. 62 Figure 2.11. Clade resolvability for the set of treatments with directional selection and lineages evolved for equivalent generations per branch. Best and consensus trees of increasing threshold strictness are included for each analysis , except for NJ which only includes the best tree, with open symbols for indiv idual replicates and closed for the median across replicates. Clade accuracy shows interesting patterns among the treatments with directional selection regime 2 or 3 and/or the L SS pattern of long external branches, with the presence of recombination being a further complicating factor (Fig. 2.12). For any pair of otherwise identically conducted treatments, selection regime 2 had slightly decreased clade accuracy compared to regim e 3. Topologies with equivalent generations per tree level had the greatest ac curacy irrespective of recombination, as discussed previously. With comparatively short internal branches (i.e., relatively long external branches) of topology LSS, clade accurac y was greatest when directional selection was present and recombination absent , was slightly reduced when stabilizing selection was present, and was lowest under combined directional selection with sexual reproduction. While the trends are more difficult t o observe with clade resolvability (Fig. 2.13), since resolvability was genera lly greatly reduced, all the trends highlighted with respect to clade accuracy were repeated. 63 Figure 2.12. Clade accuracy for treatments with directional selection regimes 2 or 3 and/or the LSS pattern of generations per branch. Best and consensus trees of increasing threshold strictness are included for each analysis, except for NJ which only includes the best tree, with open symbols for individual replicates and closed for the median across replicates. Note that the first five treatments were shown in p rior figures and are included here for comparison. Figure 2.13. Clade resolvability for treatments with directional selection regimes 2 or 3 and/or the LSS pattern of generatio ns per branch. Best and consensus trees of increasing threshold strictness are included for each analysis, except for NJ which only includes the best tree, with open symbols for individual replicates and closed for the median across replicates. Note that t he first five treatments were shown in prior figures and are inc luded here for comparison. 64 Across all treatments, the best tree reported by each analysis often exhibited very high accuracy (Fig. 2.14 ). Recall that s ince the best tree produced by each analy sis lacks polytomies although may include incorrect clades inste ad, each false positive clade necessarily creates a false negative clade, and so for these fully - resolved trees clade accuracy is equal to clade resolvability , and thus their average as topolo gical accuracy (as in the examples of Fig 2.03b,e) . For twelve t reatments, all phylogenetic analyses inferred the true tree for all replicates. An additional three treatments had a single replicate each that resulted in a tree inference with one incorrect clade under either NJ or MP only. The remaining eight treatments had multiple replicates that each had up to three incorrect clades under multiple analyses ; these included the 100 - generation per branch treatment starting with the pre - evolved ancestor, the three stabilizing selection eight - ingroup taxon treatments with at least one set of short internal branches (i.e., treatments LSL, LLS, and LSS), and the final four directional selection treatments with short internal branches. To simplify further comparis ons, results from these eight troubl esome treatments - non - troublesome treatments r each treatment, with open symbols for individual replicates an d closed for the median across replicates. Troublesome treatments are the eight treatments demonstrating considerable variation . 65 Both two - and four - taxon clades were responsible for decrease d resolvability among the eight troublesome treatments, and ther e is considerable clade support variation between replicates ( Fig. 2.15 ). While only BI analyses are presented in Figure 2.15, ML bootstrap support is similar albeit with magnified trends due to generally lower clade proportions. The clades responsible for decreased resolvability , that is, true clades appearing in low proportions of bootstrap or posterior samples, are always those denoted by short branches. For example, when clade resolvability was low in t reatment LSL , it was always due to true two - taxon clades the clades whose successful inference depended on the middle tree level (Fig. 2.15b). Likewise, treatment LLS lacked only true four - taxon clades (Fig. 2.15c) , and replicates for all LS S treatments lacked two - and/or four - taxon clades (Fig. 2.15d - h) . Note that in the eight - ingroup taxon topology there are twice as many two - taxon clades compared to four - taxon clades; this disparity of clades denoted by the middle versus basal tree level i s responsible for the cla de resolvability pattern between treatments LSL and LLS (Fig. 2.15b,c) branch responsible for providing support to fewer overall clades . In the case of the treatment with strong stabilizing selection and 10 0 generations per branch, decreased clade resolution was due to both two - and four - taxon clades (Fig. 2.15a). While previous figures also demonstrated variation among replicates for any given treatment, this variation is especially evident in these target plots (Fig. 2.15). For ex ample, in Figure 2.15a only five replicates had one or more clades with resolution less than 50% and none included false clades, while in Figure 2.15f two replicates had clades with resolution less than 50% and each also had false clades with greater than 50% support. Note that while many additional false clades were present within the trees sampled, only those with support of 50% or more are presented, since only these contribute to decreased clade accuracy. 66 Figure 2.15. BI post erior support for true fo ur - taxon clades (orange), true two - taxon clades (blue), and false clades (red outline) for all eight troublesome treatments. Each spoke per target plot is a replicat e and each eight - ingroup taxon tree replicate had 2 four - taxon cla des and 4 two - taxon clade s, therefore 6 true clades are arrayed linearly from the outer to inner ring for each spoke. The thick target rings are in intervals of 30% and thin rings by 10%, with the absolute center of the target plot as 100% support. Only fa lse clades with 50% or gr eater support are shown. Experimental treatment labeling is as indicated in the text. Accuracy and consensus thresholds The proportion of true and false clades identified by ML and BI analyses show a stark distinction between troub lesome and non - troublesom e treatment sets (Fig. 2.16 ). For both ML and BI, the non - troublesome treatments exhibited a perfect relationship around the 50% bootstrap or posterior support threshold , with no true clade represented in fewer than half the trees sampled and no false clad es in greater than half (Fig. 2.16a,c) . Troublesome treatments had many more, as well as more highly represented, false clades (Fig. 2.16 b,d ), and several infrequently produced true clades, with some even having nil support. For bo th troublesome and non - tr oublesome tree sets, BI greater proportions of clades with high support. 67 Figure 2.16. Frequency of all true and all false clades per treatment set by relative bootstrap support for ML analyses and by posterior probability support for BI analyses. Note th e x - axis is not to scale, including the 95% support category. Troublesome treatments are the eight treatments demonstrating considerable variation among best trees across phylogenetic analyses; and non - troublesome treatments are th e fifteen treatments demo nstrating near - perfect clade accuracy among best trees across phylogenetic analyses . Treatment LSS B is excluded so that only eight - ingroup taxon trees are compared . As a measure of the probability of a clade being correct, the rel ative proportion of true versus false clades is compared across support values in F igure 2.17, following an analysis provided by Hillis and Bull (1993) . These researchers use d jackknifing followed by bootstrapping to determine MP bootstrap proportions for the restriction site dataset of Hillis et al. (1992) . Hillis and Bull (1993) observed tha t bootstrap proportions are lower than the probability of being c orrect for all proportions above 35%, and that proportions 70% or greater indicate a 68 very high probability (>95%) that the clade is real. For ML analyses with the troublesome treatments, boot strap proportions greater than 30% were conservative in accuracy, and very highly so for proportions greater than 70%. BI analyses for the troublesome treatments produced posterior proportions conservative with respect to accuracy for all proportions great er than 30% except for 80%, which was slightly liberal in its rep resentation of clade accuracy . BI was only more conservative than ML for support values of 40% and 50 % . Within the range of 30 - 50%, BI posterior support does not closely approximate clade acc uracy, with all other support values having a posterior probabili ty ± 5% of the probability the clade is true. Figure 2.17. Relationship between clade accuracy as the percent of correct clades for values of bootstrap support for ML analyses and posterior probability support for BI analyses. Results include data from Hi llis and Bull (1993) as estimated from their Figure 4a (black), data from Avidian evolution non - troublesome treatments for both ML and BI analyses (blue, identical relationsh ip), and data from Avidian evolution troublesome treatments for M L analyses (orange) and BI analyses (red dashed). The thin grey line is the one - to - one accuracy - to - support relationship; values above the line are conservative as being an underestimation of accuracy and values below are liberal as being an overestimation of accuracy. Troublesome treatments are the eight treatments demonstrating considerable variation among best trees across phylogenetic analyses; and non - troublesome treatments are the fifteen treatments demonstrating near - perfect clade accuracy among best trees across phylogenetic analyses. Treatment LSS B is excluded so that only eight - ingroup taxon trees are compared. Clade accuracy and clade resolvability, as well as their averaged score, to pological accuracy, are compared within a whole - tree context in F igures 2.18 and 2.19 . Results from each 69 phylogenetic analysis and consensus method are shown for the fifteen non - troublesome treatments (including treatment LSS B ) in Fig. 2.18 and for the eight troublesome treatments in Fig. 2.19 as the proportion of trees that attained thresholds for clade accuracy (subplot a) , clade resolvability (subplot b) , and average topological accuracy (subplot c) . These measures provide overall evaluat ions of the information provided in the trees resulting from the Avidian evolutio n treatments presented. Consensus methods presented include the thresholds previously evaluated as well as the MRe result. Note that ten replicates were performed for each tre atment and that each phylogenetic analysis and consensus method was performed for all such treatments and replicates. Therefore, the data shown within each column of these plots represent 150 trees and 80 trees, respectively , in Fig ures 2.18 and 2.19 . Non - troublesome treatment trees had high clade accuracy yet were often not fully - reso lved and therefore had low overall accuracy for more strict consensus support values (Fig. 2.18). Clade accuracy was perfect (i.e., meeting the 100% threshold ) for all consens us trees for each analysis and for best trees for ML and BI analyses (Fig. 2.18 a) ; of course, these were the criteria - was greater than 95% for NJ and the MP best trees, although for the la tter an equally parsimonious tree was consistently the true tree. A 70% consensus support threshold maintain ed very high clade resolvability (Fig. 2.18 b), with at least 95% of trees being fully resolved for each analysis ( i.e. , meeting the 100% CR threshol d ) . While MP performed slightly better than ML for thresholds of 95% and 99%, BI maintain ed a much greater clade resolvability at these high support thresholds. When weighting clade accuracy and clade resolvability equally as average topological accuracy ( Fig. 2.18 c), BI produced a large proportion of accurate trees even at very strict support thresholds. 70 Figure 2.18. P roportion of trees across all non - troublesome treatments that met or exceeded clade accuracy (a) , clade resolvability (b) , and average topological accuracy (c) thresholds for each analysis and consensus method evaluated . Non - troublesome treatments are the fifteen treatments demonstrating near - perfect clade accuracy among best trees across phylogenetic analyses. 71 Trees from tr oublesome treatments had high clade accuracy but were poorly resolved, with low average topologica l accuracy (Fig. 2.19). Clade accuracy was perfect for only 50% of the best trees produced by each analysis, and it was at least 80% for nearly 80% of the bes t trees produced by each analysis, with MP performing a bit worse (Fig. 2.19 a). As the strictness of the majority rule consensus threshold increased, clade accuracy increased at different rates among analyses, with ML approaching 95% of trees as being perf ect at the 70% consensus threshold and both MP and BI requiring a threshold of 95%. For trees of i ncreasing consensus threshold strictness, clade resolvability decreased substantially, from approximately half the trees being fully resolved to less than 5% of trees (Fig. 2.19 b). MP and ML reach ed this low resolvability at a 95% consensus threshold while BI reache d it only at 99%. Across consensus thresholds, ML consistently produced better - resolved trees than MP , but less so than BI. This trend of BI maintai ning greater clade resolvability stands out when examining trees with at least a 60% clade resolva bility (light blue, Fig. 2.19b) . For 99% consensus trees, MP and ML produced only about 15% of trees with at least 60% clade resolvability while BI produced 4 5%. This increased resolvability caused BI consensus trees to have greater overall accuracy, and M L consensus trees median accuracy compared to MP (Fig. 2.19 c). For MP and BI, 50% majority rule trees ha d slightly lower overall accuracy compared to best tre es by having greater clade accuracy and slightly lesser resolvability, although for ML over 15% mo re trees had reduced overall accuracy. MRe consensus trees for ML and BI had slightly reduced clade accuracy or clade resolvability , with minimal increase in the other metric, if any; therefore, these trees tended to have near - equivalent or reduced overall 72 Figure 2.19. P roportion of trees across all troublesome treatments that met or exceeded clade accuracy (a) , cl ade resolvability (b) , and average topological accuracy (c) thresholds for each analysis and conse nsus method evaluated. Troublesome treatments are the eight treatments demonstrating considerable variation among best trees across phylogenetic analyses. 73 Dis cussion Treatment comparisons The basic pair of treatments exhibited similar molecular evolution a nd comparable phylogenetic inference accuracy in Avida compared to that in the T7 phage research . These 100 - and 300 - generation per branch treatments under asexual evolution and stabilizing selection with the naïve ancestor favorably compared to sets of mo lecular T7 data exhibiting distinct rates of evolution. Hillis et al. (1994) , usi ng a DNA dataset, observed 63 variable sites and a total of 6 9 substitutions across internal and external ingroup branches. In comparison, the 100 - generation pe r branch treatment using Avida resulted in a median of 85 variable sites ( Fig. 2.04 ) and a total of approximately 74 substitutions, as extrapolated from inferred branch lengths across ML and BI best trees ( Fig. 2.05 ). The NJ analysis Hillis et al. (1994) performed had a tree with a single clade as incorrect, which was found for one of the ten NJ replicates, with the others having perfect accuracy ( Fig. 2.06 ). Their MP analysis resulted in the true tree and one other as equally parsimonious tr ee, and this result also occurred for one replicate here, with the other replicates identifying only the true tree as most parsimonious. Whereas the T7 researchers found ML as being incorrect by one clade , all replicates of ML and BI for the 100 - generation per branch treatment produced trees with perfect clade accuracy. Hillis et al. (199 4) performed nonparametric bootstrapping and reported a single clade resolvability score for each analysis, whereas here I report clade resolvability separat ely for consensus trees of increasing threshold strictness. They found that MP maintained much h igher resolvability than ML, with NJ slightly outperforming them all. The consensus trees analyzed for the 100 - generation per branch treatment confirm these res ults ( Fig. 2.07 ), although a bootstrap analysis with NJ was not performed . Consensus thresholds of 70% and greater resulted in one or more replicates producing trees with at least one polytomy, and MP and BI were not as greatly affected by increasing thres hold strictness as was ML. Hillis et al. (1992) , using a restriction site dataset, observed 202 variable sites and a total of 220 substitutions across internal and external ingroup branche s, which is comparable to the 300 - generation treatment which had a median of 186 74 variable sites and approximately 236 substitutions. Hillis et al. (1992) found that their dataset produced correct tree inferences under each method evaluated , and this result is reproduced here, with ea ch of these analyses and consensus threshold s having 100% clade accuracy and resolvability. Comparing the Avidian evolution treatments, i t is reasonable to susp ect that the three - fold increased evolution between lineage division allowed greater fixations t o occur, providing greater support for clade relationships, and therefore resulting in greater variable and parsimony informative sites and thus increased topol ogical accuracy. These treatments approximated the molecular evolution characteristics of datase t size and degree of evolutionary change and found similar phylogenetic success as the T7 experimental phylogenetics research. This demonstrates that Avida was successfully used to reimplement prior experimental phylogenetics work using biological organism s. Three of the four remaining treatments conducted under stabilizing selection with lineages evolved for an equivalent amount of time within each history produ ced results like the prior treatments, with the remaining treatment being phylogenetically probl ematic. When an ancestor pre - adapted to the environment was used, the extent of molecular evolution decreased dramatically. Presumably, strong stabilizing selec tion in maintaining ancestral adaptation to the selective environment caused many fewer fixation s to occur, resulting in reduced variable and parsimony informative sites (Fig. 2.04) , shorter branch lengths (Fig. 2.05), and thus lower clade accuracy and res olvability (Figs. 2.06 - 2.07) . As with the basic pair of stabilizing selection treatments , increased lineage evolution between cladogenesis allowed greater phylogenetic information and inferred topological accuracy for the 300 - generation treatment s . A notab le trend was that clade resolvability was improved by the addition of sex to t he 100 - generation treatment s . This was a curious result, as recombination in these treatments appears to increase clade resolvability while not greatly increasing the number of v ariable or informative sites. Together the set of six stabilizing selection tr eatments with equivalent lineage evolution per branch demonstrates that phylogenetically informative sequence variation and 75 thus topological accuracy is reduced due to relatively stronger stabilizing selection and relatively reduced lineage evolution betwe en cladogenic division , as would be expected . Five treatments were designed to evaluate the effects of differing branch lengths. Evolution along internal branches can produce evi dence of the shared ancestry among latter lineages via the production of synap omorphies. This is especially likely to occur if the population has sufficient opportunity to acquire substitutions, i.e. , on relatively longer branches. On the other hand, evolu tion along external branches can only produce such evidence due to lineage sor ting (i.e., the segregation of variant characters following cladogenesis ) . However, evolution along both internal and external branches may produce homoplasy, which is more likel y to occur along relatively longer branches (Rok as and Carroll, 2006) . Of course, both internal and external branches can also negate evidence of shared ancestry when substitutions occur at sites that previously exhibited s uch evidence. Therefore, internal branches are more important in positively co ntributing to phylogenetic inference and this contribution is relative to their length, and external branches are likely to negatively contribute with increased length. For examp le, Hang et al. (2003) used digital evolution with Avida to demonstrate that relatively longer internal branches improved phylogenetic inference through the generation of synapomorphic variation. T hree Avidian evolution treatments presented here ev aluated this through the placement of a single short - generation) set of branches per tree level - generation) sets of branches : S LL had eight external branches of 300 generations each, LSL had four internal branches of 300 generations, and LLS had two basal - most internal branches of 300 generations. Felsenstein (1978) another through the producti on of homoplasious evolution swamping out positively informative cladogenic si gnal. The fourth Avidian evolution treatment , LSS , had only external long branches to test this long branch attraction effect. Graybeal (1994) used simulations to show that breaking up such long exter nal branches by adding targeted taxa resolves those 76 difficulties. Finally, Avidian evolution treatment LSS B did just that, resulting in a 32 - taxon asymmetrical history. This set of treatments, conducted under weak stabilizing selection with differing numbers of generations per tree level, supported this prior research. Since the primary distinction between parsimony informative sites and variable sites is that the latter additiona lly includes autapomorphic variation, the observed pattern of informative sites in Figure 2.04 for these treatments makes sense. Treatment LSS would be expected to show the greatest proportion of such sites due to greater evolution along the long external branches. Although treatments LSL and LLS have the same long external branches, they also have a set of long internal branches, which provide d a greater opportunity for parsimony informative variation among the resulting taxa. Treatment SLL did not have lo ng external branches during which substantial evolution could take place. And treatment LSS B had longer external branches than SLL, although it also had four times as many taxa, allowing greater opportunity for informative yet misleading variat ion to occur due to parallelism and convergence. NJ underpredict ing long branches for these and other treatments is as expected since the algorithm sensitive to deviations from its model and generally undercounts rates of change (Tateno et al., 1994) . As expected, c lade accuracy was perfect for all analyses and consensus thresholds when all internal branches were long and when external lo ng branches were disrupted by increased tax a breaking up long branches yet was reduced with treatments of long external branches and at least one set of short internal branches ( Fig. 2.08 ). When c lade resolvability was decreased, the responsible clades wer e consistent ly those denoted by short branches ( Fig. 2.15b - d ). Overall, this set of treatments produced the phylogenetic trends as predicted by theory, simulation, and prior digital evolution research . The eight treatments with directional selectio n and lo ng branches throughout the tree support prior experimental phylogenetics research on natural selection using digital evolution. While it is undetermined whether selection aids or hurts phylogenetic inference generally, experimental phylogenetics research u sing biologi cal systems tends to show that selection 77 produces homoplasy (Bull et al., 1997; Cunningham et al., 1997; Fares et al., 1998) , although see Leitner et al. (1996) , while in digital systems it primarily produces greater synapomorphies (Hagstrom et al., 20 04; Hang et al., 2007, 2003) . Four different selective regimes simultaneously altered the strength of selection and the possibility of selection producing parallelism or convergent homoplasious evolution , with the strength of selection decreas ing from reg imes 1 through 4 and environmental diversity among branches increasing from regimes 1 through 4 (Fig. 2.02) . For example, in regime 1 all branches experienced very strong selection for the same tasks, and in regime 4 each branch experienced rel atively weak er selection for distinct combinations of tasks. Recombination is expected to increase the efficiency of natural selection by fixing beneficial and removing deleterious alleles, so its presence is expected to magnify the effects of selection. The pattern o f branch lengths increasing from regimes 1 to 4 for ML and BI inferred best trees (Fig. 2.05) is evidence that selection for new tasks during later periods of lineage evolution was driving greater evolution. Since NJ did not show this pattern, this suggest s greater selection treatments. Since branches were reduced in length from that observed for stabilizing selection treatments that also had branches of 3,000 generati ons (e.g., lo ng external branch treatments in treatment LLS), selection presumably caused fewer fixations to occur. Other than increased selection fixin g beneficial alleles and removing deleterious alleles, presumably stronger selective sweeps were occurri ng in stronge r selection regimes, further causing a decrease in evolutionary change. Internal branch lengths appeared inflated without recombination and external branch lengths inflated with recombination because recombination treatments included the seedi ng of new lin eages with the entire population instead of the most abundant genotype. This should cause lineage sorting dynamics, resulting in fewer overall substitutions, and delaying some fixation s until later in the evolutionary history. The trend observ ed across all 23 treatments of ML inferring slightly shorter branch lengths than BI was especially prevalent for internal branch lengths in these treatments and is best exemplified by the selection regime 3 treatments with or without recombination. While i t is not clea r what is 78 causing this near - systematic difference, perhaps either ML or BI is more sensitive to deviations from the model of evolution metric . Comparing the set of treatments wi th recombinat ion to those without, recombination appears to decrease clade resolvability (Fig. 2.11). This is the opposite trend from that observed for 100 - generation per branch treatments (Fig. 2.07) . Overall, it appears that stronger selection in these t reatments inc reases clade resolvability, with homoplasious evolution presumably not being as significant of a concern in this digital evolution system. The final four treatments evaluated the combinatorial effects of selection, recombination, and varying b ranch lengths in the long branch attraction (LSS) design . Selection and a lack of recombination increased both clade accuracy and resolvability compared to the standard LSS treatment (Figs 2.12 and 2.13 ). Whereas selection in the presence of recombination decreased bot h clade accuracy and resolvability compared to the standard LSS treatment. With either recombination present or absent, both clade accuracy and resolvability slightly increased in the less selective environment, regime 3, countering the trend observed for the 3 , 000 - generation per branch treatments. As with the stabilizing selection LSS treatment, clade resolvability was diminished by both two - and four - taxa clades ( Fig. 2.15 ), owing to the presence of internal branches needing to provide phylog enetic suppor t for clades of both tree levels. Overall, complex combinatorial dynamics were observed with these treatments. Accuracy and consensus thresholds Arguably, the most impactful individual result from the T7 work is the connection of MP bootstrap support value s with clade accuracy, with the threshold of 70% indicating very high ( > 95% ) clade accuracy (Hillis and Bull, 1993) . The expansive set of experimental phylogenetics data presented here was used to revisit this clad e - level accur acy result within the context of ML and BI, and to further consider it within a tree - level context. The relative proportion s of true and false clades were directly compared across support values as a measure of the probability of a clade bein g correct (Fi g. 2.17 ). Avidian evolution ML analyses with the troublesome treatments supported the results of Hillis and Bull (1993) , with 79 bootstrap proportions greater than 30% being conservative in accuracy, and very highly so for proportions greater than 70%. BI analyses for the troublesome treatments provide strong evidence that posterior probabilities are ve ry close estimates of clades accuracy, confirming analyses of simu experi mental conditions, BI posterior support is a closer estimate to clade accuracy than is ML bootstrap support for s upport values greater than 60%. Curiously, the range of 30 - 50% in which BI posterior support fails to closely track clade accuracy does not app ear to be especially under - sampled, and so it is unclear why these median values are overly conservative. While a few simulation studies suggest that BI posterior probabilities may be overly liberal estimates, only two support values (20% and 80%) were ind eed liberal, and neither were overly so. This result suggests that BI posterior probabilities are a much better r eflection of clade accuracy than bootstrap support, at least for most support values of interest to systematists. It also suggests that the com monplace 100% posterior support value may be regarded as a significant indication of clade accuracy rather than a s an overly liberal estimate. While the individual accuracy of a clade is important, clades are rarely just evaluated alone but also in the con text of one another in a tree, and therefore it is important to examine clade accuracy, as well as resolvability, in a whole - tree context (Figs. 2.18 - 16). Greater assurance of clade accuracy (for BI, anyway, based on results in Fig. 2.17 ) using more strict consensus support thresholds entail ed a stark trade - off with clade resolvability even for the non - troublesome tr eatments (Fig 2.18 a - b) , although it had extreme consequences in troublesome treatments (Fig 2.19a - b) . For both sets of treatments, BI maintaine d greater clade resolvability at stricter thresholds and therefore produced the most overall accurate trees even at very strict consensus thresholds. This increased clade resolvability caused BI consensus trees to have much greater average topological accu racy than MP and ML . A systematist might use the MRe method to produce a fully - resolved tree that considers boots trap consensus support information. The data presented here suggest that this would be misguided, as it sacrifices clade accuracy and/or clade resolvability with little gain. 80 Across both clade - level and tree - level measures, a ML 70% consensus support tree is approximately as accurate as a BI 95% consensus support tree. As shown in Fig. 2.17, since ML bootstrap support is so conservative, at the 7 0% threshold clades have approximately a 95% probability of being correct. This compares to the much closer measu re of accuracy provided by BI posterior probability, with a 95% posterior clade support having approximately a 95% probability of being correct . For conditions in which phylogenetics should perform well, as with the non - troublesome treatments, a 70% consen sus ML tree may be approximately equally expected to exhibit perfect clade accuracy, clade resolvability, and therefore average topological acc uracy as a 95% posterior support BI tree (Fig. 2.17). And for more difficult conditions, these trees are similar yet with a few differences; specifically, a 70% consensus ML tree may have a slightly lower chance of having perfect clade accuracy, a slightly higher chance of being fully - resolved, and overall a slightly greater chance of having perfect topological accur acy than a 95% posterior support BI tree (Fig. 2.19). Conclusions As the most ambitious examination of phylogenetic accuracy in an experimental system to date, t his work demonstrates the use of digital evolution in experimental phylogenetics as a powerful tool for the evaluation of phylogenetic inference . This work has shown that Avida can be used to approximate the basic molecular evolution cha racteristics and phylogenetic inference results obtained in the foundational T7 experimental phylogenetics resear ch of Hillis et al. (1992) and Bull et al. (1993) , and supports the bootstra p - accuracy conclusion of Hillis et al. (1994) . Avida can also be used to expand upon these earlier experiments by evaluating a r ange of designs expected to cause phylogenetic inference complication, thereby demonstrating the greater utility and generality of the digital evolution approach over biological systems for experimental phylogenetics. A fundamental phylogenetics phenomeno n has been demonstrated using this system across many treatments that greater evolution aids phylogenetic infer ence except in 81 conjunction with lesser evolution among internal branches; although, uninvestigated here, too much evolution along branches will eventually swamp out positively informative signal. Directional selection has been shown to aid phylogenetic in ference in this system, and does so the stronger selection occurs, although directional selection in combination with recombination and differi ng extents of lineage evolution produces conflicting results. Strong stabilizing selection in this system can hur t tree accuracy if it sufficiently reduces the magnitude of evolutionary change. It is not clear what the effects of sexual recombination, as i mplemented in Avida, are on phylogenetic inference, as it appears to increase clade resolvability in some circums tances and lower both clade resolvability and accuracy in others. This research suggests several general recommendations for systematists. BI p osterior clade support probabilities are very close estimates of clade accuracy at least throughout the range of 60 100%, inclusive. Bootstrap support values in ML analyses are highly conservative measures of clade accuracy for values of 70% and greater. S ince higher consensus thresholds fail to substantially improve accuracy, although they may drastically reduce cla de resolvability, ML 70% bootstrap support values represent an ideal trade - off between accuracy and resolvability. BI maintains greater clade r esolvability at greater consensus support thresholds than does MP or ML. A 50% majority rule consensus tree provi des a fair trade - off between best tree without s acrificing too much resolvability for most analyses. MRe consensus trees are not useful, as they are either equal accuracy, resolvability, or both. If one wishes to have a fully - resolved tree , then the best tree resulting from the analysis should be used instead of the MRe tree. Under either phylogenetically facile or challenging circumstances, ML 70% consensus trees are approximately equivalent to BI 95% consensus trees for both clade - level and tre e - level measures of accuracy. Finally, there are a few outstanding questions that were not addressed in this stud y. Overall, the effects of natural selection and whether its phylogenetic inference effects are 82 different in biological versus digital systems stands to be further investigated. Since treatments reported here provided conflicting results, the influence of recombination (as implemented in Avida) on phylogenetic inference, separately and together with other factors, remains to be evaluated, in addi tion to its underlying mechanistic effects. The molecular genetics occurring within these Avidian populations, es pecially instruction frequencies and rates of change, was surely different from the Poisson model of evolution used for these analyses. Deviati ons from this model presumably accounted for differences in branch length estimation between analyses and potenti ally contributed to differences in clade support. The relative robustness between ML and BI to deviations from its model of evolution should be further explored. A characterization of the Avidian molecular evolution occurring in these treatments and the ef fects of deviation from its model of evolution might provide insight into differences between analyses. While molecular genetics mechanisms, su ch as selective sweeps and lineage sorting, have been suggested here as likely explanations for certain results, a characterization of such patterns in these evolutionary contexts remains to be shown. Finally, it is unclear whether trees have a higher like lihood of being accurate if multiple phylogenetic analyses, e.g., ML and BI, produce identical topologies, or, in the case of polytomies, at least produce non - conflicting topologies. If not, then these results suggest that BI analyses with posterior probab ility tree support of 95% or greater should be used by systematists desiring an assurance of high clade accuracy and reasonably well - resolved trees. 83 CHAPTER 3 : Digital Evolution Address es Intractable Research Questions in Phylogenetics Introduction The l ack of experimental phylogenetics research following the revolutionary work of Hillis et al. (1992) in constructing and evaluating a known biological evolutionary history has been rem arkab le. After an initial flurry then trickle of related research, critiques against this (Oakley, 2009) . In Chapter 1, I argued that the experimental phylogenetics approach had so far suffere d fro m an imbalance of utility, realism, and generality compared with computational simulations due to the historical use of biological study systems. Instead, digital evolution may preserve the greater biological realism found in experimentally generated evolu tionary histories while not sufficiently reducing the utility in time, resources, and expertise and not sufficiently reducing the generality or universal applicability to other systems compared to simulations. In this manner, digital evolution may fil l the current void between computational simulations and biological experimental evolution for the evaluation of phylogenetic methodologies. In Chapter 2, I presented the digital evolution system Avida as a suitable system for experimental phylogenetics. I did so by demonstrating the correspondence between prior research using biological organisms (Bull et al., 1993; Hillis et al., 1994, 1992; Hillis and Bull, 1993) with digital evolution experiments designed to produce similar results. By extending this work to evaluate a greater range of predicted phylogenetically troublesome conditions, I then further demonstrated the utility and generality of the digital approach to experimental phylogenetics. Those results provided a few outstanding areas for investi gatio n, including the impact of natural selection, sexual recombination, and deviations from the model of evolution on phylogenetic accuracy both clade support and branch length inference and a lack of a 84 detailed understanding regarding the molecular evolu tiona ry circumstances of such experimental treatments. The work presented in this chapter addresses those research questions: Can experimental evolution conditions in Avida be configured to closely approximate the biological reality of molecular evolution ary d ynamics? And if so, does natural selection aid phylogenetic inference, as suggested in Chapter 2? Does sexual reproduction always aid this, or are its effects not easily predictable, as suggested in Chapter 2? Finally, are deviations from the model of mole cular evolution used in a phylogenetic analysis highly impactful on phylogenetic accuracy or are deviations well - tolerated? With respect to the impact of selection and sexual recombination, the experimental treatments investigated here varied in thei r pot ential for neutral evolution or natural selection to occur and whether asexual or sexual reproduction occurred. And to even further differentiate the molecular evolutionary trends as summarized in models of evolution, experiments were initiated with e ither a naïve or pre - adapted ancestor genotype. A fully factorial design was implemented to investigate the effects of all eight combinations of these experimental treatments on the phylogenetic accuracy, both topological and branch length accuracy, of Avi dian evolutionary histories. Deviations from the model of evolution were investigated by comparing phylogenetic analyses using the Poisson model of molecular evolution to empirical substitution models created from the detailed tracking of substitutions tha t occ urred across experimental replicates. For all experiments, experimental evolution conditions were set to approximate biological reality, and this too was evaluated using population genetic analyses to determine whether evolution proceeded as expected. The work presented here demonstrates the greater utility and generality of digital evolution over biological systems for evaluating phylogenetic inference. For example, to create the eight empirical models of molecular evolution, substitutions across a to tal o f 1 , 637 , 600 , 000 generations of evolution were tracked, and to characterize the population genetic dynamics of this evolution, the identified 3,291,266 substitutions were evaluated for their fitness effect 85 ory. The collection and characterization of similar molecular evolution data in biological systems would be prohibitively burdensome if not impossible. Additionally, the design of these experiments strove for biological realism, for which analyses conducte d her e provide more so than those in chapter 2. For example, the population sequence diversity, the distribution of mutational effects, and the extent of evolution per lineage between cladogenic events were well within the range observed in biological syst ems ( e.g., Eyre - Walker and Keightley, 2007; Li, 1997; Pin et al., 2001; Simmons, 2012) . The impact s on phylogenetic inference by model divergence, selection, recombination, and complex population dynamics have been underexplored in phylogenetics research . Digital evolution has potential utility in exploring such phylogenetically difficult sample spaces , and the increased utility and generality of digital evolution can be harnessed to explore questions that simulations have aimed to address. The following work was performed within the context of providing specific case - study examples of the digital exper iment al phylogenetics approach, and it has produced several surprising findings that are at least relevant under the evolutionary conditions investigated. This research shows that recombination may have a beneficial role in phylogenetic inference by encour aging substitutions to occur gradually throughout lineage evolution ; t hat neutral evolution can pose greater difficulty for phylogenetic inference than directional selection ; t hat using a more accurate model of evolution in the phylogenetic analysis may no t off er improvement ; t hat inferred branch lengths may often be quite inaccurate despite clade support being accurate ; a nd , that metrics like bootstrap support and posterior clade support may not be close estimates of clade accuracy. The aim of this work is that it may show the phylogenetics and experimental evolution communities alike the potential for future uses of digital evolution to investigate research questions that are intractable with evolving biological populations. 86 Methods Experimental design Base evo lutionary history Eight experimental treatments were conducted in a fully factorial design of neutral evolution or natural selection, and naïve or pre - adapted ancestor, and asexual or sexual reproduction. E ach treatment was replicated ten times, for the ge neration of a total of eighty 1,024 - taxon fully symmetrical Avi dian evolutionary histories. Unlike in Chapter 2, the base of the ingroup taxa was not a true polytomy but was a root lineage uniting all taxa. This first lineage was initiated with a full pop ulation of the starting organism, and each subsequent lineage b egan with the cloned population from the end of the previous lineage. Therefore, lineage seeding in these experiments was consistent with the selection treatments in Chapter 2, and not like the other Chapter 2 treatments that used the single most abundant genotype to initiate lineages. In this manner, population growth did not occur, and the population size remained very close to 1,000 organisms for the entire evolutionary history. Under this ro oted and fully symmetrical design, ten tree levels of evolution proceeded, producing 1,024 (i.e., 2 10 ) external branches and 1,023 (i.e., 2 10 - 1) internal branches. The per site mutation rate was set sufficiently high to maintain the relative likelihood tha t Avidians would adapt in selective environments, yet low enoug h to be within the range of biological reality (Li, 1997) . Mutation rates from 3 * 10 - 5 to 5 * 10 - 6 were initially evaluated using the default organism, population size, a nd under neutral evolution or a similar selective environment a s described in the conditions here. These populations evolved for 10,000 generations and the sequence diversity of the population every 1,000 generations was evaluated. Variation at each locus was transcoded using the DNA bases for each instructional varia nt so that population genetics software could be used to infer nucleotide diversity, (Nei, 1987) . Rather than using a genetic code or having some other biological meaning, this was simply a means to characterize locus diversity; for example, the four most common variants 87 were coded as A, T , G, or C, and for loci with greater than four variants, all ot hers were coded with a dash, although this rarely occurred. Using the program MEGA, version 6.06 (Tamura et al., 2013) , sequence diversity was calculated for each timepoint and compared to the biologically realistic range of 0.0 05 to 0.02 (Pin et al., 2001; Stone et al., 2002) . The only mutation rate evaluated that was within this range for nearly all conditions and gener ations observed was 1 * 10 - 5 . Since Avidians reproduce with over - lapping generations such that the was 5 * 10 - 6 . For comparison, this rate is approximately forty times lower than in Chapter 2. The number of generations each lineage evolved per branch was set so that branch lengths were reasonably short. The objective was to produce branches of lengths o n the shorter range of biological reality, so that future work may use taxon sampling on the resulting evolutionary histories to evaluate longer branch lengths. Of the phylogenetic simulation studies surveyed, many evaluated branches as small as 0.01 subst itutions per site and as large as 0.4 ( e.g., Wiens and Cannatella 1998; Wiens and Soltis 2005; Simmons 2012) . A total of 5,000 generations of evolution per branch was chosen, yielding an expected neutral evolution branch length of 0.025 substitutions per s i te, or expectedly lower under selection. This is approximately the length of the shortest branches inferred in Chapter 2 treatments. The total extent of evolution from root - to - tip experienced by a lineage evolving across eleven tree levels was 55,000 tota l generations and expected to be 0.275 substitutions per site under neutral evolution , which simulation analyses suggest should be within optimal ranges across phylogenetic difficulty conditions (Klopfstein et al., 2017) . Non - default Avida settings shared by all experiments included the following. Both the neutral and default sexual instruction sets allowed only 20 possible characters to mutate. The default instruction did so by disallowing t h e same set of instructions from being incorporated into the genome via mutation as in Chapter 2, except for h - copy , required for Avidian genome which an offspring i s placed randomly into the population instead of near their parent, resulting 88 in a lack of spatially - and genetically - structured populations and increasing the effectiveness of selection because individuals were equally likely to compete for space with rel a tives and nonrelatives alike. Population details, including genotype frequencies, were recorded every 100 generations, resulting in 50 sampling timepoints per branch lineage. All other settings were as the default, and Avida (Ofria et al., 2009; Ofria and Wilke, 2004) version 2.14.0 was used with modifi cations including the neutral i nstruction set and the extra computational tasks available for reward, as explained below. N eutral evolution versus natural selection Treatments varied in whether neutral or adaptive evolution occurred. Avida was modified to incorporate a new instruction s et that consisted of 20 copies of the nop - X instruction (e.g., nop - X A , nop - X B , nop - X C , etc.), varying only in their alphabetic abbreviation. This instruction does not take any computational resources to perform unlike other A vidian instructions, including nop - C , that do to various degrees. Therefore, there should not be any stabilizing selection or the potential for any form of adaptive evolution with Avidian genotypes using this set of instructions. One additional instruction other than a nop - X analogue , c alled repro - sex , was used that was necessary and sufficient for reproduction for these genotypes; st position in the genome and was excluded from all a nalyses. While the reproduction instruction was excluded from being introduced via mutation, it had the possibility to mutate away causing inviability. In this manner, very, very weak stabilizing selection occurred as it was limited to one locus. Selective environments were designed to produce very gradual adaptive evolution throughout the evolutionary history while additionally minimizing the potential of homoplasy. Each of the 2,047 lineages within the evolutionary history were exposed to a distinct selec tive environment to limit homop (Fig. 2.02). As with those treatments, a nesting environmental structure promoted continual adaptation, with all ancestrally rewarded tasks also rewarded in derived branches. Here, every b ranch additionally rewarded ten new tasks. To accomplish this, 215 new tasks were coded, 89 bringing the total number of available Avidian tasks to 348, including one - , two - , and three - input logic or mathematical tasks. Using th ese, relative task difficulty w as roughly approximated by evolving replicate Avidian populations in the full - task environment and recording how many generations elapsed before a mutation conferring it arose. A total of 277 tasks were identified from these approximated difficulty classes and environments were configured such that two tasks from each class were awarded per branch to promote steady adaptation across lineages. Since realistic population genetics were sought, selective environments were designe d to decrease the magnitude and frequency of selective sweeps, e.g., fixation within fewer than 100 generations. Each task was rewarded with a 10% increase in merit, which was thought to be approximately minimally sufficient to promote phenotypic evolution . Note, there was no tradeoff i n merit for performing computationally more difficult tasks, as each was equivalently rewarded. The ten new tasks rewarded each branch therefore allowed a maximal selective advantage of 1.1 10 (i.e., 2.6x). In practice, howeve r, Avidian populations rarely i f ever reached full proficiency in rewarded tasks. At the very most, a difference between an Avidian performing no tasks versus performing all 110 tasks rewarded in an external lineage environment was 35,743x. Since task perf ormance usually entails a nomin al reduction in offspring cost relative to merit and since Avidians tend to evolve tasks sequentially, the realized differences in fitness between contemporaneous individuals are most likely greatly reduced from these values. These conditions varied from t hose of Chapter 2, in which all treatments experienced at least relatively weak stabilizing selection, and in which the most selective environment rewarded nine tasks with a maximal advantage of performing all versus n o tasks of over 30 million. Naïve vers us pre - adapted ancestor The ancestor used to initiate the evolutionary histories varied by being either naïve or pre - evolved. Each ancestor had a genome of 1,000 loci and this size was fixed by disallowing insertion or deletion mutations. Rare Avidian prog rammatic circumstances that may otherwise disassociate positional homology among loci was also controlled by mandating offspring size to 90 be identical to its parent and disallowing unstable genotypes from reproducing. A s in Chapter 2, the naïve ancestor was equivalent to the default Avidian genotype although with additional nop - C instructions as genomic filler between the two strings of instructions necessary for reproduction. In this case 900 such instructions were adde d. As before, this genotype can reprod uce but perform no other meaningful computation, for example a task rewarded by the environment. Whereas the sought distinction between the naïve and pre - adapted ancestor in Chapter 2 was that the latter would be task proficient in the environments encoun tered during the evolutionary history, this was not the rationale between the naïve and pre - evolved ancestors used here. Recognizing the tremendous difference between the model of evolution experienced by the naïve anc estor and a biological organism, the g oal for the pre - evolved ancestor was to create a genotype that had instruction frequency comparable to typical amino acid profiles in biological organisms. The was accomplished by evolving ten replicate Avidian populat ions from the default ancestor through millions of generations of evolution. The selective environment rewarded the remaining 71 tasks that were not rewarded among any branch environment in the selection treatments, although since only five tasks were perf ormed by the identified genotype, the environment largely fostered stabilizing selection. After periodically evaluating random organisms for instruction frequencies comparable to biological taxa, the genotype chosen as the pre - evolved ancestor had experie nced 2,038,000 generations of evolutio n. Table 3.01 presents summary statistics for the pre - evolved and naïve Avidians in addition to several biological taxa, whose empirical amino acid frequencies were obtained using a diversity of methods and with taxa a cross the tree of life. While the naïv e ancestor had 98.8% of its genotype as nop - C instructions and several instructions unrepresented, the pre - evolved ancestor had a frequency distribution approximately comparable to biological organisms, with only 9.9% of its genome as nop - C . Ancestors did not substantively differ in sequence variation between neutral and selection treatments, with the same genotypic sequence of instructions used for each but under differing instruction sets (i.e., 91 the C instruction code d for nop - C in selection treatments an d nop - X C in neutral treatments). The only sequence difference was the presence of the additional final instruction for neutral treatments, which was necessary for reproduction, and was excluded from all phylogenetic an alyses. Table 3.01. Summary statistics for empirical amino acid frequencies for ancestor Avidian instruction frequencies and sets of biological taxa, with values scaled to 0 1,000 for ease of comparison to a 1,000 - loci genome. For example, a minimum of 8 i ndicates that the lowest frequency for any of the twenty amino acid frequencies in the dataset was 8 out of 1000 (i.e., 0.8%). Some empirical frequencies were accessed via the ExPASy resource portal (2012) . Pre - evol ved ancestor Naïve ancestor Brooks et al. (2002) UniProt Consortium (2017) McCaldon and Argos (1988) Hormoz (2013) King and Jukes (1969) Minimum 10 0 8 11 13 9 13 Median 55 0 48 54 53 53 47 Maximum 99 988 89 97 90 102 81 Standard Deviation 23 215 25 22 21 24 20 Asexu al versus sexual reproduction Recombination in sexual reproduction treatments occurred differently than in Chapter 2. In Avida there are four settings that determine the genetical recombination process, with additional settings related to other factors lik e mate choice or the two - fold cost of sex, which are of no concern here. The first setting is RECOMBINATION_PROB , which is the probability that recombination will occur between a pair of mates. A value of 0 was used for the asexual reproduction treatments and 1 for sexual treatments bo th here and in Chapter 2. Note that this setting is different from the recombination rate value used for biological organisms, where a rate of 0 means there is a nil chance of recombination between two specified loci (i.e., co mplete genetic linkage disequi librium) and the maximum rate is 0.5 in which there is a 50% chance of recombination between two loci, (i.e., independent assortment). The second setting s the fixed position of recomb ination breakpoints. Modules divide the genome evenly such that 25 modules for a 100 - length genome would be four loci each. The end of each module is a 92 potential recombination breakpoint. Each module is evaluated independently when recombination occurs bet ween a pair of mates, and each module has a 50% chance that recombination will occur. The other two settings, CONT_REC_REGS and CORESPOND_REC_REGS govern how modules are swapped, and are unimportant here, with values of 0 used for each. In Chapter 2, recom bination occurred via independent assortment among all loci, with the number of modules set as the genome size. Here, the number of modules was set to 0, which causes two breakpoints to be chosen at random within the genome. T hus, linkage disequilibrium be tween two loci is negatively associated with greater genomic distance and will degrade over generations of reproduction as recombination probabilistically disassociates alleles. Therefore, Chapter 2 recombination is biological ly analogous to each locus bei ng like genes far apart on a chromosome or as different chromosomes altogether, while recombination in these experiments is more analogous to a sequence of DNA for which only very far apart loci associate independently. Analys es Distribution of mutational effects The fitness of all one - step mutants was calculated with respect to a randomly chosen Avidian to evaluate whether the distribution of mutational effects was approximately like that observed for biological organisms. Sin ce the genome was 1 ,000 characters in length and with 19 possible mutant instructions at each position, the relative fitness of 19,000 Avidians was recorded for each genotype evaluated. A systematic examination of the distribution of mutational effects was not conducted acro ss an entire evolutionary lineage nor for all extant organisms at a population timepoint. Instead, genotypes were evaluated at random across a few treatments to roughly gauge adherence to biological examples. The genotypes chosen for inc lusion here are app roximately representative of others evaluated from similar evolutionary histories evolved under selection and include the following: The genotype of the pre - evolved ancestor was evaluated in both its ancestral (71 - task) environment and i n the 10 - task envir 93 evaluated in the root selective environment. A descendant of the pre - evolved ancestor at the last timepoint on one external branch was evaluated in that bra - ta sk environment, as well as that for a descendant of the naïve ancestor in the same selective environment. In this manner, the distribution of mutational effects was recorded at the beginning and very end of the evolutionary history for b oth naïve and pre - e volved treatment conditions in addition to the final state of the pre - evolved ancestor in its ancestral environment. Characterization of observed substitutions and empirical models of evolution All alleles that reached fixation in a popu lation were identif ied and tracked throughout their frequency trajectory during segregation. Alleles were considered as segregating within a population if their frequency was between 5% and 95%, above which they were effectively fixed and below which they were too infrequent to track. A substitution occurred when an ancestral fixed allele was supplanted by a derived allele that reached fixation, assessed as greater than 95% frequency. The fifty 100 - generation population timepoints for each branch were evalu ated for all such o ccurrences. If a substitution entirely occurred between data lleles at that locu s were recorded for each timepoint during which it was segregating. For each substitution, its locus, derived and ancestral state, origin and fixed branch, and number of generations to reach fixation were recorded. Data were used to eval uate various distri butions, including the frequency of fixations per branch, frequency of fixations per generation per branch timepoint, substitutions per locus, number of generations to fix, and the generation on the fixed branch. For selection treatments , the relative fitn ess of each substitution was then evaluated throughout its frequency trajectory. This was calculated for each timepoint by dividing the average fitness of all extant organisms with the derived state by the average fitness of all other ex tant organisms in t he population. If at any timepoint there were no organisms that did not have the derived state, then the relative fitness was considered as 1, although such timepoints 94 were not considered when classifying by fitness type, below. If there was only one such timepoint then it was considered a quick sweep substitution. The minimum and maximum relative fitness values across measured timepoints were used to classify substitutions by selection type (Table 3.02), using t hresholds calculated foll owing nearly neutral theory (Ohta and Kimura, 1971) . Alleles of sufficiently small selection coefficients, s, have a probability of fixation primarily due to genetic drift when . The effective population size, , is less than the maximum populat ion siz e 1,000, so the threshold 0.001 used here is conservative in classifying substitutions as neutral. For example, beneficial and deleterious substitutions were never measured as having a relative fitness within the range 1 ± s, s = 0.001 for which gen etic dr ift should be the dominant evolutionary determinant of their probability of fixation. Neutral - beneficial and neutral - deleterious substitutions had at least one timepoint for which their relative fitness was within this range and at least one timepoi nt abov e or below, respectively, and beneficial - deleterious substitutions may or may not have had one or more timepoints within the neutral range although they had at least one timepoint with a fitness above and at least one below. Table 3.02 . Classificati on of s ubstitution s by the minimum and maximum relative fitness recorded across measurable timepoints throughout their frequency trajectory. Threshold values were determined based on ne arly neutral theory (Ohta and Kimura, 1971) for a population size of 1,000, and fi tness t ypes are mutually exclusive. See the text for a description of a seventh type, q uick s weep. Substitution Fitness Type Minimum Relative Fitness Maximum Relative Fitness Beneficial > 1.001 Neutral Beneficial ( neu - ben ) 0.999 1.001 > 1.001 Neutral 0.999 1.001 Neutral - Deleterious ( neu - del ) < 0.999 0.999 1.001 Deleterious < 0.999 Beneficial - Deleterious ( ben - del ) < 0.999 > 1.001 The tracking of ancestral and derived states for all substitutions a lso allowed the creation of empirical fixed - rate models of Avidian instruction substitution. The structure of the 95 models resembles empirically - derived amino acid models ( e.g., Dayhoff et al. 1978; Jones et al. 1992; Whelan and Goldman 2001; Le and Gascuel 2008) of unequal, fixed state frequencies for the 20 amino acid characters and the 190 general time - reversible substitution rates. Each model was parameterized u sing substitution rates and state frequencies pooled across all ten replicates for each of the eight experimental treatments. Note that substitution rates were treated as time - reversible, i.e., not differentiating between ancestor and derived character sta tes in pooling X - to - Y and Y - to - X substitutions together, since greater - parameterized models were not compatible with the phylogenetic programs. Phylogenetic analyses For each of the eighty experimental treatment replicates, taxon - character datasets were c reated using all 1,024 extant populations. As in Chapter 2, a random organism was sampled from each population and the outgroup taxon was a randomly sampled organism from an extant population of a different replicate of the same experimental treatment. The complete sequence of a sampled organism was used, except for the final 1,001 st character for neutra l treatment organisms. As in Chapter 2, three characters, J, O, and B, were translated to amino acid abbreviation counterparts W, Y, and V, respectively, fo r compatibility with phylogenetic programs designed to handle amino acid abbreviations. Perfect posi tional homology was maintained with fixed organism genome lengths, so alignment was not required. Each of these datasets was used to conduct six different p hylogenetic analyses. Phylogenetic analyses generally occurred identically to those in Chapter 2, al though no - joining (NJ) trees were constructed using QuickTree, v ersion 2.0 (Howe et al., 2002) , and maximum parsimony (MP) was implemented using MPBoot, version 1. 1.0 (Hoang et al., 2018, 2017) . As before, the number of parsimony trees evaluated was 10,000 to conduct a more identified equally parsimonious trees. ML was i mplemented using IQ - TREE, version 1.5.5 (Minh et al., 2017; Nguyen et al., 2015) th the greatest 96 likelihood value. A total of 100 nonparametric bootstrap replicates were also produc ed and used to determine the relative proportions of true and false clades with support values over 10%, a cutoff that excluded the excessive number of very infrequently inferred clades. BI was implemented using MrBayes, version 3.2.5 (Ronquist et al., 2012, 2011) , with the parallel processing imp lementation and the BEAGLE library (Altekar et al., 2004; Ronquist et al., 2012) , with the Markov chain sampled every 100 generati single phylogram with the greatest posterior probability, also termed the maximum clade credibility tree. The post - burnin sampled trees were used to determine the relative proportions of true and false clades with support values over 10%. For maximum likelihood (ML) and Bayesian inference (BI), separate analyses were performed using the Poisson fixed - rate model and th e calculated empirical model corresponding to the treatment. For example, analyses of taxon - characte r datasets from the neutral treatment under asexual reproduction and starting from the naïve ancestor used the empirical model as calculated from the substi tution data pooled across all replicates from that treatment. Python, version 2.7, and the following packages were used to organize and present these data: Jupyter, version 0.27.0 (Kluyver et al., 2016; Perez and Granger, 2007) ; Matplotlib, version 1.3.1 (Hunter, 2007) ; ETE2, version 2.2.1 (Huerta - Cepas et al., 2016) ; and DedroPy, version 3.12 (Sukumaran and Holder, 2010) . Topological accuracy between the true tree and inference tree was calculated using the variants of the Robinson - Foulds (RF) distance discussed in Chapter 2. Briefly, the RF distance is the sum of false positive (FP) branches and false negative (FN) branche s, which can also be calculated as rates. To e mphasize the accuracy of these inference methods, I report the FN ra te (i.e., FN divided by the number of internal branches in the inferred tree). The arithmetic mean of the FP and FN rates is the average topological error (S wenson et al., 2010) overall accuracy of the phylogenetic inference by equally weighting correctly resolved clades 97 an d unresolved clades. Examples of these metrics are shown for a comparison of a known (or correctly inferred) cladogram (Fig. 2.03a) to four variously inaccurate inferred cladograms (Fig. 2.03b - e). Note that the trees reported here, i.e., the best tree topo logy produced by each analysis, is fully resol ved for each analysis except BI (since maximum clade credibility trees are not necessarily fully resolved). Since each non - BI best tree may include incorrect clades, each false positive clade requires a counter part false negative clade, and therefore clade accuracy, clade resolvability, and average topological accuracy are equivalent values (e.g., Fig. 2.03b,e). Additional measures include comparisons of inferred branch lengths and the amount of variable and par simony informative sites in the taxon - characte r dataset. Branch lengths are summarized as the median length across all internal branches and, separately, across all external branch lengths. Tracking all substitutions that occurred within each evolutionary history allows the calculation of the true med ian substitutions per site per branch. These rates are compared to the median inferred branch lengths, as well as the expected value based on neutral theory (Kimura, 1983) . Locus positions are variable if at least two types of characters exist among the set of taxa, and parsimony informative if at least two sets of characters are found among at least two taxa each. Statistics Statistical te sts were used to evaluate the significance of differences between treatments and population genetic expectations. When comparing variation across replicates per treatment with a null hypothesis derived from pop ulation genetics theory, single - sample t - tests were conducted. For example, the number of substitutions per evolutionary history was evaluated using single - sample t - tests with the null hypothesis being that they were equivalent to the product of the mutati on rate, genome size, number of generations pe r branch, and number of branches per tree, and this was conducted separately for each treatment. To evaluate significant differences among treatments, one - way ANOVA analyses were used, e.g., if there was a sign ificant difference in the number of substituti ons per evolutionary history among neutral treatments. Post - hoc analyses to determine the pairwise treatments driving the 98 statistical significance were conducted using t - tests with Bonferroni corrections for mu ltiple comparisons. For the relative proportio ns of substitutions by fitness type, Chi - square tests were used. These included a four - by - two test to evaluate a difference among neutral treatments for quick sweep versus those taking greater than 100 generati ons to fix, and a four - by - seven test to evalua te a difference among all fitness types for selection treatments. Following this, the adjusted standardized residuals were evaluated for significance using a Bonferroni adjustment to the z critical value for th e table size, and a four - by - two Chi - square tes t was used with a (Sharpe, 2015) . As in Chapter 2, the focus of the phylogenetic results is on comparisons of median values and trends across treatment replicates due to the recognized lack of independence in phylogenetics, including among topology metrics such as clade accuracy and resolvability and for branch lengths, since the presence of a non - inclusion in the tree. Results Empirical models of e volution Separate empirical fixed - rate models of Avidian instru ction substitution were created for each treatment and parameterized with rates and state frequencies pooled across all ten replicates per treatment. The two models presented in Table 3.03 repr esent the extremes of two trends of lesser versus greater frequ ency variation among the full set of eight models constructed. Within neutral evolution treatments, as shown in the first model in Table 3.03, the character - to - character substitution rates and per character state frequency values were approximately equaliz ed, at least for all non - C characters, since all mutations were equally likely to occur and therefore result in substitution since fitness effects were nil. Within naïve ancestor treatments (e. g., the first model in Table 3.03), substitution rates and char acter state frequencies remained high for all values involving character C (i.e., nop - C in selection or nop - X C in neutral treatments) due to its prevalence within the naïve ancestor genotype (i .e., 99 constituting 988 out of 1,000 characters; Table 3.01). Con versely, and as shown in the second model presented in Table 3.03, selection treatments had greater variation among substitution rates and character state frequencies, at least for all non - C ch aracters, since fitness effects were not equivalent among chara cter states. And pre - adapted ancestor treatments (e.g., the second model in Table 3.03) had lesser variation among substitution rates and character state frequencies for values involving charac ter C due to greater uniformity among character state frequenci es in the pre - adapted ancestor genotype (Table 3.01). For reference, the Poisson model has equivalent rates and state frequencies for all values (Bishop an d Friday, 1987) . Using the scaling in Table 3.03, substitution rates are uniformly 2.63 (i.e., ) and character state frequencies are 5 (i.e., ) under the Poisson model . 100 Table 3 .03 . Two examples of empirical fixed - rate models of Av idian instruction substitution out of the eight models constructed, one per experimental treatment, by pooling substitutions observed across all ten replicates. Character - to - character substitution rates for the neutral, asexual evolution treatment starting with the naïve ancestor are above the diagonal and character state frequencies are the *first line of values below. Substitution rates for the selection, asexual evolution treatment from the pre - evolved an cestor are below the diagonal and state frequencie s are the **second line . Note that the instructions are arrayed by the alphabetical order for full amino acid names, as is the convention for phylogenetic programs. Substitution rates are specified relative to a total of 500 per treatment, and state freque ncies are out of 100 per treatment. Neutral, Asexual, Naïve Treatment Model of Substitution Rates and Character State Frequencies* A R N D C E Q G H I L K M F P S T W Y V Selection, Asexual, Pre - evolved Treatment Model of Rates and Frequencies** A 0 .6 0.6 0.6 21.4 0.6 0.6 0.6 0.5 0.6 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 R 4.6 0.5 0.5 21.2 0.6 0.6 0.5 0.5 0.6 0.5 0.6 0.6 0.6 0.6 0.5 0.6 0.6 0.6 0.6 N 4 5 0.5 21.3 0.6 0.6 0.6 0.5 0.6 0.5 0.5 0.5 0.6 0.5 0.5 0.6 0.5 0.5 0.5 D 3.9 5 4.7 21.2 0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.5 0.5 0.5 0.6 0.6 C 5.6 5.5 4.9 5.5 21.6 21.4 21 21.3 21.5 21.4 21 21.5 21.3 21.3 21.3 21.5 21.4 21.3 21.5 E 4.7 5.1 4.7 4.7 5.2 0.6 0.5 0.5 0.6 0.5 0.6 0.5 0.6 0.6 0.6 0.6 0.5 0.6 0.6 Q 4.1 4.6 4.3 4.2 4.7 4.3 0.6 0.6 0.5 0.6 0.6 0.6 0.6 0.6 0.5 0.6 0.6 0.5 0.6 G 0.6 1 0.9 1.1 0.9 0.8 0.5 0.5 0.6 0.5 0.6 0.5 0.6 0.6 0.5 0.6 0.6 0.6 0.6 H 4 4.7 4.6 4.4 4.8 4.4 4 0.9 0.5 0.5 0.5 0.5 0.6 0.5 0.5 0.6 0.5 0.6 0.6 I 3.8 4.5 5 4.2 4.8 4.2 3.8 1 4.2 0.5 0 .5 0.5 0.6 0.6 0.5 0.6 0.5 0.6 0.6 L 1.1 1.3 1.2 1.4 1.4 1.3 1.2 0.3 1.2 1.3 0.5 0.5 0.5 0.5 0.4 0.6 0.5 0.5 0.6 K 2.7 3 3.1 2.6 2.9 2.9 2.7 0.8 2.8 3 1.6 0.5 0.7 0.6 0.6 0.6 0.6 0.6 0.5 M 3.3 3.7 3.9 3.4 4.4 3.3 3.2 0.6 4.2 3.6 0.9 2.3 0.6 0.6 0. 6 0.5 0.5 0.5 0.6 F 0.4 0.9 0.8 1.1 0.7 0.7 0.5 0.6 0.7 0.9 0.1 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 P 1.3 1.7 1.6 1.4 1.8 1.5 1.5 0.3 1.4 1.1 0.3 0.8 1.2 0.5 0.6 0.6 0.5 0.6 0.6 S 4.2 4.7 5.2 3.9 5.3 4.4 3.7 0.8 4.8 4.1 1.1 4 3.3 1.2 1 0.5 0.5 0.6 0.5 T 0.9 0.8 1 0.9 1.1 0.8 0.7 0.4 0.9 1.1 0.4 0.9 0.8 0.3 0.2 0.9 0.6 0.6 0.6 W 2.9 3.1 2.8 2.7 2.9 2.9 2.6 0.8 2.5 2.5 1.4 2.9 1.9 0.5 0.9 2.5 0.7 0.6 0.6 Y 4.1 4.3 5.5 4.2 4.8 4.3 3.8 0.9 4.4 4.7 1.3 2.8 4 0.8 1.3 5.8 1 2.5 0.6 V 4.7 5 4.6 4.7 5.7 4.8 4.3 0.9 4.7 4.3 1.2 2.7 3.6 0.8 1.9 4.6 0.8 2.9 4.3 * 3.2 3.1 3.1 3.1 40.5 3.2 3.2 3.1 3.1 3.1 3.1 3.1 3.1 3.2 3.2 3.1 3.2 3.1 3.1 3.2 ** 6.1 6.8 6.8 6.4 7.3 6.5 5.9 1.4 6.4 6.2 2 4.5 5.2 1.3 2.1 6.5 1.5 4.2 6.5 6.6 Distribution of mutationa l effects Clear trends were found when the fitness of all mutants with a single difference was evaluated for a randomly chosen Avidian from a specified population in its selective 101 environment. The proportion of lethal mutations (i.e., relative fitness equa l to 0) is largely consistent across genotypes and environments, averaging 11% per genotype (Fig. 3.01). Using the conservative nearly neutral theory threshold of mutations being fitness neutral within the fitness range of 1 ± 0.001 (Table 3.02), neutral m utations w ere the largest category of possible mutations, with 42 83% per genotype. Neutral mutations were more common for the two root genotypes than the three genotypes evaluated at the termination of their evolutionary history in the selective environme nt, at 78% and 48% on average respectfully. Terminal genotypes include descendants in the tip environments of the evolutionary history as well as the pre - evolved ancestor in its ancestrally adapted environment. Small effect beneficial mutations, having up to 1% fitn ess effect, were extremely rare for all organisms, averaging less than 0.09% of possible mutations, and large effect beneficial mutations (i.e., relative fitness greater than 1.01) were rare, averaging less than 0.7%. Small effect deleterious mut ations wer e relatively more common than beneficial mutations, with 10% on average for root genotypes and 3% for terminal genotypes. Finally, moderate and large effect deleterious mutations (i.e., relative fitness greater than 0 and less than 0.99) were mor e prevalen t for the three terminal genotypes, averaging 37% compared to the root genotypes at 4%. Note that although relative fitness was calculated with high precision, values are collected here in bins of varying size to highlight proportions of mutants of certain effects. 102 Figure 3.01. The distribution of mutational effects for five genotypes with the relative fitness proportions of all genotypes differing from a randomly chosen genotype by one mutation. Relative fitness bin size varies between 0.1, 0.5 , and a logarithmic scale of base 10, with differences in scaling demarcated by red dashed lines below the x - axis. The genotypes evaluated included the pre - evolved ancestor in its ancestral selection environment (black), the same genotype in the selective environm ent at the root of the evolutionary history (orange), the naïve ancestor in the same root environment (yellow), a final descendant of the pre - evolved ancestor at the tip of the evolutionary history (blue), and a final descendant of the naïve ancest or in th e same tip environment (green). Characterization of observed substitutions Fewer total substitutions occurred than expected under the neutral theory for all treatments. According to neutral theory (Kimura, 1983) , the per locus per generation mutation rate is equal to the substitution rate for neutral alleles. Therefore, across 2,047 lineages (i.e., evolutionary history branches) ea ch of 5,000 generations with a mutation rate of 5 * 10 - 6 , the genome of 1,000 loci should experience 51,175 substitutions. Each of the eight treatments were evaluated for a significant difference from this expected value for neutral alleles and was signifi cantly lower ( t 9 > 3, p < 0.05; Fig. 3.02). 103 Figure 3.02. Number of recorded substitutions per treatment (mean ± 95% CI) relative to the expectation from neutral theory (dashed line). Experimental treatments are denoted by their selective condition (gree n labe ls), starting ancestor genotype (pink labels), and recombination condition (orange labels). Evolutionary histories exhibited variation in total substitutions recorded per treatment condition. Compared to neutral evolution treatments that had a mean o f appr oximately 50,500 substitutions per treatment, the selection treatments had at least 30% fewer total fixations, and within selection treatments those with the pre - evolved ancestor had about 15% fewer fixations than with the naïve ancestor (Fig. 3.02). Among the four neutral treatments there was no significant difference between the mean number of fixations per replicate ( F 3, 36 = 0.141, p = neither the mode of re produc tion nor ancestor genotype had an effect. Among the four selection treatments there was a highly significant difference between the mean number of fixations per replicate ( F 3, 36 = 41.28, p < 0.001). Post - hoc analyses demonstrated a highly significant diffe rence ( t 36 > 7, adjusted p < 0.001) between four of the six pairwise treatments. The exceptions were in the comparison of treatments otherwise identical except for having 104 asexual or sexual recombination. There was a non - significant difference in mean numbe r of fixations per replicate between the recombination treatments with selection and the naïve ancestor ( t 36 = 0.556, adjusted p = 1), shown in Figure 3.02 with confidence intervals labeled pre - e volved ancestor ( t 36 = 2.775, adjusted p = 0.052), although the latter was significant for 95% normal - distribution - evolved ancestor fixed significantly fewer substitu tions than those starting with the naïve ancestor, and while treatments with asexual reproduction fixed fewer substitutions than under sexual reproduction the difference was not significant for the naïve ancestor and borderline significant for the pre - evol ved an cestor. Within selection treatments, natural selection influenced nearly all substitutions, both positively and negatively, for at least one generation timepoint sampled across each Across all treatment s, as shown in Figure 3.03, 95 99% of all substitutions were beneficial for at least one timepoint (i.e., pooling beneficial, neutral - beneficial, and beneficial - deleterious fitness type categories) and 77 90% were deleterious for at least one timepoint (i. e., po oling deleterious, neutral - deleterious, and beneficial - deleterious). M ost substitutions were of fitness type beneficial - deleterious (74 89% across treatments) or beneficial (12 20%). There was inconsistent ordering among the next three fitness classe s, alt hough neutral - beneficial substitutions were usually the third most common (0.75 4%), followed by deleterious (0.28 1.64%), and neutral - deleterious (0.4 m utatio n and fixation happened within the 100 - generation data collection window, was the second least common class (0.04 0.26%) and neutral the very least (0.02 0.09%). There was a significant difference in these proportions by fitness type among treatments ( 2 18 , 1268850 = 37028 , p < 0 .01 ) . Evaluation of the adjusted residuals compared to the Bonferroni adjusted threshold indicated that all but four of the 28 factors drove this significance. Two of the exceptions are not especially enlightening (deleterious substi tutions for the asexual 105 reproduction using the naïve ancestor treatment, and neutral substitutions for the sexual reproduction using the pre - evolved ancestor treatment); and the other two concern the observed proportion of quick sweeps in those same treatm ents, indicating that while they are not different from one another (0.13%), they are different with respect to sexual versus asexual reproduction treatments with the same starting ancestor. In comparison, for the neutral treatments only 0.001 0.002% were found to be quick sweeps (i.e., fixation within 100 generations) and there was no significant difference among treatments for the proportion of quick sweeps versus substitutions taking longer to fix ( 2 3 , 2022416 = 1.996 , p = 0.573 ) . Comparing among o ther e xperimental conditions for the selection treatments, asexual treatments had approximately 10% relatively fewer beneficial - deleterious substitutions than their counterpart for the same starting ancestor. This was compensated with twice to four times g reater proportions of each other class, especially neutral - beneficial substitutions. 106 Figure 3.03. Number of observed substitutions by fitness type per selection experiment. Experiments began with either the naïve ancestor under asexual (a, N = 338,936) or sex ual (b, N = 341,842) reproduction, or with the pre - evolved ancestor under asexual (c, N = 286,780) or sexual (d, N = 301,292) reproduction. For each treatment, data from all replicates are pooled, and substitutions are colored by fitness type as per Table 3.02. The number of fixations were random with respect to position in the genome for neutral treatments and highly variable for each individual treatment replicate for selection experiments starting with the naïve ancestor, and there were greater reg ions o f invariance across treatment replicates for selection experiments starting with the pre - evolved ancestor. All neutral 107 treatments exhibited random variation in which loci experienced substitution, as shown in Figure 3.04a for a single replicate. All select ion treatment replicates exhibited a diversity in the number of substitutions per locus, with many sites being invariant or nearly invariant (e.g., Fig. 3.04c). The identity of sites with exceptionally low substitution rates differed for each replica te of the treatment starting with the naïve ancestor; for example, the variation shown in Figure 3.04c for a single replicate is largely averaged out when all replicates of that treatment are pooled (e.g., Fig. 3.04b). Selection treatment replicates initia ted by the pre - evolved ancestor did not have as much variation in which sites had exceptionally low substitution rates, as shown in the pooled replicate data of Figure 3.04d, with many of the same sites remaining invariant for each replicate. In all select ion tr eatments, the regions approximately corresponding to the reproduction machinery of the original naïve ancestor the approximately five positions at the very beginning and very end of the genome experienced few substitutions; further, the pre - evolved a ncesto naïve ancestor, whose genotype was also its own ancestor prior to millions of generations of descent. After the first few positions, approximately the next fifty exhibite d rela tively higher rates of beneficial as well as deleterious substitutions. Only asexual reproduction treatments are shown because these trends do not appear to vary due to reproductive mode alone (i.e., treatments otherwise identical except for sexual v ersus asexual reproduction have similar distributions). 108 Figure 3.04. Representative treatment patterns for the number of fixations by Avidian locus. Selected patterns include single treatment replicates (a and c) or pooled treatment replicates (b and d) , unde r neutral evolution (a) or selection (b d), and starting with the naïve ancestor (a c) or pre - evolved ancestor (d). Substitutions are colored by fitness type as per Table 3.02. Recombination strongly affected how many substitutions fixed within the f irst couple generations after a population was introduced to a new selective environment. For asexual selection treatments (e.g., Fig. 3.05a), a much greater proportion of substitutio ns fixed every 100 generations during the first 600 generations than over 109 similar trend but with a lesser magnitude of difference acros s the branch (Fig. 3.05b). For either reproduction condition, the samplin g interval which exhibited the greatest fixations was the second 100 generations, during which a much greater proportion of beneficial substitutions fixed. Only treatments initiated w ith the naïve ancestor are shown because these trends do not appear to va ry due to ancestor genotype alone (i.e., otherwise identical treatments have similar distributions). Neutral evolution treatments did not exhibit variation with respect to the generat ion alleles fixed within a lineage. 110 Figure 3.05. Representative patter ns for asexual (a) and sexual (b) treatments under selection for the number of observed substitutions fixed per 100 - generation population timepoint across all lineages (i.e., branches ) in the evolutionary history. For each treatment, data from all replicat es are pooled, and substitutions are colored by fitness type as per Table 3.02. Alleles fixed slower than expected under neutral evolution. The time to fixation for neutral alleles in a haploid population should be 2N e generations (Kimura, 1983) , or at most 2,000 for this experiment. Each of the four neutral evolution treatments was evaluated for a significant difference from this expected value and was significantly higher ( t >500000 > 124, p < 0.001), each with a mean of approximately 2,230 generations and a Poisson - like distribution. 111 Among these treatments there was a significant difference between the mean time to fixation ( F 3, 2022412 = 4 .23, p < 0.01). Post - hoc analyses demonstrated that this was driven by a significant difference in two of the six pairwise comparisons: between asexual treatments using the pre - evolv ed versus naïve ancestor ( t 2022412 = 3.17, adjusted p < 0.01) and between asexual versus sexual reproduction treatments starting with the naïve ancestor ( t 2022412 = 2.66, adjusted p < 0.05), however the largest difference in treatment means was less than 9 generations. Owing to this relatively miniscule effect size and with no clear pattern with respect to the effect of reproductive mode or starting genotype, the statistically significant differences between neutral treatments likely bears little importance . Alleles fixed much slower with sexual compared to asexual reproduction when under selection. For each selection treatment, beneficial, neutral, deleterious, and quick sweep substitutions peak in fixing within their first hundred generations (Fig. 3.06); neutral - beneficial and neutral - deleterious alleles peak in fixing within their second hundred generations; and the mean fixation of beneficial - deleterious alleles is not until after approximately 1,100 generations under asexual reproduction and 1,300 under sexual reproduction after they are introduced to the population. Treatme nts with the pre - evolved ancestor (Fig. 3.06c,d) also fixed beneficial - deleterious alleles quicker compared to counterpart treatments with the naïve ancestor (Fig. 3.06a,b). 112 Figure 3.06. Number of observed substitutions by how many generations elapsed b etween entry in the population until fixation for selection treatments. Under sexual reproduction (b and c) the tail for beneficial - deleterious substitutions extends until 12,000 gene rations. For each treatment, data from all replicates are pooled, and sub stitutions are colored by fitness type as per Table 3.02. Fewer fixations occurred every 100 generations under selection, and especially with sexual reproduction. For example, the sel ection with sexual reproduction treatment (Fig. 3.07d) exhibited fewer fi xations per sampled timepoint compared to under neutral evolution (Fig. 3.07b), or with asexual reproduction (Fig. 3.07c), and far fewer when under neutral evolution 113 with asexual repr oduction (Fig. 3.07a). Only treatments initiated with the naïve ancestor are shown because these trends do not appear to vary due to ancestor genotype alone (i.e., otherwise identical treatments have similar distributions). Figure 3.07. Representative tr eatment patterns for the number of observed substitutions fixed per 100 g enerations. Selected patterns include asexual (a and c) or sexual reproduction (b and d), and under neutral evolution (a and b) or selection (c and d) . For each treatment, data from a ll replicates are pooled, and substitutions are colored by fitness type a s per Table 3.02. Neutral evolution caused stochastic frequency trajectory patterns, and asexual reproduction caused the periodic simultaneous fixation of more numerous substitutions compared to sexual reproduction. Population frequency change appears stoc hastic for 114 substitutions under neutral evolution, for example, as shown in Figure 3.08 for all substitutions occurring within one evolving population (i.e., one branch) in the evoluti onary history. A greater number of fixations occurred together under asex ual versus sexual reproduction, as represented by relative line thickness linking observed substitution frequencies in Figure 3.08. For example, at generation 500 and 3,200 of the ase xual population (Fig. 3.08a), many alleles fixed together, while nearly a ll alleles in the sexual population fixed alone (Fig. 3.08b). Additionally, fewer unique genotypes at any given time contained an allele that would eventually reach fixation under ase xual reproduction compared to sexual reproduction. For example, at the en d of the branch (i.e., generation 5,000) in the asexual population, only one genotype of all that existed in the population is shown because only it contains alleles that reached fixa tion on a subsequent lineage following cladogenesis in the evolutionary h istory. In contrast, at the end of the branch in the example sexual population, more than ten genotypes contain one or more alleles that reach fixation following segregation on one or both subsequent branches. 115 Figure 3.08. Representative patterns for ase xual (a) and sexual (b) treatments under neutral evolution for how alleles that became substitutions changed in frequency between entry in a population (measured as 5% frequency) un til fixation ( 95% frequency). Each panel shows a single population (i.e ., one branch from one treatment replicate) on a middle tree level in the evolutionary history, and all substitutions that were segregating before, during, or after this lineage between cladogenic e vents. Observed substitution frequencies are shown as circ les, and straight lines connect measurements every 100 generations. Line thickness indicates the relative number of substitutions on the same fixation trajectory. Natural selection caused punctuated equilibrium patterns for average population fitness, whic h tended to be caused by beneficial alleles quickly fixing. Average population fitness tended to plateau before increasing by approximately 10%, for example, as shown in one evolving population in F igure 3.09d at generation 1,500. Such large increases in a verage population fitness tended to be caused by a beneficial allele quickly fixing and eliminating all variation at that locus, which is why fitness was indeterminable at that timepoint for the all ele 116 (i.e., orange star in circle at 100% frequency at gene ration 1,500 in Figure 3.09c). Under either mode of reproduction, genotypes that eventually reached fixation were often influenced by positive and negative selection for at least one timepoint in th eir population frequency trajectory, in addition to neutra l evolution caused by genetic drift. For example, the genotype in the example asexual population (Fig. 3.09a,b) that entered the population by generation 1,200 experienced strong negative selection within its first couple hundred generations, was within th e 1% fitness range around neutrality for most of its trajectory and was under strong positive selection for at least the 200 generations before fixing by generation 2,800. As under neutral evolution (Fig. 3.08), asexual reproduction with selection caused m any more alleles to fix together, and fewer unique genotypes at any given time contained an allele that would reach fixation (Fig. 3.09a) compared to sexual reproduction (Fig. 3.09c). 117 Figure 3.09. Representative patterns for asexual (a and b) and sexual (c and d) treatments under natural selection for how alleles that became substitutions changed in population frequency (a and c) and fitness (a d), and their effects on average populatio n fitness (b and d). 118 Figure 3.09 (cont d ). Panels a and b show a single population and panel s c and d show a single population, each on a middle tree level in the evolutionary history, and include all substitutions that were segregating before, during, or after this lineage bet ween cladogenic events. Panels a and c: Observed substitution frequen cies are shown as circles, and straight lines connect measurements every 100 generations. Line thickness indicates the relative number of substitutions on the same fixation trajectory. R elative fitness is shown by coloration with values as indicated in the color bar, which includes the nearly neutral theory thresholds for neutrality (Table 3.02) as dashed lines. Orange stars within circles at 100% frequency indicate that fitness was indet erminable. Panels b and d: Relative fitness of substitutions are show n as solid black lines, with thickness indicating the relative number of substitutions on the same fixation trajectory. Dashed black lines indicate the nearly neutral threshold and dashe d colored lines indicate a 1% selection advantage (blue) and disadvant age (red). Average population fitness is shown by the dotted green line and with values on the secondary y - axis. Phylogenetic accuracy Topological accuracy was high across treatments, i ncluding perfect for sexual reproduction treatments, and asexual repro duction treatments yielded improved accuracy when selection occurred. All analyses performed with greater than 96% topological accuracy (Fig. 3.10), although note that for these 1,024 - in group taxon topologies, a 1% difference in accuracy is approximately e quivalent to 10 clades being incorrectly inferred. Topological accuracy was perfect in sexual recombination treatments, reduced to about 99.8% (i.e., around 2 incorrect clades) for asexu al selection treatments, and reduced to approximately 98% (i.e., aroun d 20 incorrect clades) for asexual neutral evolution treatments. Ancestor genotype did not seem to affect topological accuracy. 119 Figure 3.10. Topological accuracy, the average percenta ge of clades correctly inferred and correctly resolved, for each phylo neighbor joining (NJ), maximum parsimony (MP), and maximum likelihood (ML) and Bayesian inference (BI) with the Poisson or emp irical model of evolution; open symbols for individual replicates and closed for the median across replicates. Experimental treatments are denoted by their selective condition (green labels), starting ancestor genotype (pink labels), and recombination cond ition (orange labels). Differences between phylogenetic analysis perfo rmance among the asexual treatments, which did not reach perfect topological accuracy, was slight. The overall trend was that the median NJ replicate performed slightly worse than other analyses and BI tended to perform the best (Fig. 3.10). The exception is the most inaccurate treatment (i.e., asexual reproduction neutral evolution with the naïve ancestor), for which the median MP tree was more accurate than all other analyses, and this is especially surprising considering that the MP best tree selection w as chosen from between six to 100 equally parsimonious trees across this 120 only slightly varied when conducted with the Poisson versus empirical model of evolution. T he greatest difference was in the selection treatment with asexual reproduction and the naïve ancestor for which the Poisson model had 0.05% greater median accuracy. Unlike in Chapter 2, the number of taxa analyzed here contributed to a lack of full resolv ability in BI tree inferences. All non - BI analyses produced fully - resolved trees, so for these trees, topological accuracy is equivalent to clade accuracy and clade resolvability. Clade resolvability was equivalent for each BI model comparison per treatmen t and very nearly so for each treatment comparison among neutral and among selection treatments, so the very slight differences in topological accuracy between such comparisons are attri butable to slight differences in clade accuracy alone. The greatest di fference in accuracy between any such comparisons is for the selection treatment with asexual reproduction and the pre - evolved ancestor, for which the empirical model had 0.05% greater m edian accuracy and therefore 0.025% greater median topological accurac y. Since this lack of full resolution among BI trees caused such minor difference in median topological accuracy, only topological accuracy is presented here to ease comparison among ana lyses. Bootstrap and posterior probability clade support values are ov erly liberal for neutral evolution treatments, and often overly conservative for selection treatments, with neither type of clade support value being a close approximation of clade accur acy. Clade support values greater than the 50% threshold, which is the minimum value of relevance for consensus support, are conservative estimates of accuracy for selection treatments (filled lines, Fig 3.11). The exception to this trend is that the 90% B I posterior support threshold produced liberal estimates of accuracy. For neutral treatments, clade support values greater than 50% are consistently liberal estimates of accuracy (unfilled lines, Fig 3.11). For either analysis type, ML versus BI (solid ver sus dashed lines), or the model of evolution used, empirical versus Po isson (blue versus orange), no consistent trends were observed. Treatments otherwise identical except for the starting ancestor produced very similar results, so such treatment pairs are pooled here. Unlike the analyses of asexual reproduction treatments, which resulted in both 121 true and false clades across most support thresholds, sexual reproduction treatments resulted in near - perfect sets of bootstrap and posterior sample topologies (i. e., exceedingly low frequencies of false clades), so the treatments sh own here with asexual reproduction . Figure 3.11. Relationship between clade accuracy as the percent of correct clades for values of bootstrap support for ML analyses and posterior prob ability support for BI analyses of asexual reproduction treatments poo led across naïve and pre evolved ancestor conditions. Results include comparisons to bootstrap support for ML analyses (solid lines) or posterior probability support for BI analyses (das hed), using the empirical model of evolution (blue) or Poisson model ( orange), and under selection (filled lines) or neutral evolution (unfilled lines). The grey line is the one - to - one accuracy - to - support relationship; values above are conservative as bein g an underestimation of accuracy and values below are liberal as an ov erestimation of accuracy. Note only support greater than 10% was assessed due to the large number of very low - supported clades. The observed substitution rate (i.e., the empirical equiva lent to branch length) per internal and external evolutionary lineage was less than the neutral theory expectation for all treatments. The expected substitution rate for neutral alleles (Kimura, 1983) is the product of the mutation rate and the number of generations per branch, so a lineage (i.e., evolutionary history branch) of 5,000 generations with a mutation rate of 5 * 10 - 6 should have 0.025 substitutions per site ( Figure 3.12 , solid line ) . Observed inte rnal branch substitution rates ( Figure 3.12, dashed lines) were found to be significantly lower than this expectation for seven 122 treatments ( t 1022 > 2, p < 0.01), with the exception being the treatment of neutral evolution with asexual reproduction starting with the naïve ancestor ( t 1022 = 1.784, p = 0.0747). Observed external branch substitution rates were also found to be highly significantly lower than the neutral theory expectation for seven treatments ( t 1023 > 3, p < 0.001), and although the exception n eutral evolution with asexual reproduction starting with the pre - evolved ancestor was also significantly different, it was not as highly significant ( t 1023 = 2.210, p = 0.0273). Note that the observed substitution rates shown in Figure 3.12 are median valu es for consistency with the inferred branch lengths, while all statistical tests performed are for means. Figure 3.12. Empirical, theoretical, and inferr ed median internal (yellow) and external (orange) branch lengths per treatment. The empirically obser ved rates of substitutions per site (dashed lines) and the expectation under neutral theory (solid line) are included for comparison to the branch lengths inferred for the single best tree resulting from each analysis and model of evolution. Analyses inclu de NJ (square), ML with Poisson model (triangle pointing up), ML with empirical model (triangle pointing down), BI with Poisson model (plus), and BI with e mpirical model (cross); open symbols for individual replicates and closed for median across replicate s. 123 Only for selection treatments was the observed substitution rate, on average, higher for internal branches than external branches (Fig. 3.12). For neutr al treatments, there was no significant difference in mean substitution rate between internal and ext ernal branches for three treatments ( t 2045 < 1.5, p > 0.1), with the exception being the treatment with sexual reproduction and naïve ancestor, which had s ignificantly higher rates for external branches than internal branches ( t 2045 = 2.7, p < 0.01). For e ach selection treatment, the mean substitution rate for internal branches was higher than for internal branches ( t 2045 > 6, p < 0.001). Ancestor genotype a nd mode of reproduction did affect the observed substitution rate for either internal or external bra nches for neutral evolution treatments, however for selection treatments, ancestor genotype and, less often, mode of reproduction affected observation subs titution rates for both internal and external branches (Fig. 3.12). Among neutral treatments, there w as no significant difference in mean substitution rate for internal branches ( F 3, 4088 = 0.36, p = 0.785) and external branches ( F 3, 4092 = 1.45, p = 0.226). Selection treatments demonstrated a highly significant difference in mean substitution rate for inte rnal branches ( F 3, 4088 = 409.6, p < 0.001) and external branches ( F 3, 4092 = 406.1, p < 0.001). Post - hoc analyses for internal branches demonstrated a highl y significant difference ( t 4088 > 8, adjusted p < 0.001) between all pairwise treatments except for t he comparison of asexual and sexual reproduction treatments with the naïve ancestor ( t 4088 = 1.397, adjusted p = 0.975). Similarly, post - hoc analyses for e xternal branches also showed a highly significant difference ( t 4092 > 7, adjusted p < 0.001) between all pairwise treatments except for the comparison of asexual and sexual reproduction treatments with the naïve ancestor ( t 4092 = 1.952, adjusted p = 0.306) . For each treatment and phylogenetic analysis, external branches were inferred to be longer than int ernal branches (Fig. 3.12). And for most treatments and analyses, as demonstrated by the observed median substitution rates (i.e., the empirical equivalent to branch length) for both internal and external branches being in between these sets of inferences, internal branches were inferred as shorter and external branches as longer than 124 what occurred during lineage evolution. Among neutral treatments, sexual r eproduction produced slightly greater inferred branch lengths across analyses, while the ancestor gen otype did not have an effect. NJ tended to infer shorter branches than did ML and BI, and this was especially the case for internal branches. Selection tre atments starting with the naïve ancestor produced more accurate branch length inferences, and this wa s especially true for asexual reproduction. For this treatment pair, BI outperformed NJ and ML with respect to external branches, while ML slightly outperf ormed NJ and BI for internal branches. For selection treatments starting with the pre - evolved ancesto r, the branch length inference accuracy of NJ was greater than BI and ML. This treatment regime with asexual reproduction was the only treatment in which M L and BI markedly differed, with BI even inferring shorter branches than NJ; although, several ML rep licates using the Poisson model inferred similarly short internal and external branches as the median BI and NJ inference, and a few BI replicates using ei ther model inferred similarly long external branches as median ML values. ML consistently inferred mo re accurate lengths than did BI for the sexual reproduction treatment, while each inferred either internal or external branch lengths better than the other for the asexual treatment. There was much greater variation among inferred branch lengths for select ion treatments, with at least a few treatments per analysis and model of evolution for ML or BI being very different from the others. The model of evolutio n made a difference in branch length inference most prominently among ML analyses compared to among B I analyses, which inferred very similar lengths using either model (Fig. 3.12). For neutral treatments, analyses using the Poisson model inferred greater m was especially tr ue for ML analyses of treatments starting with the naïve ancestor. This trend was not consistent among selection treatments, as only selection with sexual reproduction starting from the naïve ancestor produced this trend for both ML and BI, and ML alone di d so for the otherwise equivalent treatment under asexual reproduction. For all others, the empirical model inferred greater branch lengths. Since both ana lyses consistently inferred 125 shorter internal and longer external branch lengths than that the observe d substitution rates, model use impacted branch length accuracy differently for external and internal branches. For example, with neutral treatments the Po isson model was relatively more accurate at inferring internal branch lengths and less so for externa l branches compared to the empirical model. Under neutral evolution every locus in the taxon - character datasets used for phylogenetic inference was parsimo ny informative, and under selection most loci were parsimony informative, but with fewer especially a mong the descendants from the pre - evolved ancestor. There was no variation among neutral treatments for either the number of variable or informative sites, with all replicates having the entire genomic sequence exhibiting variation and each character being informative (Fig. 3.13). One - way ANOVA analyses among selection treatments demonstrated a highly significant difference between the mean number of variabl e sites per replicate ( F 3, 36 = 16.52, p < 0.01) as well as the number of informative sites ( F 3, 36 = 2 8.71, p < 0.01). Post - hoc analyses demonstrated a highly significant difference between four of the six pairwise treatments for both the number of variable sites ( t 36 > 5, adjusted p < 0.01) and informative sites ( t 36 > 7, adjusted p < 0.01). For both meas ures, the exceptions were in the comparison of treatments otherwise identical except for having asexual or sexual recombination ( t 36 < 1, adjusted p = 1). Thus, selection treatments starting with the pre - evolved ancestor had significantly fewer, on average 21, variable and significantly fewer, on average 40, parsimony informative sites than those starting with the naïve ancestor, and recombination did not ca use a difference. 126 Figure 3.13. Number of variable sites (blue pentagons) and parsimony informative sites (purple stars) for all experimental treatment taxon - character datasets, with open symbols for individual replicates and closed for the median across replicates. Discussion The experimental evolution conditions were successfully configured so that dig ital evolution closely approximated the biological reality of molecular evolutionary dynamics. Establishing the congruity in molecular dynamics allows a st ronger evaluation via analogy of whether natural selection aids phylogenetic inferences. Indeed, sele ction , at least directional and stabilizing selection, does increase topological accuracy and may increase branch length accuracy. Sexual reproduction, at least as implemented in this system, aided phylogenetic inference even more so than natural selection . And deviations from the model of molecular evolution used in the phylogenetic analysis made minor impact. Finally, neutral evolutionary dynamics did not entirely proceed as expected in neutral evolution treatments, and branch 127 length inference was especia lly problematic and clade support measures especially overestimated accuracy in these treatments. Distribution of mutational effects The distributions of m utational effects for Avidians evaluated after adaptation to their selective environment were similar to those observed for biological organisms (Fig. 3.01). All potential genotypes accessible by a single mutational change were evaluated for five genotypes and classified as lethal, large (i.e., > 1%) or small effect deleterious or beneficial, and neutral (i.e., relative fitness effect of 0.999 through 1.001) according to the nearly neutral framework (Ohta and Kimura, 1971) for the maximum population size of 1,000 organisms . As with biological organisms (Eyre - Walker and Keightley, 2007) , most potential mutations for these genotypes are neutral with respect to fitness, a large collection are deleterious, a reduced proportion are lethal, and a small proportion are benefici al. The pre - evolved ancestor evaluated in its ancestral environment, having been evolved for millions of generations in that selective environment, mos t closely fit this expected profile. The descendant of either the naïve or pre - evolved ancestor at the ti p of the 55,000 - generation evolutionary history also fit this profile quite well. This is evidence that these Avidians adapted to their selective envir onment in contrast to the ancestors at the root of the tree, which did not have a history of adaptive evo lution in the root environment. Compared to genotypes evaluated after many generations of evolution in their selective environment, the potential singl e - step mutations available to ancestor organisms evaluated at the root of the tree were proportionally mo re neutral (78% versus 48%), more small effect deleterious (10% versus 3%), and less non - lethal large effect deleterious (4% versus 37%, Fig. 3.01). As Avidian adaptation proceeds, greater numbers of loci are necessary for the completion of environmentally rewarded tasks (i.e., fitness - increasing phenotypes); further, some loci evolve to be pleiotropic in that they are necessary for the completion of mul tiple tasks. A mutation that confers a loss of a task within these selective environments yields a merit (and approximate fitness) loss of 9.1% per task, and a mutation that confers the gain of a task yields a 10% increase in merit (and approximately so fo r fitness). 128 The pools of potential deleterious and beneficial mutations observed here are in accord with these values or multiple thereof, with the latter being evidence that numerous loci are pleiotropic . The very large proportion of slightly beneficial a lbeit neutral mutations for the pre - evolved ancestor in the root environment were likely due to slight of fspring cost optimization when not requiring the maintenance of previously phenotype - conferring loci in the ancestral environment. Characterization of observed substitutions Compared to the neutral treatments, the decreased number of fixed alleles (Fig. 3. 02) and reduced branch lengths (Fig. 3.12) for the selection treatments agrees with the population genetics expectation and other data shown here. As e xpected, and shown by the distribution of mutation effects (Fig. 3.01), many mutations were deleterious a nd therefore eliminated by selection, causing a net loss of fixations compared to neutrality. The statistically significant difference in the total num ber of fixations between selection treatments with the naïve versus pre - evolved ancestor (Fig 3.02) would be expected due to greater mutations being deleterious for the latter; that is, stabilizing selection was more strongly occurring for experiments star ting with the pre - evolved ancestor. This was not shown in Figure 3.01, where each tip - environment descend ant had approximately 55% deleterious or lethal mutations, although the statistically significant greater proportion of invariant loci for the pre - evol ved treatments (Fig. 3.12) suggest that a more thorough examination of mutational effects might demonstra te this as not uniformly the case among extant Avidians across tip environments. The relative proportions of substitutions by fitness type for selection treatments broadly agrees with the population genetics expectation and prior research using Avida (Fig. 3.03). Since recombination decreases linkage disequilibrium, it is also expected to magnify the influence of rather than that of the broader genetic backgr ound to which it is linked (Felsenstein, 1974) . This effectively makes selection more efficient by increasing the fixation of beneficial alleles and decreasing that for deleterious alleles. And neutral alleles have a decreased chance of fix ation 129 due to being associated with beneficial alleles. The relative proportions shown in Figure 3.03 do exhibit fewer deleterious, neutral, and neutral - deleterious substitutio ns fixed in sexual treatments. While these treatments show fewer strictly benefic ial substitutions, for each pair of otherwise identical treatments, this proportion is more than compensated by an even greater increase in beneficial - deleterious alleles. Th ere is a remarkable prevalence of individual substitutions being so prominently i nfluenced by a diversity of evolutionary forces (i.e., positive selection, genetic drift, and/or negative selection), as suggested by fitness type proportions (Fig. 3.03). Thi s finding is likely evidence of the pervasiveness of epistasis within evolved Avi dian genomes (Lenski et al., 1999b; Strelioff et al., 2010; Valverde et al., 2012) . For example, Covert et al. (2012) has demonstrated that sign - epistasis, that is, the beneficial - deleterious substitutions under the form ulation here (Table 3.02), is common in Avida and may greatly contribute to adaptation. The prevalence of sign - epistasis has been predicted for biological systems too (Kvitek and Sherlock, 2011) , how ever it is incredibly difficult to collect precise fitness data for individual alleles over multiple generations of population frequency change. It should be noted that the ca lculated prevalence of substitutions with mixed fitness effects throughout their frequency trajectory is likely an undercount since relative genomic fitness was evaluated only every 100 generations . C alculating fitness effects more often (e.g., every gener ation) likely would have shown that substitutions defined as purely beneficial, n eutral, or deleterious substitutions as sampled every 100 generations were influenced by additional evolutionary forces between these point estimates. The nearly neutral thres hold used here is conservative in that the effective population size is not as gr eat as the maximum population size. Yet it is unlikely conservative enough to account for the large proportions of substitutions that cross selection boundaries throughout the ir frequency trajectory, i.e., those that are not strictly neutral, beneficial, o r deleterious. For example, increasing the neutral threshold from 0.001 to 0.005 for these analyses would be purposefully liberal in overestimating neutrality in a population of 1,000 individuals, because 130 0.005 is the neutrality threshold expectation for a n effective population size of 500 organisms. Using a 0.005 threshold still results in very large proportions of substitutions such as beneficial - deleterious, with between 45 67% per treatment (not shown) compared to the 74 89% shown in Figure 3.03. With t his liberal threshold, the proportion of strictly neutral substitutions in selection treatments is still quite low, with between 2.8 6.2% per treatment, although much higher t han the 0.02 0.09% shown in Figure 3.03. A few of the results in neutral evolutio n treatments exhibited deviations from neutral theory expectations and have an as - yet unidentified cause. The number of substitutions for entirely neutral alleles in the neutr al evolution treatments was significantly less than expected (Fig. 3.02). The neu tral theory expectation is that the mutation rate, here 5 * 10 - 6 , should be equal to the substitution rate, which was found to be approximately 4.94 * 10 - 6 pooled across treat ments and branches. And a similarly reduced rate of evolution was found with resp ect to substitutions per site per lineage (i.e., branch length) across internal branches and, separately, external branches, with medians and the neutral expectation shown in Figure 3.12. One suggestion is that since allele frequencies were sampled every 1 00 generations, these and gone uncounted if more than one substitution occurred a t the same locus within those 100 generations. This is impossible, however, since only 34 substitutions or 0.002% across all 80 neutral treatment evolutionary histories were identified as quick sweeps, which is over 700 times less frequent than would need to occur, let alone the improbability of two sweeping at the same site within 100 generations. Another deviation from neutral theory involves the time to fixation of neutral alleles, which in a haploid population should be 2N e generations (Kimura and Ohta , 1969) . Across neutral treatments, the average time to fixation was approximate ly 2,230 generations, which was significantly different from the 2,000 - generation expectation. This value would yield an effective population size of 1,115, which would be imp ossible with the maximum population size of 1,000 experimentally limited here. Co nservatively, with an estimated effective 131 population size of the maximum population size, fixation time was observed to take 230 generations longer than expected. In fact, it took even longer than this since segregation below 5% and greater than 95% in the population was not tracked. While allele frequencies were sampled at population timepoints every 100 generations, this should systemically undercount both their entrance and fixation into the population, and therefore not affect the observed average fixat ion times of neutral alleles. Note that although there was a significant difference between neutral evolution treatments in time to fixation, the relatively miniscule effect s ize of 9 generations and lack of a clear pattern with respect to the effect of re productive mode or starting genotype suggests that this difference bears little importance compared to the overall difference from the neutral theory expectation. The fixation trends exhibited by the neutral and selection treatments agree with the populati on genetics expectations. For selection treatments, each lineage or branch of the evolutionary history experienced a new selective environment. The population tended to experi ence a burst of adaptive evolution, with many beneficial and beneficial - deleterio us substitutions being fixed after a brief lag from its introduction to the environment (Fig. 3.05). For the remaining approximately 75% of the branch, the population underwen t further adaptive evolution although with no trend in when alleles fixed, and wi th little difference in the relative proportion of substitutions by fitness type. Beneficial substitutions fix rapidly and appear to sweep to fixation neutral, deleterious, an d quick sweep alleles, which all peak in fixing within their first hundred genera tions in the population (Fig. 3.06). And the frequency trajectory of beneficial alleles within a population was highly dependent on their relative fitness (Fig. 3.09). For com parison, evolution in neutral treatments seemed uniform and stochastic in that th ere was no variation in which generation along a lineage a fixation occurred, treatment conditions did not alter their time to fixation, and their frequency trajectory in the population appeared to be textbook (Hartl and Clark, 2007) . Curiously, the time to fixati on of beneficial - deleterious alleles in selection treatments seems distinct from alleles of other fitness types within those same treatments (Fig. 3.06). While these alleles d id fix after fewer generations 132 than for alleles in the neutral evolution treatmen ts, their fixation dynamics in aggregate appear to be much closer to neutral in effect size rather than beneficial or deleterious. The fixation trends exhibited by the sexual and asexual treatments also agree with the population genetics expectation. In co mparison to asexual treatments, there is two - to three - fold fewer quick sweep alleles fixed in the sexual treatments (Fig. 3.03). Recombination caused a much lower increased r ate of fixation when a population was first introduced to a new selective environ ment (Fig. 3.05), although this burst involved a relatively greater proportion of beneficial alleles compared to other fitness types (Fig. 3.06). There was a much greater dist inction between substitutions fixed quickly versus slowly upon their introduction to the population, with beneficial and hitchhiking alleles fixing rapidly and quickly decaying in their time to fixation. There was also a more distinct proportion of benefic ial - deleterious alleles with respect to their time to fixation, and they took lon ger to fix on average than their asexual treatment counterparts. The starkest difference between sexual and asexual treatments was in the number of alleles that hitchhiked to fixation among linkage groups, with many fewer alleles fixing every hundred gener ations under sexual reproduction (Fig. 3.07). Clonal interference in asexual population and the Hill - Robertson effect in sexual populations are nicely illustrated in these ev olving Avidian populations. Figure 3.09a illustrates the dynamic under adaptive a sexual reproduction in which new advantageous alleles can only become substitutions once a prior linkage group has fixed (e.g., at 1,100 and 2,700 generations) unless they ari se on a background that is already fixing in the population (e.g., 3,000, 3,200, 4,400 generations, etc.). This is clonal interference (Gerrish and Lenski, 1998) and while we cannot see the frequency trajectories of competitors that had relatively high fitness but went extinct because only alleles that fixed were tracked, the decrease in relative fi tness that made the fixing allele deleterious or neutral from generations 1,900 through 2,300 was likely caused by one or more relatively higher fit competitors. And finally, Figure 3.09b illustrates the Hill - Robertson effect (Hill and Robertson, 1966) in which ad vantageous alleles can be combined onto the same linkage block and increase in frequency together. For example, the very highly 133 advantageous new mutant at generation 1,200 qui ckly rose in frequency and fixed in the population. At generation 1,500, when the allele fixed, at least four other alleles that later fixed in the population were segregating; their maintenance in the population required recombination onto the genetic bac kground that included the highly beneficial allele. In contrast, under clonal int erference in an asexual population, at the time of fixation, only at most one other allele destined for fixation would be segregating in the population (e.g., at 2,700 generat ions, Fig. 3.09a) . Phylogenetic accuracy All phylogenetic analysis methods for all treatments with sexual reproduction produced the correct 1,024 - taxa topology, and when reduced under asexual reproduction, selection entirely clear why sexual reproduction im proved phylogenetic inference compared to asexual reproduction irrespective of natural selection or other deviations to the model of evolution. The few trends that varied due to sexual versus asexual reproduction al one are that fewer fixations per 100 gene rations occurred with sexual reproduction (Fig. 3.07), which was at least in part driven by fewer fixations occurring simultaneously (e.g., Fig. 3.08 and Fig. 3.09), even though there tended to be no difference in t he number of total substitutions (Fig. 3. 02) or the per lineage substitution rate (Fig. 3.12) for otherwise identical treatment conditions. Together these trends suggest that phylogenetic inference is improved by substitutions occurring gradually throughou t lineage evolution, instead of in bursts (e.g., Fig. 3.08). Note that this is a different phenomenon than a molecular clock (Kimura, 1968) , because under neutral evolution an asexual population would still have a molecular clock; this is more akin to how loud versus quiet the clock ticks (i.e., fewer substitut ions fixing simultaneously) rather than t he clock ticking regularly versus irregularly or ticking faster versus slower. This hypothesis could help support why natural selection aided phylogenetic inference under asexual reproduction (Fig. 3.12), because se lection itself causes fewer fixations per 100 generations (Fig. 3.07). 134 While topological accuracy was very high, the population processes and genomic evolution processes occurring in this digital evolution system likely contributed to the inaccuracy obser ved. The lowest performing analyses had a bout 2% of clades incorrectly inferred, which is about 20 clades; surprisingly, this relatively poor performance was the neutral evolution treatments without sex (Fig. 3.10). Since coalescent and other population ge netic processes are commonly evaluated wi th respect to simulations of neutral evolution, this result demonstrates that such modeling is missing important population processes that decrease phylogenetic algorithm performance. And this modeling is additional ly failing to account for other processes that may promote accuracy, such as selection as evidenced here, which had an order of magnitude less inaccuracy, about 0.2% (Fig. 3.10). One explanation for selection, particularly stabilizing selection, improving topological accuracy over neutral evoluti on may be the maintenance of synapomorphic loci important in contributing to rewarded tasks (i.e., phenotypes). As shown in Figure s 3.04 and 3.13, these treatments had greater frequencies of invariant sites and site s that otherwise experienced reduced rate s of change. Branch length inferences were highly inaccurate across most treatments, and there was no relationship between topological accuracy and branch length accuracy for a given treatment. For example, the med ian external branch for the selection tre atment with sexual reproduction and the pre - adapted ancestor had an inferred branch length about 40% longer (for analyses other than NJ) than the median observed substitution rate for those lineages (Fig. 3.12); and this was despite clade accuracy being pe rfect for this treatment (Fig. 3.10). In contrast, the median external branch for the selection treatment with asexual reproduction and the naïve ancestor had an inferred branch length of only about 5% longer than t he median observed substitution rate (Fig . 3.12); and clade accuracy was about 99.8% or around two incorrect clades (Fig. 3.10). Inferred median external branch lengths for neutral treatments were about 40% longer than the observed substitution rates (Fig. 3.12), and although this was consistent across treatments differing by mode of reproduction or ancestor genotype, topological accuracy varied between 98% and 100% (Fig. 3.10). 135 Specific explanations to explain the inaccuracy of inferred branch lengths are lacking. With random genotype population sampling for the taxa, the overestimation of external branches agrees with such random sampling of segregating variation as being interpreted as fixed differences. However, this should not have been the case for in ternal branches, which were generally und erestimated. This underestimation may be attributable to multiple substitutions occurring at the same loci and the models not sufficiently accounting for this (Sullivan and Joyce, 2005) . However, the treatments that produced overestimated internal branches, those starting from the pre - evolved ancestor under selection, were the treatments that had many more invaria nt sites, as shown in Figure s 3.04d and 3.13. While a few sites especially near the beginning of the genome had a much greater turnover rate, there did not seem to be an overall tendency to have a greater number of sites with higher rates of change than on average for these treatments with the naïve ancestor (Fig. 3.04). NJ produced similar internal branch lengths for this pair of treatments, selection with the pre - evolved ancestor, although it showed lower and more accurate branch lengths under asexual rep roduction (Fig. 3.12). For this pair of treatments, ML and BI exhibited a great deal of variation in inferred branch lengths across replicates, and for the treatment under asexual evolution, this variation was subst antial enough that median values for BI w ere about 22% lower than the observed substitution rate while ML produced values about 32% greater (Fig. 3.12). Looking across experimental conditions, it is unclear which of these analyses has greater resiliency in inferring branch lengths under the compl ex biologically realistic phenomena explored here, although all are affected. The inaccuracy of branch length inference was especially surprising for the neutral evolution treatments. With a lack of natural selectio n occurring, the molecular evolutionary d ynamics should be closer to the models of evolution upon which the phylogenetic analyses are based, and therefore lead to greater accuracy, but this was not the case. Especially considering other found deviations fr om neutral theory expectations, such as s ubstitution rate and time to 136 fixation, branch length inaccuracy adds to the evidence that complex population processes in an evolving system are not being captured by models of evolution used in phylogenetics. This work has demonstrated the possibility of evaluating phylogenetic analyses using highly customized fixed - rate models of evolution constructed from close observations of population genetic history across hundreds of thousands of generations. The model used for each experiment was parameterized usi ng the recorded substitutions from across the ten replicates for that experiment, so substitution rates and state frequencies were highly accurate; much more accurate than biological amino acid model counterparts th at estimate values using ancestor reconst ruction for molecular datasets ( e.g., Dayhoff et al. 1978; Jones et al. 1992; Whelan and Goldman 2001; Le and Gascuel 200 8) due to the infeasibility of collectin g observed substitution data in biological populations. And the empirical models were as precise as possible given the general time - reversibility assumptions necessary for use with the phylogenetic programs, despite the character polarity for each recorded substitution being known. However, at least for the phylogenetic difficulty evaluated in these treatments, the empirical model of evolution did not consistently yield more accurate results for inferred branch lengt hs (Fig. 3.12), topological accuracy (Fig . 3.10), or clade support (Fig. 3.11). When model choice made a difference, often the Poisson model outperformed the empirical model, as is especially evident for the internal branch length inferences of the neutral asexual reproduction treatment with the naïve ancestor (Fig. 3.12). It is unclear why the highly accurate empirical models of evolution performed as poorly, if not worse, than the basic Poisson model. Differences in model performance are not attributable to overfitting, as both types of models have fixed rates and are therefore equally (Box, 1976; Sullivan and Joyce, 2005) , it is unclear why the b asic Poisson model functioned relatively well, or at least not substantially worse, than the empirical models. While the Poisson model is accurate with respect to the Avidian mutational model, it would not appear to at all characterize the substitution pat terns occurring across experimental treat ments (e.g., Table 3.03). Model choice 137 may especially have significant consequences over greater extents of evolutionary time, for example in the correction of multiple substitutions per site, and long branches shou ld be affected more so than short branche s (Felsenstein, 1978; Sullivan and Swofford, 2001) . It seems plausible that the symmetrical, equidistant, and short branched topology evolved here did not offer difficulty that models of evolution may improve. Yet the inability for either mode l to closely resolve both branch lengths and topological accuracy for any taxon - character dataset should be highly concerning to systematists, because modern phylogenetics methods rely on the joint optimization of both aspects of a phylogeny. Following the analysis presented in Chapter 2 and as f irst conducted by Hillis and Bull (1993) for their biological experimental phylogenetics research, the relationship between support values and clade accuracy is shown in Figure 3.11. The treatments analyzed in Chapter 2 produced conservati ve estimates of clade accuracy for support values greater than 30%, except for a single BI support value which was slightly conservative. The selection treatments analyzed here reproduced this result for only suppor t values greater than 50%, and the neutra l treatments were the opposite as in being liberal estimates for all such support values. This is a concerning result, as it indicates that the support - accuracy conclusions first shown by Hillis and Bull (1993) and still relied on by resea rchers (e.g., Sleator 2011) may not hold. The distinction between selection and neutral treatments may indicate that selection cannot just improve accuracy but also improve our estimate of it, but more research is warranted u sing other experimental designs and in ot her systems. An alternative hypothesis is that this result is not especially related to the experimental treatment conditions that produced the molecular data analyzed, but rather a factor of the absolute number of true clades insufficiently supported. For example, in the analyses shown in Chapter 2, and also by Hillis and Bull (1993) , the maximum number of incorrect clades observed for any replicate was six, which is also the maximum possible for th ese eight - ingroup taxon evolutionary hist ories, while for these histories the maximum observed was 31 and maximum possible is 1022. Perhaps these larger trees, which are not exceptionally large compared to contemporary work (Li et al., 2015) , have a greater rate 138 of false positive clade identification owing to their size. The implication of this would be that the greater number of alternative topologies necessitates an even greater bootstrap or posterior support samplin g to alleviate the increased random error . Conclusions The work presented here demonstrates the potential of using digital evolution to conduct experimental phylogenetics. I have systematically evaluated a few of the complex processes that may affect phylo genetic accuracy. I did so using a much l arger experimental design than any experimental phylogenetics research to date, and a much more biologically realistic experimental design than in Chapter 2, with the goal of demonstrating the biological realism, ut ility, generality, and overall potential of digital experimental phylogenetics. Compared to evaluating phylogenetic methods using simulations alone, this work has produced molecular evolutionary dynamics that closely resemble those observed in biological s ystems, and these processes led to curious phylogenetic inference results. The work presented here shows that complex population processes occur even in a digi tal system that is experiencing entirely neutral evolution, leading to differences from neutral t heory expectations. When organisms reproduce sexually with recombination, as implemented in Avida, their evolutionary histories can be perfectly inferred, and under asexual reproduction natural selection restores some of the reduced accuracy, suggesting th at both recombination and selection may aid phylogenetics. Digital evolution allows the construction of precise models of evolution, but phylogenetic methods w ere not improved with their use, producing similarly inaccurate branch lengths and clade relation ships under very different models. Clade accuracy is not predictive of branch length accuracy for these evolutionary trees, and the clade support metrics of bo otstrap support and posterior clade support are not close estimates of clade accuracy. Analyses of the substitutions that occurred have begun to show the biological realism - recognized effects of altering the distribution of mutational effects for adapted genotypes, reducing the total num ber of 139 substitutions that occur due to negative and stabilizing selection, and increasing the number of substitutions that co - occur in selective sweeps were ea ch observed. Similarly, the combined effects of recombination and selection occurring together we re observed, such as the fixation of greater numbers of beneficial alleles, greater rates of fixation in novel adaptive environments, and the combining of adva ntageous alleles onto the same linkage block. And yet, differences from neutral theory prediction s indicate that the population genetic and genome evolution processes occurring in an evolving system are complex. For example, with neutral evolution treatmen ts the number of substitutions and time to fixation were significantly different from expectation s, and with selection treatments most substitutions exhibited sign - epistasis. Especially since the phylogenetically worst - performing treatments were those cond ucted under neutral evolution, these results suggest that such complex dynamics may not yet be ad equately investigated using simulation approaches alone. Surprisingly, and at least under the conditions evaluated here, selection and especially recombinatio n restore d reduced phylogenetic accuracy under asexual reproduction and neutral evolution. These results suggest an intriguing and novel mechanism by which phylogenetic patterns may be better inferred due to substitutions occurring throughout lineage evolu tion instead of in simultaneous bursts. This would be a similar dynamic to a molecular clock, but due to natural selection and recombination, and would seem to be relatively more beneficial to phylogenetic inference since neutral evolution treatments showe d reduced accuracy. These results point to a novel benefit that recombination and natural selecti on could provide systematists, albeit these processes in even more complex contexts (i.e., even closer to biological reality) may have more costs than benefit ( e.g., Lanier and Knowles 2012; Adams et al. 2018 ; Pang 2020) . For example, recombination here could not disrupt positional homology since geno mes had fixed length, so alignment was not required. Still, it would be nice to is perceived to be (Martin et al., 2011) . 140 The unexpected phylogenetic benefits of recombination and selection obs erved here should be approached with a nod towards the generality of the experiment and system. A n important caveat to experimental phylogenetics research is that an experiment shows exactly what happened under one set of conditions, but not every set of c onditions. And with digital le systems, including in currently implements two very different approac hes to recombination, and it would be helpful to add a third setting to better approximate recomb ination rate as interpreted by biologists. Currently, one approach is to configure the genetical system for the number of equally sized modules, with each modu le having independent assortment from one another; this was employed in Chapter 2 with the number of modules equal to the genome length so that all loci exhibited independence. A second current approach is to have a single module but with two recombination breakpoints chosen randomly in the genome; this was the approach for the experiments in this cha pter. A not - yet - implemented third approach would be to alter the number of randomly chosen paired breakpoints: zero (no recombination), one (as used here), an d so on up to the size of the genome less one which would entail breakpoints between every locus and be equivalent to independent assortment among all loci. The benefit to this approach would be the ability to increase the recombination rate as biologists interpret it to have better control over how the evolving population approaches linkage equilibri um. For example, humans have much larger linkage blocks than do species like maize or Arabidopsis thaliana (Rafalski and Morgante, 2004; Wall and Pritchard, 2003) , and with this new setting experimenters could better match such differences in molecular evolut ion otherwise produced by differences in life history or evolutionary history. The influence of s election on phylogenetic accuracy might also be system dependent. In Avida, selection greatly improved phylogenetic accuracy under asexual reproduction conditi ons. However, biological e xperimental phylogenetics research tends to show that selection inhibit s phylogenetic inference by produc ing homoplasy (Bull et al., 1997; Cunningham et al., 1997; 141 Fares et al., 1998) , although see Leitner et al. (1996) . The accuracy of inferences of d igital evolut ion histories have been shown to be aided by selection via the production of more synapomorphies (Hagstrom et al., 2004; Hang et al., 2007, 2003) . Unfortunately, the experiments presented in this chapter did not offe r an ideal test of this, since these selective environments were explicitly designed to promote d iversifying selection throughout the entire evolutionary history, so by design there should be greater synapomorphies. A hypothesis that remains to be tested i s that the diversity in the potential genotype - to - phenotype map for digital organisms such as Avi dians may make them especially unlikely to produce homoplasious evolution. For example, there are relatively few constraints to the genetic programming such th at there are innumerable genotypes that code for the same phenotype. However, the impression that prompts this hypothesis may be biased since digital evolution experiments tend lly unburdened by genotype - phenotype limitations due to historical contingency or other constrain ts. Further and biological systems is needed. Digital evolution has great utility in allowing the tracking of data that would be impossible or incredibly burdens ome to collect for biological systems. Digital evolution allowed the construction of models of evolution tailored to the substitution rates and state frequenci es of the evolving populations under study with much greater precision than has been accomplished for biological systems. The sensitivity of model - based approaches was evaluated in this study with great accuracy while still maintaining the increased biolog ical realism of experimental phylogenetics approaches. Although the model of evolution made littl e difference in the work presented, perhaps this would not be so under conditions dissimilar from the fully symmetrical and ultrametric evolutionary histories evolved here. Future work to this end could explore how model choice becomes increasingly importa nt under varying adaptive evolution dynamics in addition to differently long branches, asymmetrical topologies, and other evolutionary divergence conditions. T he construction of empirical models of evolution has 142 fulfilled the expectation of Hagstrom et al. (2004) who remarked that it near future ian systems. Although, the work presented here has not attempted to determine a single model wide ly applicable to differently parameterized Avida experiments. This may be a fruitful research direction that, as predicted by Hagstrom et al. (2004) , would further energize research The phylogenetic accuracy trends presented by these results are curious and their implications should be further investigated. Although clade accur acy was high across all treatments, it was comparatively low for the neutral evolution treatments . The treatments conducted under neutral evolution with asexual reproduction and inferred using the empirical model of evolution were predicted to have the gre atest clade and branch length accuracy, since the evolutionary dynamics occurring should have bee n closest to the model of evolution and other assumptions used in the analysis. Yet these were among the lowest performing. Phylogenetic methods such as maximu m likelihood and Bayesian inference iteratively optimize tree topology and branch lengths concurr ently to produce the overall best tree. However, there was little relationship between clade accuracy and branch length accuracy in these analyses, perhaps sug gesting room for improvement in the model of evolution or its use. And yet the use of very differ ent models of evolution made little difference. Finally, the common metrics of clade support, bootstrap support values under maximum likelihood and posterior c lade support probability under Bayesian inference, were not close estimates of clade accuracy, an d their being under - or overestimations depended on whether natural selection was occurring. More research questions and hypotheses have been proposed than ans wered with these case - study experiments, and digital evolution will be a valuable tool among othe rs in helping phylogeneticists to address them. The T7 phage work of Hillis et al. (Hil lis et al., 1992) presented the experimental an important void in the science of phylogenetic void largely u nfilled, the work presented here aims to spur greater research by presenting the 143 potential contri greater biological realism than simulations and greater utility and generality than biological experimental systems. By navigating these inherent tradeoffs in conducting phylogenet ics decades ago. As we build better computational machinery and discover more about the underlying f acets of molecular evolution, digital evolution has the potential to become more and more powerful and therefore relevant for if reality is but a simulation (Bostrom, 2003) then what is digital evolution if not a proto - reality ? 144 CHAPTER 4 : Shifting Student Understanding of the Importance of Variation for Evolution in a Course Featuring Digital Evolution Introduction (Coyne, 2010) , (Chown, 2013) , and so simple that Huxley remarked that it was stupid of him to not have thought of it (Huxley, 1 887; Kalinowski et al., 2016) . Evolution requires three concepts variation, heredity, and differential reproduction (Godfrey - Smith, 2007) : Mutation, heritable from parent to offspring, is the ultimate source of genetic variation that may affect phenotyp ic variation and variation in reproduction. Natu ral selection is the resulting process if phenotypic variation non - randomly changes in a population due to its effect on reproducti ve ability, and genetic drift is the resulting process if variation randomly changes due to random differences in reproductio n (Gregory, 2009; Tibell and Harms, 2017) . Y et evolution remains elusively difficult to teach , ay, (Mayr, 1982) of the most counterintuitive ideas the human min (Evans, 2008) the tremendous effort to teach well (Dennet, 1995; Gregory, 2009) . Among the many difficulties in understanding evolution, a large set of them are ultimately related to those borne from misapplying the intuitive reasoning strategy of essentialism to understand the concepts of species, i nheritance, mechanisms of evolutionary change including natural selection, and the crux of the matter the importance of individual - level variation for evolution. Essentialist thinking applied to a species entails individuals who are united by something t hat makes them unique, and this essence preclu des evolutionary change. Taken in an absolute sense, essentialist thinking underlies creationist claims that a present - day species cannot be derived from any other species (Ev ans, 2008) . 145 Essentialist reasoning can be incorporated into a conception of evolutionary change as transformationalism, which is distinguished from the scientifically valid explanation referred to as variat ionalism (Mayr, 2001) . In transformationalism, evolution is defined as the change in the overall characteristics of one generatio n to the next. In variationalism, within - popul ation variation is of utmost importance , so evolution is commonly defined as the change in the frequency of alleles, i.e., the genetic constituents of variation, over time (Freeman and Herron, 2014) . Th is distinctio n of valuing individual - level differences in one generation and population - level differences between multiple generations in variationalism versus only valuing the latter in transformationalism may seem slight, but it is integral to understanding the mecha nisms of modern evolutionary biology, especially natural selection. For example, differences between individuals are consequential enough to cause evolution under variationalism via differential reproduction, b ut under transformationalism, any difference b etween individuals is immaterial to evolution because it would be impossible for individuals to differ enough to produce such an effect. I hypothesized that the direct exposure to and experimentation regarding variation and its importance using integrative thinking , statistical reasoning, and computer modeling via Avida - ED would produce transformational - to - variational shifts in student understanding . This hypothesis was tested within the whole - semester course co ntext of a novel undergraduate introductory bi ology course , Integrative Biology: From DNA to Populations . Herein, I discuss a few of the difficulties in understanding, and therefore teaching, evolution, focusing on essentialis m and its evolutionary formula tion as transformationalism as contrast ing the modern scientific formulation variationalism. I then describe the features of the Avida - ED digital evolution platform that allow experiential learning of variationalism and outline further motivation for the curricu lum development of the course . The course features addressed here , of which the digital evolution lab component is both key and one of many that address student understanding of variation, may collectively contribute to shifting student understandi ng of bi ological variation from transformationalism to variationalism. I describe the development , 146 use , and results of a modified transformationalism - variationalism assessment in a pre - /post - course design to measure this shift . And I conclude with suggesti ons rega rding the incorporation of instructional material that engage s students in developing a scientifically - modern understanding of variation and the evolutionary process. A few of the difficulties in understanding and teaching evolution Instruction in evolutio of knowledge, but rather fundamentally altering the way they see the world an ontological shift (Sinatra et al., 2008; Tibell and Harms, 2017) . Difficulty in understanding evolution and natural selection at least in part may rely on incorrect systems thinking, or misattribution at the appropriate level of organization (Chi, 2005) . Humans are accustomed to direct, causal processes that typically have a c entral controlling agent that interacts in a series of sequential steps from a beginning to endpoint. Conversely, emergent process - type phenomena like interac tions be (Cooper, 2017) . A failure to understand statistics and randomness can magnify the effects of misapplying systems thinking (Cooper, 2017; Fiedler et al., 2019; Speth et al., 2014; Tibell and Harms, 2017) how could one understand that patterns such as distributions of trai ts can r esult from emergent processes based, at least in part, on random interactions among lower - level units? Additionally, Tibell and Harms (2017) point out that applying systems thinking in understanding evolutio n across large temporal and spatial scales is especially difficult. A relatable inability in applying systems thinking is the vernacular of evolutionary biology, which includes many terms that have a different common folk application. The folk usa scientific usage applies to a population across generations of evolution (Coley and Ta nner, 2015; Shtulman, 2006) . Biologists may seem to apply agency or intentionality to natural natural selection is a direct causal process (Cooper, 2017; Gregory, 2009) . Among yet other 147 misconceptions (also called naïve intuitions) are ideas related to organisms evolving in response to a need or the use/disuse of body parts (Coley and Tanner, 2012) . These are just a few of the persistent folk ideas that differ from modern scientific thinki ng. Increased undergraduate biology education, at least as has been implemented and assessed, can sometimes lead to the suppression but not supplantation of intuitive means of understanding (Shtulman and Valcarcel, 20 12) , or even strengthen misconceptions based on misapplications of fundamental modes of understanding (Coley and Tanner, 2015) . So - called , , are integral to helping us understand the world, especially as we develop this understanding as children, allowing us to simplify, categorize, and otherwise red uce the amount of information we process to make sense of the complexity that we encounter (Heddy and Sinatra, 2013; Sinatra et al., 2008) . For example, knowing that an organism is a member of a certain species allows you to reliably predict (at least some of) its properties (Shtulman and Schulz, 2008) . However these mental shortcuts cannot b e ap plied in every application in which we might like to use them, and when we apply them incorrectly, they entail a serious cost as persistent and overarching misconceptions (Evans, 2008; Sinatra et al., 2008) . Evolution is perhaps a quintessential example of this, with an array of common misconceptions (Gregory, 2 009) stemming from multiple underlying cognitive construals (also called cognitive biases), including essentialism, teleology, intentionality, and anthropocentric thinking (Coley and Tanner, 2012; Evans, 2008; Sinatra et al., 2008) . Within the field of evolutionary psychology, cognitive construals have been thought of as complex evolutionary adaptations for dealing with complex envi ronments ( Geary, 2007) , and the stickiness or resistance in not may be an adaptation too (Sinatra et al., 2008) . Essentialism is th e ge neral idea that things can separately be placed into real and named categories (Gelman, 2003) that exhibit an underlying innateness (Knobe and Samuels, 2013) , essence (Evans, 2008) or hi dden causal power (Shtulman, 2006) , whether or not we know, or can know, what that underlying natu re is (Coley and Tanner, 2012) . Crucially, this essen ce i s 148 immutable or unchanging. Essentialist thinking can be applied to multiple levels of biology and is overall the idea that a core underlying facet of a level of organization (e.g., biological istent identity (Coley and Tanner, 2015) . Essentialism is an example of natural history theory r ecapitulating the ontogeny of human cognitive developmen t (Shtulman and Schulz, 2008) . As recorded by Aristotle, in ancient Chinese thinking, and in folk ideas throughout other cultures, humans have long exhibited an essentialist understanding of li v ing organisms (Evans, 2008; Shtulman, 2006) . This was a formidable historical obstacle to modern evolutionary theory (Kalinowski et al., 2016) , and Darwin spent great care dismantling it by writing at length on the importance of individual - level variation (Gregory, 2009) . This understanding is also exhibited by children, especially until about 8 - 10 years old, after which they may be able to start to understand concepts like modern sc i entific notions of common descent and other variationalist concepts (Evans, 2008; Sinatra e t al., 2008) . This conceptual change is like a scientific revolution experienced within their own personal understanding (Sinatra et al., 2008) . Transformationalism and variationalism I n essentialism a species has an immutable essence that defines its nature; in transformationalism this essence is mutable and its change over time is evolution. Under t ransformationalism , individuals in a population change all - together over time because th e underlying essence is evolving, for example all individual moths in a population becoming successively darker each generation . Under the scientifically - modern understanding of variationalism, genotypic (or phenotypic) individual variation changes in freq u ency within a population across generations , for example a black moth variant increasing in population frequency over generations (Shtulman, 2006) . Note that someone with a transformational understanding is not blinded to variation within a population. They may admit that slight variation exists between individuals, but that this variation is inco n sequential to the evolutionary process, since evolution must involve a change to the shared essence of the 149 species and therefore must act on all individuals (Coley and Tanner, 2015) . Under both transformationalism and variationalism , individuals vary in terms of their genetic information (i.e., genotype) and/or expressed characteristics (i.e., pheno t ype), and over time generations of organisms change (i.e., evolve). Yet these conceptualizations vary in important respects. In transformationalism, evolution acts in a single step where the environment directly affects the essence of all individuals of a species (Gregory, 2009; Shtulman, 2006) . Mode rn molecular reductionism in biology may even provide a pseudoscientific species essence in the guise of DNA, genes, and chromosomes (Coley and Tanner, 2012; Fodor, 1998; Gelman, 2004) . In my opinion, this could be at least one means by which increased biology education may reify essentialist reasoning for some stu dents. In comparison, variationalism requires at least two steps: Mutation is the originator of variation by creating a variant allele in only one individual in the population. This is followed by an evolutionary process (e.g., natural selection or ge neti c drift) that operates to change the frequency of this variant in the population over successive generations (Gregory, 2009; Speth et al., 2014) . Highlighting the importance of mutation as the sour ce of variation, Gregory (2009) A byproduct of essentialist thinking is the persis tent undervaluation of within - species variation; although individuals within a population may vary in some respects, this variation is limited and not important in an evolutionary context (Coley and Tanner, 2015; Shtulman and Schulz, 2008) (Coley and Mura to re, 2012) . Shtulman (2006) takes this further by claiming that transfor mati onalists would intuit that any individual variation must be non - adaptive or maladaptive. In variationalism, differences between individuals are of critical importance (Coley and Muratore, 2012) and are characterized in terms of population frequencies, i.e., the number of individuals of the generation with a certain characteristic. In transformationalism, the population average (or something similar) is ontologicall y re ified, i.e., made real (Gould and Duve, 1996; Shtulman, 2006) . In variationalism, the population average is an abstraction and 150 not something real in and of itself (Coley and Muratore, 2012) ; it is a statistical property of a distribution of separate individual - level features, an emergent pattern (Cooper, 2017) . Variationalism requires the shift in thinking of natural selection not as an event but as an unbounded process (Gregory, 2009; Sinatra et al., 2008) . Evolution is an emergent process at the population level resulting from random, undirected mutation occurrin g to individuals and inheritance of this genetic variation that may be expressed phenotypically as , ultimately , variation in reproduction among those individuals in the population (Gregory, 2009) . In this way, natural selec tion results from a series of contingent events and continues unendingly (Cooper, 2017) . Transformationalism is much simpler and intuitive because it uses folk understandings of causation individual - level events like mutation and natural selection are neither necessary nor consequ ential, what matters is the direct inheritance among all members of the population of those traits which are suited to the envir onment (Sinatra et al., 2008) . Further, transformationalism m ay lead to saltationist notions of complex traits arising suddenly in a single generation in response to a need (Gregory, 2009) . Trouble dislodging transformationalism Biology education research has shown that novices tend to hold transformational views, and that it can be difficult to transform their thinking towards variationalism. As with other i deas that have a firmly intuitive basis, such as teleological reasoning, the transformationalist idea of a species having a shar ed essence is a sticky concept one that is often impervious to dislodgement (Speth et al., 2014) . Studies have shown that large proportions of children (Evans, 2008; Shtulman and Schulz, 2008) , high school students (Furtak et al., 2014; Shtulman, 2006) , and coll ege students (Richard et al., 2017; Shtulman, 2006; Shtulman and Calabi, 2008) hold a transformational if not essentialist understanding of biol o gical variation. Shtulman and Schulz (2008) found that only people with variational i st thinking could, across a majority of taxonomic and trait - class examples, affirm that within - species variation is both prevalent and probable. When expert and novice biologists have been asked to explain natural selection, a majority of experts exhibite d variationist thinking by valuing this first step the origin of heritable 151 variation enough to include it in their explanations of natural selection while only 10% of undergraduates did so (Nehm a n d Ridgway, 2011; Speth et al., 2014) . Coley and Tanner (2015) reported that a majority of non - biology majo r undergraduates agreed with an essentialist statement, while a minority of biology majors did , perhaps demonstrating a predilection towards opposing essentialist thinking with greater innate interest or curiosity if not increased expertise alone . While Sh t ulman and Schulz (2008) argue that an understanding of variationalism should be considered a necessary condition for und e rstanding natural selection , o ther research has demonstrated that instruction in evolution, especially the variational processes of mutation and natural selection, is insufficient in altering these misconceptions (Gregory, 2009; Kalinowski et al., 2016) . Instead, these authors report that it is most important important that variation is for evolution. When helping students shift from transformational ism to variationalism, the understanding (Evans, 2008) . Shtulman (2006) e - learning progression of Furtak et al. (2014) incompatibility of these modes of understandings. Th e difficulty in fostering a shift in student thinking, and the likelihood of mixed model formation, is related to the cognitive stickiness of the essentialist intuitive reasoning strategy, which is innate and impervious to change (Gelman, 2004; Speth et al., 2014) . Although difficult, Shtulman and Calabi (2008) describe a decrease in undergraduates reporting transformational understandi formulation of evolution by natural selection. However, Richard et al. (2017) discovered no si essentialist statements . Concerningly, the results of Coley and T anner (2015) suggest that formal biology education can reify the relationship between essentialist intuitive reasoning and 152 essentialism among students who hold such views , which c ould strengthen the hold of transf ormational understanding among advanced biology students. Experimentation in Avida - ED emphasizes individual variation Evans (2008) describes the need of an instrument to observe evolution happening in action: change as contingen t and non - directional; in effect, species would morph from one to another as environments change, or dis appear entirely. Yet everyday cognition, mired as it is in a particular time and place, appears to obstruct this view of a dynamic world. What is needed is the equivalent of a microscope or telescope, such as a time - machine that transcends human cognitive and Although we do not yet have such a machine to observe the evolution of biological organisms, we might have the next best th ing in t he Avida - ED model system , which allows students to readily observe the processes of evolution in action for digital organisms. By having a simplified genetic model and the color - cod ing of genotypic, phenotypic, and fitness differences, students obs erve that variation among individuals is created from parent to offspring and that this variation exist s within a population context . And by engaging in what is occurring within rapidly changing populations, students have an opportunity to recognize that v ariation among individuals not only exists but that it is the raw material of evolution. T herefore , Avid a - ED may be especially illustrative of variationalism. Digital organisms have a greatly simplified genetic model compared to biological organisms. This may aid in highlighting genotypic differences between individuals as such variation may be obfuscated by complex genotype to phenotype relationships. In biolog ical organisms, genetic information is transferred across multiple levels of organization and expression within an organism, for example, from DNA to RNA to protein and other molecular machinery to exp ressed phenotype. Avidian genetics is limited to the in formation transfer 153 between the instruction sequence to programmatic machinery (i.e., interactions between instruction ) and the expressed phenotype. T his provides a simplified model that may lead to increased understanding of genetic differences by allowing students to focus on differences between organisms rather than focusing on the transfer of genetic information within an organism. Further, the myriad causes, types, and effects of mutatio ns in biology add further complication to understanding genotype to phenotype relationship s (Tibell and Harms, 2017) . Avidians and Avida - ED provide a straightforward representation of how differences in genotypes are expresse d as differences in phenotypes (Smith et al., 2016) . In Avida - E D , genotypic variation between indivi duals, as well as the cause of that variation, is readily observable. Using the organism viewer, students observe the life process of an Avidian, including reproduction. During reproduction random mutation may create var iation in the genome of a single org anism the offspring leaving the parent unchanged . Transformationalism would require a mutation to occur for both the parent and its offspring , or e generation . D ifferences between th e parent and offspring are observed by evaluating changes in the labeled sequence of instructions , with mutations outlined in black . These genomic sequence differences are also often reinforced with a color difference, s ince sets of Avidian instructions ar e color - coded according to their approximate computational or programmatic function . In Avida - ED, as in biological systems, single mutations may have an appreciable likelihood of result ing in consequential phenotypic and fitness effects in an evolving popu lation. Using the population viewer, sets of organisms with discrete phenotypes (i.e., computational task completion) may be outlined in color, and the number of organisms performing each function is displayed among the population statistics. By default, o rganisms in the population are color - coded by fitness, and the average fitness of a population is displayed among the population statistics. Slight fitness differences often exist even for organisms with shared phenotype s, since genotypic differences may h ave consequences for reproduction efficiency (i.e., offspring cost). These differences are easily observable due to the high sensitivity of the 154 fitness color scal e . Instead of a transformational depiction of all organism s having similar if not identical ge notype, phenotype, fitness, and therefore color, a great deal of variation at each of these levels exists among contemporaneous individuals, producing a multitude of colors at any given timepoint. When the environment i s configured to reward certain tasks (in Avida - ED verbiage, when the resources associated with tasks are present in the environment), then the gain or loss of these phenotypes has significant fitness consequences. Avidians have short lifespans and rapidly reproduce to create large population s that experience hundreds of generations within minutes. Instead of a transformational picture of the entire population gradually changing color altogether over generations, d ifferences among individuals are readily app arent as spatial , patchy changes in color - coded fitness over seconds or minutes. Slightly more efficient Avidians evolve over time, slowly altering the color across the population as their descendants with greater relative fitness increase in frequency ; an d Avidians with de novo mutation s fo r task performance are observed to quickly produce similarly fitness - colored offspring, drastically changing the genotypic, phenotypic, and fitness makeup of the population. This provides a powerful visualization of variationalism as colors appear to rapid ly evolve (i.e., change frequency) onscreen. Of course, it is not the color that is evolving, but ra ther the population of digital organisms. This presentation of fitness driving the evolution of a population is key. However, fitness in biology can be ext remely difficult to measure and track over time. In Avida - ED, fitness is automatically recorded for both individual organisms and the population average, and its change over time is graphed dynamically. These data are quantitative and readily exportable fo r further analyses. The lighter - colored genotypes produce more offspring and the population evolves accordingly genotypic differences are driving fitness differences that drive the evolution of the population. Further, students observe this change quickl y, being able to see how fitness at one time point affects a later time point by observing a series of intermediates. 155 Motivation for curriculum Over the last couple decades, interest in modernizing science education in the United has reached a critical mas s, resulting in the publication of several calls for instruction, curriculum, and assessment reform. With respect to biology content, calls for reform emphasize the curricular centrality of evolution as a major unifying theme. And they emphasize variationa l understanding of the evolutionary process. For example, as established in Science (2013) , instruction i n variationalism begins in first grade by establishing that phenotypic variation exists within a species. In third grade , the connection between genotypic a nd phenotypic variation is established, and, separately, differential fitness leading to adaptation. Middle school life sciences curriculum includes that mutation is the ultimate source of variation, that variation due to mutation may affect fitness, and t hat natural selection changes phenotypic frequency within a population. In high school, the requirem ents of natural selection, heritable variation causing differential fitness, are more greatly explored, connecting concepts established across primary and s econdary education. Similar emphasis on individual - level within - population variation being required for evolution to occur is found within the life sciences core ideas in - (2012a) and is the first big idea in (2011) . The most relevant reform product in the undergraduate biology curriculum space is (2011) , which identifies five core foundational conc epts that unify biological knowledge . Two of these, Evolution, and Information F low , prominently fea ture variationalist understanding, as exhibited within the content - specific statements offered in the BioCore Guide (Brownell et al., 2014) , which operationaliz ed for a general biology curriculum. Calls for reform also emphasize the importan ce of science and engineering practices in STEM education through the engagement of students in inquiry - based and research - based pedagogies (Auchincloss et al., 2014; Corwin et al., 2015; Linn et al., 2015; Weaver et al., 2008) . For example, - 12 Science Educatio (2012a) identified ei ght science and 156 engine ering practices. These practices embody the means by which scientists discover and investigate the natural world and by which engineers evaluate and design the built world (National Research Council, 2012a) , and are applicabl e in undergraduate edu cation (Cooper, 2013) . S (2012 b) , identified inquiry - based laboratory experiences as being especially key for helping students gain a better understanding of evolution. Furthermore, it has been shown that exercises and labs that directly address common misconcept ions can be the most s uccessful at (Grant, 20 09; Robbins and Roy, 2007) , a factor of special and significant concern for evolution education (Nelson, 2012) . (2011) adaptation and genetic variation provide rich opportunities for students to work with relevant data and practice quanti tative analysis and dy Collectively, these documents call for a revolution in how students are taught, with best practices being both evidenced - based and backed by sound theoretical rationale. The Avida - ED lab curriculum has been designed to address the recomme ndations and best practices put forward by the se documents . Primarily four curricular components have been suggested to produce transformational - to - variational shifts in understanding. Since this is a profound conceptual change, Sinat ra et al. (2008) propose d that direct exposure to the evolutionary phenomenon in class, and especially with a high degree of experimental engagement, has the greatest likelihood of success. Shtulman and Schulz (2008) and Shtulman and Calabi (2008) recommend that instructors make a point to highlight within - species variation across a multitude of traits. Cooper (2017) stresses the importance of introducing statistical reasoning and computer simulations or modeling as means to distinguish between the ind ividual and population level understanding of emergent processes like natural selection. And Kalinowski et al. (2010) suggest that explicitly connecting molecul ar genetics concepts to evolutionary phenomenon, or further, co nnecting biological 157 ideas across fields of biology , especially molecular biology and genetics (Smith et al., 2009) , may (Nehm et al., 2009; Speth e t al., 2014) . Engaging across biological fields in this mann er is referred to as (Wake, 2008) . For these reasons , I hypothesized that a course experience featuring direct exposure to and experimentation regarding variation and its importance (i.e., the introduct ion via mutation and change via population processes) using Avi da - ED , and especially embedded within broader integrative biology contexts, would shift students from transformational to variational understanding. Methods University and course population con text Michigan State University is a large public research - inten sive university located in the Midwestern United States, with an enrollment of 39,000 undergraduate students across fourteen colleges. While several introductory biology courses are taught, Int egrative Biology was uniquely designed to be a one - semester cou rse that explores many aspects of what is more commonly a two - semester introductory biology curriculum while maintaining the intellectual rigor required of STEM majors. We , principally Louise M ead and I, tailored the course to its target population of STEM students, consisting of engineering and non - life sciences majors needing only one semester of an introductory biology course and not requiring a separate laboratory course. Fall 2016 through S ummer 2018, a total of 3 5 4 students took the course , with one c ourse section offered each of the six semesters (Table 4.01) . A total of 95 student group research projects and poster presentations were produced at the culmination of these semesters . In - person classes met three times weekly for 50 - minute learning sessio ns, one of which was reserved for the digital evolution lab, for 16 weeks; and online - only classes used an online video conferencing platform , Zoom, to facilitate individual student grou ps in meeting three times weekly for one - hour active learning sessions , two of which were reserved for digital 158 evolution labs, for seven weeks. Demographically, the students were primarily first - year (55%) or second - year (24%) students, and most were men ( 80%). Students were largely enrolled in the College of Engineering (60 %) or College of Natural Sciences (29%), with the remainder enrolled across seven other colleges. Students majoring in 36 distinct major concentrations completed the course, with 85% of these being non - life sciences STEM majors, and 57% being either Mechan ical Engineering, Computer Science, Mathematics, or Physics majors. Table 4.01. Enrollment in Integrative Biology: From DNA to Population s in six consecutive semesters (N = 354). Year 20 16 2017 2018 Semester Fall Spring Summer Fall Spring Summer Enrollme nt 86 72 16 74 74 32 Course design Avida - ED digital evolution lab The accessible presentation of variationalism in Avida - ED is reinforced by the instructional circumstances of the digital evolution lab experience. The lab book is grounded by a core curric ulum sequence of five activities culminating in an independent group - based research project (Smith et al., 2016; Kohn et al. 2018) . The core curriculum has been classroom - tested in full and in part in undergraduate courses and high school Advanced Placemen t courses across the United States (Smith et al. 2016; Kohn et al. 2018; Lark et al. 2018) . Avida - ED curriculum resources, distributed as ancillary to the lab book, include a popular science article on Avidi ans (Zimmer, 2005) , the Avida - ED Quick Start Manual ( accessible from the Help toolbar in Avida - ED ), and curriculum activities exploring additional topics, such as g enetic engineering and mutation rate evolution (Johnson et al. , 2011a, 2011b; Lark et al., 2014) . The first five activities, constituting the core Avida - ED curriculum, are designed to support student understanding of evolution, the Avidian ex perimental system, and experimentation in science generally. F irst presente d by Smith et al. (2016) , the activities were expanded and modified by Kohn et al. (2018) to includ e an additional exercise on genetic drift, 159 furthe r develop student expertise in the features of the Avida - ED system, incorpo rate assessment of science and engineering process skills, and otherwise expand the treatment of evolutionary concepts. Student familiarity with the system is jointly scaffolded wit h fundamental biology concepts. The biology content supports a conceptual p rogression from the initiation of differences between individuals in a population to the consequences of those differences: the origin of biological variation via mutation, the ran domness by which mutations occur, the environment - specific concept of absol ute fitness and population - specific concept of relative fitness , and finally the evolutionary processes of natural selection and genetic drift, highlighting the effects adaptive and non - adaptive change have on variation in an evolving population. The moder n scientific understanding of variationalist evolutionary progression is emphasized, experimented on, and discussed throughout each of these core curriculum activities. Being a true experimental study system , each of these activities involves the collectio n of novel data, and the results of any given Avida - ED lab experiment will likely not be as the instructor would predict for each individual student. Data sharing, statistical analy sis, and classroom discussion uncovers the processes underlying the biologi cal and experimental replicate variation. Since the aggregate results for each activity is highly predictable to the instructor although not to the average student, this portion of the curriculum fits the inquiry as opposed to traditional model of research authenticity (Fig. 4.01 ). Avida - ED activi ties include a ready means of online data collection, analysis, and presentation using established Google Sheets files . This allows students to compare their relatively limited set of data with that collected by a much larger set of students. By contributi ng their de - identified data, students can see their contributions towards an accumulating research dataset. These methods provide an excellent opportunity to discuss the importance of sample size when investigating phenomena fraught with experimental varia tion and, further, the potential for human error. For example, the accumulated results for the first lab exercise show two systematic differences from theoretically predicted outcomes. Such results are organic to t he research experience, being 160 explainable by rare, though repeated student methodological and data transcriptional errors. The ensuing conversation highlights the reality of science being a human endeavor. Figure 4. 0 1 . Models of research authenticity com piled from the literature on modes of inve stigation in laboratory environments (Ballen et al., 2017; Buck et al., 2008; Corwin et al., 2015; Domin, 1999; Go odwin et al., 2021; Seymour et al., 2004) . In Avida - ED curricula such as that used in the present study, a fter completing the core curriculum activities and gain ing a familiarity with the model system specifically and experimentation in science generally, groups of students conduct independent re search projects. Students are expected to draw connections across research systems by exploring biological phenomena within the digital experimental evolution system, and by testing evolutionary mechanisms that can not typically be addressed during other ty pes of biology lab experiences. This aspect of the Avida - ED curriculum highlights the full potential of the model system by encouraging students to conduct their own experiments, all while exploring the nature of s cientific reasoning itself. For instance, students can change environmental and other variables and perform controlled experiments to test their own evolutionary hypotheses. Students can see for themselves how evolutionary hypotheses can be supported by em pirical tests. Thus, students learn first - hand that scientists base conclusions upon repeatable empirical observations to construct arguments from evidence. By doing so, Avida - ED provides an environment for students to directly confront and correct their m isconceptions about the scientific status of evolutionary theory and about the nature of scientific practices. 161 Student groups first brainstorm a research question and take ownership of the resulting research. Early in the process, each group prepares and p resents a research proposal; this requirem ent is designed to prompt careful planning, and it allows the instructors an opportunity to provide formalized feedback early in the process when it can hopefully have the greatest impact. Feedback via critical yet encouraging guidance continues throughout the semester and, as is similar for other experimental evolution labs (Cotner and Hebert, 2016) , is often crucial to helping students overcome common experimen tal pitfalls for example, basic misunder standings regarding Avidian biology, experimental design flaws, or failure to investigate phenomena left unexplored in the introductory exercises. Depending on the originality and experimental design creativity of the students, a proportion of experiments each semester investigate phenomena whose results are unknown to the instructor and potentially the scientific community broadly, and therefore clearly fit the discovery - based inquiry model of research authenticity (Fig. 4.01 ). The remaining experiments ar e inquiry - based, with results expected by the instructor with varying degrees of accuracy. Based on the sophistication of e students during their research process o r provided by the instructors. Related publications may provide students with conceptual or experimental design inspiration for their research, in addition to reinforcing the analogy of Avidian experimentation with biology research in other living systems. Before completing the final draft of their research poster, students participate in a peer review process; this provides an opportunity for students to engage in the critical assessment of the content and practice s involved in the work of others and, idea lly, to metacognitively examine their own work. The conclusion of the digital evolution laboratory experience is a public poster presentation session in which student groups present their work to one another, the i nstructors, and members of the university community. This professional presentation opportunity is similar to that of a scientific conference poster session. 162 Other course features The Integrative Biology curriculum is exemplified by its name using integr ative thinking (Wake, 2008) to u nify the disciplines of biology across spa tial, temporal, and taxonomic scales to holistically examine biological systems and processes. This provides context to and strengthens the variationalist patterns experimented upon in the digital evolution lab. A case - based approach was used with evolution as the organizing force across both the biological world and the course material. As Theodosius Dobzhansky (1973) fa mo minutiae and content is best understood and appreciated through the evolutionary lens. As such, evolution is an ex plicit theme of Integrativ e Biology , which consists of a sequence of seven content units. Each unit, except the first on tree - thinking, focuses on a specific biological system (for example, Caribbean anole lizards) and the scientists actively conducting that work. Integrative think in g allows the students to holistically investigate the biological phenomena of each unit. For example, students may examine the progression of a biological trait, exploring its genotypic basis, genetic expression, intracellular function, and resulting tis su e - and organismal - evident phenotypic expression, and finally its population - level change due to an evolutionary process like natural selection (White et a l., 2013) . Other example progressions entail a different suite of biological subfields: organismal trait expression, population - level frequency variation, ecological mechanisms, and finally ecosystem effects. For each case, evolutionary origins, mechani sm s, and/or consequences are investigated. This integrative approach emphasizes variationalism by highlighting the evolutionary consequences of variation among individuals in a population, and across a range of taxonomic diversity. Further, these biologica l examples reinforce the processes and patterns observed in the digital evolution lab using explicit analogies and discussions connecting the systems. In addition to their research with Avida - ED, students interact with authentic biological data collected b y the scientists researching the biology of the cases under study. Depending on the activity, students work with this data at multiple stages of the scientific process: selecting 163 which variable to research, collecting data, summarizing data statistically, gr aphing results, interpreting evidence, and/or presenting conclusions. One activity is meant to demonstrate how meticulous data collection in science can often be . In this work, students measure several phenotypic characters of individuals among related species. Of course, it also prompts them to become familiar with indivi du al - level variation variation that is then discussed in terms of the macroevolution of several related species. Fulfillment of calls for curriculum reform Throughout the Avida - ED lab curriculum, students engage with each of the eight science and engineeri ng practices - (National Research Council, 2012a) , including both the science and engineering - sp ecific framings (Kohn et al., 2018) . However, some practices incur lesser student engagement in the core activities as compared to the independent research project. For example, while the research topic, experimenta l methodology, and statistical analyses are provided as part of the core activities, students are prompted to engage with each of these through short answer assessment items and instructor - facilitated discussion. These early experiences prepare students to independently participate in the eight practices while conducting their research projects. Working with authentic biological data, students further develo p their science and engineering practices in contexts other than with Avidians . While most Avida - ED activities approach biological phenomena with a scientific lens by asking questions and constructing explanations, a few of the non - Avida - ED activities expl icitly use the lens more associated with engineering, that of identifying problems and designing sol utions. Proposing, evaluating, and iteratively adjusting a scientific model is explicitly practiced in some activities so that students actively engage in e valuating their own mental model of the phenomenon. For example, students iteratively create increas ingly complex models of the interplay between communicative and morphological phenotypes with sexual and natural selection in the context of crickets and th eir parasites. Students also practice data management and basic statistical analyses and interpretat ions, including measures of experimental variation such as confidence 164 interval construction and statistical significance determination, emphasizing how impo rtant variation is for understanding biological mechanisms and interpreting hypotheses and experimen tal results. Most recommended introductory undergraduate biology education content from (2011) and The BioCore Guide (Brownell et al., 2014) is addressed in the c ourse. Across the course, over ninety percent of the BioCore G uide content statements are addressed and assessed at least once, with many revisited in multiple systems and using varied assessment styles. This fact highlights the pervasiveness of integrativ e thinking throughout the course along with the opportunity to show the connection between individual trait variation and its evolution. For example, the unit on selection and convergence in mouse coat color variation , adapted from White et al. (2013) , revisits ten content statements covered earlier in the course and introduces ten new content statements from across four core concepts and all major subdisciplines of biology. Twenty percent of the co ntent statements are specifically explored in the Avida - ED lab core activities. For example, one of the statements investigated in the third Avida - ED lab exercise is: odifications can impact the regulation of gene expression and/ or the structure and function of the gene product. If mutations affect phenotype and lead to This statement corresponds to the core concept of Evolution with c onnections to the biology subfields of Molecular , Cellular , and Developmental Biology. Additional content statements may be revealed by student groups during their research process. Consider the core concept Evolution, sub - discipline Physiology statement, "Physiological systems are constrained by ancestral structures, physical limits, and the requirements of other physiological systems, leading to trade - offs that affect fitness." Some s tudents di scover that the evolution of Avidian functions can involve tra de - offs wherein the or physiological machinery (Lenski et al. 1999) , the loss of a simpler fun ction. Another 165 statement commonly addressed is the Information flow core concept, Ecology/ Evolutionary Biology sub - discipline statement, "A genotype influences the range of possible phenotypes in an individual; the actual phenotype results from interactio ns between alleles and the environment." S ome s tudents come ac ross this genotype - by - environment interaction in which an Avidian genotype inconsistently performs a function. This is because Avidian function performance requires encountering and utilizing ra ndom numbers in the digital environment, and some number combi nations require fewer computational steps to output the product necessary to fulfill a function. Those s tudents that encounter this phenomenon do not have the requisite Avidian taxonomic experti se to understand what is occurring. Thus, this can be an exten uating factor causing experiments to produce unexpected results. Once this genotype - by - environment interaction is elucidated by the instructor, this experience can provide a fantastic learning o by Linn et al. (2015) and Goodwin et al. (2021) . The course curriculum also addresses ten recommendations by Hillis ( 2007) for including evolution in introductory biology education. It demonstrates that evolutionary research is ongoing through units such as those on the E. coli Long Term Evolution Experiment (Lenski et al ., 2015) , in part adapted from the curriculum of White et al. (2013) . The evol utionary processes of mutation, natural selection, genet ic drift, and migration are explored and contrasted in multiple systems, clarifying that evolution is not a synonym for natural selection but rather change in individual variation over time. Fresh exa mples are used and continually updated, for example addi tional research on the evolution of quiet Hawaiian crickets (Balenger and Zuk, 2015) has recently been published (Schneider et al., 2018) and will be included in future iterations of the course. Examples such as the co - evolutionary significance of the human skin and gut microbiome are used to show how evolution is relevant to human lives . Numerous examples of evolutionary biology from popular media are incorporated, including many YouTube videos and NPR interviews of researchers conducting and expl aining their work. The course includes experimental evol ution both with a unit on biological experimentation in the E. 166 coli Long Term Evolution Experiment (B lount et al., 2008; Lenski and Travisano, 1994) and t he digital evolution lab in which students conduct their own experimental research. Instead of being taught only in the context of organismal biology and ecology, Evolution is integrated throughout th e course and is the unifying thread connecting the mater ial , as has been called for by Smith et al. (2009) , among oth ers . Tree - thinking is explored in the first unit of the course to provide a conceptual basis and means of understanding later material. The diversity of life is highlighted using units featuring taxa from across biological diversity, including plants, mamm als, bacteria, insects, mollusks, and reptiles. And finally, the great magnitude of evolutionary time is emphasized through discussions on the origin of sex and photosynthesis. The Avid a - ED laboratory experience shares features with other experiences that have been shown to have a multitude of benefits. Corwin et al. (2015) suggest s that similar combination s of activities , especially in the context of course - ba sed undergraduate research experiences, improves cognitive, psychosocia l, and behavioral outcomes. For example: Avida - ED and other activities throughout the course require students to work in small, cooperative groups, a context with well - documented benef its (Smith, 1996; Smith et al., 2005) . The research project, in particular, provides teamwork experience in addition to the greater student interact ion facilitated by inquiry - based learning (Aditomo et al., 2 013; Felder and Brent, 1996) . Research experiences prompt students t o more greatly identify themselves as being scientists or part of the science community (Carlone and Johnson, 2007) , in addition to promoting interest in their degree and retention in STEM generally (Bangera and Brownell, 2014) , especially for unrepresented persons in these fields (Ea gan Jr. et al., 2013; Espinosa, 2011) . Furthermore, conducting research using Avidians necessitates an explicit interdisciplinary connection between biology and computer science; by s olving conceptual problems in complex interdisciplinary systems student s may improve learning and cognitive skill development (Betz, 1995) . Computational tools in education have been shown to have well - documented benefits with regards to student l earning and engagement. For example, such tools facilitate the connecti on of observed phenomena with underlying causal processes 167 (Magana, 2017) , and allow students to observe the unobs ervable (Trey and Khan, 2008) , for example electrons movi ng in electrical circuits or, as with Avida - ED, variationalism as individual - level variation producing population - level evolutionary change within minutes. In some cases, students prefe r computational tools over conventional tools and knowledge sources, re porting greater interest in the material (Akkoyun, 2017) . Additionally, some students interact with Avida - ED in a manner akin to gamification, which can strongly encourage student engagement (Drace, 2013) . Interest, in turn, influences affective response and persistence, which influences learning (Ainley and Ainley, 2011; Rotgans and Schmidt, 2009, 2014) . Furthermore, by conducting authentic scientif ic investigation as part of a course, students tend to experience great er gains in content knowledge (Minner et al., 2010) . While mos t of these benefits have not yet been shown with the use of Avida - ED specifically, the features of the digital evolution lab experience suggest they may indeed occur. Assessment To asse ss transformational and variational reasoning, I used a three - item asse ssment featuring a graphical representation of color variation in the peppered moth Biston betularia during the industrial revolution in England (Fig. 4.02 ). The assessment stem established that moths darkened over time, that the coloration is heritable, a nd that the proposed samples are representative of the historical populations. Two potential collect ions of moths sampled across 100 years in 25 - year intervals are proposed , Panel A and Panel B. R espondents are prompted to provid e a separate explanation fo r each sample's phenotypic pattern and then choos e and defend which collection would be more likely to have existed . Panel A presents sequential ly darker populations that exhibit individual - level variation in white and dark moth variants, with dark moths i ncreasing in frequency over time (Fig . 4.02 ) . Note that in this panel, the population neither begins with all white moths (i.e., prior to a mutation conferring the dark coloration) nor ends with all black moths (i.e., fixation of the dark variant) , but rat her maintains contemporaneous phenotypic variation at each sampling interval . Panel B presents seque ntial ly 168 darker populations that each exhibit a lack of variation among individuals, with each successive population being slighter darker than the previous sampling interval . Figure 4.02 . The a ssessment regarding melanic moth populations with idealized variational (Panel A) and transformational (Panel B) evolutionary trends, as modified from Shtulman (2006) . Th e assessment tool used in this study is mod eled after a portion of an assessment by Shtulman (2006) that also addresses transformational and variational understanding of 169 biological var iation. The entire assessment of Shtulman (2006) consisted of thirty items that aimed to di stinguish variational and transformational i nterpretations for each of six evolutionary phenomena : variation, inheritance, adaptation, domestication, speciation, and extinction. The five - item subset on variation first presented the peppered moth scenario within an explicit adaptive evolution contex shading a set of 25 moth outlines , arrayed with five moths every 25 year s as per Panel A or B in our assessment, but with no color variation other than that which the participant selected . Participants shaded each moth per row to reflect the color variation they would expect to have been observed historically, choosing from on e of five grayscale values for each moth. O ur goals in modifying the assessment of Shtulman ( 2006) were to exclude the requirement of natural selection , to reduce the measurement error associated with the original scoring system, for which 31% of respondents produced uninterpretable shading patterns , and to facilitate digital survey collection by using standard item formatting . The moth figure used in our assessment (Fig. 4.02 ) reflects two competing scenarios the ideal variational (panel A) and transformational (panel B) shading patterns sought by Shtulman (2006) in his Figure 2. While Shtulman (2006) offered some limited evidence on content validity for his assessment (i.e., three biology doctorates reviewed the items), it is unclear how this evidence relates to the ent ire instrument versus the specific portion u sed here with modifications. Responses Using an online survey platform, s tudents completed the assessment twice, during the first week and then during the last two weeks of instruction for in - person course semesters and during the first and last week of on line - only course semesters. Completion was incentivized with a small amount of extra course credit and was weighted such that completion of both pre - and post - course surveys awarded substantially greater ex tra credit. The transformationalism - variationalism items analyzed here (Fig. 4.02 ) were a subset of the survey, which additionally contained multiple choice and other open - ended items addressing other facets of evolution 170 education . A total of 649 surveys w ere returned, of which 611 were fully completed wi th respect to the transformationalism - variationalism portion, for an overall response rate of 84%. Of these, 103 students completed only the pre - or post - course survey, while 254 students completed both the pre - and post - course surveys for a joint response rate of 70%. Scoring Each survey submission was scored as transformational (T), variational (V), or other (O), with the latter including all responses that were not otherwise classifiable (Fig . 4.03 ). An i nitial scoring rubric was created based on the the oretical distinction between transformational and variational understanding (Coley and Muratore, 2012; Coley and Tanner, 2015; Gregory, 2009; Shtulma n, 2006; Shtulman and Calabi, 2008; Shtulman and Schulz, 2008; Sinatra et al., 2008; Speth et al., 2014) . Multiple scorers examined a set of responses for non - biology majors in a different course and revised the rubric following discussion. Two scorers then independently evaluated all Integrative Biolo gy responses reported here, with discussion and rubric revision occurring throughout this process, before reaching finalized consensus scoring across all responses . Inter - rater reliability was calculated us scorers prior to consensus being reached through discussion and was calculated for both all completed responses and the subset of paired pre - /post - course responses. 171 Figure 4.03 . Melanic moth assess ment rubric including the process for choosing which items to score and t he criteria for awarding scores of variationalism (V), transformationalism (T), and other (O). Underlining indicates a scoring distinction between panels A and B , and colored font fur ther indicates the same response as being scored differently depending on the chosen panel . A single score was given for each participant per survey disbursement. The default score was O, with the response requiring explicit endorsement of an appropriate V or T response. The rationale for the rubric (Fig. 4.03 ) is that although both conceptualizations of variation may recognize that variation exists among individuals in a population, those with a variational understanding will deem it with consequential imp ortance as contributing to evolutionary change over time. Note that only rejecting transformationalism by, for example, stating that the Scoring process determined from item 3 response: If pattern A more likely Evalu ate items 1 and 3 If pattern B more likely Evaluate items 2 and 3 O score Any other response, including choosing both or neither Scoring for items 1 and 3, explanation for panel A: V A variant within the population is changing or should change over t ime. T Explicitly refers to the essence of the species changing. O Any other response, including: The moth (population or species) is evolving, changing, adapting, developing, dominating, etc ; natural selection or other evolutionary process without men tion of phenotypic variation; uncertainty if moth variants can mate or if they are in the same population or species; variation among selective environments without mention of phenotypic variation. Scoring for items 2 and 3, explanation for panel B: V A variant within the population is changing or should change over time and explicitly states that variation exists at a single time. T Explicitly refers to the essence of the species changing ; or the moth (population, species, genotype, phenotype) is evolv ing, changing, adapting, developing, dominating, etc. O Any other response, including: Evolution is occurring rapidly without mention of individual variation ; natural selection or other evolutionary proc ess without mention of phenotypic variation; uncert ainty if moth variants can mate or if they are in the same population or species; variation among selective environments without mention of individual phenotypic variation. 172 population would not change in unison, was insufficient to score a V, as were references to genetics concepts alone, e. g., mutation, allele, dominance, or recessivity. The selection of items r eviewed per participant depended on the item 3 response. If a student indicated that either panel A or panel B was most likely to be observed, then their score was based on their answ er to item 3 and the item corresponding to their selection (item 1 or 2, respectively). The rationale for this differential scoring process is that a student may have difficulty justifying the panel that does not conform to their understanding of biologica l variation and evolutionary process es . The score represents an inability to differentiate a response as transformational or variational (i.e., measurement error) and is not necessarily intended as a determination of a mixed model of understand ing (Evans, 2008) or pre - variationalism (Shtulman, 2006) . An identically worded exp lanation for panel B with it being chosen as most likely may result in a different score than if it was the explanation for panel A by a student who chose panel A. Such instances are highlighted in Figure 4.03 with underlining and examples are provided in Table 4.02 . Because the panels presented a distinction between whether variation among individuals exists within a population at a given time, we deemed of panel A provided some evidence of variational think ing and likewise for a ch oice of panel B and transformational thinking . Therefore, the rubric indicates a lower scoring barrier for T scores when choosing panel B and for V scores when choosing panel A . For example, a vague arded an O if it was an explanation for panel A as being most likely , but a T for panel B (i.e., the blue text in Fig. 4.03 ) . I n describing panel B, a V score required additional explication that variation existed within the popula tion at a given time, as opposed to simply being different between generations as represented in that panel. The additional O response in explanations of panel B, evolution occurring rapidly, was added because this may be a scientifically accurate explanat ion for a variational pro cess, although this was in sufficient by itself to warrant a V. 173 Table 4. 02 . Example scored responses of transformational (T), variational (V), and other (O) for panel chosen as most likely in item 3, with item 1 or 2 reviewed depen ding on panel chosen . Pan el, Score Item 3 Item 1 or 2 Reason A, V Panel A is due to the fact you can see moths evolving in order to adapt to the area they are in. With the predator sight being drawn to a darker colored moth, as the environment started to change, the ability to se e the darker colored moths where difficult compared to seeing the lighter moths. Variant should change over time A, V I would say A is more likely to be observed because populations don't all just change at once. Throug h adaption, one ancestor passes the trait on to offspring One moth was randomly dark and passed this trait on to the offspring. This continued down the line Variant changing over time A, T No such responses provided by Integrative Biology students surveye d A, O A, because it would be gradu al, not each generation being darker than the previous. The moths gradually darkened over time. Moth species is evolving A, O Panel A, as the evolutionary timeline in panel B is a bit fast. 100 years for a complete colour change. The independent populatio n of dark butterflies was more fit than the light ones. Separate populations B, V B is more likely given that it is a more gradual process of color change. There was a more gradual environmental change whic h selected moths with darker color to survive and reproduce over those without that phenotype. Variation existing at a single time changes in the population B, V I think that panel B is more likely because the difference is too great in panel A between th e moths for the two colors to coexist for that lo ng. Over time, a series of random mutations that made the moths slightly darker occurred. These mutations were beneficial enough that they outcompeted the other types and the entire population became gradual ly darker over time. Variation existing at a sing le time changes in the population B, T Panel B because all members of the species darkened and had a chance at an increased survival rate. The whole moth species gradually began to darken as opposed to a rapid darkening in a few members of the species. Mo th species is evolving B, T Panel B is definitely more likely to occur as all of the moths have many of the same survival and reproduction genes making it way more likely for them to darken as a populatio n and gradually rather than individuals and suddenl y. Over the course of the century, the moth population as a whole was slowly developing a slightly darker shade of grey in order for survival. As the years progressed the shades became darker. Moth species is evolving 174 Table 4.02 (continued) B, O Panel B. It seems unlikely that with 100 years of selective pressure that there would still be a light moth in 1900. There was selective pressure on the light moths to darken. Evolutionary process without mention of individual variation; also, rapid evolution B , O panel B because when an invasive species arrives, if its trait was more preferable then the other one it would take over extremely quick were panel A has a slow natural selection occurring. over time the darker the moth became the higher chance of surv iving the moth had allowing for more offspring, which slowly changed the color of the population. Separate species; also, rapid evolution Neither, O Natural selection is more likely to occur because this is how most organisms adapt to their environment. N ot reviewed Neither panel chosen Both, O I think they are both equally likely to occur. Both seem to be caused by the influence of a trait that wasn't there before. Not reviewed Both panels chosen Statistics As an initial control to evaluate whether resp ondents provided longer responses pre - and post - course, repeated - measures t - tests were conducted for the number of words and characters, and for both the complete response provided to all three items as well as just the scored portion of the response. A th ree - factor by two - factor Chi - square test was used to evaluate an overall difference in transformational (T), variational (V), and indeterminable among pre - and post - course scor es among all completed responses and, separately, among t he paired responses . Following this, each was partitioned into orthogonal two - by - two tables (Sharpe, 2015) . This analysis independently evaluated the pair of factors of interest here, i.e., scores of transformational versus variational, and the pair of factors of lesser interest, i.e., scores classifiable as transformational or variational versus other. Since these sets of data were orthogonal, adjustment for multiple comparisons was not needed. A McNemar - Bowker test was used to evaluate symmetry among pre - to - post - course score shifts among pai red responses only ; for example whether the number of students that 175 shifted from transformationalism to variationalism was significantly different from the number that shifted from variationalism to transformationalism (i.e., T - V vs. V - T) . Individual McNem ar tests were then used to evaluate each of the three pairs of comparisons (T - V vs. V - T, T - O vs. O - T, and V - O vs. O - V) and Bonferroni adjustments were conducted for the resulting p - values. All Chi - square and related analyses additionally included Cramér's V, which is a measure of association or effect size an d has the same interpretation as a Pearson correlation coefficient. All statistical tests were completed using Microsoft Excel. Results and Discussion For students that responded to both the pre - and po st - assessment, there was no significant difference in the length of responses provided for either assessment ( Table 4.03 ) regardless of whether all three items or only the scored items were compared. To put these values into perspective, the first set of e xample responses in Table 4.02 for panel choice A & sc ore O, choice B & score V, and choice A & score V are representative of the lower quartile, mean, and upper quartile word and character lengths. It was often extremely difficult to classify student unde rstanding when sparsely detailed responses were provid ed. Since instances of insufficient evidence inflate the number of O scores, it is important that a lack of significant difference in response length pre - and post - course was found. Table 4.03 . Mean wor d and character counts for paired student responses wi th repeated - measures t - tests (N = 254, df = 253) showing no significant differences. Relevant word and character counts include only the subset of items used for scoring , see Figure 4.03 . Mean Word Cou nt Mean Character Count Mean Relevant Word Count Mean Relevant Character Count Pre - course 56.5 324.1 39.2 223.5 Post - course 54.9 319.8 37.7 217.3 p - value 0.4 0.69 0.33 0.49 The rubric appeared to function as expected. Inter - rater reliability was suffic iently high (Hallgren, 2012) between the two scorers. Percent agreeme nt was 87.2% across all responses (N = = 508), percent 176 agreement was 86.8% with a kappa of 0.774. The rubri c was designed to differentiate among means of understanding depending on the p anel chosen as most likely to have been historically observable (item 3) , with a lower barrier for scoring variationalism when panel A was chosen and likewise for transformation alism and panel B. This result was observed. Of the 400 students that chose pan el A, 76% of the responses were scored as variationalism and 0% as transformationalism; and of the 188 that chose panel B, 61% of the responses scored as transformational and 14 % as variationalism. The remaining proportion of students whose understanding w as indeterminable when choosing panel A or B, 24% and 26% respectively, was near identical, suggesting that scoring for this catch - all category was not biased depending on panel chosen . There was a significant difference in the proportion of student unders tanding of variation pre - and post - course across all completed responses (Fig. 4.04 ) . The difference in student understanding was statistically significant in the omnibus analys is of all responses ( 2 2,611 = 15.106, p < 0.001) with a n effect size of 0.157. This result was driven by the pre - /post - course difference s in transformational and variational understanding, as evidenced by its partitioned analysis ( 2 1,445 = 13.401, p < 0.001) with an effect size of 0.174. This contrasts with the orthogonal partition ing of students with transformational or variational thinking compared to students whose understanding was indeterminable, which was non - significant ( 2 1,611 = 1.822, p > 0. 1 ). 177 Figure 4.04 . Percent of all students demonstrating understanding of biologi cal variation pre - course (dark, N = 340) and post - course (light, N = 271), with asterisks indicating significance. There was also a significant difference in student understa nding for the subset of students that completed both pre - and post - course surveys ( Fig. 4.05 , overall states of understanding and black asterisks; Table 4.04 , sums ). The difference in student understanding was statistically significant in the omnibus analy sis comparing changes in the overall states of understanding ( 2 2 , 508 = 12.711 , p < 0.00 1 ) with an effect size of 0. 158. The post - hoc tests demonstrated that both orthogonal partitioned datasets were significant, comparing transformational and variational understanding ( 2 1, 374 = 13.298 , p < 0.001) with an effect siz e of 0.1 89 and comparing transformational or variation al thinking to students whose understanding was indeterminable ( 2 1, 508 = 9.889 , p < 0. 005 ) . Therefore, the proportion of students with transformational understanding significantly decreased and the pr oportion with variational or indeterminable understanding significantly increased. 178 Figure 4.05 . Percent of paired - response students (N = 254) demonstrating understanding of biological variation pre - and post - course, with asterisks indicating significanc e between pre/post states of understanding (black) and directionality of shifts between states (pink). Table 4. 0 4 . Detailed percentage breakdowns for paired - response students (N = 254) demonstrating understanding of biological variation pre - and post - cours e. Pre - course Transformational Other Variational Sum Post - course Transformational 5.9% 2.4% 3.9% 12.2% Other 8.3% 8.3% 13.4% 29.9% Variational 9.8% 12.2% 35.8% 57.9% Sum 24.0% 22.8% 53.1% 100% There was asymmetry in the directionality of shifts between states of understanding for the paired - response students (Fig. 4.05 , colors crossing and pink asterisks; Table 4.04 , off - diagonal percentages ) . The directionality of shifts in student understandin g was statistically significant in the omnibus anal ysis ( 2 3,254 = 14.900, p < 0.005) with an effect size of 0.242. This result was driven by two significant factors : greater transitions from transformational to variational understanding rather than in th e reverse ( 2 1,35 = 6.429, adj. p < 0.05) and great er transitions from transformational to indeterminable understanding than in the reverse ( 2 1,2 7 = 8.333, adj. p < 0. 0 05). In comparison, there was a lack of significance in transitions between variational and indeterminable understanding ( 2 1, 65 = 0.138 , adj. p > 0. 1 ). These results 179 provide the means to explain the significant overall shifts in student understanding among pa ired - response students (Fig. 4.05 , black asterisks) and potentially, albeit untestable, among all students (Fig. 4.04 ). These results are encouraging, especially in comparison to prior work. The pre - course proportion of students identified here as transfor mationalists is similar to that identified for nonbiology majors and biology majors by Richard et al. (2017) , and for the biology m ajors by Coley and Tanner (2015) , although less so than their nonmajors. Unlike Richard et al. (2017) , who saw no chan ge in transformational understanding between entering and advanced biology majors ( i.e., after multiple semesters of instruction ) , the proportion of students here was halved at the conclusion of their one , and likely only , college biology course ( Fig. 4.04 ). At least half of this change is attributable to students shifting from transformational to variational understanding, with the remainder shifting to the catch - (Fig. 4.05 ) . Shtulman and Calabi (2008) , using the assessment of Shtulman (2006) , reported a slightly lower prevalence of transformationa l thinking among biology students than reported here. Unfortunately , their results are not comparable to those reported here because their assessmen t measured variational and transformational understanding on a single continuous interval scale and with res pect to inheritance, adaptation, domestication, speciation, and extinction in addition to variation alone. For example, Shtulman (2006) classified students as pre - variationalists if they held variational views of adaptation and inheritance but transformational views of variation and the other factors assessed. Additionally, it would be ideal if the effect size results observed here could be compared to those in prior work (Sun et al., 2010) ; unfortunately , they were not reported. The proportion of students that could not be classified as having either variation al or - /post - course for paired - response students (Fig. 4.05 ) , although not significantly so for all students (Fig. 4.04 ) . Perhaps the increased proportion of paired - e attributed to their incorporation of new information into mixed models of understanding as students shift from 180 transformational to variationalism (Evans, 2008) or pre - var iationalism (Shtulman, 2006) . This might especially explain the 8.3% of paired - response studen ts that shifted from Table 4.04 ). R esearch will be necessary to test this hypothesis, especially using methods with reduced measu rement error . While slightly more Table 4.04 ), this difference was not significant ; combined, these groups constitute more than a quarter of all paired - response students and are the clearest indicators of measurement error % of ambiguous responses in the original assessment (Shtulman, 2006) , indicating that our asse ssment had limited success in reducing measurement error. It is evident from reviewing student responses th at there are a variety of means by which a student may preserve a transformationalist perspective. For example, students may misunderstand the mutat ional process as one that acts on all individuals in the population simultaneously. In this way, the species the mutation as its means of change (Coley and Tanner, 2012; Fodor, 1998; Gelman, 2004) . Seemingly based on a reductionist, molecular means, this transfo rmational model might appear to have a modern scientific basis; although of course it entails a gross misund erstanding about how mutation occurs. It is unclear if these observations support mixed model creation (Evans, 2008; Shtulman, 2006) , the stre ngthening of transformationalism following biological education (Coley and Tanner, 2015) , or bot h. If such student understanding is a conflation of the concepts of mutation occurring to an individual and substitution occurring within a population then it may be an instance of confusion over levels of the system, i.e., a failure in population thinking (Cooper, 2017; Mayr, 1994) . Or it could instead be a case of students improperly applying terminology for the mechanism of change (e.g ., mutation instead of substitution) or the product of such change (e.g., mutation instead of allele) and therefore being incorrectly classified non - variationalists. 181 Other students seem to propose ecological or physiological processes that were not intende multiple populations or species of moth existed in panel A, an d others proposed phenotypic plasticity to explain panel B. These biologically plausible responses may be du e to either creative thinking or may instead be a cover for an innate essentialist reasoning strategy by suggesting that evolution is not occurring at all. Conclusions This work demonstrated a significant change in student understanding of variation after a single semester of instruction. The cognitive stickiness of the essentialist intuitive reasoning strategy as expressed through a transformationali s t understanding of the importance of individual variation (Gelman, 2004; Speth et al., 2014) was not shown here , as 9.8 % of all students switched from transformational to variationalism, with the potential improvement of up to 18 indicative of mixed - model reasoning ( Table 4.04 ). Stated in a different way, relative to the proportion of students that b egan as transformationalists, 40% of them switched to variational understanding and a total of up to 75% potentially improved their un derstanding. This result might be attributable to the direct exposure to and experimentation regarding variation and its i mportance using integrative thinking , statistical reasoning, and computer modeling via Avida - ED, as hypothesized here and suggested ge nerally by Sinatra et al. (2008) , Kalinowski et al. (2010) , Speth et al. (2014) , and Cooper (2017) . However, the Avida - ED curriculum was one component among many in a course th at emphasized within - species variation and population thinking in considering evolution across multiple levels of biologic al organization , as also hypothesized here and suggested generally by others (Kalinowski et al., 2010; Nehm et al., 2009; Shtulman and Calabi, 2008; Shtulman and Schulz, 2008; Smith et al., 2009; Speth et al., 2014) . Further, throughout the course, observations and results made using Avida - ED were directly compared with those gathered from biological sources such that an understanding of the importance of variation in 182 one context may be tied to an understanding in others . Due to the thor ough integration of Avida - ED in the course and the fact that the assessment was administered not immediately before and after specific Avida - ED curricula, it is impossible to attribute the results shown here to anything other than the course experience as a whole. Additionally, while the results presented here are substantial and intriguing, further assessment development for measur ing s tudent understanding of variation is welcomed. The Avida - ED digital evolution lab experience was a hallmark of the Integra tive Biology course. This course should serve as an exemplar for how to best incorporate integrative thinking using case - based content exploration (see White et al., 2013) and a digital evolution lab using Avida - ED into an undergraduate introductory biology curriculum to engage students in a variational understanding (Kohn et al., 2018) . The goal in designing Integrative Biology wa s to create a rigorous introductory biology course for non - life sciences STEM majors that integrates a case - based approach, introducing students to the levels of biological organization while using evolution as a central organizing framework. As such, it w as designed according to the specific circumstance s of its institution and student body; yet the curricular niche it fills is likely applicable to other universities catering to non - life sciences STEM majors, especially engineers, mathematicians, and physi cists, and its approach to variational understandi ng is applicable more broadly still. While Avida - ED has been used in other courses across a range of geographic ally and educational ly divers e contexts (Lark et al., 2018) , the effect size of the curricular interventions on student understanding of variation reported here may not be equivalent. Lark (2014) has established that Avida - ED implementation success is positively correlated with instructor familiarity and comfort, presentation and exploration during in - person classes and especially laboratory - type learning opportunities and student completion of introductory activities followed by guided inquiry investigation. Integrative Biology was instructed by Avida - ED curriculum developers whom in reduced - class - size computer lab environments implemented the full set of recommended c urricula , including the student driven research pr oject. 183 Additionally, Avidians were explicitly analogized with asexual study systems and experiments of each were discussed side - by - side, numerous biological concepts initially introduced with biological sys tems were contemporaneously explored in Avida - ED a nd vice versa, and students were given the opportunity to learn through research success and, as importantly, failure with instructor guidance. Further, this is, to my knowledge, the first application of a digital evolution lab within a larger course conte xt in which the importance of variation among individuals is repeatedly explored in multiple biological systems and across biological fields using integrative thinking. Instructors interested in similarly s hifting student understanding of variation may, th erefore, find success by incorporating a digital evolution lab using Avida - ED into their classroom and integrating student discoveries therein throughout the curriculum. Care was taken in this study to meas ure student understanding of a difficult concept, and further improvements to the measurement tool would be useful contributions to the field of biology education research. Especially considering the relatively high proportion of measurement error observed here, which was marginally improved over Shtulman (2006) , further efforts to refine our as sessment instrument are welcomed. Biology education assessments are numerous, difficult to compare, and routinely in a state of development (Mead et al., 2019) . Since the assessment used here was closely based on that offered by Shtulman (2006) , the validity of that instrument at least somewhat lends evidence for the validity of this instrument. Even so, o ne or more validity analyses could be conduc ted to confirm that it is measur ing the intended distinction between variational and transformational understanding of biological variation. Further refinement of the rubric may also be warranted, especially if it reduces the classification of indeterminab maintaining its validity. For example, the explicit rejection of transformationalism or variationalism alone should, I think, necessitate evidence of the other mode , although others may disagree, e.g., Furtak et al. (2014) . The original assessment offered evidence of content validity via agreem ent among experts (Shtulman, 2006) , although their expert pool of three biology doctorates was small. The 184 modifications made h ere, while expected to maintain this validity, could be similarly tested and with a larger pool of experts. Additionally, evidence of substantive validity would be most convincing (Campbell and Ne hm, 2013) . Substantive validi ty is often shown through think - aloud interviews demonstrating that the cognitive processes used to answer the assessment are as e as a transformational response in the context of panel B and perhaps clarify what the same statement may mean in the panel A context (i.e., the blue text in Fig. 4.03 ) . Student interviews may also provide further insight ot her than with respect to the rub ric. For example, in our understanding of what students mean when they state or imply that the moth variants in panel A are separate species or populations, or in understanding their nuanced transformational or perhaps mixed - model explanations, especially because expert variationalist biologists may struggle to understand the now - alien transformational form of understanding , despite essentialist thinking in biological contexts being universal during childhood (Evans, 2008; Sinatra et al., 2008) . As an alternative to major revisions to the instrument itself, future uses could routinely include s tructured interviews, especially with classifying a greater proportion of students as holding either transformational or variational understanding. Follow - up work should explore how changes in variational thinking relates to other biology topics. Most notably, Shtulman and Schulz (2008) arg ue that natural selection can only be understood once variational thinking has been established. However, the work of Kalinowski et al. (2016) for the CANS instrument showed that student understanding of v ariation was largely independent of their understanding of evolution generally. While it might not be possible to tease apart a causal relationship, analys es comparing increased variationalism with the understanding of natural selection, genetic drift, mut ation, and other factors would be interesting. It might also be worthwhile to investigate the correlation between variational thinking and acceptance of ev olution, although admittedly transformationalism is still an 185 evolutionary process, albeit closely rel ated and perhaps indistinguishable on the present assessment from the non - evolutionary essentialism. 186 REFERENCES 187 REFERENCES Abascal, F., Zardoya, R ., Posada, D., 2005. ProtTest: selection of best - fit models of protein evolution. Bioinforma. Oxf. Engl. 21, 2104 2105. https://doi.org/10.1093 /bioinformatics/bti263 Adami, C., 2006. Digital genetics: unravelling the genetic basis of evolution. Nat. Rev. G enet. 7, 109 118. https://doi.org/10.1038/nrg1771 Adami, C., Ofria, C., Collier, T.C., 2000. Evolution of biological complexity. Proc. Natl. Ac ad. Sci. 97, 4463 4468. Adams, R.H., Schield, D.R., Card, D.C., Castoe, T.A., 2018. Assessing the Impacts of Posi tive Selection on Coalescent - Based Species Tree Estimation and Species Delimitation. Syst. Biol. 67, 1076 1090. https://doi.org/10.1093/sysbio/ syy034 Aditomo, A., Goodyear, P., Bliuc, A. - M., Ellis, R.A., 2013. Inquiry - based learning in higher education: pr incipal forms, educational objectives, and disciplinary variations. Stud. High. Educ. 38, 1239 1258. Agren, J.A., Williamson, R.J., Campitelli, B.E., Wheeler, J., 2017. Greenbeards in yeast: an undergraduate laboratory exercise to teach the genetics of coo peration. J. Biol. Educ. 51, 228 236. Ainley, M., Ainley, J., 2011. Student engagement with science in early adolescence: The contribution of e Contemp. Educ. Psychol. 36, 4 12. Akkoyun, O ., 2017. New simulation tool for teaching learning processes in engineering education. Comput. Appl. Eng. Educ. 25, 404 410. Altekar, G., Dwark adas, S., Huelsenbeck, J.P., Ronquist, F., 2004. Parallel metropolis coupled Markov chain Monte Carlo for Bayesia n phylogenetic inference. Bioinformatics 20, 407 415. Alters, B.J., Nelson, C.E., 2002. Perspective: Teaching evolution in higher education. Ev olution 56, 1891 1901. American Association for the Advancement of Science, 2011. Vision and change in undergradu ate biology education: a call to action. American Association for the Advancement of Science, Washington, DC. Arenas, M., 2012. Simulation of M olecular Data under Diverse Evolutionary Scenarios. PLoS Comput. Biol. 8, e1002495. https://doi.org/10.1371/journ al.pcbi.1002495 Artimo, P., Jonnalagedda, M., Arnold, K., Baratin, D., Csardi, G., de Castro, E., Duvaud, S., Flegel, V., Fortier, A., Gasteige r, E., Grosdidier, A., Hernandez, C., Ioannidis, V., Kuznetsov, D., Liechti, R., Moretti, S., Mostaguir, K., Reda schi, N., Rossier, G., Xenarios, I., Stockinger, H., 2012. ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res. 40, W597 W603. https: //doi.org/10.1093/nar/gks400 188 Auchincloss, L.C., Laursen, S.L., Branchaw, J.L., Eagan, K., Graham, M., Hanauer, D. I., Lawrie, G., McLinn, C.M., Pelaez, N., Rowland, S., Towns, M., Trautmann, N.M., Varma - Nelson, P., Weston, T.J., Dolan, E.L., 2014. Assessmen t of Course - Based Undergraduate Research Experiences: A Meeting Report. CBE - Life Sci. Educ. 13, 29 40. https://do i.org/10.1187/cbe.14 - 01 - 0004 Balenger, S.L., Zuk, M., 2015. Roaming Romeos: male crickets evolving in silence show increased locomotor behaviou rs. Anim. Behav. 101, 213 219. https://doi.org/10.1016/j.anbehav.2014.12.023 Ballen, C.J., Blum, J.E., Brownell, S., Hebert, S., Hewlett, J., Klein, J.R., McDonald, E.A., Monti, D.L., Nold, S.C., Slemmons, K.E., Soneral, P.A.G., Cotner, S., 2017. A Call to Develop Course - Based Undergraduate Research Experiences (CUREs) for Nonmajors Courses. CBE Life Sci. Educ. 16. h ttps://doi.org/10.1187/cbe.16 - 12 - 0352 Bangera, G., Brownell, S.E., 2014. Course - Based Undergraduate Research Experiences Can Make Scientific Re search More Inclusive. CBE - Life Sci. Educ. 13, 602 606. https://doi.org/10.1187/cbe.14 - 06 - 0099 Barbançon, F., Eva ns, S.N., Nakhleh, L., Ringe, D., Warnow, T., 2013. An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica 30, 143 170. https://doi.org/10.1075/dia.30.2.01bar Baum, B.R., Duncan, T., Stuessy, T., 1984. Application of co mpatibility and parsimony methods at the infraspecific, specific, and generic levels in Poaceae, in: Cladistics: Perspectives on the Reconstruction of Evolutionary History. Columbia Univ. Press, New York, pp. 192 220. Baum, D.A., Smith, S.D., 2013. Tree Th inking An Introduction to Phylogenetic Biology. Roberts and Company Publishers Inc, the United States. Beckmann, B.E., McKinley, P.K. , Ofria, C., 2008. Selection for group - level efficiency leads to self - regulation of population size, in: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation. ACM, pp. 185 192. Beerenwinkel, N., Siebourg, J., 2012. Probability, statistics, and computational science, in: Evolutionary Genomics. Springer, pp. 77 110. Betz, J.A., 1995. Computer games: Increase learning in an interactive multidisciplinary environment. J. Educ. Technol. Syst. 24, 195 205. Bishop, B.A., Anderson, C.W., 1990. Student conceptions of natural selection and its role in evolution. J. Res. Sci. Teach. 27, 415 427. Bishop, M.J., F riday, A.E., 1987. Tetrapod relationships: the molecular evidence, in: Molecules and Morphology in Evolution: Conflict or Compromise. Cambridge University Press, Cambridge, England, pp. 123 139. Blount, Z.D., Borland, C.Z., Lenski, R.E., 2008. Historical c ontingency and the evolution of a key innovation in an experimental population of Escherichia coli. Proc. Natl. Acad. Sci. 105, 7899 7906. https://doi.org/10.1073/pnas.0803151105 Bostrom, N., 2003. Are You Living in a Computer Simulation? Philos. Q. 53, 24 3 255. 189 Box, G.E., 1976. Statistics and Science. J Am Stat Assoc 71, 791 799. Brooks, D.J., Fresco, J.R., Lesk, A.M., Singh, M., 2002. Evolution of Amino Acid Frequencies in Proteins Over Deep Time: Inferred Order of Introduction of Amino Acids into the Gen etic Code. Mol. Biol. Evol. 19, 1645 1655. https://doi.org/10.1093/oxfordjournals.molbev.a003988 Brown, J.M., Hedtke, S.M., Lemmon, A .R., Lemmon, E.M., 2010. When Trees Grow Too Long: Investigating the Causes of Highly Inaccurate Bayesian Branch - Length Est imates. Syst. Biol. 59, 145 161. https://doi.org/10.1093/sysbio/syp081 Brownell, S.E., Freeman, S., Wenderoth, M.P., Crowe, A.J., 201 4. BioCore Guide: a tool for interpreting the core concepts of Vision and Change for biology majors. CBE - Life Sci. Educ. 13 , 200 211. Bryant, D., 2003. A classification of consensus methods for phylogenetics. DIMACS Ser. Discrete Math. Theor. Comput. Sci. 61, 163 184. Buck, L.B., Bretz, S.L., Towns, M.H., 2008. Characterizing the level of inquiry in the undergraduate laborator y. J. Coll. Sci. Teach. 38, 52 58. Bull, J.J., Badgett, M.R., Wichman, H.A., Huelsenbeck, J.P., Hillis, D.M., Gulati, A., Ho, C., Mol ineux, I.J., 1997. Exceptional convergent evolution in a virus. Genetics 147, 1497 1507. Bull, J.J., Cunningham, C.W., Moli neux, I.J., Badgett, M.R., Hillis, D.M., 1993. Experimental Molecular Evolution of Bacteriophage T7. Evolution 47, 993 1007. https:// doi.org/10.2307/2409971 Burke, M.K., Dunham, J.P., Shahrestani, P., Thornton, K.R., Rose, M.R., Long, A.D., 2010. Genome - wi de analysis of a long - term evolution experiment with Drosophila. Nature 467, 587 590. https://doi.org/10.1038/nature09352 Burmeister, A.R., Smith, J.J., 2016. Evolution across the Curriculum: Microbiology. J. Microbiol. Biol. Educ. 17, 252 260. https://doi .org/10.1128/jmbe.v17i2.988 Campbell, C.E., Nehm, R.H., 2013. A Critical Analysis of Assessment Quality in Genomics and Bioinformatic s Education Research. CBE Life Sci. Educ. 12, 530 541. https://doi.org/10.1187/cbe.12 - 06 - 0073 Carlone, H.B., Johnson, A., 2 007. Understanding the science experiences of successful women of color: Science identity as an analytic lens. J. Res. Sci. Teach. 44 , 1187 1218. https://doi.org/10.1002/tea.20237 Cavalli - Sforza, L.L., Edwards, A.W., 1967. Phylogenetic analysis: models and estimation procedures. Evolution 21, 550 570. CG, N., LaBar, T., Hintze, A., Adami, C., 2017. Origin of life in a digital microcosm. Phil Trans R Soc A 375, 20160350. https://doi.org/10.1098/rsta.2016.0350 Chi, M.T.H., 2005. Commonsense Conceptions of Eme rgent Processes: Why Some Misconceptions Are Robust. J. Learn. Sci. 14, 161 199. https://doi.org/10.1207/s15327809jls1402_1 190 Chow, S.S ., Wilke, C.O., Ofria, C., Lenski, R.E., Adami, C., 2004. Adaptive Radiation from Resource Competition in Digital Organisms . Science 305, 84 86. https://doi.org/10.1126/science.1096307 he Big Stuff. Faber & Faber. Clune, J., Goldsby, H.J., Ofria, C., Pennock, R.T., 2011. Selective pressures for accurate alt ruism targeting: evidence from digital evolution for difficult - to - test aspects of inclusive fitness theory. Proc. R. Soc. Lond. B Bio l. Sci. 278, 666 674. Clune, J., Misevic, D., Ofria, C., Lenski, R.E., Elena, S.F., Sanjuán, R., 2008. Natural Selection Fa ils to Optimize Mutation Rates for Long - Term Adaptation on Rugged Fitness Landscapes. PLoS Comput. Biol. 4, e1000187. https://doi.org /10.1371/journal.pcbi.1000187 Clune, J., Ofria, C., Pennock, R.T., 2007. Investigating the emergence of phenotypic plastici ty in evolving digital organisms, in: European Conference on Artificial Life. Springer, pp. 74 83. Coley, J., Muratore, T., 2012. Trees, Fish, and Other Fictions: Folk Biological Thought and its Implications for Understanding Evolutionary Biology, in: Evol ution Challenges: Integrating Research and Practice In Teaching and Learning About Evolution. pp. 22 46. https://doi.org/10. 1093/acprof:oso/9780199730421.003.0002 Coley, J.D., Tanner, K., 2015. Relations between Intuitive Biological Thinking and Biological Misconceptions in Biology Majors and Nonmajors. CBE Life Sci. Educ. 14, ar8. https://doi.org/10.1187/cbe.14 - 06 - 0094 Coley, J.D., Tanner, K.D., 2012. Common Origins of Diverse Misconceptions: Cognitive Principles and the Development of Biology Thinking. CB E Life Sci. Educ. 11, 209 215. https://doi.org/10.1187/cbe.12 - 06 - 0074 College Board, 2011. AP Biology Curriculum Framework 2 012 2013. College Board, New York. Connelly, B.D., Zaman, L., Ofria, C., McKinley, P.K., 2010. Social Structure and the Maintenance of Biodiversity., in: ALIFE. pp. 461 468. Cooper, M.M., 2013. Chemistry and the Next Generation Science Standards. J. Chem. Educ. 90, 679 680. https://doi.org/10.1021/ed400284c Cooper, R.A., 2017. Natural selection as an emergent process: instructional imp lications. J. Biol. Educ. 51, 247 260. https://doi.org/10.1080/00219266.2016.1217905 Cooper, T.F., Ofria, C., 2003. Evolutio n of stable ecosystems in populations of digital organisms, in: Artificial Life VIII: Proceedings of the Eighth International Confer ence on Artificial Life. pp. 227 232. Corwin, L.A., Graham, M.J., Dolan, E.L., 2015. Modeling Course - Based Undergraduate Res earch Experiences: An Agenda for Future Research and Evaluation. CBE - Life Sci. Educ. 14, es1. https://doi.org/10.1187/cbe.14 - 10 - 0167 Cotner, S., Hebert, S., 2016. Bean Beetles Make Biology Research Sexy. Am. Biol. Teach. 78, 233 240. https://doi.org/10.152 5/abt.2016.78.3.233 191 Covert, A.W., Lenski, R.E., Wilke, C.O., Ofria, C., 2013. Experiments on the role of deleterious mutations as st epping stones in adaptive evolution. Proc. Natl. Acad. Sci. 110, E3171 E3178. Covert III, A.W., Smith, L., Derrberry, D.Z., Wilke, C.O., 2012. What does sex have to do with it: tracking the fate of deleterious mutations in sexual populations. Artif. Life 1 3, 32 36. https://doi.org/10.7551/978 - 0 - 262 - 31050 - 5 - ch005 Coyne, J.A., 2010. Why evolution is true. Viking, New York. Cummin gs, M.P., Handley, S.A., Myers, D.S., Reed, D.L., Rokas, A., Winka, K., 2003. Comparing bootstrap and posterior probability values i n the four - taxon case. Syst. Biol. 52, 477 487. Cunningham, C.W., Jeng, K., Husti, J., Badgett, M., Molineux, I.J., Hillis, D.M., Bull, J.J., 1997. Parallel molecular evolution of deletions and nonsense mutations in bacteriophage T7. Mol. Biol. Evol. 14, 1 13 116. Cunningham, C.W., Zhu, H., Hillis, D.M., 1998. Best - fit maximum - likelihood models for phylogenetic inference: Empiri cal tests with known phylogenies. Evolution 52, 978 987. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C., 1978. 22 a model of evolutiona ry change in proteins, in: Atlas of Protein Sequence and Structure. National Biomedical Research Foundation Silver Spring, M D, pp. 345 352. de Varigny, H., 1892. Experimental Evolution. MacMillan. de Visser, J.A.G.M., Hermisson, J., Wagner, G.P., Meyers, L .A., Bagheri - Chaichian, H., Blanchard, J.L., Chao, L., Cheverud, J.M., Elena, S.F., Fontana, W., 2003. Perspective: evolutio n and detection of genetic robustness. Evolution 57, 1959 1972. Degnan, J.H., DeGiorgio, M., Bryant, D., Rosenberg, N.A., 2009. Prop erties of consensus methods for inferring species trees from gene trees. Syst. Biol. 58, 35 54. Dangerous Idea: Evolution and the meanings of life. Simon & Schuster. Dobzhansky, T., 1973. Nothing in Biology Makes Sense except i n the Light of Evolution. Am. Biol. Teach. 35, 125 129. https://doi.org/10.2307/4444260 Domin, D.S., 1999. A Review of Labor atory Instruction Styles. J. Chem. Educ. 76, 543. https://doi.org/10.1021/ed076p543 Drace, K., 2013. Gamification of the Laboratory Experience to Encourage Student Engagement 274. https://doi.org/10.1128/jmbe.v14i2.632 Eagan Jr., M.K., Hurtado, S., Chang, M.J., Garcia, G.A., Herrera, F.A., Garibay, J.C., 2013. Making a Difference in Science Educatio n: The Impact of Undergraduate Research Programs. Am. Educ. Res. J. 50, 683 713. https://doi.org/10.3102/0002831213482038 El ena, S.F., Wilke, C.O., Ofria, C., Lenski, R.E., 2007. Effects of population size and mutation rate on the evolution of mutational r obustness. Evolution 61, 666 674. https://doi.org/10.1111/j.1558 - 5646.2007.00064.x 192 Elsberry, W.R., Grabowski, L.M., Ofria, C ., Pennock, R.T., 2009. Cockroaches, drunkards, and climbers: Modeling the evolution of simple movement strategies using digital org 99. Espinosa, L., 2011. Pipelines and Pathways: Women of Color in Undergraduate STEM Majors and the College Experiences That Contribute to Persistence. Harv. Educ. Rev. 81, 209 24 1. https://doi.org/10.17763/haer.81.2.92315ww157656k3u Evans, E.M., 2008. Conceptual Change and Evolutionary Biology: A Deve lopmental Analysis, in: Vosniadou, S. (Ed.), International Handbook of Research on Conceptual Change. Routledge, New York, pp. 263 2 94. Eyre - Walker, A., Keightley, P.D., 2007. The distribution of fitness effects of new mutations. Nat. Rev. Genet. 8, 610 618. https://doi.org/10.1038/nrg2146 Fares, M.A., Barrio, E., Becerra, N., Escarmís, C., Domingo, E., Moya, A., 1998. The foot - and - mou th disease RNA virus as a model in experimental phylogenetics. Int. Microbiol. 1, 311 318. Felder, R.M., Brent, R., 19 96. Navigating the bumpy road to student - centered instruction. Coll. Teach. 44, 43 47. Felsenstein, J., 2005. PHYLIP (Phylogeny Inference Package). Department of Genome Sciences, University of Washington, Seattle. Felsenstein, J., 2004. Inferring phylogeni es. Sinauer associates, Sunderland, MA. Felsenstein, J., 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mo l. Evol. 17, 368 376. Felsenstein, J., 1978. Cases in which parsimony or compatibility methods will be positively misl eading. Syst. Zool. 27, 401 410. Felsenstein, J., 1974. The Evolutionary Advantage of Recombination. Genetics 78, 737 756. Fiedler, D., Sb eglia, G.C., Nehm, R.H., Harms, U., 2019. How strongly does statistical reasoning influence knowledge and acceptance o f evolution? J. Res. Sci. Teach. 56, 1183 1206. Fitch, W.M., Atchley, W.R., 1985. Evolution in inbred strains of mice appears rapid. Scien ce 228, 1169 1175. https://doi.org/10.1126/science.4001935 Fitch, W.M., Margoliash, E., 1967. Construction of phylogen etic trees. Science 155, 279 284. Fodor, J.A., 1998. Concepts: Where cognitive science went wrong. Oxford University Press. Fortuna, M.A., Zaman, L., Ofria, C., Wagner, A., 2017. The genotype - phenotype map of an evolving digital organism. PLOS Comput. Biol . 13, e1005414. https://doi.org/10.1371/journal.pcbi.1005414 Fortuna, M.A., Zaman, L., Wagner, A.P., Ofria, C., 2013. Evolving digital eco logical networks. PLoS Comput. Biol. 9, e1002928. Freeman, S., Herron, J.C., 2014. Evolutionary analysis, Fifth. ed. P earson Prentice Hall, Upper Saddle River, NJ. 193 Furtak, E.M., Morrison, D., Kroog, H., 2014. Investigating the Link Between Learning Progres sions and Classroom Assessment. Sci. Educ. 98, 640 673. https://doi.org/10.1002/sce.21122 Garland, T., Rose, M.R., 200 Experimental Evolution: Concepts, Methods, and Applications of Selection Ex periments. University of California Press, Berkeley, pp. 3 13. Geary, D., 2007. Educating the Evolved Mind: Conceptual Foundations for an Evolutionary Educational Psychology, in: Carlson, J., Levin, J.R. (Eds.), Educating the Evolved Mind: Conceptual Found ations for an Evolutionary Educational Psychology. IAP, pp. 1 99. Gelman, S., 2004. Psychological essentialism in chil dren. Trends Cogn. Sci. 8, 404 409. https://doi.org/10.1016/j.tics.2004.07.001 Gelman, S.A., 2003. The Essential Child: Origins of Essenti alism in Everyday Thought. Oxford University Press, New York. Gerrish, P.J., Lenski, R.E., 1998. The fate of competing beneficial mutations in an asexual population. Genetica 102, 127. Godfrey - Smith, P., 2007. Conditions for evolution by natural selection. J. Philos. 104, 489 516. Goings, S., Clune, J., Ofria, C., Pennock, R.T., 2004. Kin selection: The rise and fall of k in - cheaters, in: Proceedings of the Ninth International Conference on Artificial Life. pp. 303 308. Goldsby, H.J., Cheng, B.H., 2008. Avid a - MDE: a digital evolution approach to generating models of adaptive software behavior, in: Proceedings of the 10th An nual Conference on Genetic and Evolutionary Computation. ACM, pp. 1751 1758. Goldsby, H.J., Cheng, B.H., McKinley, P.K., Knoester, D.B., O fria, C.A., 2008. Digital evolution of . International Conference On. IEEE, pp. 87 96. Goldsby, H.J., Dornhaus, A., Kerr, B., Ofria, C., 2012. Task - switching costs promote the e volution of division of labor and shifts in individuality. Proc. Natl. Acad. Sci. 109, 13686 13691. Goldsby, H.J., Kno ester, D.B., Kerr, B., Ofria, C., 2014a. The effect of conflicting pressures on the evolution of division of labor. PloS One 9, e102713. G oldsby, H.J., Knoester, D.B., Ofria, C., Kerr, B., 2014b. The Evolutionary Origin of Somatic Cells under the Dirty Wor k Hypothesis. PLoS Biol. 12, e1001858. https://doi.org/10.1371/journal.pbio.1001858 Goodwin, E.C., Anokhin, V., Gray, M.J., Zajic, D.E., P odrabsky, J.E., Shortlidge, E.E., 2021. Is This - Based Course Feel Authentic. CBE Life Sci. Educ. 20, ar10. https://doi.org/10.1187/cbe.20 - 07 - 0149 Gould, S.J., Duve, C. de, 1996. Full house: The spre ad of excellence from Plato to Darwin. Nature 383, 771 771. 194 Grabowski, L.M., Bryson, D.M., Dyer, F.C., Pennock, R.T., Ofria, C., 2013. A Case Study of the De Novo Evolution of a Complex Odometric Behavior in Digital Organisms. PLoS ONE 8, e60466. https://d oi.org/10.1371/journal.pone.0060466 rstanding of evolution by natural selection in an introductory biology course. Teach. Issues Exp. Ecol. 6. Graybeal, A., 1998. Is it better to add taxa o r characters to a difficult phylogenetic problem? Syst. Biol. 47, 9 17. Graybeal, A., 1994. Evaluating the Phylogenetic Utility of Genes: A Search for Genes Informative About Deep Divergences Among Vertebrates. Syst. Biol. 43, 174 193. https://doi.org/10.2 307/2413460 Green, D.S., Bozzone, D.M., 2001. A test of hypotheses about random mutation: using classic experiments to teach experimental design. Am. Biol. Teach. 63, 54 58. Green, J.H., Koza, A., Moshynets, O., Pajor, R., Ritchie, M.R., Spiers, A.J., 2011 . Evolution in a test tube: rise of the Wrinkly Spreaders. J. Biol. Educ. 45, 54 59. https://doi.org/10 .1080/00219266.2011.537842 Gregory, T.R., 2009. Understanding Natural Selection: Essential Concepts and Common Misconceptions. Evol. Educ. Outreach 2, 15 6 175. https://doi.org/10.1007/s12052 - 009 - 0128 - 1 Gupta, A., LaBar, T., Miyagi, M., Adami, C., 2016. Evo lution of Genome Size in Asexual Digital Organisms. Sci. Rep. 6, 25786. https://doi.org/10.1038/srep25786 Hagstrom, G.I., Hang, D.H., Ofria, C., Torng, E ., 2004. Using Avida to test the effects of natural selection on phylogenetic reconstruction methods. A rtif. Life 10, 157 166. Hall, B.G., 2005. Comparison of the Accuracies of Several Phylogenetic Methods Using Protein and DNA Sequences. Mol. Biol. Evol. 22, 792 802. https://doi.org/10.1093/molbev/msi066 Hallgren, K.A., 2012. Computing Inter - Rater Reliabil ity for Observational Data: An Overview and Tutorial. Tutor. Quant. Methods Psychol. 8, 23 34. Handelsman, J., Houser, B., Kriegel, H., 1997. Biology bro ught to life: a guidebook to teaching students to think like scientists. McGraw - Hill Primis. Hang, D., Ofria, C., Schmidt, T.M., Torng, E., 2003. The effect of natural selection on phylogeny reconstruction algorithms, in: Genetic and Evolutionary Computati on Conference. Springer, pp. 13 24. Hang, D., Torng, E., Ofria, C., Schmidt, T.M., 2007. The effect of natural selection on the performance of maximum parsimony. BMC Evol. Biol. 7, 94. https://doi.org/10.1186/1471 - 2148 - 7 - 94 Hartl, D.L., Clark, A.G., 2007. Principles of population genetics, Fourth. ed. Sinauer associates, Sunderland, MA. Hasegawa, M., Kishin o, H., Yano, T., 1985. Dating of the human - ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160 174. 195 Heddy, B.C., Sinatra, G.M. , 2013. Transforming misconceptions: Using transformative experience to promote positive affect and con ceptual change in students learning about biological evolution. Sci. Educ. 97, 723 744. Hill, W.G., Robertson, A., 1966. The effect of linkage on limits to artificial selection. Genet. Res. 8, 269 294. Hillis, D.M., 2007. Making evolution relevant and exci ting to biology students. Evolution 61, 1261 1264. Hillis, D.M., 1995. Approaches for assessing phylogenetic accuracy. Syst. Biol. 44, 3 16. Hillis, D.M. , Bull, J.J., 1993. An empirical test of bootstrapping as a method for assessing confidence in phylogen etic analysis. Syst. Biol. 42, 182 192. Hillis, D.M., Bull, J.J., White, M.E., Badgett, M.R., Molineux, I.J., 1993. Experimental Approaches to Phylogenet ic Analysis. Syst. Biol. 42, 90 92. Hillis, D.M., Bull, J.J., White, M.E., Badgett, M.R., Molineux, I.J ., 1992. Experimental Phylogenetics: Generation of a Known Phylogeny. Science 255, 589 592. Hillis, D.M., Huelsenbeck, J.P., Cunningham, C.W., 1994. Appl ication and accuracy of molecular phylogenies. Science 264, 671 677. Hoang, D.T., Vinh, L.S., Flouri, T ., Stamatakis, A., von Haeseler, A., Minh, B.Q., 2018. MPBoot: fast phylogenetic maximum parsimony tree inference and bootstrap approximation. BMC Evol. Biol. 18. https://doi.org/10.1186/s12862 - 018 - 1131 - 3 Hoang, D.T., Vinh, L.S., Flouri, T., Stamatakis, A. , von Haeseler, A., Minh, B.Q., 2017. MPBoot version 1.1.0 User Manual. Hormoz, S., 2013. Amino acid composition of proteins reduces deleterious impact o f mutations. Sci. Rep. 3, 2919. https://doi.org/10.1038/srep02919 Howe, K., Bateman, A., Durbin, R., 20 02. QuickTree: building huge Neighbour - Joining trees of protein sequences. Bioinformatics 18, 1546 1547. Huang, H., He, Q., Kubatko, L.S., Knowles, L.L., 2010. Sources of Error Inherent in Species - Tree Estimation: Impact of Mutational and Coalescent Effects on Accuracy and Implications for Choosing among Different Methods. Syst. Biol. 59, 573 583. https://doi.org/10.1093/sysbio/syq047 Huang, H., Sukumaran, J., Smith, S.A., Knowles, Ll., 2017. Cause of gene tree discord? Distinguishing incom plete lineage sorting and lateral gene transfer in phylogenetics. PeerJ Prepr. https://doi.org/10.7287/peerj.preprints.3489v1 Huelsenbeck, J.P., 1995. Performance of phyl ogenetic methods in simulation. Syst. Biol. 44, 17 48. Huelsenbeck, J.P., Nielsen, R., 1999. Effect of nonindependent substitution on phylogenetic accuracy. Syst. Biol. 48, 317 328. Huelsenbeck, J.P., Ronquist, F., Nielsen, R., Bollback, J.P., 2001. Bayesi an inference of phylogeny and its impact on evolutionary biology. science 294, 2310 23 14. 196 Huerta - Cepas, J., Serra, F., Bork, P., 2016. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol. Biol. Evol. 33, 1635 1638. https://doi.org/ 10.1093/molbev/msw046 Hunter, J.D., 2007. Matplotlib: A 2D Graphics Environment. Compu t. Sci. Eng. 9, 90 95. https://doi.org/10.1109/MCSE.2007.55 Letters of Charles Darwin, Including an Autobiographical Chapter. Kpjm