LEVERAGING LARGE DATASETS TO INVESTIGATE THE DISTRIBUTION OF FITNESS EFFECTS IN PLANTS By Miles David Roberts A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Genetics and Genome Sciences – Doctor of Philosophy 2025 ABSTRACT Mutations—spontaneous changes in DNA sequences—are fundamental to evolution, and con- tribute to everything from disease susceptibility in people to agriculture. Population geneticists are tasked with understanding mutations - historically categorizing them as either beneficial, harmful, or neutral based on how they affect fitness. However, our understanding of these categories is of- ten limited by incomplete data and an inability to travel back in time to track mutations. Over four chapters, I will explore how large datasets can enhance our understanding of these three funda- mental mutation categories. First, I will use 27 terabases of gene expression data from 300 studies to investigate whether rarely expressed genes accumulate harmful mutations at higher rates than constitutively expressed genes. Next, I will leverage about 205 terabases of DNA-sequencing data and use k-mers (DNA subsequences) to measure neutral variation in 112 natural plant species. The results suggest that current methods for estimating diversity reliant on reference genomes underes- timate genetic variation. The following chapter expands upon the idea of using k-mers to measure genetic diversity, numerically and analytically exploring the relationship between k-mer diversity and nucleotide diversity. Overall, k-mer diversity scales linearly with nucleotide diversity and we showcase the use of bloom filters to decrease the memory burden of the k-mer diversity calcu- lations. Finally, I will use 250,000 evolutionary simulations to train machine learning models to infer the time it took for a fixed beneficial mutation to spread. Overall, large datasets like these hold many opportunities to revisit old biological questions and further our understanding of muta- tion trajectories. To David, Cheree, Blake, and Paige Roberts iii ACKNOWLEDGEMENTS I am infinitely appreciative for the guidance and support of my advisor, Emily Josephs. Emily’s patience, even in the face of my constant stubbornness, and her promotion of curiosity greatly shaped how I view research. This dissertation also greatly benefited from the input of my com- mittee: Bob Vanburen, Shinhan Shiu, and Jeff Connor. I want to give a special thanks to Addie Thompson for being the outside evaluator for my comprehensive exam and for later stepping into my committee for my dissertation defense. I also want to thank Olivia Davis and Robert Williamson for being wonderful collaborators. We will see their awesome work in chapter 3. I deeply appreciate the members of the Josephs lab, past and current, who have given me invalu- able advice and have (usually) been willing to laugh at my puns. These folks are: Husain Agha, Sophie Buysse, Nate Catlin, Derek Denney, Asia Hightower, Daniela Palmer, Rebecca Panko, Gabbie Sandstedt, Mia Stevens, Chrissy Miller, Adrian Platts, Magie Williams, and Maya Wilson Brown. I was also lucky to have many amazing friends that helped keep my sprits high. In no particular order, these folks are: Laura Ford, Nick Johnson, Max Harman, Luke Strickland, Bran- don Webster, Kim Fisher, Brandon Beal, John Botsford, Robin Waterman, and Carolyn Graham. Thank you all. I also want to thank MSU’s Council of Graduate Students and their dozens of representatives across MSU for their work in advocating for graduate and professional student needs and for giving me the opportunity to be their President. It was a great honor to represent my peers in that capacity and the experience shaped my understanding of leadership. Thank you especially to Deanne Arking and Allyn Shaw for helping me immensely with these efforts. I want to acknowledge that it was my undergraduate genetics teacher William Cushwa that encouraged me to continue studying genetics and to engage in undergraduate research, which were both crucial to me pursuing a graduate degree. I did undergraduate research in Dr. Stephanie Porter’s lab at Washington State University in Vancouver, Washington and I received incredibly valuable advice and mentorship that I continue to lean on to this day. I also want to acknowledge mentorship from the former post-docs Emily Helliwell and Camile Wendlandt, as well as former iv graduate students Angeliqua Montoya, Zoie Lopez, and Niall Millar that were willing to train and mentor me during this time and motivated me to go to graduate school. The work in this dissertation, as with all scientific works, leans heavily on the previous efforts of other scientists. I want to acknowledge that in chapters 1 and 2, I make extensive use of public databases with contributions from hundreds of scientists across the world. It is impossible to list all of the contributors here, but I am grateful for their work and for making their data public. Without their effort none of this would have been possible. I am also grateful for the anonymous peer reviewers that have commented on chapters 1-3. I would like to thank the Plant Biology Department and the Ecology, Evolution, and Behavior Program at MSU for providing a welcoming environment where I could always participate even though I had technically no affiliation with either of them. My home program of Genetics and Genome Sciences was also always supportive and flexible in allowing me to pursue my personal educational goals. I would like to thank the Biomolecular Sciences Program for supporting the 1st year of my PhD, the National Science Foundation and the National Institutes of Health for supporting the 2nd, 3rd, and 4th years of my PhD. I would also like to thank MSU’s Institute for Cyber-Enabled Research for providing the computing power and a fellowship to learn cloud-based computing approaches. To my family, David, Cheree, Blake, and Paige Roberts, your support is the reason why I got to pursue an education. I do not know where I would be without it. I love you all very much. To my partner, Riley Pizza, thank you for your endless support. You are the single greatest thing that’s happened to me in graduate school and I cannot wait to see where our journey takes us. - Miles v PREFACE This dissertation is my original work. Chapters 1 and 3 are published in peer-reviewed journals under a CC-BY license, allowing unlimited sharing and reproduction as long as attribution is given. Chapter 2 is accepted for publication at Evolution Letters, but the preprint version is available at biorxiv under a CC-BY license. Chapter 4 is unpublished. I was responsible for all of data collection, data analysis, and writing in Chapters 1, 2, and 4 (with guidance from my advisor and committee). Chapter 3 was co-first authored by myself and Olivia Davis - a former undergraduate (now graduate) of Rose-Hulman Institute of Technology in Terre Haute, Indiana. Olivia carried out the simulations depicted in Figures 3.3 - 3.5, D1, and D2. I performed the simulations depicted in Figure 3.1, made Figure 3.2, and derived equations 3.7, 3.8, and 3.15. Olivia also wrote the text in the section Testing the efficacy of k-mer measures of variation, but all other text was written by me. vi TABLE OF CONTENTS INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 1: WEAKER SELECTION ON GENES WITH TREATMENT SPECIFIC EXPRESSION CONSISTENT WITH A LIMIT ON PLASTICITY EVOLUTION IN ARABIDOPSIS THALIANA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 CHAPTER 2: k-MER-BASED DIVERSITY SCALES WITH POPULATION SIZE PROX- IES MORE THAN NUCLEOTIDE DIVERSITY IN A META-ANALYSIS OF 98 PLANT SPECIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 . . CHAPTER 3: k-MER-BASED APPROACHES TO BRIDGING PANGENOMICS AND POPULATION GENETICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 CHAPTER 4: SUMMARY STATISTICS COMPARABLE TO CONVOLUTIONAL NEURAL NETWORKS FOR INFERRING TIMES TO FIXATION . . . . . . . . 90 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 APPENDIX A: SUPPLEMENTAL FIGURES FOR CHAPTER 1 . . . . . . . . . . . . 139 APPENDIX B: SUPPLEMENTAL METHODS FOR CHAPTER 2 . . . . . . . . . . . . 178 APPENDIX C: SUPPLEMENTAL FIGURES FOR CHAPTER 2 . . . . . . . . . . . . 179 APPENDIX D: SUPPLEMENTAL FIGURES FOR CHAPTER 3 . . . . . . . . . . . . 205 APPENDIX E: SUPPLEMENTAL FIGURES FOR CHAPTER 4 . . . . . . . . . . . . . 207 vii INTRODUCTION Population geneticists study genetic variation within and between populations. To understand variation, we need to understand the fate of new mutations. What determines whether a mutation spreads through a population or goes extinct? There are three frontiers that are currently revolu- tionizing our understanding of this question: (1) the collection of data across many environments, (2) the assembly of multiple genomes per population (i.e. pangenomics), and (3) machine learning (Hahn, 2018). In the course of this dissertation, we will familiarize ourselves with each of these areas, but to prepare for the journey, we will need a mental model as a guide. Let’s imagine the amount of genetic diversity in a population as the level of water in a pool. In this analogy, the process of mutation is the faucet or hose that adds water into the pool. If this were the only dynamic, then our pool would fill up forever. Thus, our pool must have some drains. One important drain is natural selection, which is a process where mutations that cause heritable, delete- rious changes in survival or reproduction will not spread through a population and may eventually go extinct (i.e. purifying selection). On the other hand, mutations that cause heritable, beneficial changes to survival and reproduction will spread and may eventually be fixed (i.e. positive se- lection). If we imagine a world where all mutations are deleterious then we would expect natural selection to constantly remove mutations from our population. However, this expectation clearly does not match reality because essentially all natural populations have genetic variation (Leffler et al., 2012). One way to modify our expectation is to recognize that natural selection is not an omnipotent force - the pool drain only opens under certain circumstances. One important limit to natural selection is that it can only act when a mutation is affecting fit- ness. If a mutation occurs in a non-expressed gene, for instance, then such a mutation may not affect fitness and will be invisible to natural selection. This masking of deleterious mutations can theoretically happen in any gene that is only expressed at certain times or in certain environments (Van Dyken and Wade, 2010). But on the other hand, natural selection often favors conditional ex- pression (Snell-Rood et al., 2010). Many species of plants have evolved specific gene expression responses for specific stimuli, including mechanical stress (Braam and Davis, 1990), light stress 1 (Mishra et al., 2012), and heat stress (Zhang et al., 2017b). Thus, a trade-off arises between the added benefit of conditional expression vs the weakened ability to purify genes of deleterious muta- tions. Though the theory behind this relationship is well-understood (Van Dyken and Wade, 2010), there are no empirical investigations of the relationship between environment-specific expression and deleterious variation. This is in part because gene expression could not be measured across many environments easily until relatively recently. Chapter 1 leverages publicly available gene expression data across 200 different environments in Arabidopsis thaliana to test whether genes with more environment-specific expression have more deleterious variation. What if we assume most mutations are neutral? This idea is the foundation of Neutral The- ory (Kimura, 1983). Under this framework, the amount of variation in a population is simply a product of how many individuals are in the population and the rate of mutation. In other words, more individuals replicating DNA to produce offspring means more opportunities for mutations to occur in DNA. This is a straightforward expectation that is anchored in the foundational works of population genetics and evolutionary biology. Charles Darwin observed that “wide ranging, much diffused and common species vary most” (Darwin (1859), Chapter II) and R. A. Fisher supported this observation mathematically as “a numerous species, with the same frequency of mutation, will maintain a higher variability than will a less numerous species” (Fisher, 1923). However, when sci- entists empirically estimate the scaling between population size and genetic diversity, the result is much weaker than expected (Lewontin, 1974; Leffler et al., 2012; Buffalo, 2021). There are many biological mechanisms that contribute to this paradoxical result (Charlesworth and Jensen, 2022). However, one underexplored explanation is whether our methods of measuring genetic diversity simply do not capture all of the variation present. Chapters 2 and 3 will investigate whether the ubiquitous practice of using a reference genome to estimate diversity contributes to this paradox and how patterns in k-mers (DNA subsequences) provide a potential solution. Finally, what about beneficial mutations? Positive selection can also be thought of as a drain in our pool analogy because fixation of beneficial alleles removes diversity around the site of the beneficial mutaiton - a signature called a selective sweep (Whitehouse and Schrider, 2023). The 2 amount of diversity that positive selection removes is proportional to the time it takes for the mu- tation to fix (Coop, 2020). Shorter times to fixation result in greater loss of diversity because there is less time for mutation and recombination to break up the sweep. After fixation, mutation and recombination will gradually reintroduce variation at the site of the sweep over time. This suggests that the times to fixation and age of a sweep are encoded in patterns of genomic diversity around the sweep site. However, times to fixation rarely estimated in real data, despite being a core quan- tity in studies of neutral theory and selective sweeps (Charlesworth, 2020; Zhao et al., 2013). In chapter 4, we will use machine learning to try to estimate times to fixation using a snapshot of a contemporary population. 3 CHAPTER 1: WEAKER SELECTION ON GENES WITH TREATMENT SPECIFIC EXPRESSION CONSISTENT WITH A LIMIT ON PLASTICITY EVOLUTION IN ARABIDOPSIS THALIANA1 Abstract Differential gene expression between environments often underlies phenotypic plasticity. How- ever, environment-specific expression patterns are hypothesized to relax selection on genes, and thus limit plasticity evolution. We collated over 27 terabases of RNA-sequencing data on Ara- bidopsis thaliana from over 300 peer-reviewed studies and 200 treatment conditions to investigate this hypothesis. Consistent with relaxed selection, genes with more treatment-specific expression have higher levels of nucleotide diversity and divergence at nonsynonymous sites but lack stronger signals of positive selection. This result persisted even after controlling for expression level, gene length, GC content, the tissue specificity of expression, and technical variation between studies. Overall, our investigation supports the existence of a hypothesized trade-off between the environ- ment specificity of a gene’s expression and the strength of selection on said gene in A. thaliana. Future studies should leverage multiple genome-scale datasets to tease apart the contributions of many variables in limiting plasticity evolution. Introduction Organisms must cope with ever-changing environmental conditions to survive and reproduce. If these changes in condition cannot be avoided or escaped, phenotypes that respond to environ- mental variation through phenotypic plasticity may be adaptive. For example, under low light, the same Arabidopsis thaliana genotype will produce more or larger leaves to capture more energy for photosynthesis (Pigliucci and Kolodynska, 2002). Plastic responses are partly controlled through differential gene expression between environments (Scheiner, 1993; Schlichting and Smith, 2002). Understanding the evolution of these condition-specific expression patterns could help reconcile the diversity of plastic responses observed in nature and engineer organisms to overcome environ- mental challenges. 1This chaper is published at the following DOI under a Creative Commons CC-BY License: https://doi.org/10. 1093/genetics/iyad074. This license permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Permission to reuse this article is not required. 4 However, not all organisms can respond plastically to environmental change, so it is crucial to understand the processes that constrain plasticity (Van Kleunen and Fischer, 2005). These con- straints are usually characterized as either costs, where plasticity reduces fitness in some way, or limits to the evolution or maintenance of plasticity (DeWitt et al., 1998). Decades of research has attempted to measure the costs associated with plasticity (reviewed in Schneider 2022) but studies often fail to detect costs or find costs that are weak or restricted to certain environments (Van Kle- unen and Fischer, 2005; Van Buskirk and Steiner, 2009; Auld et al., 2010). Theory also predicts that there will be strong selection to alleviate costs (Murren et al., 2015). Thus, limits may be more important than costs in shaping the evolution of plasticity. Recent work suggests that relaxed selection can limit plasticity evolution (Snell-Rood et al., 2010; Murren et al., 2015). For instance, one hypothesis posits that genes are often under selection for environment-specific expression to minimize deleterious pleiotropy (Snell-Rood et al., 2010; McGuigan et al., 2014; Huber et al., 2017). However, narrowing the range of environments where a gene is expressed also reduces the opportunity for negative selection to act on deleterious muta- tions in the gene (Kawecki, 1994; Whitlock, 1996; Van Dyken and Wade, 2010). The accumulation of deleterious mutations could then cancel out any selective benefits of the environment-specific expression pattern. Thus, a trade-off arises between a gene’s degree of environment-specific ex- pression and the strength of negative selection acting on said gene. If we assume that environment- specific expression generally contributes to phenotypic plasticity, then this trade-off would poten- tially limit the maintenance of plasticity (Kawecki, 1994; Snell-Rood et al., 2010). Whether such a trade-off exists has not yet been tested, but the deposition of expression data from hundreds of experimental treatments across hundreds of labs into public repositories now enables approximat- ing environment specificity as treatment specificity and linking treatment-specific expression to the rate of evolution. One challenge in studying the relationship between treatment specificity and protein evolution is that many factors influence evolutionary rates (for review, see Rocha 2006; Gaut et al. 2011; Koonin 2011; Zhang and Yang 2015) and these factors are hard to disentangle. A protein’s expres- 5 sion level is often considered the best predictor of its evolutionary rate (Rocha, 2006) - a result observed across all domains of life (Zhang and Yang, 2015) and sometimes considered a “law” of genome evolution (Koonin, 2011). Among multicellular organisms, the degree of tissue specificity in expression is also generally predictive of evolutionary rates (Duret and Mouchiroud, 2000; Lar- racuente et al., 2008; Winter et al., 2004; Zhang and Li, 2004; Slotte et al., 2011; Bush et al., 2015; Mukherjee et al., 2016; Groen et al., 2020; Huang, 2022). Additional factors that also influence evolutionary rates include exon edge conservation (Bush et al., 2015), mutational bias (Wang et al., 2004; Ossowski et al., 2010), gene length (Mukherjee et al., 2016), gene age (Moutinho et al., 2022), GC content (Zhang et al., 2002; Mukherjee et al., 2016), expression stochasticity (Groen et al., 2020), involvement in general vs specialized metabolism (Mukherjee et al., 2016), iden- tity as a regulatory or structural gene (Wheeler et al., 2022), recombination rate (Langley et al., 2012), codon-bias (Betancourt and Presgraves, 2002), mating system (Wright et al., 2002; Glémin, 2007; Payne and Alvarez-Ponce, 2018), gene compactness (Larracuente et al., 2008; Mukherjee et al., 2016), co-expression or protein-protein interaction network connectivity (Alvarez-Ponce and Fares, 2012; Masalia et al., 2017; Mähler et al., 2017; Alvarez-Ponce et al., 2017; Josephs et al., 2017), gene body methylation (Takuno and Gaut, 2012), metabolic flux (Colombo et al., 2014), protein structure (Lin et al., 2007), essentiallity (Nembaware et al., 2002; Yang et al., 2003; Davis and Petrov, 2004), and even plant height (Lanfear et al., 2013). This overabundance of possible explanatory variables suggests that massive genome-scale datasets and careful statistical analysis are required to tease out the influence of treatment-specific expression on evolutionary rates. To investigate the influence of treatment-specific expression on evolutionary rates, we com- piled a dataset of gene expression data across over 200 treatments from over 300 peer-reviewed studies in A. thaliana. We annotated RNA-sequencing runs from these studies using standardized ontologies, then processed all of them with the same pipeline. Finally, we combined the result- ing gene expression matrix with estimates of selection based on within-species polymorphism and between-species divergence to investigate whether genes with treatment-specific expression were under weaker negative selection. 6 Methods RNA-seq run annotation We amassed an initial set of RNA-seq runs from the Sustech Arabidopsis RNA-seq database V2 (Zhang et al., 2020) (http://ipf.sustech.edu.cn/pub/athrdb/) excluding any samples not associ- ated with a publication or lacking a tissue type label. On May 24th, 2022 we also downloaded all run metadata from the Sequence Read Archive (SRA) returned by the following search term: (“Arabidopsis thaliana”[Organism] AND “RNA”[Source]) OR (“Arabidopsis thaliana”[Organism] AND “RNA-Seq”[Strategy]) OR (“Arabidopsis thaliana”[Organism] AND “TRANSCRIPTOM- IC”[Source]). All SRA runs were linked to their associated publications, if possible, using Entrez. Any SRA run numbers that we could not link to a PUBMED ID or DOI were omitted. We then man- ually removed all SRA runs that originated from transgenic, mutant, hybrid, grafted, cell culture, polyploid, or aneuploid samples based on information in the SRA metadata and associated publi- cations. Runs from any naturally-occurring A. thaliana accession were included. We also omitted SRA runs that focused on sequencing non-coding RNA (ncRNA-seq, miRNA-seq, lncRNA-seq, sRNA-seq, etc.). After applying these criteria, any bioprojects with 8 or fewer SRA run numbers remaining were also omitted. All runs were labeled with treatment and tissue type descriptions using the Plant Experimen- tal Conditions Ontology (PECO) and the Plant Ontology (PO) (Cooper et al., 2018), respectively, based on information in their associated publications and SRA metadata. In our analysis, control exposure was defined as long day conditions (12 hrs light exposure or longer, but not constant light) and growing temperatures in the range of 18° - 26°, inclusive, without explicit application of stress or nutrient limitation. Warm treatments were defined as 27° or higher, while cold treatments were defined as 17° or lower. Any studies that did not report both day length and growing temperature were omitted. Any runs that could not be linked to treatments based on their annotations in the SRA or Sustech databases were also omitted. Treatment with polyethylene glycol (PEG) was catego- rized as drought exposure. Samples from plants that were recovering from stress were categorized according to the growth conditions of the recovery state instead of the stressed state. When appro- 7 priate, we labeled samples with multiple PECO terms. For example, a sample that was subjected to both heat stress and high light stress would get two PECO terms (one for each stress) and be treated separately from samples subjected to only heat stress or only light stress. Tissue type labels were eventually collapsed to the following categories: whole plant, shoot, root, leaf, seed, and a com- bined category of flower and fruit tissues. The flower and fruit tissue categories were combined because of their developmental relationship and small size relative to the other categories. In the end, we had a dataset of 24,101 sequencing runs from 306 published studies. RNA-seq run processing All RNA-seq runs were processed using the same workflow to remove the effects of bioin- formatic processing differences between studies on expression level. First, runs were downloaded using the SRA toolkit (v2.10.7), but 90 runs were not publicly available and thus failed to download. All successfully downloaded runs were trimmed using fastp v0.23.1 (Chen et al., 2018), requiring a minimum quality score of 20 and a minimum read length of at least 25 bp (-q 20 -l 25). Trimming results were compiled using multiqc v1.7 (Ewels et al., 2016). All trimmed runs were then aligned to a decoy-aware transcriptome index made by combining the primary transcripts of the Araport11 genome annotation (Cheng et al., 2017) with the A. thaliana genome in salmon v1.2.1 (Patro et al., 2017) using an index size of 25bp. The salmon outputs of each run were then combined with a custom R script to create an gene-by-run expression matrix. We omitted 423 runs with a mapping rate < 1 %, 215 runs with zero mapped transcripts, and 18 genes with zero mapped transcripts across all runs from further analysis. We note that although this cut-off does not exclude samples with more modest mapping rates (e.g. 20 - 60 %) the choice to include these samples was to avoid removing large chunks of data as “outliers” and analyzing only those samples that conform to our expectations. Whole genome sequence data processing We downloaded whole genome sequencing data for 1135 A. thaliana accessions from the 1001 genomes project panel (SRA project SRP056687) (Alonso-Blanco et al., 2016) using the SRA toolkit. All runs were trimmed using fastp (Chen et al., 2018), requiring a minimum quality score 8 of 20 and a read length of at least 30 bp (-q 20 -l 30). Trimmed reads were then aligned to the A. thaliana reference genome using BWA v0.7.17 (Li and Durbin, 2009). The alignments were sorted and converted to BAM format with SAMTOOLS v1.11 (Danecek et al., 2021), then optical duplicates were marked with picardtools v2.22.1. Haplotypes were called for each accession, then combined and jointly genotyped with GATK v4.1.4.1 assuming a sample ploidy of 2, heterozygosity of 0.001, indel-heterozyogsity of 0.001, and minimum base quality score of 20. Invariant sites were included in the genotype calls with the –include-non-variant-sites option. All calls were restricted to only coding sequence (CDS) regions based on the Araport11 annotation by supplying a BED file of CDS coordinates made with bedtools (v2.29.2). Following Korunes and Samuk (2021), variant and invariant sites were filtered separately using both GATK and vcftools v0.1.15 (Danecek et al., 2011). Variant sites were filtered if they met any of the following criteria: QD < 2, QUAL < 30, MQ < 40, FS > 60, HaplotypeScore > 13, MQRankSum < -12.5, ReadPosRankSum < -8.0, mean depth < 10, mean depth > 75, missing genotype calls > 20%, being an indel, or having more than 2 alleles. In the end, 1,915,859 variant sites across all coding sequences were retained for further analysis. Invariant sites were filtered if they met any of the following criteria: QUAL > 100, mean depth < 10, mean depth > 75, missing genotype calls > 20%. Finally, variant sites were annotated using snpEff (Java v15.0.2) (Cingolani et al., 2012b) and variants labeled as either missense or synonymous were separated into different files using SnpSift (Cingolani et al., 2012a). Selection estimated from between-species divergence We identified 1:1 orthologs between the primary transcripts of A. thaliana and Arabidopsis lyrata with Orthofinder v2.5.4 (Emms and Kelly, 2019). For each 1:1 ortholog, we aligned their protein sequences with MAFFT L-INS-I v7.475 (Katoh and Standley, 2013), then converted the protein alignments to gapless codon-based alignments using pal2nal v14 (Suyama et al., 2006). Using the gapless codon-based alignments, we estimated dN /dS using the Nei and Gojobori (1986) method implemented as a custom Biopython v1.79 script and implemented through the codeml program in the PAML package v4.9 (Yang, 2007). Unlike codeml, the custom Biopython script also returns counts of nonsynonymous (N ) and synonymous sites (S) within each gene as described in 9 Nei and Gojobori (1986), which we later used to calculate nucleotide diversity per nonsynonymous site (πN ) and per synonymous site (πS). Before proceeding with more analyses, we confirmed that our estimates of dN and dS were consistent between our Biopython script and codeml (Figure A5, Pearson correlations dN : ρ = 0.9998, dS : ρ = 0.9809). The outputs of the Biopython script were used in all subsequent analyses. Selection estimated from within-species polymorphism Nucleotide diversity (π) was calculated for each gene with pixy v1.2.3.beta1 (Korunes and Samuk, 2021) three times: once using all sites (both variant and invariant), once using missense sites plus invariant sites, and once using synonymous sites plus invariant sites. These estimates were then converted to π, πN , and πS, respectively, by first multiplying the per site estimate output from pixy by the number of sites included in the analysis. Then, to get πN and πS, the values from analyses of missense plus invariant, and synonymous plus invariant sites were divided by the N and S values for each gene, respectively, as determined by the method in Nei and Gojobori (1986). We next calculated Tajima’s D for each gene. First, we calculated π and Watterson’s Theta (θW ) for each variant site i within a gene (πi and θW i respectively). In this case, πi was calculated as: ) ( ) ( πi = ni ni − 1 2∑ j=1 1 − p2 ij (1.1) Where ni is the number of sequenced chromosomes with non-missing genotypes for variant i, pi1 is the frequency of the reference allele, and pi2 is the frequency of the alternative allele. Then, θW i was calculated as: Where ai is: θW i = 1 ai ai = ni−1∑ j=1 1 j (1.2) (1.3) This calculation of θW i is equivalent to the usual calculation of θW with the number of segregating 10 sites set to one. Next, the variance in Tajima’s D was calculated for each site as: V ar(πi − θW i) = − 1 ai ni+1 3(ni−1) ai (1.4) This is equivalent to equation 38 in Tajima (1989) with the number of segregating sites set to one. Finally, the results of the above calculations were combined in the following formula: Di = √ πi − θW i V ar(πi − θW i) (1.5) To get Tajima’s D for each gene, we then averaged across the Di values for all the variant sites within a gene. We also calculated the direction of selection (DoS) statistic (Stoletzki and Eyre-Walker, 2011) for each gene. Counts of nonsynonymous and synonymous polymorphisms within each gene (PN and PS, respectively) were determined with bedtools (v2.29.2). The number of nonsynonymous and synonymous differences (DN and DS, respectively) between A. thaliana genes and their 1:1 A. lyrata orthologs, if present, were estimated during the process of calculating dN /dS in Biopython as described above. We then combined these components into the following equation: DoS = DN DN + DS − PN PN + PS (1.6) We chose this metric, as opposed to the proportion of amino acid substitutions driven by positive selection (α), because it is less biased than α (Stoletzki and Eyre-Walker, 2011) and was suc- cessfully used in studies similar to ours (Paape et al., 2013). Furthermore, we found that α often returns uninterpretable negative values when applied to A. thaliana, perhaps because of an excess of slightly deleterious polymorphisms (Nordborg et al., 2005) due to their predominantly selfing mating system (Charlesworth, 1994). Treatment specificity Treatment specificity (τ ) was estimated separately for runs from each tissue type using the following formula (Yanai et al., 2005): ∑ τ = N i=1 1 − xi N − 1 max x 11 (1.7) Where x is the vector of average expression values of a gene in each treatment category, measured in transcripts per million (TPM), and where N is the number of treatment categories. Dividing by N means that τ varies between zero and one, where zero indicates no specificity and one in- dicates exclusive specificity to a single treatment. We used this metric of specificity because it is consistently more robust than others (Kryuchkova-Mostacci and Robinson-Rechavi, 2017) and is normalized by the number of treatments included, making it comparable across data sets. We also applied the same formula to calculate tissue specificity in several different treatment conditions. Simulating correlations between average expression and specificity index Average expression level and measures of expression specificity are correlated by definition because genes with more treatment/tissue-specific expression will have lower average expression across all treatment/tissue categories. We ran two simulations to better illustrate the factors driving the correlation between average expression and the specificity index, τ . In both simulations, we generated 1000 random matrices, where each element xij represented the expression of gene i in experiment j, by sampling from a zero-inflated negative binomial distribution: xij ∼ ZIN egBinom(N, p1, p2) (1.8) Where the size and probability parameters of the negative binomial component were N = 100 and p1 = 0.1, respectively, while the probability of an expression value being non-zero was p2 = 0.4. All matrices included 5 groups of columns, with 5 columns per group, representing replicates of tissue/treatment groups. For both simulations, we averaged across columns within each group to simulate the calculation of tissue/treatment-wide averages. We then applied the formula for τ across the rows of this averaged matrix to get expression specificity. In one simulation, we calculated expression level by averaging across the rows of the expression matrix. In a second simulation, we excluded experiments where a gene was not expressed (xij = 0) from the calculation of average expression. 12 Average expression, length, GC content, family size Calculating the average expression of each gene was a three-step process. First, we averaged together runs with matching SRA experiment IDs because these runs represented technical repli- cates of the same biological sample and treatment conditions. Second, we partitioned our gene-by- experiment expression matrix by the tissue type each sample came from. Finally, for each tissue type’s expression matrix, we averaged across all of the expression values of each gene across all experiments, excluding values < 5 transcripts per million (TPM). We excluded values < 5 TPM from the average expression calculation to avoid a high correlation between average expression and treatment-specificity, as has been reported in previous studies (Slotte et al., 2011). This high correlation occurs because an environment-specific gene will by definition also have low average expression across environments it is rarely expressed in. Furthermore, we excluded values < 5 TPM to avoid including small expression values that could be artifacts of alignment error. The length and GC content of each gene was measured using the bedtools nuc command (v2.29.2) and included each gene’s introns and untranslated regions when present. We included introns and untranslated regions in the estimate of gene length because they play important roles in determin- ing rates of protein evolution (Castillo-Davis et al., 2002; Eisenberg and Levanon, 2003). Finally, the family size for each gene was estimated as the number of A. thaliana genes in their respective orthogroups output by OrthoFinder. Partial correlation analysis Not all treatment-tissue combinations were sampled in the overall RNA-seq dataset, causing confounding between the treatment and tissue labels. We resolved this in two ways. First, we subset the data to only the treatment conditions where all tissue types were represented. Second, we subset the data by tissue type and analyzed each subset separately. For each subset, we calculated partial spearman correlations between treatment specificity and our measures of selection (dN , πN , Tajima’s D, and DoS) after accounting for average expression (excluding values TPM < 5), gene length, and GC content using the ppcor R package (Kim, 2015). For partial correlation analyses involving πN and Tajima’s D, we also controlled for gene family size. We did not account for gene 13 family size in partial correlation analyses involving dN or DoS because these metrics apply to only genes with one family member in this study. When calculating partial correlations involving dN , we excluded any genes with saturating divergence (dS > 1). All statistical analyses and data visualizations used R v4.0.3 and used color palettes in the scico R package (Crameri, 2018; Pedersen and Crameri, 2023). Surrogate variable analysis We recalculated treatment specificity and repeated all partial correlation analyses after correct- ing each data subset for technical between-experiment variation (i.e. batch effects), following an approach from (Fukushima and Pollock, 2020). Batch effects include variables that influence gene expression measurements but are not of interest to this study, such as the sequencing platform and the library prep protocol used in each experiment. First, with our data already subset by tissue type, we further subset to only include treatments with RNA-seq runs from at least two studies. This min- imizes confounding between-treatment variation with the technical between-experiment variation we aimed to account for. We then applied surrogate variable analysis (SVA) using the svaseq() function within the SVA package (Leek and Storey, 2007) to each of these subsets. Briefly, SVA models gene expression as: xij = µi + f (yi) + eij (1.9) Where xij is the expression of gene i in experiment j, µi is the average expression of gene i across all experiments, and yi is the value of a predictor variable of interest for gene i. Furthermore, f (yi) gives the deviation of gene i from its average expression based on the value of yi and eij is the residual error. SVA takes this model and partitions the residual variance, eij, into: L∑ xij = µi + f (yi) + ℓ=1 γℓigℓj + e∗ ij (1.10) ∑ Where L ℓ=1 γℓigℓj gives the summed effects of L unmodeled variables (gℓj) for each gene and e∗ ij gives the gene-specific noise in expression. SVA does not attempt to estimate what the unmodeled variables influencing expression are, but rather find a set of vectors (the surrogate variables) that 14 span the same space as g: xij = µi + f (yi) + K∑ k=1 λkihkj + e∗ ij (1.11) Where each hk is a surrogate variable and each λk gives the effects of each surrogate variable on gene expression. For our analyses, our predictor variable yi was treatment type. To get a measure of expression where the effects of surrogate variables are removed, we then subtracted off the effects of surrogate variables from both sides of the above equation. K∑ xij − λkihkj = µi + f (yi) + e∗ ij (1.12) k=1 Where xij − ∑ K k=1 λkihkj gives us our expression values accounting for the effects of surrogate variables. The net result here is a reduction in the amount of unexplained or seemingly stochastic variation in expression because sources of variation have been attributed to “surrogates” that span the same space as real batch variables. We also conducted principal component analysis in R before and after SVA to verify the removal of batch effects. Results Summary of tissue differentiation, treatment specificity, and selection in overall dataset To understand how treatment specificity of gene expression affects evolutionary rates of pro- teins, we queried the Sequence Read Archive for all A. thaliana RNA-seq experiments published before May 2022. We then annotated these experiments with standardized tissue and treatment on- tology terms, manually filtered the dataset, and then processed all RNA-seq runs with a standard- ized pipeline. The number of sequencing experiments associated with each combination of tissue and treatment labels is summarized in Tables S1. Overall, the most sampled tissue category was leaf (4,642 experiments) followed by root (3,348 experiments), whole plant (2,492 experiments), seed (1,866 experiments), shoot (1,106 experiments), then fruit and flower (266 experiments). The four most sampled treatment categories were control (5,701 experiments), cold air exposure (675 experiments), short day length (561 experiments), and short day length plus Botrytis cinerea expo- sure (407 experiments). Any sequencing runs that shared an SRA experiment ID were averaged to 15 produce individual gene expression values for each SRA experiment. We first looked at the distribution of mapping rates across all RNA-seq runs. The median mapping rate was 72.39 % (Figure A1) and we excluded runs with a mapping rate < 1% from further analyses. We next performed a principal components analysis (PCA) on the expression matrix and observed strong differentiation between root and non-root tissues along PC2 (Figure 1.1). We also observed that nearly all genes had some degree of treatment specificity in their expression (Figures 1.2A, S3). Furthermore, only a small proportion of genes had strong signatures of selection based on dN /dS, πN /πS, DoS, or Tajima’s D (Figure 1.2B-D, Figure A2). The treatment specificity of expression was lower on average in flower and fruit tissue compared to the other tissues (Figure A3). However, tissue specificity did not vary widely depending on the treatment condition (Figure A4). 16 Figure 1.1: Principal components analysis of all expression data. Each point represents a dif- ferent RNA-seq experiment and is colored by its associated tissue type. Experiments from all treatment conditions are included in this analysis. Plotting order was randomized to avoid over- plotting. 17 0100−100−50050PC1: 3.97 %PC2: 2.49 %Tissuefruit OR flowerleafrootseedshoot systemwhole plant Figure 1.2: Density plots of key variables measured in this study. (A) Distribution of treatment specificity in leaf tissue expression across all genes included in this study. The area underneath the curve in a given interval of treatment specificity represents the proportion of genes in this study that fall within that range of treatment specificity. (B) Distribution of dN /dS across all genes included in this study. The area to the right of the dashed line represents the proportion of genes in this study with dN /dS > 1. (C) Distribution of πN /πS across all genes included in this study. The area to the right of the dashed line represents the proportion of genes in this study with πN /πS > 1. (D) Distribution of DoS across all genes in this study. Area to the right of the dashed line represents the propotion of genes with DoS > 0, which is interpreted as evidence of adaptive evolution. 18 012340.40.60.81.0Treatment-specificity (leaf tissue)DensityA0.00.20.4-6-4-20log(dN/dS)DensityB0.00.10.20.30.4-505log(πN/πS)DensityC0123-1.0-0.50.00.5Direction of Selection (DoS)DensityD Omitting samples with low expression disentangles expression level and specificity Genes that are only expressed in one treatment or tissue will, by definition, have low mean expression across all environments or tissues (Wright et al., 2004). Thus, we sought a method of calculating expression level that was independent of treatment specificity. To better understand the relationship between average expression and treatment specificity, we calculated correlations between treatment-specificity and expression level while either including or excluding low expres- sion values (TPM < 5) on our real RNA-seq dataset. We found that excluding low expression values decreased the correlation between average expression and treatment-specificity in leaf tissue sam- ples (Figure 1.3) and other tissues (Figures A34 - A38) and replicated the result by simulating gene expression matrices (Figure A39). Thus, for all later partial correlation analyses (see next section) we quantified each gene’s average expression after dropping experiments where the gene was not expressed (TPM < 5). 19 Figure 1.3: Correlation between the average expression in transcripts per million (TPM) and treatment specificity of genes. Figures show correlations when samples with low expression (< 5 TPM) are included (A) vs excluded (B). Expression level and treatment specificity were calculated using only data from leaf tissue samples. Line is a smoothing line with 95 % confidence intervals and values in parentheses give spearman correlation. Treatment specificity correlates with levels of nonsynonymous diversity and divergence in genes We next calculated partial correlations between treatment specificity and measures of selection after controlling for average expression, gene length, GC content, and tissue specificity in expres- sion. These partial correlations were calculated separately for expression data on each tissue type and did not account for batch effects (see next section). Among leaf tissue samples, average ex- pression had significant partial correlations with dN (ρ = −0.19, p-value = 2.1 × 10−122) and πN (ρ = −0.17, p-value = 2.8 × 10−175) after controlling for other factors (Figures 1.4A,1.4B). Treatment specificity was more strongly correlated with dN (ρ = 0.10, p-value = 7.6 × 10−31) and πN (ρ = 0.10, p-value = 1.2 × 10−62) than Tajima’s D (ρ = 0.01, p-value = 3.1 × 10−7) and DoS (ρ = 0.04, p-value = 2.3 × 10−06, Figure 1.4C, 1.4D). Furthermore, the top 25% most 20 −10−505100.40.60.81.0treatment specificity of expressionlog(average expression in TPM)leaf (−0.63)A2.55.07.510.00.40.60.81.0treatment specificity of expressionlog(average expression in TPM)leaf (0.16)B treatment-specific genes in leaf tissue for our dataset have average dN and πN values nearly 2.5 times greater than the 25% least treatment-specific genes (dN = 0.025 vs 0.061; πN = 0.0014 vs 0.0032). Meanwhile, the most and least treatment-specific genes have average Tajima’s D values of are -0.44 and -0.43, respectively, and average DoS values of -0.19 and -0.14, respectively. The strongest partial correlation generally occurred between tissue specificity and treatment specificity (Spearman’s ρ = 0.53 − 0.60, Figure 1.4). Gene family size had among the weakest partial cor- relations with πN compared to other covariates, but strongly correlated with treatment specificity (ρ = 0.12, p-value = 6.3 × 10−84, Figure 1.4B). All of these findings generally held when average expression and treatment specificity were calculated on data from other tissues (Table S2, Figures A6-A10). 21 Figure 1.4: Partial correlation analysis between specificity and selection. Analyses include either (A) dN , (B) πN , (C) Tajima’s D, or (D) direction of selection (DoS) as a covariate. Average expression excludes values < 5 TPM and was calculated using only leaf tissue samples. Treatment specificity was also calculated using only leaf tissue samples. Tissue specificity was calculated using only control samples across all tissue categories. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 22 -0.40.00.4Partialspearmancorrelationcoefficient0.170.11-0.020.10.050.53-0.190.02-0.10.09-0.030.08-0.12-0.01-0.15-0.18-0.14-0.15-0.15-0.35-0.12dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 14629A0.540.14-0.080.1-0.020.58-0.170.1-0.150.11-0.02-0.02-0.10.01-0.13-0.130.08-0.16-0.21-0.34-0.160.020.060.050.120.070.030.11πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 26497B0.010.030.6-0.01-0.180.09-0.02-0.10.01-0.130.01-0.18-0.23-0.33-0.160.050.050.120.070.030.11Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 26503C0.050.040.55-0.04-0.130.07-0.05-0.12-0.01-0.15-0.06-0.18-0.18-0.33-0.14Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 14573D Correlations between treatment specificity and nonsynonymous variation persist after controlling for batch effects and dataset imbalance While combining gene expression data across multiple studies can increase the statistical power of an analysis, there are some potential concerns. First, if many tissue-treatment combinations are not sampled, the dataset will be unbalanced and the effects of tissue and treatment variation on expression could be confounded. Consistent with this expectation, there was a high correlation between tissue specificity and treatment specificity in our initial analyses (Figure 1.4, S6-S10). Furthermore, combining data from multiple laboratories could generate batch effects (Leek et al., 2010). To address the issues of imbalance and batch effects, we first subset our data to only include treatments where all tissue types were represented. This subset included the treatments of control, abscisic acid, continuous light, warm/hot air temperature, and cold air temperature. We then used SVA to correct for the influence of unknown batch effects on this data subset (Leek and Storey, 2007). After SVA, treatment specificity positively correlated with dN (ρ = 0.10, p-value = 1.6 × 10−32) and πN (ρ = 0.07, p-value = 1.5×10−23) when average expression and treatment specificity were calculated on combined fruit and flower data (Figures A33). However, treatment specificity in other tissue types generally did not correlate with our measures of selection (Figures A28-S33, Table S4). The inclusion of only five treatments in the above analysis could limit quantification of a gene’s treatment specificity. Thus, in order to include data from a larger number of treatments, avoid dataset imbalance, and avoid batch effects, we split our expression matrix into six subsets by tissue category. We then further removed treatments that only had expression data from one study to avoid confounding treatment effects with study-specific batch effects. We applied SVA (Leek and Storey, 2007) to each of these tissue-specific subsets. After SVA, the expression profiles of most genes appear less treatment-specific (Figures A16-A21 panels A vs B). We also observed less separation in PCA space within treatment groups after SVA (for example, see Figures A16C and A16D). Average expression levels before SVA were generally correlated with expression levels after SVA (Figures A16-A21 panels A and B). In partial correlations on each SVA-corrected subset, treatment specificity significantly correlated with dN (ρ = 0.13, p-value = 6.9 × 10−50) and πN (ρ = 0.16, p- 23 value = 3.9 × 10−128) but less strongly correlated with Tajima’s D (ρ = 0.04, p-value = 6.6 × 10−10) and DoS (ρ = 0.05, p-value = 2.0 × 10−8) for the leaf tissue data subset (Table 1, Figures 1.5). These patterns were similar in other tissue types (Figures A11-A15, Table S3). Pre/post-SVA Selection measure Pre Post Pre Post Pre Post Pre Post dN dN πN πN Tajima’s D Tajima’s D DoS DoS Partial correlation: selection vs treatment- specificity 0.10 0.13 0.10 0.16 0.03 0.04 0.04 0.05 p-value 7.6 × 10−31 6.9 × 10−50 1.2 × 10−62 3.9 × 10−128 3.1 × 10−7 6.6 × 10−10 2.3 × 10−6 2.0 × 10−8 Table 1: Partial correlations between treatment-specificity and different measures of selection pre-SVA and post-SVA. All correlation coefficients are spearman coefficients and are calculated only on leaf tissue samples. All p-values represent whether correlation coefficient significantly differs from 0. 24 Figure 1.5: Partial correlations between specificity and selection after SVA. Analyses include (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on leaf tissue data subset after applying SVA. Data was further subset to include only treatment groups with data from more than one study before applying SVA. Average expression calculation excludes values < 5 TPM. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 25 -0.40.00.4Partialspearmancorrelationcoefficient0.160.130.03-0.270.02-0.05-0.050.08-0.1-0.14-0.2-0.17-0.25-0.28-0.08dSTreatment specificityAverage expressionGClengthdNdSTreatment specificityAverage expressionGCn = 13133A0.530.16-0.06-0.260.15-0.05-0.04-0.01-0.07-0.12-0.150.09-0.34-0.24-0.1-0.010.070.10.040.010.07πSTreatment specificityAverage expressionGClengthFamily sizeπNπSTreatment specificityAverage expressionGClengthn = 22459B0.040-0.09-0.03-0.08-0.120.01-0.37-0.22-0.10.050.10.040.010.08Treatment specificityAverage expressionGClengthFamily sizeTajima's DTreatment specificityAverage expressionGClengthn = 22461C0.05-0.07-0.09-0.05-0.1-0.14-0.06-0.3-0.25-0.09Treatment specificityAverage expressionGClengthDoSTreatment specificityAverage expressionGCn = 13087D Discussion Our main finding is that genes with more treatment-specific expression patterns are, on aver- age, under weaker selective constraint in A. thaliana. This is evident by treatment-specific genes generally having higher values of πN and dN , but not higher values of Tajima’s D and DoS, com- pared to genes with more constitutive expression (Figures 1.4,1.5). Our result does not refute the possibility of strong positive selection on treatment-specific genes, as is the case for nucleotide binding site leucine rich repeat proteins (NBS-LRRs) in A. thaliana (Mondragón-Palomino et al., 2002). Rather, treatment-specific genes are simply under weaker selection on average compared to less treatment-specific genes. Altogether, this pattern is consistent with the hypothesis that a trade- off between the strength of selection and the treatment specificity of expression helps maintain variation in plasticity for A. thaliana (Snell-Rood et al., 2010; Van Dyken and Wade, 2010). There are a few ways to think about the biological relevance of the correlations of treatment specificity with πN and dN . First, the magnitude of treatment specificity’s correlation with πN and dN was generally half the magnitude of average expression’s correlation with πN and dN and similar to tissue specificity’s correlation with πN and dN . Both tissue specificity and average ex- pression are thought to be important determinants of protein evolution (Bush et al., 2015; Wu et al., 2022), suggesting the comparable effects of treatment specificity may be important too. Second, the effect of treatment specificity on πN and dN persisted even after simultaneously controlling for expression level, tissue specificity, gene length, GC content, and batch effects. Finally, the top 25% most treatment-specific genes in our dataset have average dN and πN values nearly 2.5 times greater than the 25% least treatment-specific genes (dN = 0.025 vs 0.061; πN = 0.0014 vs 0.0032), but relatively similar Tajima’s D and DoS values (Tajima’s D = -0.44 vs -0.43; DoS = -0.19 vs -0.14). These observations together suggest that treatment specificity is an important determinant of protein evolution. This study disentangles several processes that were often difficult to resolve in previous re- search. First, many previous studies focus mainly on explaining trends in dN /dS (Slotte et al., 2011; Gaut et al., 2011; Bush et al., 2015), but both relaxed negative selection and increased posi- 26 tive selection can lead to increases in dN /dS. To tease apart these two processes, we additionally investigated treatment specificity’s relationship with Tajima’s D and DoS. Treatment specificity’s weaker correlation with Tajima’s D and DoS, compared to dN and πN , suggests that relaxed nega- tive selection plays a larger role than increased positive selection in explaining the high evolution- ary rates of treatment-specific genes. Furthermore, measures of expression specificity are often highly correlated with expression level (Slotte et al., 2011; Alvarez-Ponce and Fares, 2012; Huang, 2022). When calculating a gene’s expression level, we only included samples where said gene was expressed (TPM > 5) to get an estimate of expression level that was still correlated with dN and πN , but was independent of expression specificity, allowing us to better disentangle these factors. Finally, previous studies have struggled to partition the factors that influence selection on genes in the presence of predictor variables with considerable error, such as expression level (Drummond et al., 2006; Plotkin and Fraser, 2007; Yang and Gaut, 2011). Error in expression measurements can often be attributed to unmeasured differences between RNA-sequencing experiments (Leek et al., 2010) and we accounted for these differences using SVA (Leek and Storey, 2007). Even after SVA, treatment specificity was strongly correlated with dN and πN (Figures 1.5A-B), suggesting our re- sults are not an artifact of errors in expression measurement or combining expression data across many studies. Surprisingly, nearly all genes in A. thaliana have some degree of treatment specificity in their expression (Figures 1.2A, A3), reflecting results of previous studies on tissue specificity (Eisen- berg and Levanon, 2003). The high prevalence of treatment specificity in our dataset is partly explained by batch effects because SVA significantly lowered the apparent treatment specificity of most genes (Figures A16B-A21B) and reduced within-treatment differentiation in PCA space (for example, see Figures A16C and A16D). This reduction in treatment-specificity likely happened because batch effects can include unrecorded between-treatment differences (e.g. the humidity of the growth chamber, light intensity, watering schedule, etc.). Controlling for these unrecorded between-treatment differences thus causes the expression of genes to be less treatment-specific. However, even after batch correction most genes still showed some degree of treatment specificity 27 (Figures A16B-A21B), suggesting it is rare for a gene to be expressed at the same level across many environments. We also observed that genes with higher treatment specificity generally belonged to larger gene families. We expected gene family size to correlate with selection because singleton and duplicated genes often evolve at different rates (Jordan et al., 2004; Davis and Petrov, 2004). Theory also suggests that gene duplication leads to relaxation of selection on duplicates, allowing for neo- and sub-functionalization (Lynch and Conery, 2000; Aagaard et al., 2006). We could not investigate how gene family size correlates with dN or DoS because measuring these quantities requires iden- tifying substitutions between orthologous genes. Thus, dN and DoS can only be reliably measured for 1:1 orthologs between A. thaliana and A. lyrata. However, πN and Tajima’s D can be calcu- lated for genes in larger families and we did observe persistent correlations between family size and Tajima’s D (For Figure 1.5C: ρ = 0.05, p-value = 3.1 × 10−12; also see Figures A6C-A15C, A28C-A33C). Altogether, these correlations suggest that processes of gene duplication, neofunc- tionalization, and subfunctionalization could be connected to evolving some degree of treatment specificity. Gene length was generally the second most correlated factor with dN and πN in our study, just behind average expression. This is consistent with previous work suggesting that longer proteins require more energy to synthesize and are thus under stronger selective constraints (Urrutia and Hurst, 2001; Castillo-Davis et al., 2002; Eisenberg and Levanon, 2003; Urrutia and Hurst, 2003). However, while some previous studies in A. thaliana observe this same trend (Bush et al., 2015), others do not (Slotte et al., 2011). This discrepancy could be due to differences in how gene length is defined between studies. In this study, each gene’s length included coding sequence as well as introns and untranslated regions, whereas other studies break down gene length into individual features (Bush et al., 2015). The goal of this study was not to understand differences in evolution between different gene features, so we included all gene features in our estimate of gene length. However, introns and untranslated regions experience different evolutionary patterns than coding sequences; for example, highly expressed genes being under selection for shorter introns (Castillo- 28 Davis et al., 2002; Eisenberg and Levanon, 2003). Therefore, future studies must clearly define even seemingly simple features like gene length to ensure that results are comparable across studies. Although we focused on testing the idea that treatment specificity is responsible for relaxed negative selection in some genes, it is also possible that relaxed selection caused the evolution of treatment specificity. There is some evidence that relaxation of selection occurs before the evolu- tion of expression specificity (Hunt et al., 2011) and may better explain cases of neo- and subfunc- tionalization (Lynch and Conery, 2000; Aagaard et al., 2006). Future experiments that look at the evolution of treatment specificity and sequence evolution across a broader phylogenetic scale may be helpful for determining the order of these processes. In summary, this study investigates a trade-off between the treatment-specific expression of a gene and the strength of selection said gene experiences, which is hypothesized to limit plasticity evolution. Consistent with this hypothesis, genes in A. thaliana with more treatment-specific ex- pression are under weaker selection compared to more evenly expressed genes. While we find that this trade-off exists, we could not dissect the direction of causality in the trade-off or determine how much this trade-off constrains plasticity evolution relative to other processes. However, these are exciting areas of future research. Future studies should ideally generate fully balanced datasets on gene expression acquired across natural environmental gradients. Taking these steps will contribute to a comprehensive understanding of the constraints on plasticity and protein evolution. Data availability Supplemental figures are available in Appendix A. All code for our bioinformatic workflows, data analysis, and figure creation can be found here: https://github.com/milesroberts-123/arabidopsis- conditional-expression. All supplementary tables are available online at: https://doi.org/10.1093/ genetics/iyad074. The tissue type and treatment annotations for RNA-seq runs in our study can be found in Table S5. Genomic references, mapping rates, and a table of expression specificity; nucleotide diversity; and substitution rate values estimated for all A. thaliana genes included in this manuscript’s analyses is available at: https://doi.org/10.5061/dryad.xd2547dnd. The genome assembly and annotation used in this study was originally downloaded from Phytozome: https: 29 //phytozome-next.jgi.doe.gov/. 30 CHAPTER 2: k-MER-BASED DIVERSITY SCALES WITH POPULATION SIZE PROXIES MORE THAN NUCLEOTIDE DIVERSITY IN A META-ANALYSIS OF 98 PLANT SPECIES2 Abstract A key prediction of neutral theory is that the level of genetic diversity in a population should scale with population size. However, as was noted by Richard Lewontin in 1974 and reaffirmed by later studies, the slope of the population size-diversity relationship in nature is much weaker than expected under neutral theory. We hypothesize that one contributor to this paradox is that current methods relying on single nucleotide polymorphisms (SNPs) called from aligning short reads to a reference genome underestimate levels of genetic diversity in many species. As a first step to testing this idea, we calculated nucleotide diversity (π) and k-mer-based metrics of genetic diversity across 112 plant species, amounting to over 205 terabases of DNA sequencing data from 27,488 individuals. After excluding 14 species with low coverage or no variant sites called, we compared how different diversity metrics correlated with proxies of population size that account for both range size and population density variation across species. We found that our popula- tion size proxies scaled anywhere from about 3 to over 20 times faster with k-mer diversity than nucleotide diversity after adjusting for evolutionary history, mating system, life cycle habit, cul- tivation status, and invasiveness. The relationship between k-mer diversity and population size proxies also remains significant after correcting for genome size, whereas the analogous relation- ship for nucleotide diversity does not. These results are consistent with the possibility that variation not captured by common SNP-based analyses explains part of Lewontin’s paradox in plants, but larger scale pangenomic studies are needed to definitively address this question. Introduction Understanding the determinants of genetic diversity within populations is key to informing species conservation (Cole, 2003) and breeding efforts (Sanchez et al., 2023). However, most species have far less genetic diversity (commonly estimated as pairwise nucleotide diversity, π) than expected (Frankham, 2012; Corbett-Detig et al., 2015; Buffalo, 2021). If we assume that 2This chapter is, as of writing, accepted for publication at Evolution Letters. The preprint version is currently available at the following DOI under a CC-BY License: https://doi.org/10.1101/2024.05.17.594778 31 the vast majority of genetic variants are neutral, then the determinants of genetic diversity are en- capsulated in neutral theory (Kimura, 1983): E[π] ≈ 4Neµ, where E[π] is the expected level of genetic diversity, Ne is the effective size of a population, and µ is the mutation rate per base pair per generation. Mutations rates for SNPs and small indels vary relatively little across species (Cagan et al., 2022; Bergeron et al., 2023; Quiroz et al., 2023), while the total number of individuals in a species varies massively (Buffalo, 2021). Thus, under neutral theory, population size should be a strong determinant of genetic diversity and species with larger population sizes should be more diverse. However, even some of the most abundant species studied to date have low genetic diver- sity compared to neutral theory expectations. For example, Drosophila simulans has an estimated population size > 1014 and a diversity of π ≈ 0.01, but an expected diversity of π > 0.1 (Buffalo, 2021). This mismatch between expected and observed levels of neutral diversity across populations of varying size is known as Lewontin’s paradox, named after Richard Lewontin who first described the phenomenon (Lewontin, 1974). The potential mechanisms underlying Lewontin’s paradox have been reviewed extensively (Leffler et al., 2012; Slotte, 2014; Ellegren and Galtier, 2016; Charlesworth and Jensen, 2022). Mul- tiple selective and demographic processes likely contribute to Lewontin’s paradox; however, deter- mining the relative importance of these processes remains a contentious area of research. The two most explored mechanisms are historic population size changes (i.e. demography, Charlesworth and Jensen (2022)) and linked selection - whereby fixation or purging of selected alleles causes the loss of linked neutral alleles (Kojima and Schaffer, 1964, 1967; Smith and Haigh, 1974; Charlesworth et al., 1993; Charlesworth, 1994). Linked selection is expected to reduce diversity more in regions of lower recombination and higher functional density (Slotte, 2014) and many studies have tested this hypothesis(Tenaillon et al., 2001; Hellmann et al., 2003; Nordborg et al., 2005; Roselius et al., 2005; Branca et al., 2011; Paape et al., 2012; Corbett-Detig et al., 2015; Silva-Junior and Grat- tapaglia, 2015; Wang et al., 2016; Phung et al., 2016; Mackintosh et al., 2019). These previous investigations often conclude that linked selection contributes to Lewontin’s paradox, but not all report significant results (e.g. Schmid et al. (2005); Roselius et al. (2005); Flowers et al. (2012); 32 Wang et al. (2016)). It’s been argued that studies focused on plant species especially tend to find weaker evidence for linked selection (Slotte, 2014). There is also both empirical and theoretical evidence that linked selection is unlikely to explain the entirety of Lewontin’s paradox, suggesting that demographic factors play an important role too (Coop, 2016; Buffalo, 2021; Charlesworth and Jensen, 2022). There are three main types of demographic changes proposed to contribute to Lewontin’s para- dox: contractions, expansions, and cyclical population size changes (Charlesworth and Jensen, 2022). Population contractions cause loss of diversity. Thus, if many species’ populations recently contracted (due to human activity, for example), then their contemporary diversity would be much lower than expected from their pre-contraction population sizes (Exposito-Alonso et al., 2022). Recent population expansions could cause a similar mismatch. Because it takes many generations for populations to accumulate diversity compared to the timescale of typical expansions, contem- porary diversity levels for an expanded population would be much smaller than expected from a post-expansion population size (Peart et al., 2020; Charlesworth and Jensen, 2022). For a similar reason, species that have seasonal variation in their population sizes will also tend to have diversity levels closer to what one would expect based on their minimum size rather than their peak size (Wright, 1940). Studies investigating Lewontin’s paradox would ideally try to jointly infer these demographic histories alongside selective factors in natural populations. However, issues of model complexity and identifiability often prevent such joint estimation (Johri et al., 2020, 2022b,a), sug- gesting further explorations of Lewontin’s paradox will require new approaches. Two potential, but rarely explored, contributors to Lewontin’s paradox are that (1) current meth- ods for estimating genetic diversity systematically underestimate the true levels of genetic diver- sity in most populations and (2) changes or corrections to the calculation of diversity could yield stronger correlations with population size. Lewontin’s original observations and earlier studies on the population size-diversity relationship were based on allozymes, which detect variants in pro- tein sequences (Lewontin, 1974; Nei and Graur, 1984). More recent studies measure diversity using SNPs at more neutral four-fold degenerate sites (i.e. sites where mutations do not affect protein 33 sequences) in DNA and generally observe greater within-species diversity and between-species divergence compared to allozymes (Li and Sadler, 1991; Makałowski and Boguski, 1998; Bazin et al., 2006; Piganeau and Eyre-Walker, 2009). However, current SNP-based methods are not per- fect either and there is significant evidence that SNPs capture a biased and incomplete picture of genetic diversity. Calling SNPs typically requires aligning reads to a reference genome, meaning any SNPs in regions that are not present or highly diverged from the reference genome will be excluded from analysis and thus downwardly bias diversity estimates (Golicz et al., 2020; Buffalo, 2021). This downward bias is typically assumed to have little effect on the qualitative relationship between diversity and Ne (Buffalo, 2021), but recent pangenomic studies have uncovered troves of non-reference variation across a variety of species (Ebler et al. (2022); Rice et al. (2023), re- viewed in Bayer et al. (2020)). Sometimes alignment errors in regions of non-reference structural variation will also create false positive SNP calls, which will also affect diversity estimates (e.g. Jaegle et al. (2023)). Finally, previous meta-analyses of population size and diversity data rely on scraping diversity estimates from previously published studies (Frankham (2012); Buffalo (2021), except see Corbett-Detig et al. (2015)). However, many studies report inaccurate SNP calls and deflated diversity estimates due to errors in the handling of missing genotype calls (Korunes and Samuk, 2021; Schmidt et al., 2021; Sopniewski and Catullo, 2024) and may filter genotype calls differently, making comparisons across species difficult. Overall, errors in diversity calculations and omission of diversity in genomic regions that are either difficult or impossible to align to could partially explain Lewontin’s paradox. Re-analyzing whole genome sequencing data with a common pipeline and applying correct calculations of nucleotide diversity would make diversity estimates across species more comparable and easier to interpret (Buffalo, 2021; Mirchandani et al., 2024). One useful pangenomics tool for measuring non-reference variation that is readily applicable to common short-read datasets is the k-mer. k-mers are subsequences of length k derived from a larger sequence and they have a long history of use in computer science (Shannon, 1948), genome assembly (Turner et al., 2018), metagenomics (Benoit et al., 2016), and quantitative genetics (Rah- man et al., 2018; Voichek and Weigel, 2020; Kim et al., 2020; Mehrab et al., 2021). Recent studies 34 have also demonstrated the utility of k-mers for measuring heterozygosity and genetic differences between individuals (commonly referred to as “dissimilarity” measures, Ondov et al. (2016); Vur- ture et al. (2017); Ranallo-Benavidez et al. (2020); VanWallendael and Alvarez (2022); Roberts et al. (2024)). Typical analysis of k-mers involves only counting the presence/absence and/or fre- quencies of all k-mers in a set of reads, without aligning the reads to any reference, then deriving measures of genetic difference from such counts (Benoit et al., 2016). Avoiding alignment allows one to incorporate sequences that would otherwise be omitted for lack of alignment to a reference genome (Rahman et al., 2018; Voichek and Weigel, 2020; Wiersma et al., 2024). We revisited Lewontin’s paradox in plants using k-mer-based measures of genetic difference and corrected π calculations, aiming to test whether the inclusion of non-reference variation or modifications to diversity calculations could increase the scaling between population size proxies and diversity. We compared how k-mer dissimilarity and typical SNP-based estimates of nucleotide diversity correlated with population size proxies across a large panel of plant species - all processed through the same bioinformatic pipeline. Our expectation was that if k-mers are better at capturing genomic variation than SNPs, k-mer dissimilarity would scale more rapidly with population size compared to nucleotide diversity. Methods Our entire analysis is packaged as a snakemake workflow stored here: https://github.com/milesroberts- 123/tajimasDacrossSpecies. This workflow includes the code to reproduce all of the steps individ- ually explained below, along with instructions on how to run the code, and yaml files describing the exact configurations of software we used at each step. It also includes an example directed acyclic graph showing the order of steps a typical sample is processed through. The code detailing all initial, exploratory, and confirmatory data analyses as well as figure creation can be found as an R-markdown file in the github repository. The parameters for each software were kept constant across all datasets (except occasionally for the “–ploidy” parameter in GATK HaplotypeCaller) to ensure that variation in bioinformatic processing did not bias our results. All statistical analyses used R v4.2.2 (R Core Team, 2022) and all color palettes used in figure creation come from the 35 scico R package (Pedersen and Crameri, 2023) to ensure color-blind accessibility. Population-level sequencing data collection We started by building a list of species with high quality, publicly available reference genomes as well as population-level sequencing data. The source for the genome assembly and annotation used for each species in this study is listed in Table S1. We first downloaded all genomes in Phy- tozome (https://phytozome-next.jgi.doe.gov/) with unrestricted data usage. We then downloaded all genomes for species from Ensembl plants (https://plants.ensembl.org/index.html) that were not already represented in Phytozome. Next, we downloaded genomes for additional species from the NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/) that were not already present in either Phytozome or Ensembl and met all of the following criteria: • matched filters: eukaryotic, plants, land plants, and exclude partial • included assemblies of nuclear DNA (i.e. not just plastid genomes) • included annotations of coding sequences We also downloaded a genome for Nicotiana tabaccum from the Sol genomics network (https: //solgenomics.net/). Finally, we omitted 8 species (Aegilops tauschii, Hordeum vulgare, Lens culi- naris, Pisum sativum, Thinopyrum intermedium, Trifolium pratense, Triticum aestivum, Triticum turgidum) that had at least one chromosome longer than 229 bp (about 537 Mb) from all down- stream analyses because tabix indexing, which is often utilized for SNP-calling pipelines, does not support chromosomes exceeding this length. In the end, we were left with genome assemblies and annotations for 112 plant species (see Table S1). Note that, similar to previous studies (Corbett-Detig et al., 2015; Buffalo, 2021), many of the plant species in this set of 112 are domesticated (see Table S1). This means that many of the species in our dataset have likely undergone recent demographic changes. However correctly accounting for demography is a general limitation of Lewontin’s paradox studies, as we can only estimate proxies of Ne for contemporary populations (Corbett-Detig et al., 2015; Buffalo, 2021). We include cultivation status in our downstream modeling to help account for systematic differences between cultivated and wild species (see Statistical analysis) and also repeat our analysis with multiple 36 types of population size proxies. For each species with a reference genome, we searched for DNA-seq runs in the National Center for Biotechnology Information’s Sequence Read Archive (SRA) with a name in the organism field that matched the species name (e.g. search for Arabidopsis lyrata[Organism] to get Arabidopsis lyrata runs). We downloaded the run info for each search and found the study with most sequenced individuals for inclusion in our analysis. Most datasets came from individual studies, with the exception of Zea mays, which included several studies described in Bukowski et al. (2017). The datasets used for each species are listed in Table S1. We limited the size of each species’ dataset to no more than 7.5 × 1012 bp and no more than 1200 individuals because this defined the amount of data our workflow could process without the peak memory limit exceeding 50 TB and the time limit for genotype calling exceeding 7 days. If a species’ dataset exceeded either 1200 individuals or 7.5 × 1012 bp, we randomly dropped one individual at a time until both of these limits were satisfied. We downloaded the SRA runs associated with each individual using the SRA toolkit (v2.10.7), then trimmed low-quality base calls with fastp (v0.23.1, Chen et al. (2018)), requiring a minimum quality score of 20 and a minimum read length of 30 base pairs. For each species, we summarized the results of fastp trimming using multiqc (v1.18, Ewels et al. (2016)). After trimming, any fastq files that were technical replicates of the same individual were concatenated. Concatenated fastq files were then processed through two different workflows: SNP-calling and k-mer counting. SNP calling pipeline We aligned sequencing reads for each individual to their respective reference genome using BWA MEM (v0.7.17, Li and Durbin (2009); Li (2013)), sorted the resulting BAM files with sam- tools (v1.11, Danecek et al. (2021)), and marked optical duplicates with picardtools (picard-slim v2.22.1, Institute (2019)). Next, we called SNPs with GATK HaplotypeCaller (v4.1.4.1, McKenna et al. (2010); Poplin et al. (2018)). We varied the –ploidy parameter for HaplotypeCaller be- tween species depending on the actual ploidy recorded in the literature and whether individual subgenome assemblies were available. However, the vast majority of species in our dataset had 37 a –ploidy paramter of 2. We restricted genotype calling to only 4-fold degenerate sites within the nuclear genome and included all nuclear genome scaffolds, as identified by degenotate (v1.1.3, Mirchandani et al. (2024)), to focus solely on neutral diversity. Notably, we chose to not call SNPs in non-coding sequences in plants because plant genomes tend to be highly repetitive and are largely composed of transposable elements, making short read mapping difficult For example, even in the small-genome plant species Capsella grandiflora, SNPs can only be confidently called in 10% of intergenic sites (Williamson et al., 2014). Runs for each species were then combined with GATK GenomicsDBImport, then genotyped with GATK GenotypeGVCFs, including invari- ant sites as done in Korunes and Samuk (2021). Variant and invariant sites were separated with bcftools (v1.17, Danecek et al. (2021)) and then filtered separately, as recommended by Korunes and Samuk (2021). Variant sites were removed from our analyses if they met at least one of the following criteria: number of alleles > 2, indel status = TRUE, fraction of missing genotypes > 0.2, QD < 2.0, QUAL < 30.0, MQ < 40.00, FS > 60.0, HaplotypeScore > 13.0, MQRankSum < -12.5, and ReadPosRankSum < -8.0. In short, these filters reflect the standard population genetics practices of removing non-biallelic SNPs and SNPs with a lot of missing genotype calls. The exact filter cutoffs also come from GATK Best Practices for hard filtering (Caetano-Anolles, 2023) that is widely used (e.g. (Yu et al., 2024; Ramirez-Ramirez et al., 2024; Glasenapp and Pogson, 2024; Ritter et al., 2024)). For each species, we also required that each variant site have a minimum read depth of 5, but no more than 3 times the genome-wide average read depth at variant sites for that species. We chose a minimum read depth of 5 to maintain comparability with our k-mer dissimi- larity calculations which required minimum k-mer counts of 5 (see k-mer counting), but we also filtered out calls with very high read depths because these could represent read mismapping due to repetitive or paralogous regions. Meanwhile, invariant sites were removed from our analyses if they met at least one of the following criteria: QUAL > 100.0, read depth ≤ 5, or read depth ≥ 3 times the genome-wide average read depth at invariant sites for that species. Finally, invariant and variant sites were concatenated into a single VCF file per scaffold using bcftools. For Brassica na- pus and Miscanthus sinensis, scaffolds named “LK032656” (195,249 bp, 0.023 % of the genome) 38 and “scaffold04645” (2,838 bp, 0.000136 % of the genome), respectively, were omitted from our analyses because an error in SLURM job cancellation caused snakemake to prematurely delete intermediate files for these scaffolds. It is worth noting that different choices of genotype callers and filtering parameters could lead to different estimates of nucleotide diversity. However, our workflow is representative of SNP calling workflows used in many published population genetic analyses. Using the SNP genotypes called from our pipeline, we then calculated genome-wide average nucleotide diversity at four-fold degenerate sites (¯π) using the filtered set of variant and invariant sites. To do this, we first calculated heterozygosity at each four-fold degenerate site (i) according to Hahn (2018): ( ni ni − 1 πi = ) ( ) 1 − ai∑ j=1 p2 ij (2.1) where ni is the number of sequenced chromosomes with non-missing genotypes for site i, ai is the number of alleles for site i, and pij is the frequency of the jth allele at site i. For each invariant site, the equation reduces to πi = 0 because pi1 = 1 and ai = 1. To get ¯π, we then calculated the average value of πi across all M sites in the genome (including both variant and invariant sites): ∑ M i=1 πi M ¯π = (2.2) Importantly, this calculation of nucleotide diversity does not assume that all missing genotype calls are homozygous for the reference genotype, meaning that our estimates of π will not be down- wardly biased by missing data (Korunes and Samuk, 2021). k-mer counting pipeline For each species, we counted 30-mers in the coding sequences of their respective reference genome using KMC (v3.2.1, Kokot et al. (2017)). We chose to count k-mers of 30 base pairs (i.e. 30-mers) for all species in our dataset because previous k-mer-based analyses in plants typically analyzed k-mers in the range of 20 - 40 base pairs (Voichek and Weigel, 2020; Kim et al., 2020; 39 VanWallendael and Alvarez, 2022; Ruperao et al., 2023) and because k-mers in this range can be reliably sequenced with short reads while capturing the majority of unique genomic sequences (Shajii et al., 2016; Ondov et al., 2016; Roberts et al., 2024). Next, we removed any 30-mers that matched 30-mers found in each species’ corresponding set of coding sequences. This step intended to focus our k-mers down to a set that is evolving more neutrally on average, analogously to how we focused on only 4-fold degenerate SNPs in our SNP-calling pipeline. The justification for this approach is that non-coding sequences generally have weaker signals of interspecies conservation compared to coding sequences (Woolfe et al., 2005; Siepel et al., 2005; Johnsson et al., 2014). Although, similarly to 4-fold degenerate sites, many studies have observed non-coding sequences under selective constraints (Margulies et al., 2003; Guo et al., 2007). Thus, similar to the com- mon analysis of 4-fold degenerate sites, our k-mer analysis is limited by an inability to completely remove the effects of selection on sequence diversity. Although comparing our k-mer and nucleotide diversity metrics will be affected by differences between coding and non-coding sequences, many previous studies found that non-coding regions and 4-fold degenerate sites have very similar levels of diversity (Moriyama and Powell, 1996; Makałowski and Boguski, 1998; Halushka et al., 1999; Zwick et al., 2000; Tenaillon et al., 2001; Nordborg et al., 2005; Branca et al., 2011; Williamson et al., 2014; Wang et al., 2016; Phung et al., 2016; Mattila et al., 2017). Previous investigations of Lewontin’s paradox also found that diversity levels across species vary much more than diversity levels across different categories of putatively neutral sequences (Leffler et al., 2012; Buffalo, 2021) and subsequently pooled estimates of neutral diversity across different categories of sites. Our analysis choices here are thus in line with previous studies; however, larger scale pangenomic analyses will be helpful in relaxing this assumption. For most species in this study, we identified hundreds of millions of unique 30-mers. It would be computationally expensive to analyze all the 30-mers for every species. However, previous studies have shown that one can randomly downsample k-mer sets with very minimal effects on measures of genomic dissimilarity (Fofanov et al., 2004; Benoit et al., 2020; Roberts et al., 2024). Thus, we randomly downsampled each species’ 30-mer list to 10 million 30-mers with a frequency 40 ≥ 5 in at least one sample in the species’ 30-mer list. We chose to downsample to 10 million 30- mers to decrease disk space burden of storing k-mer counts and because several previous studies show that subsets of only 1 million k-mers or less can reliably estimate genetic dissimilarity in many systems (Ondov et al., 2016; Benoit et al., 2020; VanWallendael and Alvarez, 2022). The reason we also included a frequency cut-off of 5 is to omit k-mers containing sequencing errors, which predominantly occur as very low-frequency k-mers (Ranallo-Benavidez et al., 2020). We chose a frequency cutoff of 5 to to include more k-mers in some of our lower coverage datasets, but frequency cut-offs anywhere from 2-10 are used in plants (Voichek and Weigel, 2020; VanWallen- dael and Alvarez, 2022). We then joined the subset k-mer counts for each individual into a single matrix for each species. We used this k-mer frequency matrix to measure genetic distance in two ways. First, we calculated Jaccard dissimilarity (JD, Ondov et al. (2016)) between each pair of individuals in a species’ dataset as: JD(X, Y ) = 1 − X ∩ Y X ∪ Y (2.3) where X and Y represent sets of unique k-mers identified as present in two different read sets. A k-mer is defined as present if it’s frequency in a sample is ≥ 5. To get the genome-wide average Jaccard dissimilarity ( ¯JD), we took the average of all the pairwise Jaccard dissimilarities. Jaccard dissimilarity is likely the most commonly used k-mer-based diversity measure (Ondov et al., 2016). However, whether a k-mer reaches the frequency threshold needed to be identified as preset in a sample depends on the sequencing depth for the sample (VanWallendael and Alvarez, 2022). Thus, we also calculated Bray-Curtis dissimilarity (BD) between each pair of individuals in a species’ dataset as: BD(X, Y ) = 1 − 2 ∑ i min(m∗ k ∑ i m∗ k i (X), m∗ i (Y )) i (X) + m∗ i (Y ) (2.4) where m∗ i (X) gives the normalized frequency of k-mer i in genome X. The normalized frequencies are calculated by taking each frequency mi(X) and dividing it by the sum of the raw frequencies 41 as in Dubinkina et al. (2016): m∗ i (X) = mi(X)∑ i mi(X) (2.5) This step accounts for variation in coverage between samples on k-mer frequency. To get the genome-wide average Bray-Curtis dissimilarity ( ¯BD), we again took the average of all the pairwise Bray-Curtis dissimilarities. Note that both Jaccard and Bray-Curtis dissimilarity are scaled in their denominators by either the total number of unique k-mers or total number of k-mers respectively, analogous to how nucleotide diversity is scaled by the number of sites included in the calculation. Population size proxies Following similar methods to Corbett-Detig et al. (2015) and Buffalo (2021), we defined cur- rent census population size (N ) as the product of species range size (R) in square kilometers and population density (D) in individuals per square kilometer: N = RD (2.6) Estimation of both R and D are handled separately below. Importantly, these methods have the same drawback as described in Corbett-Detig et al. (2015) and Buffalo (2021): contemporary estimates of R and D do not necessarily reflect the historical values of R and D. However, since nearly all the species in this study lack long-term historical data on their population size, it is not currently possible to estimate long-term historical N without making strong assumptions. Range size estimation from GBIF occurrence data We first estimated range size based on Global Biodiversity Information Facility (GBIF) oc- currence data from the rgbif package (Chamberlain and Boettiger, 2017). For each species, we identified its GBIF taxon key(s). If the species is domesticated, we used the taxon key(s) for a wild relative with an overlapping range when possible. We then downloaded all records associated with each taxon key that had an occurrence status of “PRESENT”, had coordinates that mapped to land, had any basis of record other than “FOSSIL SPECIMEN”, and recorded anywhere in a year ≥ 1943 42 and ≤ 2023. In addition, the records could not have any GBIF issue codes, except for the issue codes listed in Appendix C. Similar to previous studies (Corbett-Detig et al., 2015; Buffalo, 2021), we estimated range size for domesticated species using GBIF occurrences from closely-related wild relatives because it is difficult to distinguish the native and introduced ranges of globally cultivated crop species with only occurrence data. Note, however, that we also used an additional method for estimating range size that is not burdened by this same assumption (see Range size estimation from WCVP distribution maps). The relatives used for each domesticated species is detailed in Table S1. We followed methods of Buffalo (2021) to estimate range size from each species’ set of GBIF occurrence data using the package alphahull (Pateiro-Lopez and Rodriguez-Casal, 2022). We started with splitting the occurrence data by continent, in order to avoid estimating ranges that overlapped with oceans. We also only kept occurrences with unique latitude-longitude values to reduce the computational burden of alphahull’s algorithms. We then added a small amount of random jitter (normally distributed with µ = 0 and σ = 1 × 10−3) to the latitude-longitude coordi- nates of each unique occurrence to avoid errors in the triangulation algorithm of alphahull, which can break when there are lots of colinear points. Finally, we filtered out any continents which had fewer than 20 unique occurrences of a species. The only exceptions to this rule were Solanum stenotomum, Dioscorea alata, and Rhododendron griersonianum, for which we only required 8, 6, and 3 occurrences respectively due to the rarity of these species and thus a paucity of occurrence data. We then used alphahull to compute the alpha shape of each continent subset, which can be thought of as the smallest possible convex shape that encloses a set of points in a plane. We defined the alpha parameter for the alphahull package to be 200. We then used the R packages sf (Pebesma, 2018) and rworldmap (South, 2011) to measure the sizes of the alpha shapes in square kilometers after projecting them onto the Earth’s surface. Finally, we took the estimated range polygons and filtered out ones that resided on continents in the introduced range of the species, as defined by the World Checklist of Vascular Plants (WCVP) (Govaerts et al., 2021). The sum of the areas of the remaining polygons was our estimate of range size. 43 Range size estimation from WCVP distribution maps We also estimated range size from expert-drawn species distribution maps instead of species occurrence data. We used the rWCVP package (Brown et al., 2023) to download distribution maps from WCVP (Govaerts et al., 2021). We then estimated range size for each species as either (1) the sum of the areas of all map elements labeled as “native” or “extinct” for that species or (2) the sum of the areas of all map elements labeled as “native”, “invaded”, or “extinct” for that species. Regions with an occurrence label of “dubious” were excluded from downstream analyses. In contrast to GBIF-derived ranges, we used distribution maps for domesticated species in this estimate of range size because the maps discriminate between the native and introduced ranges of species. Population density estimation from plant height Similarly to previous studies, we use plant height as a proxy for plant population density (Corbett-Detig et al., 2015). While it would be ideal to use actual population densities in our anal- yses, we could not find published estimates of population densities for many of the species in our dataset and all previous studies investigating Lewontin’s paradox rely on population size proxies (Leffler et al., 2012; Corbett-Detig et al., 2015; Filatov, 2019; Buffalo, 2021). We elaborate further on the limitations of using proxies in the Discussion, but at the time of writing this manuscript using proxies is the only way to achieve a sufficient sample size for investigating Lewontin’s paradox. We decided to use plant height rather than plant mass (Deng et al., 2012) as our measure of body size because plant height measurements are available for many more species in our dataset and also to make our results more comparable to previous studies that also use plant height (Corbett-Detig et al., 2015). According to theory outlined in Deng et al. (2012), where D is population density, M is plant mass, and h is plant height, D ∝ M −3/4 and M ∝ h8/3. Combining these two relationships gives D ∝ (h8/3)−3/4 which simplifies to D ∝ h−2. Adding this density-height relation to equation 2.6 gives our main proxy for population size: N ∝ R h2 44 (2.7) In our subsequent analyses, we refer to Equation 2.7 as the range size-squared height ratio and we convert R to square meters and h to meters to make the ratio unitless. As Equation 2.7 suggests, we do not expect the range size-squared height ratio to exactly equal the true population size or be interpretable as a number of individuals. Rather, it is a quantity we expect to scale with population size. To calculate the range size-squared height ratio for each species, we downloaded plant height data from the EOL, which mainly comprised records summarized from the TRY database. If no height measurements were available for a species in the EOL, then we used estimates we found in published scientific literature. The only exceptions to this were Vanilla planifolia and Rhododen- dron griersonianum, where our height estimates came from the Kew Botanical Gardens’ and the American Rhododendron Society’s websites, respectively. The sources used for each height value are cited in Table S1. Labeling species with genome size, mating system, ploidy, cultivation status, and life cycle habit Table S1 contains citations for all studies that were used to label each species in our study with a genome size, mating system, ploidy level, cultivation status, and life-cycle habit. For determining genome size, we used estimates from flow cytometry and k-mer-spectra analyses whenever possible instead of using assembly size, since most assemblies do not contain the entire genome of the sequenced species. Most of our genome size estimates were 1C values acquired from publications cited in the Plant DNA C-values Database (Pellicer and Leitch, 2020). Any estimates in terms of picograms (pg) of DNA were converted to base pairs using the following conversion factor: DNA in Mb = DNA in pg ×0.978 × 109 (Doležel et al., 2003). If genome sizes in terms of pg were not available for a species, then we used the size of the species’ genome assembly as the genome size. We next labeled each species with a mating system (selfing, outcrossing, mixed, or clonal), cultivation status (wild or cultivated), and life cycle habit (annual, biennial, perennial, or mixed) because previous studies showed these factors to be important determinants of diversity in plants (Chen et al., 2017). For classifying species into different mating systems, we used methods similar to a previous study (Opedal et al., 2023) and generally considered species with outcrossing rate < 10 % as “selfing”, species with outcrossing rate between 10 - 90 % as “mixed”, and species with 45 outcrossing rate > 90 % as “outcrossing” when estimates of outcrossing rates were available. In the absence of outcrossing rate data, we also labeled species described as generally self-incompatible as “outcrossing” and we labeled species described as selfing as “selfing”. The only exception to this was Oryza brachyantha for which we could not find mating system descriptions in peer-reviewed literature. Thus, we assumed that this species was most likely outcrossing because most of the other wild Oryza species in the dataset were classified as outcrossing. Because of the low number of mixed (14) and clonal (2) species in our dataset, we collapsed the selfing, mixed, and clonal species into a single “not outcrossing” category for later downstream analysis. Similarly, for life cycle habit, our dataset contained only 1 biennial species and 2 species that had a mixture of annual, biennial, and perennial forms. We combined these species with the perennial category to create a single “not annual” category. For cultivation status, we looked up each species in the EOL and classified species that had documented human uses (such as for food, fiber, fodder) or had some countries known to cultivate the species as “cultivated”. All other species that did not meet these criteria were classified as “wild”. The only exception to this was Lactuca sativa, which did not have any human uses listed in EOL at the time of writing this paper; however, it is commonly known as lettuce so we classified it as “cultivated”. Finally, for ploidy levels, when more than one cytotype was described as present within a species we labeled the species with it’s most common naturally-occurring cytotype. Citations to relevant literature used for each classification decision can be found in Table S1. Statistical analysis The ultimate goal of our statistical analyses was to estimate the effect of our population size proxies on measures of diversity, comparing the effects of using k-mer-based or nucleotide diver- sity. To do this, we used the caper R package (Orme et al., 2018) to implement a statistical approach similar to Whitney et al. (2010). We performed partial phylogenetic least squares regressions con- trolling for evolutionary history (using a phylogeny obtained from timetree.org, Kumar et al. (2017, 2022)), mating system (outcrossing vs not outcrossing), cultivation status (wild vs cultivated), and life cycle habit (annual vs not annual). We did not include sequencing coverage as a covariate at 46 this step because we would later exclude species with low coverage data from our analyses (see Results) and the relationship between coverage and diversity saturates at higher levels of coverage (Figures S3A, S3D, and S3G). We also did not include number of individuals as a covariate because this did not correlate with any of our diversity measures (Figures S3C, S3F, and S3I). Similar to Whitney et al. (2010), we also scaled the dependent variables to be unitless with a mean of zero and unit variance across species (using the scale() function in R) before performing regression to make slopes more comparable across models and account for the inherent differences in unit between nucleotide and k-mer diversity metrics. This approach can be summarized as follows: scale(diversity) = β0 + β1×log10(population size proxy) + β2×mating system + β3×cultivation status + β4×life cycle habit + ϵ where population size proxy refers to either Equation 2.7 or it’s components (range size and plant height), diversity was estimated using either SNPs (log10(¯π)) or k-mers ( ¯JD or ¯BD), and the scale() function performs a z-transformation to make diversity unitless with mean of zero and unit variance. We also constructed a separate set of models where we included genome size as a covariate: scale(diversity) = β0 + β1×log10(population size proxy) + β2×mating system + β3×cultivation status + β4×life cycle habit + β5×log10(genome size) + ϵ We controlled for genome size in a separate set of models because we had conflicting expecta- tions on whether genome size would be a confounder or a mediator of the population size-diversity relationship. In other words, the effect of population size on diversity could act through genome size, since small populations may not experience strong enough selection to purge deleterious inser- tions (Lynch and Conery, 2003). Including genome size as a covariate in this case would artificially diminish the estimated effect of population size on diversity. Alternatively, genome size could fun- damentally alter the mode of adaptation in plant species (Mei et al., 2018), making genome size a confounder of the population size-diversity relationship. After constructing our models, we visualized the relationship between population size and di- versity or genome size and diversity with partial regression plots, following methods from Riddell 47 (1977) and Blomberg et al. (2012). Beginning with our initial phylogenetic least squares model: y = Xβ + ϵ (2.8) where y is a vector of diversity values, X is the design matrix, β is a vector of regression coef- ficients, and ϵ is a vector of residuals distributed normally about 0 with phylogenetic variance- covariance matrix Ω. We first performed Cholesky decomposition on Ω to get matrix C such that: Ω = CC T (2.9) We then took the inverse matrix C −1 and left-multiplied both sides of our regression equations to get: Which we will rewrite as: C −1y = C −1Xβ + C −1ϵ y∗ = X ∗β + ϵ∗ where y∗ = C −1y, X ∗ = C −1X, and ϵ∗ = C −1ϵ. In vector form, this equation is now: y∗ = β0x∗ 0 + β1x∗ 1 + β2x∗ 2 + ... + βn−1x∗ n−1 + ϵ∗ (2.10) (2.11) (2.12) where β0x∗ 0 is our intercept (Note that x0 was initially a column of 1’s before being transformed by C −1). After fitting this model to our data with the standard lm() function in R, we collected all terms besides the primary variable of interest, x∗ k (which would be a population size proxy or genome size in our case), and subtracted them from both sides of the equation to get: 48 y∗ − ∑ i̸=k βix∗ i = βkx∗ k + ϵ∗ (2.13) We then plotted the values of x∗ k against y∗ − ∑ i̸=k βix∗ i , interpreting the slope (βk) as the effect of the primary variable on the response, scaled for phylogenetic relationships and adjusted for the effects of confounding factors. Results Low diversity species explained by low mean coverage There were 112 species in our initial dataset, each with estimates of population size proxies, nucleotide diversity, and k-mer diversity (Figure 2.1). Out of these 112 species, 102 were diploids, 9 were tetraploids, and one was hexaploid, with haploid genome sizes ranging from 105 Mb to 5.06 Gb (Table S1). These species were further broken down into 57 annual species vs 55 not annual species (which were predominately perennial), 31 wild vs 81 cultivated species, and 55 outcrossing vs 57 not outcrossing species (which were predominantly selfing). Species classified as annual also tended to not be classified as outcrossing (χ2 = 18.9, p = 1.4 × 10−5, Figure C1C). However, cultivation status was independent of both life cycle habit (χ2 = 4.07 × 10−31, p = 1, Figure C1A) and mating system (χ2 = 0.53, p = 0.47, Figure C1B). The number of individuals sampled in each species varied from 3 to 1200 and the average depth of sequencing per individual varied from 0.028x to 79.7x (Figure C2). Variation in the depth of sequencing between individuals, quantified as the coefficient of variation in base pairs sequenced, varied about 50-fold from 0.030 to 1.6 (Figure C2). There were no missing values for any of the variables investigated in this study, but there were three species with zero variant sites called (Capsicum annuum, Heliosperma pusillum, and Papaver somniferum) because they had very low coverage sequencing datasets (average coverages per individual of 0.035x, 0.169x, 0.043x, respectively, Table S1). We omitted these species from all downstream analyses. Before testing our central hypothesis, we investigated whether technical sequencing variables could explain any of the diversity values observed in our dataset. Mean coverage correlated with both nucleotide diversity (ρ = 0.33, p = 0.00033, Figure C3A) and k-mer diversity (Jaccard: ρ= 49 Figure 2.1: Our study includes 112 plant species across a wide range of population sizes and diversity levels. Species labeled in purple were considered outliers and omitted from downstream analyses (see Figure 2.2), but species labeled in green were retained. The phylogenetic tree is scaled in millions of years. The WCVP full ratio is a unitless population size proxy equal to the ratio of range area, estimated using WCVP range maps, to squared plant height and is log-transformed (base 10, see Equation 2.7). Nucleotide diversity is genome-wide average diversity at four-fold degenerate sites, log-transformed (base 10, see Equation 2.2). Capsicum annuum, Heliosperma pusillum, and Papaver somniferum had nucleotide diversity values of zero and so have bars at the plotting limit (log(0) = −∞). Bray-Curtis dissimilarity is average pairwise Bray-Curtis dissimilar- ity across all pairs of individuals in a species’ sample (see Equation 2.4). 50 -0.53, p = 2.6×10−9, Figure C3D; Bray-Curtis: ρ = -0.34, p = 0.00021, Figure C3G). Coefficient of variation in bp sequenced correlated strongly with k-mer diversity (Jaccard: ρ = 0.36, p = 0.00013, Figure C3E; Bray-Curtis: ρ = 0.42, p = 4.7 × 10−6, Figure C3H) but not nucleotide diversity (ρ = -0.088, p = 0.36, Figure C3B). The number of individuals sequenced did not correlate with either nucleotide diversity or k-mer diversity (Figure C3C, C3F, C3I). Given that mean coverage strongly correlated with both k-mer dissimilarity and nucleotide di- versity, we decided to investigate this relationship further. We would expect that low coverage sequencing would produce artificially low nucleotide diversity and artificially high k-mer diver- sity. This is because our SNP-calling pipeline is generally tuned for high-coverage datasets (see SNP calling pipelne), so low coverage datasets produce few confident SNP calls. Conversely, low coverage sequencing only samples a small proportion of the total k-mer space in a set of reads. Thus, two independent low coverage samples are unlikely to share many k-mers in common and will have inflated k-mer diversity. Overall, we observed that species with very low mean coverage had both low nucleotide and high k-mer diversity (Figure 2.2A, Figure C4A). In contrast, there was no clear mapping of these abnormal diversity values with coefficient of variation in base pairs sequenced (Figure C5) or the number of individuals sequenced (Figure C6). Based on these results, we removed 10 species from our dataset with mean coverage per individual ≤ 0.5x. This included three species (Capsicum annuum, Heliosperma pusillum, and Papaver somniferum) with zero vari- ant sites called because they had very low coverage (0.035x, 0.169x, 0.043x, respectively, Table S1). We also dropped 4 additional species with fewer than 1000 variant sites called: Camelina sativa , Sesamum indicum, Setaria italica, Actinidia chinensis, the first of which were just slightly above our coverage cutoff (0.63x, 0.92x, 0.63x, respecitively, Table S1). These filtering steps re- moved many of the species with abnormally low diversity and high dissimilarity values excluding (Figure 2.2B). In total, we kept data for 98 species for downstream hypothesis testing. Range size-squared height ratio varies over more orders of magnitude than nucleotide diversity We next investigated whether Lewontin’s paradox applied to our dataset by comparing diversity estimates against population size proxies. For each species, we estimated range size using either 51 Figure 2.2: Low coverage and low numbers of variant calls explains species with abnormally low diversity. (A) shows the relationship between k-mer diversity and nucleotide diversity without omitting species with ≤ 0.5x coverage or ≤ 1000 SNP calls. (B) shows the same relationship, except species with ≤ 0.5x coverage or ≤ 1000 SNP calls are omitted. Each data point is a species. All species’ points are colored by the log (base 10) of average genome-wide coverage per individual for that species. Purple lines are loess smoothing lines with 95% confidence intervals shaded in gray. Values across the top of each plot are Spearman correlation coefficients (ρ) and p-values that test whether each correlation coefficient differs from zero. 52 GBIF occurrence data or WCVP range maps. Estimates from these two methods were significantly correlated no matter whether invaded ranges (as defined in the WCVP range maps) were included (ρ = 0.31, p = 0.00096, Figure C7A) or excluded (ρ = 0.48, p = 7.3×10−8, Figure C7B). The omission of invaded ranges lowered the range size of several plant species based on WCVP range maps (Figure C7C) but had less effect on ranges estimated from GBIF occurrence data (Figure C7D). We then calculated the ratio of range size to squared plant height (Equation 2.7) using height values from the EOL. We used this ratio as our primary population size proxy in downstream anal- yses. After excluding species with < 0.5x coverage and < 1000 variant sites called (Figure 2.2), nucleotide diversity varied over about 4 orders of magnitude for the species in our dataset (from 0.00021 to 0.117, Table S2), while the ratio of range size to squared plant height based on WCVP and GBIF range estimation methods (including both native and invaded ranges) varied over 10 (from 8.9 × 108 to 1.7 × 1018) and 13 (from 8.6 × 105 to 1.5 × 1018) orders of magnitude, re- spectively (Table S2). Mean pairwise Bray-Curtis dissimilarity values varied about 4.9-fold across species, from 0.13 to 0.64, while mean pairwise Jaccard dissimilarity varied about 22-fold, from 0.040 to 0.87 (Table S2). Bray-Curtis dissimilarity values correlated with Jaccard dissimilarity values across species (ρ = 0.76, p < 2.2 × 10−16, Figure C8). k-mer diversity scales with population size proxies more than nucleotide diversity The core of Lewontin’s paradox is that a population’s diversity does not scale much with popu- lation size. If k-mers capture a wider range of genetic variation compared to SNPs, population size will scale more with k-mer diversity than nucleotide diversity. If we did not control for shared evo- lutionary history or any confounding variables (mating system, life cycle habit, cultivation status, or genome size), then none of our diversity measures significantly correlated with the range size- squared height ratio (Figure C9). After controlling for confounding variables, nucleotide diversity marginally scaled with the range size-squared height ratio (β = 0.14, SE = 0.056, p = 0.017, Fig- ure C10A). However, the relationship between k-mer diversity and the range size-squared height ratio was highly significant, with generally a greater slope (Jaccard: β = 0.64, SE = 0.096 p = 53 2.2 × 10−9, Figure C10B; Bray-Curtis dissimilarity: β =0.79, SE = 0.11, p = 7.3 × 10−11, Figure C10C). We observed the same qualitative trend when we included both native and invaded ranges in the range size-squared height ratio (Figure C10D-F), or used the GBIF-based range estimates instead of WCVP-based estimates (Figure C11). Interestingly, we often observed Bray-Curtis dis- similarity having a larger slope with the range size-squared height ratio compared to Jaccard dis- similarity (β = 0.64 vs 0.79 Figure C10B-C), but models where Bray-Curtis dissimilarity was the response variable generally had lower adjusted R2 (e.g. R2 = 0.51 vs 0.41, models 2 and 3 in Table S4). We also analyzed range size and plant height separately as population size proxies (Figure C12- C14). Overall, WCVP-estimated range size significantly affected nucleotide diversity (β = 0.29, SE = 0.072, p = 0.00011, Figure C12A) and k-mer diversity (Jaccard: β = 0.92, SE = 0.13, p = 9.9×10−11, Figure C12B; Bray-Curtis: β = 1.2, SE = 0.13,p = 3.2×10−14, Figure C12C), and this trend held when we estimated range size from GBIF occurrences (Figure C13A-C) or included invaded range area (Figure C12D-F and Figure C13D-F). On the other hand, nucleotide diversity did not scale with plant height (β = 0.13, SE = 0.19, p = 0.5, Figure C14A), but k-mer diversity marginally scaled downward with plant height (Jaccard: β = −0.78, SE =0.38, p = 0.046, Figure 14B; Bray-Curtis: β = −0.77, SE = 0.44, p = 0.088, Figure C14C). Finally, we repeated our partial phylogenetic regressions controlling for genome size as an additional covariate. In this case, nucleotide diversity did not scale with the range size-squared height ratio (β = 0.035, SE = 0.063, p = 0.58, Figure 2.3A), but k-mer diversity did (Jaccard: β = 0.54, SE = 0.093, p = 8.8×10−8, Figure C15; Bray-Curtis: β = 0.7, SE = 0.098, p = 2.2×10−10, Figure 2.3B). Again, we got qualitatively similar results when we excluded invaded ranges in our range size estimates (Figure C16), used GBIF occurrences to estimate range size-squared height ratio (Figure C17) or used WCVP range size as the population size proxy (Figure C18). However, GBIF range size by itself did not scale with Jaccard dissimilarity (Figure C19B, C19E). Increased plant height associated with decreased k-mer diversity, but had no significant relationship with nucleotide diversity (Figure C20). 54 Figure 2.3: k-mer diversity scales with population size proxies after controlling for genome size, life cycle habit, mating system, and cultivation status. Purple lines are partial phylogenetic regression lines between diversity levels and the population size proxy. The y-axis (diversity) is scaled and adjusted according to Equation 2.13: scaling diversity levels to a standard normal distribution, followed by correcting for phylogenetic relatedness and adjusting for the confounding variables (genome size, life cycle habit, mating system, and cultivation status). The values at the top of each plot give the slope of the partial regression ± one standard error and p-values testing whether the slopes differ from zero. Dotted lines show the partial regression slope ± one standard error. WCVP full ratio is a population size proxy estimated as the ratio of range size recorded in WCVP range maps (including invaded ranges) to squared plant height. 55 k-mer diversity scales with genome size more than nucleotide diversity We also investigated the relationship between diversity and genome size because we expected genome size to potentially play a role in the mechanism underlying the greater scaling of k-mer diversity with population size. Genome size is often a strong predictor of diversity (Lynch and Conery, 2003). Among eukaryotes, variation in genome size is largely explained by variation in transposable element abundance (Flavell et al., 1974; Kidwell, 2002; Lynch and Conery, 2003; Muñoz-Diez et al., 2012; Tenaillon et al., 2011; Nystedt et al., 2013; Ibarra-Laclette et al., 2013), which contribute substantially to the repetitive sequence content of genomes and increase the dif- ficulty of aligning short reads to a reference genome (reviewed in Goerner-Potvin and Bourque (2018)). Thus, our expectation was that k-mer-based diversity measures are more sensitive to genome size variation compared to nucleotide diversity. Increasing genome size was associated with decreasing k-mer diversity (Jaccard: β = −3.7, SE = 0.42, p = 8.4×10−14, Figure C21; Bray-Curtis: β = −4.2, SE = 0.45, p = 4.5×10−15, Figure 2.4B) and nucleotide diversity (β −1.8, SE = 0.29, p = 1.4×10−8, Figure 2.4A), after controlling for variation in the range size-squared height ratio, mating system, life cycle habit, cultivation status, and evolutionary history. We got qualitatively similar results when the population size proxy we corrected for excluded invaded ranges (Figure C22), or if our population size proxy was based on GBIF occurrences (Figure C23), or we used range size or plant height individually to control for population size variation (Figure C24-C26). Across all of these analyses, the partial regression relationship between genome size and diversity was always significantly negative. Discussion Our goal was to investigate whether genomic approaches that can capture more genetic vari- ation than reference-based methods can improve the scaling between population size proxies and diversity. This was motivated by literature suggesting suggesting SNPs called against a single ref- erence provide an incomplete picture of genome-wide polymorphism (Schmidt et al., 2021; Van- Wallendael and Alvarez, 2022; Jaegle et al., 2023; Sopniewski and Catullo, 2024). In total, we processed >205 terabases of publicly available sequencing data from the SRA over approximately 56 Figure 2.4: k-mer diversity is more sensitive to genome size variation than nucleotide diversity. Purple lines are partial phylogenetic regression lines between diversity levels and genome size (see Equation 2.13) after scaling diversity levels to a standard normal distribution (mean = 0, variance = 1), followed by scaling diversity levels and population sizes according to their phylogenetic relatedness, and finally adjusting for the confounding effects of mating system, cultivation status, life cycle habit, and population size. Here we used the ratio of range size to squared plant height, where range size was estimated from ranges in WCVP range maps (including invaded ranges). The values at the top of each plot give the slope of the partial regression ± one standard error and p- values testing whether the slopes differ from zero. Dotted lines show the partial regression slope ± one standard error. 57 12 months of wall time, split between a maximum of 512 cores and 50 TB of disk space. After careful accounting for potential technical and phylogenetic confounding, the standardized slope between k-mer-based diversity and the range size-squared height ratio was up to 20 times larger than the same standardized slope for nucleotide diversity (β = 0.035 vs 0.7, Figure 2.3). We observed similar results across the two different measures of range size (Figure C17) and k-mer diversity (Figure C15). We also observed that k-mer-based diversity is more sensitive to variation in genome size compared to nucleotide diversity (Figure 2.4). Overall, these results are consistent with the possibility that diversity missed by reference-based analyses or changes in the diversity calculation itself can partly explain the weak scaling between population size and diversity. How- ever, a larger pangenome scale analysis will be required to fully understand the genetic variants underlying this result. Extending our findings to Lewontin’s paradox directly is complicated by a few factors. First, it would be ideal to compare our diversity estimates to neutral expectations of diversity vs population size. For SNPs, the expected diversity under neutrality would be 4Neµ or a similar formula. For k-mers the exact value of µ is not clear, as k-mers also reflect non-SNP variation. Several k-mer- based measures of diversity are nearly perfectly correlated with π in neutral models where only SNPs are considered (Roberts et al., 2024), but the inclusion of non-SNP variation would probably change this relationship. Identifying the concrete variants underlying k-mers would help us better understand µ for k-mer, but this is currently not possible for mutations other than SNPs and small indels without pangenomic references (Uricaru et al., 2015; Gauthier et al., 2020). The eventual release of pangenomic references across many species will provide better context for increased scaling of k-mer diversity with population size. Similar to previous Lewontin’s paradox studies, we also use both non-coding sequences (for k-mers) and 4-fold degenerate sites (for SNPs) as standards for estimating neutral diversity (Leffler et al., 2012; Buffalo, 2021), but there could be differences between coding and non-coding values of µ that contribute to results observed in our work and previous work. More estimates of µ are needed for non-SNP mutations (Quiroz et al., 2023), but we further caution that many estimates of µ will be limited in that they usually reflect 58 contemporary populations. Variation in µ over time, such as through TE bursts (Belyayev, 2014), could play an important role in diversity vs population size scaling and warrant further research. In addition to limitations estimating µ and π, estimates of Ne also tend to be limited in inves- tigations of Lewontin’s paradox. Like previous analyses, our analysis assumes that contemporary population size estimates are good proxies for historic population sizes (Corbett-Detig et al., 2015; Buffalo and Coop, 2020). Common population size proxies such as range size and plant height only reflect the current census population size of a species. However, it is the long-term harmonic mean of the effective population size, not just the current size, that determines diversity levels within a population (Wright, 1940). There is some literature correlating aspects of species history with contemporary diversity levels (López￿Delgado and Meirmans, 2022) and incorporating these metrics into Lewontin’s paradox is an exciting avenue for future research. As a first step in this direction, we used plant range maps into native and invaded ranges to test the robustness of our results to invasion-related range size changes (Brown et al., 2023). Overall, our observations were remarkably similar no matter whether we included or excluded invaded ranges in our population size proxies (Figure C17A-C vs Figure C17D-F). Part of this apparent robustness was due to the insensitivity of our GBIF-based range size estimates to the inclusion of invaded ranges (Figure C7D). However, our WCVP-based range size estimates were drastically altered by the inclusion of invaded ranges (Figure C7C) and still yielded similar results (Figure 2.3, C15, C16). Although we cannot rule out the possibility that older historical events have affected contemporary diversity levels, our results appear to be robust to recent biological invasions. As with all regression-based analyses, our results are also ultimately sensitive to error in the measurement of both covariates (population size proxies, genome size, mating system, life cycle habit, or cultivation status) and outcome variables (nucleotide or k-mer diversity). For example, we did not filter out potentially contaminating sequences from our k-mer analysis, which could add noise to our k-mer dissimilarity values. However, this could only explain the increased scaling be- tween k-mer diversity and population size if plants with larger population size proxies had a higher diversity of contaminants, which seems unlikely given that many of the sequencing datasets in our 59 study are from laboratory cultivated plants (see Table S1) and that we excluded low-frequency k- mers from our analyses. Our study is also unique in the multiple steps we took to limit the influence of systematic measurement errors on our coefficients, including reanalyzing all population-level sequencing data with one pipeline to limit the impact of bioinformatic parameter choices on our analysis (Mirchandani et al., 2024), filtering out low coverage datasets(Sandell et al., 2022), ac- counting for missing data in calculations of nucleotide diversity (Schmidt et al., 2021; Korunes and Samuk, 2021)and estimating range size with two different methods (WCVP range maps and GBIF occurrence records, Figure C6). Although we could not control for some covariates (Willis, 1922; Romiguier et al., 2014; Guo et al., 2024) due to a dearth of data, our study is still the largest reanalysis of population-level sequencing data in plants that we know of to date. The availability of our workflow also makes it easy for our study to be extended as more population-level sequencing data is released. Interestingly, the estimated effect of our population size proxies on diversity was often slightly larger for Bray-Curtis dissimilarity than Jaccard dissimilarity (for example, β = 0.7 vs 0.54 from Figure 2.3B vs Figure C15, Table S4). In contrast, the range size-squared height ratio was of- ten slightly more predictive of Jaccard dissimilarity than Bray-Curtis dissimilarity (Table S4). We could not test whether these trends were statistically significant, but the benefits of different k-mer metrics in predicting measures of population size warrant further study. Our expectation is that k-mer diversity measures based on frequency, such as Bray-Curtis dissimilarity, better capture di- versity compared to measures based on purely k-mer presence/absence, such as Jaccard dissimilar- ity, because they explicitly measure copy number variation. However, accurately measuring k-mer frequencies likely requires higher sequencing coverage than calling presence/absence, which could explain why Bray-Curtis dissimilarity generally scaled more with population size but had a lower R2 compared to Jaccard dissimilarity (Table S4). Future studies using higher coverage population level sequencing data could help test this hypothesis. k-mer frequencies are known to be highly informative of genomic structure, with one common application of k-mers being the estimation of genome size (Vurture et al., 2017; Pflug et al., 2020). 60 Similar to previous studies, we observed that nucleotide diversity was negatively correlated with genome size (Lynch and Conery, 2003; Chen et al., 2017), but we observed an even stronger neg- ative correlation for k-mer diversity (β = −1.8, SE = 0.29 vs β = -3.7, SE = 0.42 in Figure 2.4). k-mers also appeared to explain diversity patterns that scaled with population size beyond those explained by genome size, while nucleotide diversity did not. After controlling for genome size, the relationship between our population size proxies and nucleotide diversity was not significant (Figure 2.3A, C17-C19 panels A and D), but the relationship between k-mer diversity and popula- tion size proxies was often still highly significant (Figure 2.3B, C17-C19 panels B; C; E; F). The only exception was that Jaccard dissimilarity did not significantly scale with GBIF-based estimates of range size (Figure C19B, C19E). This additional scaling of k-mer diversity with population size beyond just the effects of genome size and confounding variables suggests that k-mers capture some element of the population size-diversity relationship that is absent from nucleotide diversity. Our results do not negate the fact that other important factors also underlie Lewontin’s paradox, such as past demographic fluctuations and linked selection. However, our results do suggest that future studies of Lewontin’s paradox may want to consider diversity outside one reference genome. The increasing availability of pangenomes across species (Göktay et al., 2021; Zhou et al., 2022; Rice et al., 2023; Wang et al., 2023) offers many opportunities to revisit this classic population genetics question. Ideal future studies would use pangenomic genotyping methods across a wide range of species with a standardized pipeline, combined with multiple proxies of population size, and understandings of each species’ demographic history. Altogether, these methodological devel- opments will hopefully reveal a more wholistic picture of variation across the tree of life. Data availability Supplemental figures are in appendix C. Our entire analysis is packaged as a snakemake work- flow stored here: https://github.com/milesroberts-123/tajimasDacrossSpecies. All supplementary tables will be made available online at Evolution Letters. Table S1 contains the metadata for all of the datasets used in this study, including sources for genome assemblies, genome annotations, population-level sequencing datasets, and GBIF observations. Table S2 contains all of the covari- 61 ate and response variable values used for fitting our phylogenetic least squares models. Table S3 contains the estimated coefficients of all of our phylogenetic least squares models and their related statistics, including p-values and standard errors. Table S4 contains the model-level statistics for each phylogenetic least squares model, including R2 values and F-test results. 62 CHAPTER 3: k-MER-BASED APPROACHES TO BRIDGING PANGENOMICS AND POPULATION GENETICS3 Abstract Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes can be challenging for many species, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that k-mers are a very useful but underutilized tool for bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of k-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different k- mer-based measures of genetic variation behave in population genetic simulations according to the choice of k, depth of sequencing coverage, and degree of data compression. Overall, we find that k-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity (π) up to values of about π = 0.025 (R2 = 0.97) for neutrally evolving populations. For populations with even more variation, using shorter k-mers will maintain the scalability up to at least π = 0.1. Furthermore, in our simulated populations, k-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of k- mer based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using k-mers. Introduction Two decades ago, assembling one reference genome for one eukaryotic species was an inter- national, herculean effort (Venter et al., 2001). Now, individual laboratories can readily assemble and align multiple reference-quality genomes from the same species into pangenomes (Golicz et al., 2020). This shift toward pangenomes as the basis for genetic studies is already transforming our 3This chaper is published at the following DOI under a Creative Commons CC BY License: https://doi.org/10. 1093/molbev/msaf047. This license permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Permission to reuse this article is not required. 63 understanding of genetic variation in populations. Analysis of pangenomes has uncovered vast quantities of genetic variation previously missed by the ubiquitous practice of aligning short reads to a single reference genome (Wong et al., 2020; Sirén et al., 2021; Ebler et al., 2022; Zhou et al., 2022; Rice et al., 2023; Liao et al., 2023), increased the power of trait mapping (Song et al., 2020; Chin et al., 2023), and resolved complex structural variations (Hickey et al., 2020; Song et al., 2024). Pangenomes have even revised our understanding of previously cataloged variation, be- cause single nucleotide polymorphisms (SNPs) once identified by mapping reads against a single reference can sometimes be resolved as alignment errors due to structural variation (Jaegle et al., 2023). Altogether, pangenomes provide a more accurate representation of genetic variation, re- ducing the commonly observed phenomenon where one’s choice of reference genome shapes the ultimate conclusions of a study (i.e. reference bias, Gage et al. (2019); Günther and Nettelblad (2019); Chen et al. (2021); Ebler et al. (2022)). This better ability to capture and explain pat- terns of genetic variation, combined with recent developments in pangenome assembly algorithms (Garrison et al., 2024; Hickey et al., 2020) and an explosion in pangenome sequencing for many non-model organisms (Lei et al., 2021), means that pangenomes will likely be the standard for population genetics analysis in the near future. However, pangenomes can be difficult to assemble and tune in some contexts (Hahn, 2018; Song et al., 2024). First, pangenomes by definition require more sequencing data to assemble compared to a single reference genome, which can make building pangenomes expensive for large genome species or study systems with fewer resources. Furthermore, whole genome alignment (WGA) is key for pangenome assembly but also difficult to tune, affecting pangenome analysis. Modern WGA algorithms are impressive and are being used to produce pangenomes for a wide range of species (Garrison et al., 2024); however, current WGA algorithms are mainly developed and tuned to align human and model species genomes, and so do not always generalize well to genomes that are large, highly repetitive, highly diverse, polyploid, or containing high levels of structural variation (reviewed in Song et al. (2024)). For example, corn (Zea mays) has a famously repetitive and structurally variable genome and new WGA approaches needed to be developed just 64 to properly align corn genomes (Song et al., 2022). There are also still many open questions on how to best represent complex, nested variations in such alignments and tune alignment parameters (Song et al., 2024). A researcher’s exact choice of alignment parameters and software can dras- tically affect the shape of a pangenome graph (Rice et al., 2023) and downstream genotype calls (O’Rawe et al., 2013; Li, 2014; Bush et al., 2020; Betschart et al., 2022; Sopniewski and Catullo, 2024). Given the challenges of computing and tuning alignments, approaches that skip alignment altogether could valuably complement or help guide pangenome assembly. In this review we argue that the k-mers deserve more attention from population geneticists be- cause they can complement the study of pangenomes. A k-mer is a sub-sequence of length k within a larger sequence. For example, the sequence “ATGCA”, contains the unique 2-mers AT, TG, GC, and CA. The main benefit of k-mers is that they can be analyzed without alignment. Instead, one simply counts all of the unique k-mers present in a sample of reads (where k-mers that are reverse complements of each other are typically considered identical and are counted together Kokot et al. (2017)), then uses the resulting count matrix for downstream analysis (see Figure 3.2). Thus, k- mers derived from regions far diverged, absent, or otherwise unalignable in a given reference will not be automatically excluded from an analysis, allowing one to get a picture of pangenomic vari- ation. k-mers have a long history of use in metagenomics (McClelland, 1985; Rosen et al., 2008; Dubinkina et al., 2016), phylogenetics (Kolekar et al., 2012; Haubold, 2014; Zielezinski et al., 2017, 2019; Bussi et al., 2021; Beichman et al., 2023; Jenike et al., 2024), computer science (Shannon, 1948), and quantitative genetics (Voichek and Weigel, 2020; Kim et al., 2020; Gupta, 2021; Onetto et al., 2022; Lemane et al., 2022). However, while applications of k-mers receive much attention in other disciplines, population genetic investigations using k-mers remain limited. Our goal is to review k-mer-based approaches for identifying, measuring, and explaining pat- terns of genetic variation in populations. At the same time, we investigate the behavior of k-mer- based measures of variation to the choice of k, the depth of sequencing coverage, and the degree of data compression. We finally highlight some avenues to explore in k-mer-based (i.e. reference-free or alignment-free) population genetics. Overall, we advocate that k-mer-based approaches can be 65 valuable complements to common reference-based population genetics methods. Box 1: What is the “best” value for k? A common first question in k-mer based analyses is: What value(s) of k should be analyzed? The “best” k for an analysis is ultimately determined by a trade-off between the length of k and sequencing error: longer k-mers are more likely to represent unique genomic sequences, but are also more likely to contain a sequencing error (Rahman et al., 2018). In practice, many studies use a k of around 20-40 bp (Ponsero et al., 2023) because k-mers in this range can be reliably sequenced with short read data and often align uniquely to their source genome (Wu et al., 1991; Becher et al., 2022). For example, k = 32 captures 85.7 % of unique sequences in the human genome (Shajii et al., 2016), while k = 21 distinguishes many eukaryote, bacteria, and archea species (Bussi et al., 2021). However, there are many past studies that propose criteria for choosing specific values of k. Choosing multiple values of k One “brute force” approach to test the sensitivity of results to k is to simply repeat an analysis multiple times for different values of k. This approach is especially common among genome as- sembly algorithms (Chikhi and Medvedev, 2014; Durai and Schulz, 2016) but could be applied to almost any analysis in theory. The main drawback, however, is the high computational burden of performing the same analysis multiple times. It is also difficult to know without more information whether analyses performed for certain values of k produce more accurate results than analyses for other values of k, motivating the need for k selection criteria that can be either minimized or maximized. Choosing k based on the number of unique non-erroneous k-mers Higher values of k generally allow greater detection of unique sequences, but increase the prob- ability of observing k-mers containing at least one sequencing error (Chikhi and Medvedev, 2014; Rahman et al., 2018). Each sequencing error can result in up to k erroneous k-mers, making it important to prevent errors from dominating one’s analysis. Thus, choosing a k that maximizes the number of unique non-erroneous k-mers in a dataset is generally considered optimal for tasks like 66 genome assembly (Chikhi and Medvedev, 2014). This approach generally involves measuring the k-mer frequency spectrum and then fitting a model to the distribution to estimate which parts of the spectrum come from erroneous k-mers (Chikhi and Medvedev, 2014). Usually, the low-frequency end of the spectrum is dominated by erroneous k-mers because sequencing errors are unlikely to generate the same erroneous k-mers many times and usually convert real k-mers into k-mers not found in the source genome (Kelley et al., 2010). Although this criterion for choosing k could be applied to population genetic datasets, it is not if clear the resulting optimal k would vary signifi- cantly between genomes within the same species. Presumably, if genomes within the same species have considerable variation in repetitive content (Haberer et al., 2020), size (Schmuths et al., 2004), or ploidy (reviewed in Kolář et al. (2017)) then the optimal k could vary. Determining the value of k that maximizes the number of unique non-erroneous k-mers across all individuals in a population may be of interest for future population genetics studies. Choosing k based on the probability of chance k-mer matches between samples Another way to choose k is to think about the probability that a k-mer from one genome is also found in a second genome by chance alone. For example, it would be unsurprising for almost any pair of reasonably long, naturally-occurring DNA sequences to share the k-mer “AGC” because this k-mer is very short and could occur many times in a random sequence. How large must k be then before finding a k-mer in two different sequences is unlikely by chance alone? To discuss this question, we next present equations similar to ones in Ondov et al. (2016), except we generalize to account for variation in base composition between sequences being compared. Let’s begin by imagining we have two genome sequences we wish to compare: X1 and X2. First, to generate X1, we sample with replacement the letters A, T, G, and C a total of L times with probabilities pA1, pT 1, pG1, and pC1, respectively where pA1 + pT 1 + pG1 + pC1 = 1. In other words, we sample our DNA from a multinomial distribution where we assume that each base is sampled independently of the preceding bases. This gives us one strand for X1 and we can then generate the complementary strand by pairing A with T and G with C. We repeat this whole process once more to construct X2 using probabilities pA2, pT 2, pG2, and pC2, which may or may not match the 67 probabilities for X1, except we only sample k bases. This gives us a k-mer, K, in the genome sequence of X2. We now wish to find the probability that K occurs in X1, assuming that the sampling of bases for K was independent of sampling bases for X1. If we imagine constructing X1 and K at the same time, then the probability of drawing the same letter for both sequences at a given position is: ΣF = pA1pA2 + pT 1pT 2 + pG1pG2 + pC1pC2 (3.1) However, we also want to consider sequences that are reverse complements as identical, so the probability of drawing a pair of bases that match as reverse complements is: ΣR = pA1pT 2 + pT 1pA2 + pG1pC2 + pC1pG2 (3.2) The probability of drawing k pairs of bases in a row that are either identical or reverse comple- ment matches is thus Σk F and Σk R, respectively, assuming that each base is sampled independently of all others. The complementary probability of not getting k matches in a row on the forward strand is 1 − Σk F and, comparably, 1 − Σk R for the reverse strand. Because X1 contains L bases, there are a total of L − k + 1 k-mers in X1, so the probability of a K not matching at any k-mers R)L−k+1. It should be noted that at this step in X1 is approximately (1 − Σk F )L−k+1 and (1 − Σk we are assuming that the chance of K mismatching at a given position in X1 is independent of the chance that K mismatches at other positions in X1, which is not strictly true. For example, if K is the sequence “AAAAA” and we compare K to a subsequence in X1 that is “AACAA” we would say that K and this subsequence do not match. However, with this information we would also know with certainty that adjacent k-mers in X1 will also not match K because they will still contain a “C”. However, we will continue with assuming that the mismatch between X1 and K is independent at all positions because this makes our final equations more conservative (decreasing the Σk F term in Equation 3.5 will decrease the overall value of Equation 3.5). We can now say that the probability of finding the k-mer K at least once by chance alone at 68 any of the L − k + 1 positions in the sequence X1 is: P (K ∈ X1F ) = 1 − P (K ∈ X1R) = 1 − ( ( )(L−k+1) 1 − Σk F )(L−k+1) 1 − Σk R Assuming k << L, Equations 3.3 and 3.4 approximate to: P (K ∈ X1F ) ≈ 1 − P (K ∈ X1R) ≈ 1 − ( ( )L 1 − Σk F )L 1 − Σk R (3.3) (3.4) (3.5) (3.6) We confirmed that equations 3.5 and 3.6 work as expected in simulations (Figure 3.1). Similar to Fofanov et al. (2004), we can then solve these equations to give the minimum k-mer length required to achieve a desired probability of chance k-mer matching of q = P (K ∈ X1F ) = P (K ∈ X1R): ⌈ ( kF = logΣF 1 − (1 − q)1/L ⌈ ( kR = logΣR 1 − (1 − q)1/L )⌉ )⌉ (3.7) (3.8) and now given a choice of q we are willing to tolerate, we can then use max(kF , kR) as a potential choice of k. Equations 3.7 and 3.8 demonstrates that k = 19 reduces the probability of two 3 Gb genomes of random sequence sharing the same k-mer by chance to just 1 % (Ondov et al., 2016) (assuming 69 all bases are in equal proportion in both genomes, pA1 = pT 1 = pG1 = pC1 = pA2 = pT 2 = pG2 = pC2 = 1/4) while k = 27 gives q = 1×10−6 for a 10 Gb genome where both X1 and X2 have a GC conent of 42 % (pA1 = pT 1 = pA2 = pT 2 = 0.29 and pG2 = pC2 = pG2 = pC2 = 0.21). Slightly longer k-mers are required when the proportion of each base is not 25 % in both genomes because it is more likely then to have low-complexity strings of bases which make spurious matches more likely. For example, the required k-mer length increases to 34 bp if the GC content of both genomes is 21 % (pA1 = pT 1 = pA2 = pT 2 = 0.395 and pG1 = pC1 = pG2 = pC2 = 0.105), given a 10 Gb genome and q = 1 × 10−6. In summary, if k is high enough, two genomes in a population are highly unlikely to share k-mers just through the accumulation of random sequences alone, suggesting that shared k-mers usually reflect shared ancestry. Developing additional k-mer selection criteria that explicitly model the influence of shared ancestry on the probability of k-mer sharing could improve approaches to choosing k. 70 Figure 3.1: The relationship of k-mer length and base composition to chance matching. We verified (A) Equation 3.5 and (B) Equation 3.6 by randomly generating 5.5 million pairs of k-mers and genomes. Each simulation had 100,000 trials. For each trial we randomly generated a genome of length 10,000 bases using a given base composition (pA, pT , pG, pC in barchart) and also gener- ated a random k-mer (color gradient shows k) using the same base composition. We then checked whether the random k-mer was in the random genome. In total, the 5 different base compositions × 11 different k-mer lengths × 100,000 trials per simulation gives 5.5 million trials total. Grey line is where the theoretical (i.e. Equations 3.5 and 3.6) and empirical probabilities (i.e. from sim- ulations) of a chance match are equal. Dotted black line shows the resolution of the simulations; because we did 100,000 trials per simulation we cannot empirically estimate probabilities smaller than 1/100,000. Choosing k based on the balance of shared vs differing k-mers A final class of approaches for choosing k focuses on how if k is small, most k-mers are shared between most samples, but as k increases more sample-specific k-mers occur until no shared k- mers remain (Zhang et al., 2017a). Neither of these scenarios is usually desirable - the former makes all samples appear to be identical, while the latter makes all samples appear to be completely different. Thus, choosing k boils down to balancing the number of shared vs differing k-mers. Some statistics for informing this choice include cumulative relative entropy (Sims et al., 2009), relative sequence divergence (Sims et al., 2009), average number of common features (Zhang et al., 2017a), Shannon’s diversity index (Zhang et al., 2017a), and χ2 tests (Bai et al., 2017). However, there are two main drawbacks of these statistics. First, they were mainly developed for phylogenetic studies focused on short amino acid k-mers, so their utility for population genetics (which would probably focus on longer nucleotide k-mers) is untested. Furthermore, different values of k will appear 71 optimal for genomes of different sizes (Zhang et al., 2017a), making it hard to imagine that one optimal k exists for a population exhibiting substantial genome size variation. Extending these or similar approaches to optimize k for populations with genome size variation will be very useful for future studies. Box 2: What’s the expected value of k-mer diversity in a neutrally evolving population? We argue that k-mers can be useful for initial assessments of genetic diversity, but two key questions are likely of interest to population geneticists: can k-mer diversity be related to nucleotide diversity (the common measure of genetic diversity) and what is the expected value of k-mer metrics for populations evolving under a given model? Here, we derive a simple bound on the expected number of k-mer differences between a pair of individuals in a neutrally evolving population. Our result is similar to (Shi et al., 2024), except we generalize beyond haploid organisms. We start by assuming that (1) k is sufficiently long to capture all of the unique sequences in a genome, (2) each genome is sequenced at sufficient coverage to confidently identify all of the k- mers present, (3) k-mers are sequenced without error, and (4) only SNPs contribute to differences between genomes. Let i and j be individuals in a neutrally evolving population with ploidy level x and let y and z be two haploid genomes in the pool of i and j that differ by origin. Next, let Ki and Kj represent the set of k-mers identified in individual i and j’s genomes respectively and let |Ki| denote the size of the set Ki. If all of the genomes within i and j are identical, then all k-mers are shared between them. If we start by assuming i and j are haploid, this means a maximum of 2k k-mers are not shared between i and j for every pairwise difference between i and j (Iqbal et al., 2012; Younsi and MacLean, 2015). This relationship can be written as: |Ki ∪ Kj| − |Ki ∩ Kj| ≤ 2kDyz (3.9) where Dyz represents the number of pairwise differences between haplotypes y and z and the left side represents the number of k-mers exclusive to either i or j. The reason for the using a “≤” in Equation 3.9 is to account for three possibilities: (1) if two SNPs are less than k bases apart, there 72 will be fewer than 4k k-mers not shared between i and j, (2) it’s possible for a SNP to turn one k-mer into a different k-mer that’s already present elsewhere in a genome and (3) at higher ploidy levels, if a segregating site is heterozygous in i and j then i and j will share all k-mers between them. To generalize beyond haploid organisms, we will replace the conversion factor of 2 in equation 3.9 above with a general function a(x): |Ki ∪ Kj| − |Ki ∩ Kj| ≤ a(x) ∑ y10x) and low sequencing error rates. These criteria can potentially exclude long-read datasets (Vurture et al., 2017) or reduced-representation datasets where genomes are sequenced at lower coverage to cut costs. Approaches that apply k-mers in lower coverage or 87 higher error-rate datasets are needed. Third, k-mer-based analyses can frequently involve hundreds of millions or billions of unique k-mers. Storing and processing this many k-mers at once is often not feasible even on high performance computing systems. Two approaches to solving this issue are to subset one’s k-mers (Ondov et al., 2016; Benoit et al., 2020; Zhao, 2019; Yi et al., 2021), or, as we discuss here, compress vectors of k-mer counts into an array of smaller size (Melsted and Pritchard, 2011). However, these solutions are usually developed for specific contexts, such as measuring genomic dissimilarity, and may not be applicable to every population genetic analysis. Future studies need to carefully consider common approaches to decreasing the disk burden of k-mer-based analyses and should explore new potential solutions. These challenges to studying k-mers highlight how k-mer based methods should be viewed as complementary to, instead of better than, pangenomes in many situations. Among other benefits, pangenomes have ready biological interpretations and provide greater information on linkage be- tween sequences. However, the fact that k-mers can be studied with fewer computing resources and simpler sequencing data sets makes them potentially useful for either preliminary pangenomic analyses or situations where constructing many reference quality genomes is not yet feasible. Conclusions Current literature demonstrates that k-mers are useful for identifying, measuring, and explain- ing variation within populations without performing alignment. Through our own simulations of neutrally evolving populations, we find that k-mer dissimilarity reliably scales with nucleotide diversity and k-mer matrices can be compressed with minimal loss of dissimilarity information. However, further development is needed to make k-mers amenable to a wider array of popula- tion genetic tasks, especially the identification of selected loci and population structure. Further developing alignment-free approaches to population genetics tasks will ultimately help guide and complement the analysis of pangenomes. Future directions • Can we develop criteria for choosing values of k that account for population-level variation in genome size or genome content? 88 • Can we develop k-mer-based estimates of population differentiation that are more similar to FST ? • Can we identify putative selective sweeps using k-mers? • Can we develop applications of k-mers that are amenable to lower-coverage or pooled se- quencing samples that are common in population genetics? • k-mers are useful for getting preliminary estimates of important genome parameters that are useful for tuning pangenome assembly (Ranallo-Benavidez et al., 2020). What other types of preliminary pangenome analyses might k-mers be useful for? • Can k-mers, either in combination with or without a pangenome reference, be useful for rapidly estimating allele frequencies and performing demographic inference? Data availability Supplemental figures can be found in Appendix D. Scripts for simulation and data analysis can be found at: https://github.com/williarj/kmers2024. 89 CHAPTER 4: SUMMARY STATISTICS COMPARABLE TO CONVOLUTIONAL NEURAL NETWORKS FOR INFERRING TIMES TO FIXATION4 Abstract Times to fixation (tf ) are an essential ingredient for understanding neutral theory and positive selection. However, studies of tf have largely subsided in recent years and few methods exist to predict the time to fixation for a fixed beneficial allele. A key difficulty in this prediction problem is disentangling tf from the age of a sweep (ta) because multiple different combinations of tf and ta can result in similar polymorphism patterns around the site of the sweep. We test whether convolutional neural networks (CNNs) can potentially perform better on this problem by simulating approximately 250,000 selective sweeps across five different demographic scenarios. The CNNs achieve comparable performance to approximate bayesian compuation methods used in previous approaches. The CNNs are better able to disentangle tf from ta for populations that are growing, but also give statistically worse predictions than ABC when trained on populations cycling between two different sizes. Altogether this suggests that perhaps few undiscovered signals remain in single timepoint unphased genotype data that can be used to disentangle tf and ta in singe population demographic scenarios. Introduction Adaptation requires increasing the frequency of beneficial alleles in a population, but these fre- quency changes take time. How long does it take for beneficial alleles to spread? This question was at the top of mind of naturalists in the early 20th century as there was doubt over how rea- sonable natural selection was as an evolutionary mechanism (see Charlesworth (2020) for a fuller historical treatment). The thinking was that if beneficial alleles change frequency only very slowly, then maybe natural selection is not important. However, early models outlined how selection can rapidly change allele frequencies, though most beneficial alleles are lost by drift Fisher (1923); Hal- dane (1927). Nonetheless the rate of new mutations is high enough such that new beneficial alleles are continually introduced and some eventually fix (Zhao et al., 2013). As it became appreciated that multiple fixations can build up between species, further investigations of fixation times were 4At the time of writing, this chapter is unpublished. 90 motivated by observations of molecular substitution rates (Kimura and Ohta, 1969) and trait degra- dation as a result of relaxed selection (Kimura, 1980). The theories from this foundational work have since been extended to understand how time to fixation is shaped by changing environments (Cui and Yuan, 2018; Kaushik and Jain, 2021) and population structure (Greven et al., 2016). However, despite this foundation of knowledge, studies of times to fixation have subsided in recent years such that two different papers spaced 8 years apart both argue that times to fixation need more attention from modern methods (Zhao et al., 2013; Charlesworth, 2020). It is much more common for studies to estimate coalescence times (Y. C. Brandt et al., 2022), variant age (Bisschop et al., 2021), or allele frequency trajectories of currently segregating alleles (Stern et al., 2019). One reason for this focus is that coalescent times are foundational to important population genetic structures like ancestral-recombination graphs (Lewanski et al., 2024). Another reason is because tf , in a simple model of a single additive de novo beneficial mutation, is tightly correlated to the selection coefficient. The mean time to fixation for a single additive beneficial mutation in a constant sized population is 2ln(2cNe − 1)/s where c is ploidy level (Otto and Whitlock, 2013). This equation is sensitive to s more than Ne because s is outside of the logarithm. Thus, under this model one could study the selection coefficient and get essentially the same information as would be provided by tf . However, the exact relationship between time to fixation and s is more complicated when arbitrary dominance, mating system variation, and demographic changes are considered (Glémin, 2012). Selection coefficients and coalescent times thus do not fully convey the timescale of fixation by themselves outside of simple scenarios. How can we estimate tf ? Assuming a single de novo additive mutation that is sampled imme- diately after sweep completion, there are simple models describing how the dip in diversity around a selected site varies with tf (Coop, 2020). There are more complex models requiring different sets of assumptions (He et al., 2020; Bisschop et al., 2021); however, not all study systems and situa- tions will appropriately fit the assumptions in these models. As a result, there are several methods that forgo explicit models and instead simulate sweeps and use approximate bayesian computation (ABC) to estimate the timing of selection (Przeworski, 2003; Ormond et al., 2016; Nakagome et al., 91 2019). Machine learning algorithms can also be trained on simulations and have proven useful for a variety of population genetics tasks (Kern and Schrider, 2018; Flagel et al., 2019; Torada et al., 2019; Sanchez et al., 2021; Whitehouse and Schrider, 2023), but have not yet been used to infer tf . Our goal is to use machine learning models that can infer tf and are amenable to unphased genotype data. Our hypothesis is that machine learning models will outperform ABC models and could pick up on new signatures to differentiate between old, fast and young, slow sweeps (Hahn and Mishra, 2025). Methods Our entire analysis is available as a snakemake workflow, packaged here: https://github.com/ milesroberts-123/selection-demography-cnn. This workflow includes all of the scripts used to sim- ulate data, fit models, and generate figures as well as configuration files specifying the exact soft- ware versions used for each step. Simulations Our simulated selective sweeps are generated using SLiM (v4.0.1, Haller and Messer (2019)). The parameters for each simulation were generally drawn from uniform or log-uniform distribu- tions. The exact distributions used for each parameter are listed in Table 4.1. Each simulation initializes a diploid population of size NA individuals (i.e. the ancestral population size) and a chromosome of size L bp, then burns-in for 10NA generations with mutation rate µ and recombi- nation rate R. At κ generations post-burn-in, a sweep mutation is introduced at position L/2 bp (i.e. the middle of the simulated region). The beneficial mutation initially has a selection coefficient of 0, but once the frequency of the mutation reaches a value f0 the selection coefficient is changed to a value s and given a dominance coefficient of h. After the frequency of the mutation reaches a value f1, the sweep mutation is switched back to being neutral. After burn-in, the population either has a custom demography (specified as a vector of time steps and a corresponding vector of population sizes), or a demography determined by the following logistic map: Nt+1 = rNt ) ( 1 − Nt K + Nt 92 (4.1) where Nt gives the population size at generation t and N0 = NA. This equation has the follow- ing special properties (Phatak and Rao, 1995): • r = 0, gives a constantly sized population. • 0 < r < 2 and Nt < K gives a population that grows. • 0 < r < 2 and Nt > K gives a decaying population. √ • 2 < r < √ 6 gives a population that cycles between two different values. • 6 < r < 3 gives a population that chaotically changes size. In practice, since our popula- tion size is a discrete number of individuals, the population can have a long cycle if happens to return to its initial population size of NA. From now on, we will refer to these respective demographies with the terms “constant”, “growth”, “decay”, “cycling”, or “chaotic”, respectively. For growth or decay demographies, we restrict r to be 0 < r < 0.5 to simulate gradual, rather than nearly instantaneous, population size changes. Because forward simulations slow down for larger populations, we also restrict K to be within 1 % - 50 % of NA. For each simulation, we then track how long it takes for the beneficial mutation to fix. After fixation, we allow the simulation to run for an additional τ generations before taking a random sample of size n individuals from the population. If a sweep was lost from a population due to drift, we restarted the simulation to its post-burn-in state and initialized to a new random seed. If a simulation was restarted 1000 times and still did not result in a complete sweep, then we omitted that simulation from downstream analyses. At the end of each simulation, we recorded the tf s from the completed simulation and randomly sample n individuals (n = 128 for all analyses). We then downsampled the set of completed simu- lations to create a uniform distribution of log10(tf ). Our downsampling procedure involved taking the initial log10(tf ) distribution, dividing it up into bins of length 0.1, collecting the bin heights, and randomly sampling simulations from each bin (without replacement) until all bins were the same height. Our goal with this procedure was to include simulations with fixation times spanning approximately 50 - 20,000 generations while still retaining > 11000 simulations per demographic scenario. After downsampling, we randomly partitioned the remaining simulations to either the 93 training dataset (80 %), the validation dataset (10 %) or the testing dataset (10 %). For our simulations, we kept the following parameters constant, but it is possible to configure different values in our workflow: • L = 100 Kb • κ = 1 • n = 128 • f0 = 0 • f1 = 1 In other words, we simulated only complete hard sweeps on chromosomes of size 100 Kb introduced 1 generation after burn-in and randomly sampled 128 individuals, which is a similar or larger sample size than most other sweep-related CNNs (Flagel et al., 2019; Torada et al., 2019; Whitehouse and Schrider, 2023). Parameter NA s h µ R τ r K constant U(1000, 10000) LU( log10(1/NA), 0) U(0,1) LU(-8.5, -7.5) LU(-9, -7) LU(0, 4) 0 NA growth U(1000, 10000) LU( log10(1/NA), 0) U(0,1) LU(-8.5, -7.5) LU(-9, -7) LU(0, 4) U(0,0.5) NA×U(1.01, 2) decay U(1000, 10000) LU( log10(1/NA), 0) U(0,1) LU(-8.5, -7.5) LU(-9, -7) LU(0, 4) U(0,0.5) NA×U(0.5, 0.99) cycling U(1000, 10000) LU( log10(1/NA), 0) U(0,1) LU(-8.5, -7.5) LU(-9, -7) LU(0, 4) 6) U(2, NA×U(0.8, 1.2) √ chaotic U(1000, 10000) LU( log10(1/NA), 0) U(0,1) LU(-8.5, -7.5) LU(-9, -7) LU(0, 4) √ 6, 3) U( NA×U(0.8, 1.2) Table 4.1: Sampling distributions for key simulation parameters for the five different demo- graphic scenarios included in this study. U is the uniform distribution and LU is the log uniform distribution. The first number for each distribution is the minimum and the second number is the maximum. Selective sweep statistics We calculated a suite of selective sweep summary statistics that were compatible with unphased genotype data and useful for selective sweep analysis (Kern and Schrider, 2018), including the num- ber of segregating sites (S), nucleotide diversity (π), Watterson’s theta (θW ), Tajima’s D (Tajima, 94 1989), variance in Tajima’s D, number of unique genotypes (Kern and Schrider, 2018), unphased versions of h1; h2; h12; h123; h2h1 (Kern and Schrider, 2018), variance; skew; and kurtosis in the distribution of genotype mismatches (i.e. the gkl statistic from Kern and Schrider (2018)), aver- age genotypic correlation between pairs of SNPs (Rogers and Huff’s R2,Rogers and Huff (2009)), Kim’s ω (Kim and Stephan, 2002), and unphased Messer’s hscan (Schlamp et al., 2016). Image-based sweep representation We converted the VCF file output from each simulation into an image representation using a custom R script with ggplot (Wickham, 2016), ggnewscale (Campitelli, 2022), and cowplot (Wilke, 2020). This image representation is similar to (Flagel et al., 2019) but with some modifications for unphased genotype data. Each image was a gray-scale matrix with n rows corresponding to the sampled individuals and l columns corresponding to the l SNPs closest to the selective sweep site. We clustered the rows using according to their manhattan distance using the complete clustering algorithm in stats R package’s hclust function. The color of each element in the matrix could be either black (homozygous genotype 0/0), grey (heterozygous genotype 0/1), or white (homozygous genotype 1/1). We allowed l to reach a maximum of 128 SNPs and padded each image with black columns if there were fewer than 128 SNPs in the sampled simulated population. Because each column of the image represents a SNP we also created a vector of SNP positions to go along with each image. These SNP positions were min-maxed normalized such that the SNP in the left-most column of an image was position 0 and the SNP at the right-most column of the image was position 1. Convolutional Neural Networks We used the image-based representations of genotypes in the regions of selective sweeps to train a convolutional neural network (CNN). Our CNN architecture was similar to Flagel et al. (2019), with two main branches: one to process a sweep image and another to process the vector of SNP positions associated with the columns of the sweep image. The image processing branch had three convolutional layers, each followed by a pooling and dropout layer. The final dropout layer was flattened and fed into a dense layer. The branch for processing the position information began with 95 an input layer with 128 neurons (one for each column in the image), followed by a dense layer and a dropout layer. The final dense layer for both the image-processing and position-processing branch were concatenated and fed into a final dense layer with dropout before being fed into a single output neuron. We performed 60 iterations of bayesian hyperparameter tuning implemented in keras (v2.11.0, Chollet and others (2015)) on each CNN. We chose 60 iterations because a random parameter search with 60 iterations will sample a model in the top 5 % of the performance range with probability 95 % (Bergstra and Bengio, 2012). Each iteration involved 2 epochs of model fitting to the training simulations and the final performance of the iteration was measured as mean squared error on the validation simulations. During tuning, each convolutional layer could have any multiple of 16 filters between 16 and 128 and any layer could have a dropout rate between 0 and 0.99. All dense layers could have any multiple of 32 neurons between 32, and 512. The first convolutional layer was fixed to have a kernal size of 7 and a stride of 2, while the second and third convolutional layers had a kernal size of 3 and stride of 1. All activation functions were rectified linear unit (ReLU) functions (Banerjee et al., 2019). After hyperparameter tuning, we then trained a final CNN with the best performing hyperpa- rameters. Each training epoch fit the CNN in batches of 32 images. We trained the CNN until validation performance did not improve for 20 epochs. After 20 epochs of no improvements, we reverted the CNN to the configuration with the best performance on the validation data. This early stopping criterion is meant to avoid overfitting the model on the training data. Final model perfor- mance was assessed on the testing data and we performed 100 iterations of Monte-Carlo sampling on each testing image to an estimate the uncertainty for each prediction, which we quantified as the standard deviation in predictions across the Monte-Carlo samples. The mean of the predictions across all the Monte-Carlo samples was used as the point estiamte for the outcome variable (tf ) from the CNN. 96 Dense Neural Networks We also constructed dense neural networks (DNNs) that were trained on only summary statis- tics. Our DNNs had a input layer of 17 neurons (one for each summary statistic) followed by three dense layers with dropout. Similar to the training of CNNs, we did 60 iterations of hyperpar- mater tuning using bayesian hyperparameter optimization. During tuning, each of the dense layers could have a multiple of 8 neurons anywhere from 16 to 512 and a dropout rate anywhere from 0 to 0.99. Final model performance was assessed on the testing data and we performed 100 iterations of Monte-Carlo sampling on each testing image. The mean of the predictions across the Monte-Carlo samples was used as our point estimate for the outcome variable (tf ). Approximate Bayesian Computation We also performed ABC to estimate tf using the rabc package (v2.2.1, Csilléry et al. (2012)) in R (R Core Team, 2022). Just like tuning hyperparameters of CNNs, performing ABC requires making several choices, including the choice of regression method, the tolerance level, and the type of point estimate used (i.e mean, median, or mode of the posterior distribution). Thus, similar to ML, it is typically desirable to try many hyperparameter choices and use the combination of hyperparameters with the best performance for final predictions. ABC approaches are typically tuned with a cross-validation procedure: randomly drop a simulation from the training set and then use the remaining training set to make a prediction for the dropped-out example, repeating many times Csilléry et al. (2012). However, since part of our hypothesis hinged on comparing the performance of ABC to CNNs and DNNs, we tuned ABC similarly to CNNs and DNNs to promote comparability. For each set of simulations, we tried 60 different configurations of ABC (similar to doing 60 iterations of hyperparameter tuning for CNNs and DNNs), including 4 different methods (rejection, ridge regression, local linear regression, or neural network), 5 different tolerance values (0.025, 0.05, 0.1, 0.15, or 0.2), and three different point estimates (mean, median, or mode). For each model configuration, we predicted tf for each simulation in the validation set using the training set as the baseline. We quantified the performance of the model as the Pearson correlation between 97 the model’s predictions and the true value. For the best performing ABC model, we then measured final performance by making predictions for the testing dataset, using the training dataset as the background. Partial R2 calculations Another aspect we were interested in was the benefit of adding more selective sweep statistics to explain variation in tf + ta because many studies that scan for sweeps simply focus on a handful of site frequency spectrum (SFS) statistics like π and Tajima’s D (Teshima et al., 2006). To investigate this, we calculated partial R2 values for the following statistics: S, log10(π), log10(θW ), Tajima’s D, Variance in Tajima’s D, and the number of unique genotypes. These partial R2 values can be interpreted as the proportion of variation not explain by these statistics that becomes explained by our remaining statistics. The procedure to calculate partial R2 is to first fit a ordinary least squares linear model, which we will call the full model: log10(tf + ta) ∼ S + log10(π) + log10(θW ) + D + Var[D] + number of unique genotypes + h1 + h2 + h12 + h123 + h2h1 + gkl variance + gkl skew + gkl kurt + Messer’s hscan + Rogers and Huff’s R2 + log10(Kim’s ω) Then, fit a second model which we will call the reduced model: log10(tf + ta) ∼ S + log10(π) + log10(θW ) + D + Var[D] + number of unique genotypes Given the full and reduced model, the partial R2 measuring the added predictive power of including more than just the SFS statistics is: R2 partial = SSE(reduced) − SSE(f ull) SSE(reduced) (4.2) where SSE is the sum of squared errors. 98 Results Differentiating old, fast sweeps from slow, young sweeps We performed a total of 250,000 SLiM simulations across 5 different demographic scenarios (50,000 simulations per scenario), more than 79% of which successfully produced hard sweeps with valid values of all 17 selective sweep statistics. For each of these simulations, we recorded tf and ta for the sweep (Figure 4.1A). Most of these statistics produced clear, monotonic correlations with tf + ta (Figure 4.1B, Figure E4-E7). However, for all of these statistics, there were divergent combinations of tf and ta that produced similar values of the selective sweep statistics. For exam- ple, sweeps that were old (ta > 1000) and fast (tf < 100) generally had similar π values to sweeps that were younger and slower (tf > 1000, Figure 4.1C). Figure 4.1: Separately estimating tf and ta from summary statistics leads to unidentifiability. (A) Definition of time to fixation and sweep age. Time to fixation is the time between when a sweep first arises and when it fixes. Sweep age is the time between the sweep fixes and when the population is sampled. Example allele frequency trajectory comes from simulating a beneficial allele with selection coefficient 0.5, dominance 0.5, and initial frequency of 1/1000 in a Wright- Fisher population of size 1000. (B) Nucleotide diversity in the sweep region linearly scales with the total lifetime of the sweep: tf + ta. (C) Young, slow sweeps (high tf and low ta) have similar nucleotide diversity as old, fast sweeps (low tf and high ta). Partial R2 of site frequency spectra statistics Across the selective sweep statistics, the site frequency spectra statistics were generally posi- tively correlated with each other and negatively correlated with the haplotype frequency statistics (Figure 4.2A). The two statistics related to linkage disequilibrium, Rogers and Huff’s R2 and Kim’s ω, were generally not correlated with any of the other statistics, except for being slightly correlated to each other (Figure 4.2A). Across the 5 different demographic models, the partial R2 (i.e. the 99 added predictive power) of including statistics beyond simple SFS statistics was significant, with values > 0.2 (Figure 4.2B). The growth demographic model had slightly higher partial R2 (0.26) than the other scenarios and the chaotic demographic model had the lowest (partial R2 = 0.20). Figure 4.2: Including more than site frequency spectrum statistics explains more variation in tf + ta. (A) Pairwise Spearman correlations between unphased selective sweep statistics in (B) Partial-R-squared values for a model with S, log10(π), the constant demography scenario. log10(θW ), Tajima’s D, Tajima’s D variance, and number of diplotypes. In other words, these values describe the amount of additional variation in tf + ta that explained when more than just these selective sweep statistics are added to the model. Convolutional neural networks vs summary statistics for predicting tf For each of the 5 demographic scenarios, we built (1) Approximate Bayesian Computation, (2) dense neural network, and (3) convolutional neural network models to predict tf . The pearson correlation between the true value of tf and the predicted value of tf for the best performing version of each model was generally > 0.7 (Figure 4.3, 4.4, E1-E3). By this measure of performance, the three types of models were not significantly different for most of the demographic scenarios tested (e.g. 95 % confidence intervals for constant demography: CNN = [0.705, 0.750], DNN = [0.719, 0.762], ABC = [0.731, 0.773]; Figure 4.3) with the exception of the cycling demography where the CNN performed worse (r = 0.656 (CNN) vs 0.728 (DNN) vs 0.74 (ABC), Figure E3). All three types of models also generally produced less accurate predictions for sweeps with a short tf , but 100 ta > 1000, incorrectly predicting that these sweeps had a much longer tf (Figure 4.3). The one exception was the growth demography scenario where the CNN and DNN appeared to confuse old, fast and slow, young sweeps less often (Figure 4.4). Figure 4.3: CNNs perform similarly to DNNs and ABC in a constantly sized population. Per- formance of best (A) CNN, (B) DNN, and (C) ABC models at predicting tf under a constant de- mographic scenario. For each model, we list Pearson correlation coefficients between predicted and true values (r), the number of test set examples (n), the 95 % confidence intervals for r, and the p-value for testing if r is different from 0 (p). The purple line is the 1-1 reference line where predictions and truth are equal. The yellow line is a least squares regression line. All points are colored according to the age of the selective sweep in generations (log10 transformed). 101 Figure 4.4: CNNs and DNNs and ABC in a growing population. Performance of best (A) CNN, (B) DNN, and (C) ABC models at predicting tf under a constant demographic scenario. For each model, we list Pearson correlation coefficients between predicted and true values (r), the number of test set examples (n), the 95 % confidence intervals for r, and the p-value for testing if r is different from 0 (p). The purple line is the 1-1 reference line where predictions and truth are equal. The yellow line is a least squares regression line. All points are colored according to the age of the selective sweep in generations (log10 transformed). Discussion We simulated approximately 250,000 selective sweeps across a range of parameters and demo- graphic scenarios, then used those simulations to train CNNs, DNNs, and ABC models to predict tf . Our hypothesis was that CNNs could potentially better disentangle tf from ta and thus be bet- ter at predicting tf . Our simulations reflected a core challenge in the problem of distinguishing tf from ta: sweeps with high tf and low ta (i.e. slow, young sweeps) leave similar signatures as sweeps with low tf and high ta (i.e. fast, old sweeps, Figure 4.1C). Each of the 17 selective sweep statistics we calculated explained additional variation in tf +ta (Figure 4.2). In the end, our models trained on summary statistics (DNN and ABC) performed similarly to CNNs, which were trained on images of genotype matrices (Figure 4.3, 4.4, E1-E3). Similar to previous studies on estimating coalescent times (Y. C. Brandt et al., 2022), we noticed a general bias where models overestimate low values of tf and underestimate high values of tf (Figure 4.3, 4.4, E1-E3). This is mostly likely due to a general identifiability problem where old, fast sweeps leave similar signatures as young, slow sweeps (Figure 4.1, E4-E7). However, in the 102 growth demography scenario CNNs and DNNs were apparently better able to disentangle tf and ta (Figure 4.4A and 4.4B) compared to ABC (Figure 4.4 C). This is possibly because two important signatures of sweeps, an excess of common alleles in site frequency spectra and excessive linkage, persist for longer in populations with higher Ne (Przeworski, 2002). Thus, the signatures of a short tf persist for longer and prevent (or at least delay) confusion of old, fast sweeps with slow, young sweeps. The second case where ABC and CNNs diverged was the cyclic demography scenario, where the correlation between true and predicted tf was lower for the CNN than the DNN and ABC, even though the DNN and ABC maintained a similar correlation compared to the constant demography scenario (Figure E2). This is potentially because repeated population bottlenecks have several consequences for the dynamics of sweeps (Wilson et al., 2014). In our case, if population bottlenecks are softening the sweeps, then it’s possible that the CNN has trouble making predictions for this type of sweep compared to DNNs and ABC which are fed statistics specifically tuned for soft sweeps (H2H1, H123, etc.). Summary statistics were overall effective at predicting tf from unphased genotype data (Figure 4.3B-C, 4.4B-C). Our hypothesis was that CNNs could potentially identify new signatures besides known summary statistics and thus better predict tf (Hahn and Mishra, 2025). However, the fact that ABC based on summary statistics performs the same or slightly better than a CNN suggests unphased genotype matrices do not contain signals for tf that are not already captured by sum- mary statistics. Many CNNs that classify selective sweeps use summary statistics partly to ease interpretability (Kern and Schrider, 2018; Lauterbur et al., 2023), but previous studies have also show CNNs trained on genotype matrices largely learn to reproduce summary statistics (Cecil and Sugden, 2023). It is still possible though that additional signatures of tf not already described by summary statistics do arise in other demographies not explored in this study. Two general directions forward for improving tf predictions are to tweak the prediction algo- rithms or include additional sources of data. Inclusion of time series data (Whitehouse and Schrider, 2023) or phasing information could conceivably aid tf prediction, but these are limited to cases where the researchers have ancient samples or an understanding of a recombination landscape. 103 Another option would be to include data from more than one population (Yair et al., 2021). Popu- lation differentiation provides information on the split time between populations and if a sweep is shared across populations, that would suggest that the sweep happened before the population split (Yair et al., 2021). The difficulty here is that one needs to know or derive a model of the popula- tion split and admixture beforehand so that it can be properly simulated. A combination of both including additional data types and tuning models will most likely be the fruitful path forward. Data availability Supplemental figures are in Appendix E. Our entire analysis is available as a snakemake work- flow, packaged here: https://github.com/milesroberts-123/selection-demography-cnn. 104 BIBLIOGRAPHY Aagaard, J. E., Willis, J. H., and Phillips, P. C. (2006). Relaxed selection among duplicate floral regulatory genes in Lamiales. Journal of Molecular Evolution, 63(4):493–503. Adams, C. J., Conery, M., Auerbach, B. J., Jensen, S. T., Mathieson, I., and Voight, B. F. (2023). Regularized sequence-context mutational trees capture variation in mutation rates across the hu- man genome. PLOS Genetics, 19(7):e1010807. Publisher: Public Library of Science. Aflitos, S. A., Severing, E., Sanchez-Perez, G., Peters, S., de Jong, H., and de Ridder, D. (2015). Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data. BMC Bioinformatics, 16(1):352. Aggarwala, V. and Voight, B. F. (2016). An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nature Genetics, 48(4):349–355. Number: 4 Publisher: Nature Publishing Group. Akagi, T., Henry, I. M., Tao, R., and Comai, L. (2014). A Y-chromosome–encoded small RNA acts as a sex determinant in persimmons. Science, 346(6209):646–650. Publisher: American Association for the Advancement of Science Section: Report. Al Maruf, M. A. and Shatabda, S. (2019). iRSpot-SF: Prediction of recombination hotspots by incorporating sequence based features into Chou’s Pseudo components. Genomics, 111(4):966– 972. Alachiotis, N. and Pavlidis, P. (2018). RAiSD detects positive selection based on multiple signa- tures of a selective sweep and SNP vectors. Communications Biology, 1(1):1–11. Publisher: Nature Publishing Group. Alonso-Blanco, C., Andrade, J., Becker, C., Bemm, F., Bergelson, J., Borgwardt, K. M., Cao, J., Chae, E., Dezwaan, T. M., Ding, W., Ecker, J. R., Exposito-Alonso, M., Farlow, A., Fitz, J., Gan, X., Grimm, D. G., Hancock, A. M., Henz, S. R., Holm, S., Horton, M., Jarsulic, M., Kerstetter, R. A., Korte, A., Korte, P., Lanz, C., Lee, C.-R., Meng, D., Michael, T. P., Mott, R., Muliyati, N. W., Nägele, T., Nagler, M., Nizhynska, V., Nordborg, M., Novikova, P. Y., Picó, F. X., Platzer, A., Rabanal, F. A., Rodriguez, A., Rowan, B. A., Salomé, P. A., Schmid, K. J., Schmitz, R. J., Seren, ￿., Sperone, F. G., Sudkamp, M., Svardal, H., Tanzer, M. M., Todd, D., Volchenboum, S. L., Wang, C., Wang, G., Wang, X., Weckwerth, W., Weigel, D., and Zhou, X. (2016). 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell, 166(2):481– 491. Publisher: Elsevier. Alvarez-Ponce, D. and Fares, M. A. (2012). Evolutionary rate and duplicability in the Arabidopsis thaliana protein–protein interaction network. Genome Biology and Evolution, 4(12):1263–1274. Alvarez-Ponce, D., Feyertag, F., and Chakraborty, S. (2017). Position matters: network centrality considerably impacts rates of protein evolution in the human protein–protein interaction network. Genome Biology and Evolution, 9(6):1742–1756. 105 Asif, H., Alliey-Rodriguez, N., Keedy, S., Tamminga, C. A., Sweeney, J. A., Pearlson, G., Clementz, B. A., Keshavan, M. S., Buckley, P., Liu, C., Neale, B., and Gershon, E. S. (2021). GWAS significance thresholds for deep phenotyping studies can depend upon minor allele fre- quencies and sample size. Molecular Psychiatry, 26(6):2048–2055. Publisher: Nature Publish- ing Group. Audano, P. A., Ravishankar, S., and Vannberg, F. O. (2018). Mapping-free variant calling using haplotype reconstruction from k-mer frequencies. Bioinformatics, 34(10):1659–1665. Auld, J. R., Agrawal, A. A., and Relyea, R. A. (2010). Re-evaluating the costs and limits of adaptive phenotypic plasticity. Proceedings of the Royal Society B: Biological Sciences, 277(1681):503– 511. Publisher: Royal Society. Aylward, A. J., Petrus, S., Mamerto, A., Hartwick, N. T., and Michael, T. P. (2023). PanKmer: k-mer-based and reference-free pangenome analysis. Bioinformatics, 39(10):btad621. Bai, X., Tang, K., Ren, J., Waterman, M., and Sun, F. (2017). Optimal choice of word length when comparing two Markov sequences using a chi-squared statistic. BMC Genomics, 18(6):732. Banerjee, C., Mukherjee, T., and Pasiliao, E. (2019). An Empirical Study on Generalizations of the ReLU Activation Function. In Proceedings of the 2019 ACM Southeast Conference, pages 164–167, Kennesaw GA USA. ACM. Baumdicker, F., Bisschop, G., Goldstein, D., Gower, G., Ragsdale, A. P., Tsambos, G., Zhu, S., Eldon, B., Ellerman, E. C., Galloway, J. G., Gladstein, A. L., Gorjanc, G., Guo, B., Jeffery, B., Kretzschumar, W. W., Lohse, K., Matschiner, M., Nelson, D., Pope, N. S., Quinto-Cortés, C. D., Rodrigues, M. F., Saunack, K., Sellinger, T., Thornton, K., van Kemenade, H., Wohns, A. W., Wong, Y., Gravel, S., Kern, A. D., Koskela, J., Ralph, P. L., and Kelleher, J. (2022). Efficient ancestry and mutation simulation with msprime 1.0. Genetics, 220(3):iyab229. Bayer, P. E., Golicz, A. A., Scheben, A., Batley, J., and Edwards, D. (2020). Plant pan-genomes are the new reference. Nature Plants, 6(8):914–920. Publisher: Nature Publishing Group. Bazin, E., Glémin, S., and Galtier, N. (2006). Population Size Does Not Influence Mitochondrial Genetic Diversity in Animals. Science, 312(5773):570–572. Publisher: American Association for the Advancement of Science. Becher, H., Sampson, J., and Twyford, A. D. (2022). Measuring the invisible: The sequences causal of genome size differences in eyebrights (Euphrasia) revealed by k-mers. Frontiers in Plant Science, 13. Bedo, J., Goudey, B., Wazny, J., and Zhou, Z. (2016). Information theoretic alignment free variant calling. PeerJ Computer Science, 2:e71. Publisher: PeerJ Inc. Behrens, K. A., Koblmüller, S., and Kocher, T. D. (2022). Sex chromosomes in the tribe Cyprichro- mini (Teleostei: Cichlidae) of Lake Tanganyika. Scientific Reports, 12(1):17998. Number: 1 Publisher: Nature Publishing Group. 106 Beichman, A. C., Robinson, J., Lin, M., Moreno-Estrada, A., Nigenda-Morales, S., and Harris, K. (2023). Evolution of the mutation spectrum across a mammalian phylogeny. Molecular Biology and Evolution, 40(10):msad213. Belyayev, A. (2014). Bursts of transposable elements as an evolutionary driving force. Journal of Evolutionary Biology, 27(12):2573–2584. Benoit, G., Mariadassou, M., Robin, S., Schbath, S., Peterlongo, P., and Lemaitre, C. (2020). fast and resource frugal de novo comparative metagenomics. Bioinformatics, SimkaMin: 36(4):1275–1276. Benoit, G., Peterlongo, P., Mariadassou, M., Drezen, E., Schbath, S., Lavenier, D., and Lemaitre, C. (2016). Multiple comparative metagenomics using multiset k-mer counting. PeerJ Computer Science, 2:e94. Publisher: PeerJ Inc. Bergeron, L. A., Besenbacher, S., Zheng, J., Li, P., Bertelsen, M. F., Quintard, B., Hoffman, J. I., Li, Z., St. Leger, J., Shao, C., Stiller, J., Gilbert, M. T. P., Schierup, M. H., and Zhang, G. (2023). Evolution of the germline mutation rate across vertebrates. Nature, 615(7951):285–291. Publisher: Nature Publishing Group. Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13(null):281–305. Bernard, G., Ragan, M. A., and Chan, C. X. (2016). Recapitulating phylogenies using k-mers: from trees to networks. F1000Research, 5:2789. Betancourt, A. J. and Presgraves, D. C. (2002). Linkage limits the power of natural selection in Drosophila. Proceedings of the National Academy of Sciences, 99(21):13616–13620. Publisher: Proceedings of the National Academy of Sciences. Bethune, J., Kleppe, A., and Besenbacher, S. (2022). A method to build extended sequence con- text models of point mutations and indels. Nature Communications, 13(1):7884. Number: 1 Publisher: Nature Publishing Group. Betschart, R. O., Thiéry, A., Aguilera-Garcia, D., Zoche, M., Moch, H., Twerenbold, R., Zeller, T., Blankenberg, S., and Ziegler, A. (2022). Comparison of calling pipelines for whole genome sequencing: an empirical study demonstrating the importance of mapping and alignment. Sci- entific Reports, 12(1):21502. Publisher: Nature Publishing Group. Bisschop, G., Lohse, K., and Setter, D. (2021). Sweeps in time: leveraging the joint distribution of branch lengths. Genetics, 219(2):iyab119. Blanca, A., Harris, R. S., Koslicki, D., and Medvedev, P. (2022). The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches. Journal of Compu- tational Biology, 29(2):155–168. Publisher: Mary Ann Liebert, Inc., publishers. Blomberg, S. P., Lefevre, J. G., Wells, J. A., and Waterhouse, M. (2012). Independent contrasts and PGLS regression estimators are equivalent. Systematic Biology, 61(3):382–391. 107 Boddé, M., Makunin, A., Ayala, D., Bouafou, L., Diabaté, A., Ekpo, U. F., Kientega, M., Le Goff, G., Makanga, B. K., Ngangue, M. F., Omitola, O. O., Rahola, N., Tripet, F., Durbin, R., and Lawniczak, M. K. (2022). High-resolution species assignment of Anopheles mosquitoes using k-mer distances on targeted sequences. eLife, 11:e78775. Publisher: eLife Sciences Publications, Ltd. Braam, J. and Davis, R. W. (1990). Rain-, wind-, and touch-induced expression of calmodulin and calmodulin-related genes in Arabidopsis. Cell, 60(3):357–364. Branca, A., Paape, T. D., Zhou, P., Briskine, R., Farmer, A. D., Mudge, J., Bharti, A. K., Woodward, J. E., May, G. D., Gentzbittel, L., Ben, C., Denny, R., Sadowsky, M. J., Ronfort, J., Bataillon, T., Young, N. D., and Tiffin, P. (2011). Whole-genome nucleotide diversity, recombination, and linkage disequilibrium in the model legume Medicago truncatula. Proceedings of the National Academy of Sciences, 108(42):E864–E870. Publisher: Proceedings of the National Academy of Sciences. Brown, M. J. M., Walker, B. E., Black, N., Govaerts, R. H. A., Ondo, and Nic Lughadha, E. R., World Checklist of Vascular Plants. https://onlinelibrary.wiley.com/doi/pdf/10.1111/nph.18919. (2023). rWCVP: a companion R package for New Phytologist, 240(4):1355–1365. I., Turner, the _eprint: Buffalo, V. (2021). Quantifying the relationship between genetic diversity and population size suggests natural selection cannot explain Lewontin’s Paradox. eLife, 10:e67509. Publisher: eLife Sciences Publications, Ltd. Buffalo, V. and Coop, G. (2020). Estimating the genome-wide contribution of selection to temporal allele frequency change. Proceedings of the National Academy of Sciences, 117(34):20672– 20680. Publisher: Proceedings of the National Academy of Sciences. Bukowski, R., Guo, X., Lu, Y., Zou, C., He, B., Rong, Z., Wang, B., Xu, D., Yang, B., Xie, C., Fan, L., Gao, S., Xu, X., Zhang, G., Li, Y., Jiao, Y., Doebley, J. F., Ross-Ibarra, J., Lorant, A., Buffalo, V., Romay, M. C., Buckler, E. S., Ware, D., Lai, J., Sun, Q., and Xu, Y. (2017). Construction of the third-generation Zea mays haplotype map. GigaScience, 7(4). Bush, S. J., Foster, D., Eyre, D. W., Clark, E. L., De Maio, N., Shaw, L. P., Stoesser, N., Peto, T. E. A., Crook, D. W., and Walker, A. S. (2020). Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines. GigaScience, 9(2):giaa007. Bush, S. J., Kover, P. X., and Urrutia, A. O. (2015). Lineage-specific sequence evolu- tion and exon edge conservation partially explain the relationship between evolutionary rate and expression level in A. thaliana. Molecular Ecology, 24(12):3093–3106. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.13221. Bussi, Y., Kapon, R., and Reich, Z. (2021). Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLOS ONE, 16(10):e0258693. Publisher: Public Library of Science. Caetano-Anolles, D. (2023). Hard-filtering germline short variants. 108 Cagan, A., Baez-Ortega, A., Brzozowska, N., Abascal, F., Coorens, T. H. H., Sanders, M. A., Law- son, A. R. J., Harvey, L. M. R., Bhosle, S., Jones, D., Alcantara, R. E., Butler, T. M., Hooks, Y., Roberts, K., Anderson, E., Lunn, S., Flach, E., Spiro, S., Januszczak, I., Wrigglesworth, E., Jenk- ins, H., Dallas, T., Masters, N., Perkins, M. W., Deaville, R., Druce, M., Bogeska, R., Milsom, M. D., Neumann, B., Gorman, F., Constantino-Casas, F., Peachey, L., Bochynska, D., Smith, E. S. J., Gerstung, M., Campbell, P. J., Murchison, E. P., Stratton, M. R., and Martincorena, I. (2022). Somatic mutation rates scale with lifespan across mammals. Nature, 604(7906):517– 524. Publisher: Nature Publishing Group. Campitelli, E. (2022). ggnewscale: Multiple Fill and Colour Scales in ’ggplot2’. Carlson, J., Locke, A. E., Flickinger, M., Zawistowski, M., Levy, S., Myers, R. M., Boehnke, M., Kang, H. M., Scott, L. J., Li, J. Z., and Zöllner, S. (2018). Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nature Communications, 9(1):3753. Number: 1 Publisher: Nature Publishing Group. Castillo-Davis, C. I., Mekhedov, S. L., Hartl, D. L., Koonin, E. V., and Kondrashov, F. A. (2002). Selection for short introns in highly expressed genes. Nature Genetics, 31(4):415–418. Number: 4 Publisher: Nature Publishing Group. Cecil, R. M. and Sugden, L. A. (2023). On convolutional neural networks for selection inference: revealing the lurking role of preprocessing, and the surprising effectiveness of summary statis- tics. Pages: 2023.02.26.530156 Section: New Results. Chamberlain, S. A. and Boettiger, C. (2017). R Python, and Ruby clients for GBIF species occur- rence data. Technical Report e3304v1, PeerJ Inc. ISSN: 2167-9843. Charlesworth, B. (1994). The effect of background selection against deleterious mutations on weakly selected, linked variants. Genetical Research, 63(3):213–227. Charlesworth, B. (2020). How Long Does It Take to Fix a Favorable Mutation, and Why Should We Care? The American Naturalist, 195(5):753–771. Publisher: The University of Chicago Press. Charlesworth, B. and Jensen, J. D. (2022). How can we resolve Lewontin’s Paradox? Genome Biology and Evolution, 14(7):evac096. Charlesworth, B., Morgan, M. T., and Charlesworth, D. (1993). The effect of deleterious mutations on neutral molecular variation. Genetics, 134(4):1289–1303. Chen, J., Glémin, S., and Lascoux, M. (2017). Genetic diversity and the efficacy of purifying selection across plant and animal species. Molecular Biology and Evolution, 34(6):1417–1428. Chen, N.-C., Solomon, B., Mun, T., Iyer, S., and Langmead, B. (2021). Reference flow: reducing reference bias using multiple population genomes. Genome Biology, 22(1):8. Chen, S., Zhou, Y., Chen, Y., and Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17):i884–i890. 109 Cheng, C.-Y., Krishnakumar, V., Chan, A. P., Thibaud￿Nissen, F., Schobel, S., the Arabidop- _eprint: and Town, C. D. sis https://onlinelibrary.wiley.com/doi/pdf/10.1111/tpj.13415. The Plant Journal, 89(4):789–804. thaliana reference genome. a complete reannotation of Araport11: (2017). Chikhi, R. and Medvedev, P. (2014). Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1):31–37. Chin, C.-S., Behera, S., Khalak, A., Sedlazeck, F. J., Sudmant, P. H., Wagner, J., and Zook, J. M. (2023). Multiscale analysis of pangenomes enables improved representation of genomic diver- sity for repetitive and clinically relevant genes. Nature Methods, 20(8):1213–1221. Number: 8 Publisher: Nature Publishing Group. Choi, I., Ponsero, A. J., Bomhoff, M., Youens-Clark, K., Hartman, J. H., and Hurwitz, B. L. (2019). Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons. GigaScience, 8(2):giy165. Chollet, F. and others (2015). Keras. Published: urlhttpskeras.io. Chu, J., Rong, J., Feng, X., and Li, H. (2024). ntsm: an alignment-free, ultra-low-coverage, se- quencing technology agnostic, intraspecies sample comparison tool for sample swap detection. GigaScience, 13:giae024. Cingolani, P., Patel, V. M., Coon, M., Nguyen, T., Land, S. J., Ruden, D. M., and Lu, X. (2012a). Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Frontiers in Genetics, 3. Publisher: Frontiers Media SA. Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X., and Ruden, D. M. (2012b). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly, 6(2):80–92. Cole, C. T. (2003). Genetic variation in rare nual Review of Ecology, Evolution, https://doi.org/10.1146/annurev.ecolsys.34.030102.151717. and Systematics, and common plants. 34(1):213–237. An- _eprint: Colombo, M., Laayouni, H., Invergo, B. M., Bertranpetit, J., and Montanucci, L. (2014). Metabolic flux is a determinant of the evolutionary rates of enzyme-encoding genes. Evolution, 68(2):605– 613. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/evo.12262. Compeau, P. E. C., Pevzner, P. A., and Tesler, G. (2011). Why are de Bruijn graphs useful for genome assembly? Nature biotechnology, 29(11):987–991. Coop, G. (2016). Does linked selection explain the narrow range of genetic diversity across species? Pages: 042598 Section: Contradictory Results. Coop, G. (2020). Population and Quantitative Genetics. github, 3 edition. 110 Cooper, L., Meier, A., Laporte, M.-A., Elser, J. L., Mungall, C., Sinn, B. T., Cavaliere, D., Car- bon, S., Dunn, N. A., Smith, B., Qu, B., Preece, J., Zhang, E., Todorovic, S., Gkoutos, G., Doonan, J. H., Stevenson, D. W., Arnaud, E., and Jaiswal, P. (2018). The Planteome database: an integrated resource for reference ontologies, plant genomics and phenomics. Nucleic Acids Research, 46(D1):D1168–D1180. Corbett-Detig, R. B., Hartl, D. L., and Sackton, T. B. (2015). Natural selection constrains neutral diversity across a wide range of species. PLOS Biology, 13(4):e1002112. Publisher: Public Library of Science. Cornet, L. and Baurain, D. (2022). Contamination detection in genomic data: more is not enough. Genome Biology, 23(1):60. Cornish, A. and Guda, C. (2015). A comparison of variant calling pipelines using genome _eprint: BioMed Research International, 2015(1):456479. in a bottle as a reference. https://onlinelibrary.wiley.com/doi/pdf/10.1155/2015/456479. Crameri, F. (2018). Geodynamic diagnostics, scientific visualisation and StagLab 3.0. Geoscientific Model Development, 11(6):2541–2562. Publisher: Copernicus GmbH. Csilléry, K., François, O., and Blum, M. G. B. (2012). abc: an R package for approximate Bayesian computation (ABC). Methods in Ecology and Evolution, 3(3):475–479. Cui, F. and Yuan, B. (2018). Fixation probability of a beneficial mutation conferring decreased generation time in changing environments. BMC Systems Biology, 12(S4):48. Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin, R., and Group, . G. P. A. (2011). The variant call format and VCFtools. Bioinformatics, 27(15):2156–2158. Publisher: Oxford Academic. Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., and Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2):giab008. Darwin, C. (1859). On the Origin of Species. Simon & Schuster. Num Pages: 9. Davis, J. C. and Petrov, D. A. (2004). Preferential duplication of conserved proteins in eukaryotic genomes. PLOS Biology, 2(3):e55. Publisher: Public Library of Science. Deng, J., Zuo, W., Wang, Z., Fan, Z., Ji, M., Wang, G., Ran, J., Zhao, C., Liu, J., Niklas, K. J., Hammond, S. T., and Brown, J. H. (2012). Insights into plant size-density relationships from models and agricultural crops. Proceedings of the National Academy of Sciences, 109(22):8600– 8605. Publisher: Proceedings of the National Academy of Sciences. Denti, L., Previtali, M., Bernardini, G., Schönhuth, A., and Bonizzoni, P. (2019). MALVA: Geno- typing by Mapping-free ALlele detection of known VAriants. iScience, 18:20–27. 111 Deorowicz, S., Kokot, M., Grabowski, S., and Debudaj-Grabysz, A. (2015). KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569–1576. Publisher: Oxford Aca- demic. DeWitt, T. J., Sih, A., and Wilson, D. S. (1998). Costs and limits of phenotypic plasticity. Trends in Ecology & Evolution, 13(2):77–81. Doležel, J., Bartoš, J., Voglmayr, H., and Greilhuber, J. (2003). Nuclear DNA content _eprint: Cytometry Part A, 51A(2):127–128. and genome size of trout and human. https://onlinelibrary.wiley.com/doi/pdf/10.1002/cyto.a.10013. Drummond, D. A., Raval, A., and Wilke, C. O. (2006). A single determinant dominates the rate of yeast protein evolution. Molecular Biology and Evolution, 23(2):327–337. Dubinkina, V. B., Ischenko, D. S., Ulyantsev, V. I., Tyakht, A. V., and Alexeev, D. G. (2016). As- sessment of k-mer spectrum applicability for metagenomic dissimilarity analysis. BMC Bioin- formatics, 17(1):38. Durai, D. A. and Schulz, M. H. (2016). Informed kmer selection for de novo transcriptome assem- bly. Bioinformatics, 32(11):1670–1677. Duret, L. and Mouchiroud, D. (2000). Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Molecular Biology and Evo- lution, 17(1):68–070. Ebler, J., Ebert, P., Clarke, W. E., Rausch, T., Audano, P. A., Houwaart, T., Mao, Y., Korbel, J. O., Eichler, E. E., Zody, M. C., Dilthey, A. T., and Marschall, T. (2022). Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nature Genetics, 54(4):518–525. Number: 4 Publisher: Nature Publishing Group. Eisenberg, E. and Levanon, E. Y. (2003). Human housekeeping genes are compact. Trends in Genetics, 19(7):362–365. Ellegren, H. and Galtier, N. (2016). Determinants of genetic diversity. Nature Reviews Genetics, 17(7):422–433. Number: 7 Publisher: Nature Publishing Group. Emms, D. M. and Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology, 20(1):238. Ewels, P., Magnusson, M., Lundin, S., and Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19):3047–3048. Exposito-Alonso, M., Booker, T. R., Czech, L., Gillespie, L., Hateley, S., Kyriazis, C. C., Lang, P. L. M., Leventhal, L., Nogues-Bravo, D., Pagowski, V., Ruffley, M., Spence, J. P., Toro Arana, S. E., Weiß, C. L., and Zess, E. (2022). Genetic diversity loss in the Anthropocene. Science, 377(6613):1431–1435. Publisher: American Association for the Advancement of Science. Fan, L., Cao, P., Almeida, J., and Broder, A. (2000). Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking, 8(3):281–293. 112 Filatov, D. A. (2019). Extreme Lewontin’s Paradox in Ubiquitous Marine Phytoplankton Species. Molecular Biology and Evolution, 36(1):4–14. Fisher, R. A. (1923). XXI.—On the Dominance Ratio. Proceedings of the Royal Society of Edin- burgh, 42:321–341. Flagel, L., Brandvain, Y., and Schrider, D. R. (2019). The Unreasonable Effectiveness of Convo- lutional Neural Networks in Population Genetic Inference. Molecular Biology and Evolution, 36(2):220–238. Flavell, R. B., Bennett, M. D., Smith, J. B., and Smith, D. B. (1974). Genome size and the propor- tion of repeated nucleotide sequence DNA in plants. Biochemical Genetics, 12(4):257–269. Fletcher, K., Zhang, L., Gil, J., Han, R., Cavanaugh, K., and Michelmore, R. (2021). AFLAP: assembly-free linkage analysis pipeline using k-mers from genome sequencing data. Genome Biology, 22(1):115. Flowers, J. M., Molina, J., Rubinstein, S., Huang, P., Schaal, B. A., and Purugganan, M. D. (2012). Natural selection in gene-dense regions shapes the genomic pattern of polymorphism in wild and domesticated rice. Molecular Biology and Evolution, 29(2):675–687. Fofanov, Y., Luo, Y., Katili, C., Wang, J., Belosludtsev, Y., Powdrill, T., Belapurkar, C., Fofanov, V., Li, T.-B., Chumakov, S., and Pettitt, B. M. (2004). How independent are the appearances of n-mers in different genomes? Bioinformatics, 20(15):2421–2428. Fong, L. J. M., Darolti, I., Metzger, D. C. H., Morris, J., Lin, Y., Sandkam, B. A., and Mank, J. E. (2023). Evolutionary history of the Poecilia picta sex chromosomes. Genome Biology and Evolution, 15(3):evad030. Frankham, R. (2012). How closely does genetic diversity in finite populations conform to predic- tions of neutral theory? Large deficits in regions of low recombination. Heredity, 108(3):167– 178. Number: 3 Publisher: Nature Publishing Group. Frenkel, S., Kirzhner, V., Frenkel, Z., and Korol, A. B. (2016). Organizational heterogeneity of the human genome: significant variation of recombination rate of 100 kbp sequences within GC ranges. In 2016 Second International Symposium on Stochastic Models in Reliability Engineer- ing, Life Science and Operations Management (SMRLO), pages 414–420. Fukushima, K. and Pollock, D. D. (2020). Amalgamated cross-species transcriptomes reveal organ- specific propensity in gene expression evolution. Nature Communications, 11(1):4459. Number: 1 Publisher: Nature Publishing Group. Gage, J. L., Vaillancourt, B., Hamilton, J. P., Manrique-Carpintero, N. C., Gustafson, T. J., Barry, K., Lipzen, A., Tracy, W. F., Mikel, M. A., Kaeppler, S. M., Buell, C. R., and de Leon, N. (2019). Multiple Maize Reference Genomes Impact the Identification of Variants by Genome- Wide Association Study in a Diverse Inbred Panel. The Plant Genome, 12(2):180069. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.3835/plantgenome2018.09.0069. 113 Gardner, S. N. and Hall, B. G. (2013). When whole-genome alignments just won’t work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes. PLOS ONE, 8(12):e81760. Publisher: Public Library of Science. Gardner, S. N. and Slezak, T. (2010). Scalable SNP analyses of 100+ vacterial or viral genomes. Journal of Forensic Research, 01(03). Garrison, E., Guarracino, A., Heumos, S., Villani, F., Bao, Z., Tattini, L., Hagmann, J., Vorbrugg, S., Marco-Sola, S., Kubica, C., Ashbrook, D. G., Thorell, K., Rusholme-Pilcher, R. L., Liti, G., Rudbeck, E., Golicz, A. A., Nahnsen, S., Yang, Z., Mwaniki, M. N., Nobrega, F. L., Wu, Y., Chen, H., de Ligt, J., Sudmant, P. H., Huang, S., Weigel, D., Soranzo, N., Colonna, V., Williams, R. W., and Prins, P. (2024). Building pangenome graphs. Nature Methods, 21(11):2008–2012. Publisher: Nature Publishing Group. Gaut, B., Yang, L., Takuno, S., and Eguiarte, L. E. (2011). The patterns and causes of variation in plant nucleotide substitution rates. Annual Review of Ecology, Evolution, and Systematics, 42(1):245–266. Gauthier, J., Mouden, C., Suchan, T., Alvarez, N., Arrigo, N., Riou, C., Lemaitre, C., and Peter- longo, P. (2020). DiscoSnp-RAD: de novo detection of small variants for RAD-Seq population genomics. PeerJ, 8:e9291. Publisher: PeerJ Inc. Georgakopoulos-Soares, I., Yizhar-Barnea, O., Mouratidis, I., Hemberg, M., and Ahituv, N. (2021). Absent from DNA and protein: genomic characterization of nullomers and nullpeptides across functional categories and evolution. Genome Biology, 22(1):245. Gerland, U. and Hwa, T. (2002). On the selection and evolution of regulatory DNA motifs. Journal of Molecular Evolution, 55(4):386–400. Glasenapp, M. R. and Pogson, G. H. (2024). Selection Shapes the Genomic Landscape of Intro- gressed Ancestry in a Pair of Sympatric Sea Urchin Species. Genome Biology and Evolution, 16(6):evae124. Glémin, S. (2007). Mating systems and the efficacy of selection at the molecular level. Genetics, 177(2):905–916. Glémin, S. (2012). Extinction and fixation times with dominance and inbreeding. Theoretical Population Biology, 81(4):310–316. Goerner-Potvin, P. and Bourque, G. (2018). Computational tools to unmask transposable elements. Nature Reviews Genetics, 19(11):688–704. Publisher: Nature Publishing Group. Golicz, A. A., Bayer, P. E., Bhalla, P. L., Batley, J., and Edwards, D. (2020). Pangenomics comes of age: from bacteria to plant and animal applications. Trends in Genetics, 36(2):132–145. Gourlé, H., Karlsson-Lindsjö, O., Hayer, J., and Bongcam-Rudloff, E. (2019). Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics, 35(3):521–522. 114 Govaerts, R., Nic Lughadha, E., Black, N., Turner, R., and Paton, A. (2021). The World Checklist of Vascular Plants, a continuously updated resource for exploring global plant diversity. Scientific Data, 8(1):215. Number: 1 Publisher: Nature Publishing Group. Greven, A., Pfaffelhuber, P., Pokalyuk, C., and Wakolbinger, A. (2016). The fixation time of a strongly beneficial allele in a structured population. Electronic Journal of Probability, 21(none):1–42. Publisher: Institute of Mathematical Statistics and Bernoulli Society. Groen, S. C., Ćalić, I., Joly-Lopez, Z., Platts, A. E., Choi, J. Y., Natividad, M., Dorph, K., Mauck, W. M., Bracken, B., Cabral, C. L. U., Kumar, A., Torres, R. O., Satija, R., Vergara, G., Henry, A., Franks, S. J., and Purugganan, M. D. (2020). The strength and pattern of natural selection on gene expression in rice. Nature, 578(7796):572–576. Number: 7796 Publisher: Nature Publishing Group. Grytten, I., Dagestad Rand, K., and Sandve, G. K. (2022). KAGE: fast alignment-free graph-based genotyping of SNPs and short indels. Genome Biology, 23(1):209. Guo, Q., Qian, H., Zhang, J., species age and range size. https://onlinelibrary.wiley.com/doi/pdf/10.1111/jbi.14809. and Liu, P. Journal of Biogeography, 00(n/a):1–9. The relationships between _eprint: (2024). Guo, X., Wang, Y., Keightley, P. D., and Fan, L. (2007). Patterns of selective constraints in non- coding DNA of rice. BMC Evolutionary Biology, 7(1):208. Gupta, P. K. (2021). pangenome and SNPs to SVs and k-mers. https://onlinelibrary.wiley.com/doi/pdf/10.1002/bies.202100109. GWAS for genetics of complex quantitative traits: Genome to _eprint: BioEssays, 43(11):2100109. Gyorgy, A. (2023). Competition and evolutionary selection among core regulatory motifs in gene expression control. Nature Communications, 14(1):8266. Number: 1 Publisher: Nature Pub- lishing Group. Göktay, M., Fulgione, A., and Hancock, A. M. (2021). A new catalog of structural variants in 1,301 A. thaliana lines from Africa, Eurasia, and North America reveals a signature of balancing selection at defense response genes. Molecular Biology and Evolution, 38(4):1498–1511. Günther, T. and Nettelblad, C. (2019). The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLOS Genetics, 15(7):e1008302. Publisher: Public Library of Science. Haberer, G., Kamal, N., Bauer, E., Gundlach, H., Fischer, I., Seidel, M. A., Spannagl, M., Marcon, C., Ruban, A., Urbany, C., Nemri, A., Hochholdinger, F., Ouzunova, M., Houben, A., Schön, C.-C., and Mayer, K. F. X. (2020). European maize genomes highlight intraspecies variation in repeat and gene content. Nature Genetics, 52(9):950–957. Publisher: Nature Publishing Group. Hahn, M. W. (2018). Molecular Population Genetics. Oxford University Press. Google-Books-ID: 3BDkswEACAAJ. 115 Hahn, M. W. and Mishra, S. R. (2025). Estimating recombination using only the allele frequency spectrum. Pages: 2025.02.01.635998 Section: New Results. Haldane, J. B. S. (1927). A Mathematical Theory of Natural and Artificial Selection, Part V: Selection and Mutation. Mathematical Proceedings of the Cambridge Philosophical Society, 23(7):838–844. Haller, B. C., Galloway, J., Kelleher, J., Messer, P. W., and Ralph, P. L. (2019). Tree-sequence recording in slim opens new horizons for forward-time simulation of whole genomes. Molecular Ecology Resources, 19(2):552–566. Haller, B. C. and Messer, P. W. (2019). SLiM 3: forward genetic simulations beyond the Wright– Fisher model. Molecular Biology and Evolution, 36(3):632–637. Halushka, M. K., Fan, J.-B., Bentley, K., Hsie, L., Shen, N., Weder, A., Cooper, R., Lipshutz, R., and Chakravarti, A. (1999). Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nature Genetics, 22(3):239–247. Publisher: Nature Publishing Group. Haubold, B. (2014). Alignment-free phylogenetics and population genetics. Briefings in Bioinfor- matics, 15(3):407–418. Haubold, B., Krause, L., Horn, T., and Pfaffelhuber, P. (2013). An alignment-free test for recom- bination. Bioinformatics, 29(24):3121–3127. Haubold, B. and Pfaffelhuber, P. (2012). Alignment-free population genomics: an efficient esti- mator of sequence diversity. G3 Genes|Genomes|Genetics, 2(8):883–889. Haubold, B., Reed, F. A., and Pfaffelhuber, P. (2011). Alignment-free estimation of nucleotide diversity. Bioinformatics, 27(4):449–455. He, Z., Dai, X., Beaumont, M., and Yu, F. (2020). Estimation of Natural Selection and Allele Age from Time Series Allele Frequency Data Using a Novel Likelihood-Based Approach. Genetics, 216(2):463–480. Hellmann, I., Ebersberger, I., Ptak, S. E., Pääbo, S., and Przeworski, M. (2003). A neutral explana- tion for the correlation of diversity with recombination rates in humans. The American Journal of Human Genetics, 72(6):1527–1535. Hickey, G., Heller, D., Monlong, J., Sibbesen, J. A., Sirén, J., Eizenga, J., Dawson, E. T., Garrison, E., Novak, A. M., and Paten, B. (2020). Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biology, 21(1):35. Ho, E. K. H., Macrae, F., Latta, IV, L. C., Benner, M. J., Sun, C., Ebert, D., and Schaack, S. (2019). Intraspecific variation in microsatellite mutation profiles in Daphnia magna. Molecular Biology and Evolution, 36(9):1942–1954. 116 Hrytsenko, Y., Daniels, N. M., and Schwartz, R. S. (2022). Determining population structure from k-mer frequencies. In Proceedings of the 13th ACM International Conference on Bioinformat- ics, Computational Biology and Health Informatics, BCB ’22, page 1, New York, NY, USA. Association for Computing Machinery. Huang, Y.-F. (2022). Dissecting genomic determinants of positive selection with an evolution- guided regression model. Molecular Biology and Evolution, 39(1):msab291. Huber, C. D., DeGiorgio, M., Hellmann, I., and Nielsen, R. (2016). Detecting recent selec- tive sweeps while controlling for mutation rate and background selection. Molecular Ecology, 25(1):142–156. Huber, C. D., Kim, B. Y., Marsden, C. D., and Lohmueller, K. E. (2017). Determining the fac- tors driving selective effects of new nonsynonymous mutations. Proceedings of the National Academy of Sciences, 114(17):4465–4470. Publisher: Proceedings of the National Academy of Sciences. Hunt, B. G., Ometto, L., Wurm, Y., Shoemaker, D., Yi, S. V., Keller, L., and Goodisman, M. A. D. (2011). Relaxed selection is a precursor to the evolution of phenotypic plasticity. Proceedings of the National Academy of Sciences, 108(38):15936–15941. Publisher: National Academy of Sciences Section: Biological Sciences. Häntze, H. and Horton, P. (2023). Effects of spaced k-mers on alignment-free genotyping. Bioin- formatics, 39(Supplement_1):i213–i221. Ibarra-Laclette, E., Lyons, E., Hernández-Guzmán, G., Pérez-Torres, C. A., Carretero-Paulet, L., Chang, T.-H., Lan, T., Welch, A. J., Juárez, M. J. A., Simpson, J., Fernández-Cortés, A., Arteaga- Vázquez, M., Góngora-Castillo, E., Acevedo-Hernández, G., Schuster, S. C., Himmelbauer, H., Minoche, A. E., Xu, S., Lynch, M., Oropeza-Aburto, A., Cervantes-Pérez, S. A., de Jesús Ortega- Estrada, M., Cervantes-Luevano, J. I., Michael, T. P., Mockler, T., Bryant, D., Herrera-Estrella, A., Albert, V. A., and Herrera-Estrella, L. (2013). Architecture and evolution of a minute plant genome. Nature, 498(7452):94–98. Publisher: Nature Publishing Group. Institute, B. (2019). Picard toolkit. Broad Institute, GitHub repository. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., and McVean, G. (2012). De novo assembly and geno- typing of variants using colored de Bruijn graphs. Nature Genetics, 44(2):226–232. Number: 2 Publisher: Nature Publishing Group. Jaegle, B., Pisupati, R., Soto-Jiménez, L. M., Burns, R., Rabanal, F. A., and Nordborg, M. (2023). Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity. Genome Bi- ology, 24(1):44. Jenike, K. M., Campos-Domínguez, L., Boddé, M., Cerca, J., Hodson, C. N., Schatz, M. C., and Jaron, K. S. (2024). Guide to k-mer approaches for genomics across the tree of life. arXiv:2404.01519. 117 Johnsson, P., Lipovich, L., Grandér, D., and Morris, K. V. (2014). Evolutionary conservation of long noncoding RNAs; sequence, structure, function. Biochimica et biophysica acta, 1840(3):1063– 1071. Johri, P., Aquadro, C. F., Beaumont, M., Charlesworth, B., Excoffier, L., Eyre-Walker, A., Keight- ley, P. D., Lynch, M., McVean, G., Payseur, B. A., Pfeifer, S. P., Stephan, W., and Jensen, J. D. (2022a). Recommendations for improving statistical inference in population genomics. PLOS Biology, 20(5):e3001669. Publisher: Public Library of Science. Johri, P., Charlesworth, B., and Jensen, J. D. (2020). Toward an evolutionarily appropriate null model: jointly inferring demography and purifying selection. Genetics, 215(1):173–192. Johri, P., Eyre-Walker, A., Gutenkunst, R. N., Lohmueller, K. E., and Jensen, J. D. (2022b). On the prospect of achieving accurate joint estimation of selection with population history. Genome Biology and Evolution, 14(7):evac088. Jordan, I. K., Wolf, Y. I., and Koonin, E. V. (2004). Duplicated genes evolve slower than singletons despite the initial rate increase. BMC Evolutionary Biology, 4(1):22. Josephs, E. B., Wright, S. I., Stinchcombe, J. R., and Schoen, D. J. (2017). The relationship be- tween selection, network connectivity, and regulatory variation within a population of Capsella grandiflora. Genome Biology and Evolution, 9(4):1099–1109. Kane, N. C. and Rieseberg, L. H. (2007). Selective Sweeps Reveal Candidate Genes for Adap- tation to Drought and Salt Tolerance in Common Sunflower, Helianthus annuus. Genetics, 175(4):1823–1834. Kaplinski, L., Möls, M., Puurand, T., Pajuste, F.-D., and Remm, M. (2021). KATK: Fast genotyping of rare variants directly from unmapped sequencing reads. Human Mutation, 42(6):777–786. _- eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/humu.24197. Katoh, K. and Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30(4):772–780. Publisher: Oxford Academic. Kaushik, S. and Jain, K. (2021). Time to fixation in changing environments. Genetics, 219(3):iyab148. Kawecki, T. J. (1994). Accumulation of Deleterious Mutations and the Evolutionary Cost of Being a Generalist. The American Naturalist, 144(5):833–838. Publisher: The University of Chicago Press. Ke, S., Zhang, X. H.-F., and Chasin, L. A. (2008). Positive selection acting on splicing motifs reflects compensatory evolution. Genome Research, 18(4):533–543. Company: Cold Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab. 118 Kelley, D. R., Schatz, M. C., and Salzberg, S. L. (2010). Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11(11):R116. Kent, W. J. (2002). The human genome browser at UCSC. Genome Res., 12(6):996–1006. Kern, A. D. and Schrider, D. R. (2018). diploS/HIC: An Updated Approach to Classifying Selective Sweeps. G3 Genes|Genomes|Genetics, 8(6):1959–1970. Kidwell, M. G. (2002). Transposable elements and the evolution of genome size in eukaryotes. Genetica, 115(1):49–63. Kim, J.-H., Park, J.-S., Lee, C.-Y., Jeong, M.-G., Xu, J. L., Choi, Y., Jung, H.-W., and Choi, H.- K. (2020). Dissecting seed pigmentation-associated genomic loci and genes by employing dual approaches of reference-based and k-mer-based GWAS with 438 Glycine accessions. PLOS ONE, 15(12):e0243085. Publisher: Public Library of Science. Kim, S. (2015). ppcor: an R package for a fast calculation to semi-partial correlation coefficients. Communications for statistical applications and methods, 22(6):665–674. Kim, Y. and Stephan, W. (2002). Detecting a local signature of genetic hitchhiking along a recom- bining chromosome. Genetics, 160(2):765–777. WOS:000174097600036. Kimura, M. (1980). Average time until fixation of a mutant allele in a finite population under continued mutation pressure: Studies by analytical, numerical, and pseudo-sampling methods. Proceedings of the National Academy of Sciences, 77(1):522–526. Publisher: Proceedings of the National Academy of Sciences. Kimura, M. (1983). The Neutral Theory of Molecular Evolution. Cambridge University Press. Google-Books-ID: olIoSumPevYC. Kimura, M. and Ohta, T. (1969). The Average Number of Generations until Fixation of a Mutant Gene in a Finite Population. Genetics, 61(3):763–771. Klassmann, A. and Gautier, M. (2022). Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data. PLoS ONE, 17(1):e0262024. Kojima, K. and Schaffer, H. E. (1964). Accumulation of epistatic gene complexes. Evolution, 18(1):127–129. Kojima, K.-i. and Schaffer, H. E. (1967). Survival Process of Linked Mutant Genes. Evolution, 21(3):518–531. Publisher: [Society for the Study of Evolution, Wiley]. Kokot, M., Długosz, M., and Deorowicz, S. (2017). KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759–2761. Publisher: Oxford Academic. Kolekar, P., Kale, M., and Kulkarni-Kale, U. (2012). Alignment-free distance measure based on return time distribution for sequence analysis: Applications to clustering, molecular phylogeny and subtyping. Molecular Phylogenetics and Evolution, 65(2):510–522. 119 Kolář, F., Čertner, M., Suda, J., Schönswetter, P., and Husband, B. C. (2017). Mixed-ploidy species: progress and opportunities in polyploid research. Trends in Plant Science, 22(12):1041–1055. Publisher: Elsevier. Koonin, E. V. (2011). Are there laws of genome evolution? PLOS Computational Biology, 7(8):e1002173. Publisher: Public Library of Science. Korunes, K. L. and Samuk, K. (2021). pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data. Molecular Ecology Resources, 21(4):1359–1368. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1755-0998.13326. Kryuchkova-Mostacci, N. and Robinson-Rechavi, M. (2017). A benchmark of gene expression tissue-specificity metrics. Briefings in Bioinformatics, 18(2):205–214. Kumar, S., Stecher, G., Suleski, M., and Hedges, S. B. (2017). TimeTree: a resource for timelines, timetrees, and divergence times. Molecular Biology and Evolution, 34(7):1812–1819. Kumar, S., Suleski, M., Craig, J. M., Kasprowicz, A. E., Sanderford, M., Li, M., Stecher, G., and Hedges, S. B. (2022). TimeTree 5: an expanded resource for species divergence times. Molecular Biology and Evolution, 39(8):msac174. Lanfear, R., Ho, S. Y. W., Jonathan Davies, T., Moles, A. T., Aarssen, L., Swenson, N. G., Warman, L., Zanne, A. E., and Allen, A. P. (2013). Taller plants have lower rates of molecular evolution. Nature Communications, 4(1):1879. Number: 1 Publisher: Nature Publishing Group. Langley, C. H., Stevens, K., Cardeno, C., Lee, Y. C. G., Schrider, D. R., Pool, J. E., Langley, S. A., Suarez, C., Corbett-Detig, R. B., Kolaczkowski, B., Fang, S., Nista, P. M., Holloway, A. K., Kern, A. D., Dewey, C. N., Song, Y. S., Hahn, M. W., and Begun, D. J. (2012). Genomic variation in natural populations of Drosophila melanogaster. Genetics, 192(2):533–598. Larracuente, A. M., Sackton, T. B., Greenberg, A. J., Wong, A., Singh, N. D., Sturgill, D., Zhang, Y., Oliver, B., and Clark, A. G. (2008). Evolution of protein-coding genes in Drosophila. Trends in Genetics, 24(3):114–123. Lauterbur, M. E., Munch, K., and Enard, D. (2023). Versatile detection of diverse selective sweeps with Flex-sweep. Pages: 2022.11.15.516494 Section: New Results. Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K., and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10):733–739. Number: 10 Publisher: Nature Publishing Group. Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLOS Genetics, 3(9):e161. Publisher: Public Library of Science. Leffler, E. M., Bullaughey, K., Matute, D. R., Meyer, W. K., Ségurel, L., Venkat, A., Andolfatto, P., and Przeworski, M. (2012). Revisiting an old riddle: what determines genetic diversity levels within species? PLOS Biology, 10(9):e1001388. Publisher: Public Library of Science. 120 Leggett, R. M., Ramirez-Gonzalez, R. H., Verweij, W., Kawashima, C. G., Iqbal, Z., Jones, J. D. G., Caccamo, M., and MacLean, D. (2013). Identifying and classifying trait linked polymor- phisms in non-reference species by walking coloured de Bruijn graphs. PLOS ONE, 8(3):e60058. Publisher: Public Library of Science. Lei, L., Goltsman, E., Goodstein, D., Wu, G. A., Rokhsar, D. S., and Vogel, J. P. (2021). Plant pan-genomics comes of age. Annual Review of Plant Biology, 72(1):411–435. _eprint: https://doi.org/10.1146/annurev-arplant-080720-105454. Lemane, T., Chikhi, R., and Peterlongo, P. (2022). kmdiff, large-scale and user-friendly differential k-mer analyses. Bioinformatics, 38(24):5443–5445. Lewanski, A. L., Grundler, M. C., and Bradburd, G. S. (2024). The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics. PLOS Genetics, 20(1):e1011110. Lewontin, R. C. (1974). The genetic basis of evolutionary change. Columbia University Press. Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-bio]. Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage sam- ples. Bioinformatics, 30(20):2843–2851. Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler trans- form. Bioinformatics, 25(14):1754–1760. Li, W. H. and Sadler, L. A. (1991). Low Nucleotide Diversity in Man. Genetics, 129(2):513–523. Li, Y., Patel, H., and Lin, Y. (2022). Kmer2SNP: Reference-free heterozygous SNP calling using k-mer frequency distributions. In Ng, C. and Piscuoglio, S., editors, Variant Calling: Methods and Protocols, pages 257–265. Springer US, New York, NY. Liao, Q., Du, R., Gou, J., Guo, L., Shen, H., Liu, H., Nguyen, J. K., Ming, R., Yin, T., Huang, S., and Yan, J. (2020). The genomic architecture of the sex-determining region and sex- related metabolic variation in Ginkgobiloba. The Plant Journal, 104(5):1399–1409. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/tpj.15009. Liao, W.-W., Asri, M., Ebler, J., Doerr, D., Haukness, M., Hickey, G., Lu, S., Lucas, J. K., Monlong, J., Abel, H. J., Buonaiuto, S., Chang, X. H., Cheng, H., Chu, J., Colonna, V., Eizenga, J. M., Feng, X., Fischer, C., Fulton, R. S., Garg, S., Groza, C., Guarracino, A., Harvey, W. T., Heumos, S., Howe, K., Jain, M., Lu, T.-Y., Markello, C., Martin, F. J., Mitchell, M. W., Munson, K. M., Mwaniki, M. N., Novak, A. M., Olsen, H. E., Pesout, T., Porubsky, D., Prins, P., Sibbesen, J. A., Sirén, J., Tomlinson, C., Villani, F., Vollger, M. R., Antonacci-Fulton, L. L., Baid, G., Baker, C. A., Belyaeva, A., Billis, K., Carroll, A., Chang, P.-C., Cody, S., Cook, D. E., Cook- Deegan, R. M., Cornejo, O. E., Diekhans, M., Ebert, P., Fairley, S., Fedrigo, O., Felsenfeld, A. L., Formenti, G., Frankish, A., Gao, Y., Garrison, N. A., Giron, C. G., Green, R. E., Haggerty, L., Hoekzema, K., Hourlier, T., Ji, H. P., Kenny, E. E., Koenig, B. A., Kolesnikov, A., Korbel, J. O., Kordosky, J., Koren, S., Lee, H., Lewis, A. P., Magalhães, H., Marco-Sola, S., Marijon, P., 121 McCartney, A., McDaniel, J., Mountcastle, J., Nattestad, M., Nurk, S., Olson, N. D., Popejoy, A. B., Puiu, D., Rautiainen, M., Regier, A. A., Rhie, A., Sacco, S., Sanders, A. D., Schneider, V. A., Schultz, B. I., Shafin, K., Smith, M. W., Sofia, H. J., Abou Tayoun, A. N., Thibaud-Nissen, F., Tricomi, F. F., Wagner, J., Walenz, B., Wood, J. M. D., Zimin, A. V., Bourque, G., Chaisson, M. J. P., Flicek, P., Phillippy, A. M., Zook, J. M., Eichler, E. E., Haussler, D., Wang, T., Jarvis, E. D., Miga, K. H., Garrison, E., Marschall, T., Hall, I. M., Li, H., and Paten, B. (2023). A draft human pangenome reference. Nature, 617(7960):312–324. Publisher: Nature Publishing Group. Lichilín, N., Salzburger, W., and Böhne, A. (2023). No evidence for sex chromosomes in natural populations of the cichlid fish Astatotilapia burtoni. G3 Genes|Genomes|Genetics, 13(3):jkad011. Lin, Y.-S., Hsu, W.-L., Hwang, J.-K., and Li, W.-H. (2007). Proportion of solvent-exposed amino acids in a protein and rate of protein evolution. Molecular Biology and Evolution, 24(4):1005– 1011. Linard, B., Swenson, K., and Pardi, F. (2019). Rapid alignment-free phylogenetic identification of metagenomic sequences. Bioinformatics, 35(18):3303–3312. Liu, G., Liu, J., Cui, X., and Cai, L. (2012). Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. Journal of Theoretical Biology, 293:49–54. Liu, S., Zheng, J., Migeon, P., Ren, J., Hu, Y., He, C., Liu, H., Fu, J., White, F. F., Toomajian, C., and Wang, G. (2017). Unbiased k-mer analysis reveals changes in copy number of highly repet- itive sequences during maize domestication and improvement. Scientific Reports, 7(1):42444. Number: 1 Publisher: Nature Publishing Group. Liu, Z. and Samee, M. (2023). Structural underpinnings of mutation rate variations in the human genome. Nucleic Acids Research, 51(14):7184–7197. Luczak, B. B., James, B. T., and Girgis, H. Z. (2019). A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Briefings in Bioinformatics, 20(4):1222–1237. Lynch, M. and Conery, J. S. (2000). The evolutionary fate and consequences of duplicate genes. Science, 290(5494):1151–1155. Publisher: American Association for the Advancement of Sci- ence. Lynch, M. and Conery, J. S. (2003). The origins of genome complexity. Science, 302(5649):1401– 1404. Publisher: American Association for the Advancement of Science. López￿Delgado, J. and Meirmans, P. G. (2022). History or demography? Determining the drivers of genetic variation in North American plants. Molecular Ecology, 31(7):1951–1962. Mackintosh, A., Laetsch, D. R., Hayward, A., Charlesworth, B., Waterfall, M., Vila, R., and Lohse, K. (2019). The determinants of genetic diversity in butterflies. Nature Communications, 10(1):3466. Number: 1 Publisher: Nature Publishing Group. 122 Makałowski, W. and Boguski, M. S. (1998). Evolutionary parameters of the transcribed mam- malian genome: An analysis of 2,820 orthologous rodent and human sequences. Proceedings of the National Academy of Sciences, 95(16):9407–9412. Publisher: Proceedings of the National Academy of Sciences. Margulies, E. H., Blanchette, M., Program, N. C. S., Haussler, D., and Green, E. D. (2003). Identification and characterization of multi-species conserved sequences. Genome Research, 13(12):2507–2518. Company: Cold Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab. Masalia, R. R., Bewick, A. J., and Burke, J. M. (2017). Connectivity in gene coexpression net- works negatively correlates with rates of molecular evolution in flowering plants. PLOS ONE, 12(7):e0182289. Publisher: Public Library of Science. Mattila, T. M., Tyrmi, J., Pyhäjärvi, T., and Savolainen, O. (2017). Genome-Wide Analysis of Colonization History and Concomitant Selection in Arabidopsis lyrata. Molecular Biology and Evolution, 34(10):2665–2677. McClelland, M. (1985). Selection againstdam methylation sites in the genomes of DNA of enter- obacteriophages. Journal of Molecular Evolution, 21(4):317–322. McGuigan, K., Collet, J. M., Allen, S. L., Chenoweth, S. F., and Blows, M. W. (2014). Pleiotropic mutations are subject to strong stabilizing selection. Genetics, 197(3):1051–1062. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M. A. (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Re- search, 20(9):1297–1303. Company: Cold Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab. Mehrab, Z., Mobin, J., Tahmid, I. A., and Rahman, A. (2021). Efficient association mapping from k-mers—An application in finding sex-specific sequences. PLOS ONE, 16(1):e0245058. Publisher: Public Library of Science. Mei, W., Stetter, M. G., Gates, D. J., Stitzer, M. C., and Ross-Ibarra, J. (2018). Adaptation in plant genomes: Bigger is different. American Journal of Botany, 105(1):16–19. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/ajb2.1002. Melsted, P. and Pritchard, J. K. (2011). Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics, 12(1):333. Mirchandani, C. D., Shultz, A. J., Thomas, G. W. C., Smith, S. J., Baylis, M., Arnold, B., Corbett- Detig, R., Enbody, E., and Sackton, T. B. (2024). A fast, reproducible, high-throughput variant calling workflow for population genomics. Molecular Biology and Evolution, 41(1):msad270. 123 Mishra, Y., Johansson Jänkänpää, H., Kiss, A. Z., Funk, C., Schröder, W. P., and Jansson, S. (2012). Arabidopsis plants grown in the field and climate chambers significantly differ in leaf morphology and photosystem components. BMC Plant Biology, 12(1):6. Mondragón-Palomino, M., Meyers, B. C., Michelmore, R. W., and Gaut, B. S. (2002). Patterns of positive selection in the complete NBS-LRR gene family of Arabidopsis thaliana. Genome Research, 12(9):1305–1315. Company: Cold Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab. Moriyama, E. N. and Powell, J. R. (1996). Intraspecific nuclear DNA variation in Drosophila. Molecular Biology and Evolution, 13(1):261–277. Moutinho, A. F., Eyre-Walker, A., and Dutheil, J. Y. (2022). Strong evidence for the adaptive walk model of gene evolution in Drosophila and Arabidopsis. PLOS Biology, 20(9):e3001775. Publisher: Public Library of Science. Mukherjee, D., Mukherjee, A., and Ghosh, T. C. (2016). Evolutionary rate heterogeneity of pri- mary and secondary metabolic pathway genes in Arabidopsis thaliana. Genome Biology and Evolution, 8(1):17–28. Murray, K. D., Webers, C., Ong, C. S., Borevitz, J., and Warthmann, N. (2017). kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity. PLOS Computational Biology, 13(9):e1005727. Publisher: Public Library of Science. Murren, C. J., Auld, J. R., Callahan, H., Ghalambor, C. K., Handelsman, C. A., Heskel, M. A., Kingsolver, J. G., Maclean, H. J., Masel, J., Maughan, H., Pfennig, D. W., Relyea, R. A., Seiter, S., Snell-Rood, E., Steiner, U. K., and Schlichting, C. D. (2015). Constraints on the evolution of phenotypic plasticity: limits and costs of phenotype and plasticity. Heredity, 115(4):293– 301. Bandiera_abtest: a Cc_license_type: cc_y Cg_type: Nature Research Journals Number: 4 Primary_atype: Reviews Publisher: Nature Publishing Group Subject_term: Evolutionary theory Subject_term_id: evolutionary-theory. Muñoz-Diez, C., Vitte, C., Ross-Ibarra, J., Gaut, B. S., and Tenaillon, M. I. (2012). Using nextgen sequencing to investigate genome size variation and transposable element content. In Grand- bastien, M.-A. and Casacuberta, J. M., editors, Plant Transposable Elements: Impact on Genome Structure and Function, pages 41–58. Springer, Berlin, Heidelberg. Myers, S., Freeman, C., Auton, A., Donnelly, P., and McVean, G. (2008). A common sequence mo- tif associated with recombination hot spots and genome instability in humans. Nature Genetics, 40(9):1124–1129. Number: 9 Publisher: Nature Publishing Group. Mähler, N., Wang, J., Terebieniec, B. K., Ingvarsson, P. K., Street, N. R., and Hvidsten, T. R. (2017). Gene co-expression network connectivity is an important determinant of selective constraint. PLoS Genetics, 13(4):e1006402. Nakagome, S., Hudson, R. R., and Di Rienzo, A. (2019). Inferring the model and onset of natural selection under varying population size from the site frequency spectrum and haplotype structure. 124 Proceedings of the Royal Society B: Biological Sciences, 286(1896):20182541. Publisher: Royal Society. Nei, M. and Gojobori, T. (1986). Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution, 3(5):418–426. Nei, M. and Graur, D. (1984). Extent of protein polymorphism and the neutral mutation theory. Evolutionary Biology, 17:73–118. Nei, M. and Li, W. H. (1979). Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences, 76(10):5269–5273. Publisher: Proceedings of the National Academy of Sciences. Nei, M. and Tajima, F. (1981). DNA polymorphism detectable by restriction endonucleases. Ge- netics, 97(1):145–163. Nembaware, V., Crum, K., Kelso, J., and Seoighe, C. (2002). Impact of the presence of paralogs on sequence divergence in a set of mouse-human orthologs. Genome Research, 12(9):1370–1376. Company: Cold Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab. Neves, C. J., Matzrafi, M., Thiele, M., Lorant, A., Mesgaran, M. B., and Stetter, M. G. (2020). Male linked genomic region determines sex in dioecious Amaranthus palmeri. Journal of Heredity, 111(7):606–612. Nordborg, M., Hu, T. T., Ishino, Y., Jhaveri, J., Toomajian, C., Zheng, H., Bakker, E., Calabrese, P., Gladstone, J., Goyal, R., Jakobsson, M., Kim, S., Morozov, Y., Padhukasahasram, B., Plagnol, V., Rosenberg, N. A., Shah, C., Wall, J. D., Wang, J., Zhao, K., Kalbfleisch, T., Schulz, V., Kreitman, M., and Bergelson, J. (2005). The pattern of polymorphism in Arabidopsis thaliana. PLOS Biology, 3(7):e196. Publisher: Public Library of Science. Nordström, K. J. V., Albani, M. C., James, G. V., Gutjahr, C., Hartwig, B., Turck, F., Paszkowski, U., Coupland, G., and Schneeberger, K. (2013). Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers. Nature Biotechnology, 31(4):325–330. Number: 4 Publisher: Nature Publishing Group. Novembre, J. and Stephens, M. (2008). Interpreting principal component analyses of spatial popu- lation genetic variation. Nature Genetics, 40(5):646–649. Publisher: Nature Publishing Group. Nystedt, B., Street, N. R., Wetterbom, A., Zuccolo, A., Lin, Y.-C., Scofield, D. G., Vezzi, F., Delhomme, N., Giacomello, S., Alexeyenko, A., Vicedomini, R., Sahlin, K., Sherwood, E., Elf- strand, M., Gramzow, L., Holmberg, K., Hällman, J., Keech, O., Klasson, L., Koriabine, M., Kucukoglu, M., Käller, M., Luthman, J., Lysholm, F., Niittylä, T., Olson, ￿., Rilakovic, N., Ritland, C., Rosselló, J. A., Sena, J., Svensson, T., Talavera-López, C., Theißen, G., Tuominen, H., Vanneste, K., Wu, Z.-Q., Zhang, B., Zerbe, P., Arvestad, L., Bhalerao, R., Bohlmann, J., Bousquet, J., Garcia Gil, R., Hvidsten, T. R., de Jong, P., MacKay, J., Morgante, M., Ritland, K., Sundberg, B., Lee Thompson, S., Van de Peer, Y., Andersson, B., Nilsson, O., Ingvarsson, 125 P. K., Lundeberg, J., and Jansson, S. (2013). The Norway spruce genome sequence and conifer genome evolution. Nature, 497(7451):579–584. Publisher: Nature Publishing Group. Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S., and Phillippy, A. M. (2016). Mash: fast genome and metagenome distance estimation using Min- Hash. Genome Biology, 17(1):132. Onetto, C. A., Sosnowski, M. R., Heuvel, S. V. D., and Borneman, A. R. (2022). Population genomics of the grapevine pathogen Eutypa lata reveals evidence for population expansion and intraspecific differences in secondary metabolite gene clusters. PLOS Genetics, 18(4):e1010153. Publisher: Public Library of Science. Opedal, ￿. H., Armbruster, W. S., Hansen, T. F., Holstad, A., Pélabon, C., Andersson, S., Campbell, D. R., Caruso, C. M., Delph, L. F., Eckert, C. G., Lankinen, ￿., Walter, G. M., Ågren, J., and Bolstad, G. H. (2023). Evolvability and trait function predict phenotypic divergence of plant populations. Proceedings of the National Academy of Sciences, 120(1):e2203228120. Publisher: Proceedings of the National Academy of Sciences. O’Rawe, J., Jiang, T., Sun, G., Wu, Y., Wang, W., Hu, J., Bodily, P., Tian, L., Hakonarson, H., Johnson, W. E., Wei, Z., Wang, K., and Lyon, G. J. (2013). Low concordance of multiple variant- calling pipelines: practical implications for exome and genome sequencing. Genome Medicine, 5(3):28. Orme, D., Freckleton, R., Thomas, G., Petzoldt, T., Fritz, S., Isaac, N., and Pearse, W. (2018). caper: Comparative Analyses of Phylogenetics and Evolution in R. Ormond, L., Foll, M., Ewing, G. B., Pfeifer, S. P., and Jensen, J. D. (2016). ring the age of a fixed beneficial allele. Molecular Ecology, 25(1):157–169. https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.13478. Infer- _eprint: Ossowski, S., Schneeberger, K., Lucas-Lledó, J. I., Warthmann, N., Clark, R. M., Shaw, R. G., Weigel, D., and Lynch, M. (2010). The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science, 327(5961):92–94. Publisher: American Association for the Advancement of Science. S. and Otto, ties https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470015902.a0005464.pub3. (2013). John Wiley & Sons, and Whitlock, Fixation Ltd. P. Times. eLS. M. C. In Probabili- _eprint: Ou, M., Yang, C., Luo, Q., Huang, R., Zhang, A.-D., Liao, L.-J., Li, Y.-M., He, L.-B., Zhu, Z.-Y., Chen, K.-C., and Wang, Y.-P. (2017). An NGS-based approach for the identification of sex- specific markers in snakehead (Channa argus). Oncotarget, 8(58):98733–98744. Paape, T., Bataillon, T., Zhou, P., J. Y. Kono, T., Briskine, R., Young, N. D., and Tif- Selection, genome-wide fitness effects and evolutionary rates in the _eprint: fin, P. (2013). model legume Medicago truncatula. Molecular Ecology, 22(13):3525–3538. https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.12329. 126 Paape, T., Zhou, P., Branca, A., Briskine, R., Young, N., and Tiffin, P. (2012). Fine-scale popula- tion recombination rates, hotspots, and correlates of recombination in the Medicago truncatula genome. Genome Biology and Evolution, 4(5):726–737. Pajuste, F.-D., Kaplinski, L., Möls, M., Puurand, T., Lepamets, M., and Remm, M. (2017). FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads. Sci- entific Reports, 7(1):2537. Number: 1 Publisher: Nature Publishing Group. Pateiro-Lopez, B. and Rodriguez-Casal, A. (2022). alphahull: Generalization of the convex hull of a sample of points in the plane. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., and Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4):417–419. Number: 4 Publisher: Nature Publishing Group. Payne, B. L. and Alvarez-Ponce, D. (2018). Higher rates of protein evolution in the self-fertilizing plant Arabidopsis thaliana than in the out-crossers Arabidopsis lyrata and Arabidopsis halleri. Genome Biology and Evolution, 10(3):895–900. Peart, C. R., Tusso, S., Pophaly, S. D., Botero-Castro, F., Wu, C.-C., Aurioles-Gamboa, D., Baird, A. B., Bickham, J. W., Forcada, J., Galimberti, F., Gemmell, N. J., Hoffman, J. I., Kovacs, K. M., Kunnasranta, M., Lydersen, C., Nyman, T., de Oliveira, L. R., Orr, A. J., Sanvito, S., Valtonen, M., Shafer, A. B. A., and Wolf, J. B. W. (2020). Determinants of genetic variation across eco- evolutionary scales in pinnipeds. Nature Ecology & Evolution, 4(8):1095–1104. Publisher: Nature Publishing Group. Pebesma, E. (2018). Simple Features for R: Standardized support for spatial vector data. The R Journal, 10(1):439–446. Pedersen, T. and Crameri, F. (2023). scico. Pellegrina, L., Pizzi, C., and Vandin, F. (2020). Fast approximation of frequent k-mers and appli- cations to metagenomics. Journal of Computational Biology, 27(4):534–549. Publisher: Mary Ann Liebert, Inc., publishers. Pellicer, J. and Leitch, I. J. (2020). The plant DNA C-values database (release 7.1): an updated on- line repository of plant genome size data for comparative studies. New Phytologist, 226(2):301– 305. _eprint: https://nph.onlinelibrary.wiley.com/doi/pdf/10.1111/nph.16261. Pflug, J. M., Holmes, V. R., Burrus, C., Johnston, J. S., and Maddison, D. R. (2020). Measuring genome sizes using read-depth, k-mers, and flow cytometry: Methodological comparisons in beetles (Coleoptera). G3 Genes|Genomes|Genetics, 10(9):3047–3060. Phatak, S. C. and Rao, S. S. (1995). Logistic map: A possible random-number generator. Physical Review E, 51(4):3670–3678. Phung, T. N., Huber, C. D., and Lohmueller, K. E. (2016). Determining the effect of natural selec- tion on linked neutral divergence across species. PLOS Genetics, 12(8):e1006199. Publisher: Public Library of Science. 127 Piganeau, G. and Eyre-Walker, A. (2009). Evidence for Variation in the Effective Population Size of Animal Mitochondrial DNA. PLOS ONE, 4(2):e4396. Publisher: Public Library of Science. Pigliucci, M. and Kolodynska, A. (2002). Phenotypic plasticity to light intensity in Arabidop- sis thaliana: invariance of reaction norms and phenotypic integration. Evolutionary Ecology, 16(1):27–47. Plotkin, J. B. and Fraser, H. B. (2007). Assessing the determinants of evolutionary rates in the presence of noise. Molecular Biology and Evolution, 24(5):1113–1121. Ponsero, A. J., Miller, M., and Hurwitz, B. L. (2023). Comparison of k-mer-based de novo compar- ative metagenomic tools and approaches. Microbiome Research Reports, 2(4):null–null. Pub- lisher: OAE Publishing Inc. Poplin, R., Ruano-Rubio, V., DePristo, M. A., Fennell, T. J., Carneiro, M. O., Auwera, G. A. V. d., Kling, D. E., Gauthier, L. D., Levy-Moonshine, A., Roazen, D., Shakir, K., Thibault, J., Chandran, S., Whelan, C., Lek, M., Gabriel, S., Daly, M. J., Neale, B., MacArthur, D. G., and Banks, E. (2018). Scaling accurate genetic variant discovery to tens of thousands of samples. Pages: 201178 Section: New Results. Przeworski, M. (2002). The Signature of Positive Selection at Randomly Chosen Loci. Genetics, 160(3):1179–1189. Przeworski, M. (2003). Estimating the Time Since the Fixation of a Beneficial Allele. Genetics, 164(4):1667–1676. Puurand, T., Kukuškina, V., Pajuste, F.-D., and Remm, M. (2019). AluMine: alignment-free method for the discovery of polymorphic Alu element insertions. Mobile DNA, 10(1):31. Quiroz, D., Lensink, M., Kliebenstein, D. J., and Monroe, J. G. (2023). Causes of mutation rate variability in plant genomes. Annual Review of Plant Biology, 74(1):751–775. _eprint: https://doi.org/10.1146/annurev-arplant-070522-054109. R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Rahman, A., Hallgrímsdóttir, I., Eisen, M., and Pachter, L. (2018). Association mapping from sequencing reads using k-mers. eLife, 7:e32920. Publisher: eLife Sciences Publications, Ltd. Raijman, D., Shamir, R., and Tanay, A. (2008). Evolution and selection in yeast promoters: An- alyzing the combined effect of diverse transcription factor binding sites. PLOS Computational Biology, 4(1):e7. Publisher: Public Library of Science. Ramirez-Ramirez, A. R., Bidot-Martínez, I., Mirzaei, K., Rasoamanalina Rivo, O. L., Menéndez- Grenot, M., Clapé-Borges, P., Espinosa-Lopez, G., and Bertin, P. (2024). Comparing the per- formances of SSR and SNP markers for population analysis in Theobroma cacao L., as al- ternative approach to validate a new ddRADseq protocol for cacao genotyping. PLOS ONE, 19(5):e0304753. 128 Ranallo-Benavidez, T. R., Jaron, K. S., and Schatz, M. C. (2020). GenomeScope 2.0 and Smudge- plot for reference-free profiling of polyploid genomes. Nature Communications, 11(1):1432. Publisher: Nature Publishing Group. Renny-Byfield, S. and Baumgarten, A. (2020). Repetitive DNA content in the maize genome is uncoupled from population stratification at SNP loci. BMC Genomics, 21(1):1–10. Number: 1 Publisher: BioMed Central. Rice, E. S., Alberdi, A., Alfieri, J., Athrey, G., Balacco, J. R., Bardou, P., Blackmon, H., Charles, M., Cheng, H. H., Fedrigo, O., Fiddaman, S. R., Formenti, G., Frantz, L. A. F., Gilbert, M. T. P., Hearn, C. J., Jarvis, E. D., Klopp, C., Marcos, S., Mason, A. S., Velez-Irizarry, D., Xu, L., and Warren, W. C. (2023). A pangenome graph reference of 30 chicken genomes allows genotyping of large and complex structural variants. BMC Biology, 21(1):267. Riddell, W. C. (1977). Prediction in Generalized Least Squares. The American Statistician, 31(2):88–90. Publisher: [American Statistical Association, Taylor & Francis, Ltd.]. Ritter, E. J., Cousins, P., Quigley, M., Kile, A., Kenchanmane Raju, S. K., Chitwood, D. H., and Niederhuth, C. (2024). From buds to shoots: insights into grapevine development from the Witch’s Broom bud sport. BMC Plant Biology, 24(1):283. Roberts, M. D., Davis, O., Josephs, E. B., and Williamson, R. J. (2024). k-mer-based approaches to bridging pangenomics and population genetics. Version Number: 1. Rocha, E. P. C. (2006). The quest for the universals of protein evolution. Trends in Genetics, 22(8):412–416. Rogers, A. R. and Huff, C. (2009). Linkage Disequilibrium Between Loci With Unknown Phase. Genetics, 182(3):839–844. Romiguier, J., Gayral, P., Ballenghien, M., Bernard, A., Cahais, V., Chenuil, A., Chiari, Y., Dernat, R., Duret, L., Faivre, N., Loire, E., Lourenco, J. M., Nabholz, B., Roux, C., Tsagkogeorga, G., Weber, A. a.-T., Weinert, L. A., Belkhir, K., Bierne, N., Glémin, S., and Galtier, N. (2014). Comparative population genomics in animals uncovers the determinants of genetic diversity. Nature, 515(7526):261–263. Number: 7526 Publisher: Nature Publishing Group. Roselius, K., Stephan, W., and Städler, T. (2005). The relationship of nucleotide polymorphism, recombination rate and selection in wild tomato species. Genetics, 171(2):753–763. Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., and Sokhansanj, B. (2008). Metagenome frag- ment classification using n-mer frequency profiles. Advances in Bioinformatics, 2008:1–12. Roy, R. S., Bhattacharya, D., and Schliep, A. (2014). Turtle: Identifying frequent k -mers with cache-efficient algorithms . Bioinformatics, 30(14):1950–1957. Ruperao, P., Gandham, P., Odeny, D. A., Mayes, S., Selvanayagam, S., Thirunavukkarasu, N., Das, R. R., Srikanda, M., Gandhi, H., Habyarimana, E., Manyasa, E., Nebie, B., Deshpande, S. P., and Rathore, A. (2023). Exploring the sorghum race level diversity utilizing 272 sorghum accessions genomic resources. Frontiers in Plant Science, 14. 129 Růžička, M., Kulhánek, P., Radová, L., Čechová, A., Špačková, N., Fajkusová, L., and Réblová, K. (2017). DNA mutation motifs in the genes associated with inherited diseases. PLOS ONE, 12(8):e0182377. Publisher: Public Library of Science. Sanchez, D., Sadoun, S. B., Mary-Huard, T., Allier, A., Moreau, L., and Charcosset, A. (2023). Improving the use of plant genetic resources to sustain breeding programs’ efficiency. Proceed- ings of the National Academy of Sciences, 120(14):e2205780119. Publisher: Proceedings of the National Academy of Sciences. Sanchez, T., Cury, J., Charpiat, G., and Jay, F. Deep learning for popu- lation size history inference: Design, comparison and combination with approximate _eprint: Bayesian computation. https://onlinelibrary.wiley.com/doi/pdf/10.1111/1755-0998.13224. Molecular Ecology Resources, 21(8):2645–2660. (2021). Sandell, F. L., Stralis-Pavese, N., McGrath, J. M., Schulz, B., Himmelbauer, H., and Dohm, J. C. (2022). Genomic distances reveal relationships of wild and cultivated beets. Nature Communi- cations, 13(1):2021. Number: 1 Publisher: Nature Publishing Group. Scheiner, S. M. (1993). Genetics and evolution of phenotypic plasticity. Annual Review of Ecology and Systematics, 24(1):35–68. _eprint: https://doi.org/10.1146/annurev.es.24.110193.000343. Schlamp, F., Van Der Made, J., Stambler, R., Chesebrough, L., Boyko, A. R., and Messer, P. W. (2016). Evaluating the performance of selection scans to detect selective sweeps in domestic dogs. Molecular Ecology, 25(1):342–356. Schlichting, C. D. and Smith, H. (2002). Phenotypic plasticity: linking molecular mechanisms with evolutionary outcomes. Evolutionary Ecology, 16(3):189–211. Schmid, K. J., Ramos-Onsins, S., Ringys-Beckstein, H., Weisshaar, B., and Mitchell-Olds, T. (2005). A multilocus sequence survey in Arabidopsis thaliana reveals a genome-wide depar- ture from a neutral model of DNA sequence polymorphism. Genetics, 169(3):1601–1615. Schmidt, T. L., Jasper, M.-E., Weeks, A. R., and Hoffmann, A. A. (2021). Unbiased population heterozygosity estimates from genome-wide sequence data. Methods in Ecology and Evolution, 12(10):1888–1898. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/2041-210X.13659. Schmuths, H., Meister, A., Horres, R., and Bachmann, K. (2004). Genome size variation among accessions of Arabidopsis thaliana. Annals of Botany, 93(3):317–321. Schneider, H. (2022). Characterization, costs, cues, and future perspectives of phenotypic plasticity. Annals of botany. Shajii, A., Yorukoglu, D., William Yu, Y., and Berger, B. (2016). Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics, 32(17):i538–i544. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423. Conference Name: The Bell System Technical Journal. 130 Shi, G., Dai, Y., Zhou, D., Chen, M., Zhang, J., Bi, Y., Liu, S., and Wu, Q. (2024). An alignment- and reference-free strategy using k-mer present pattern for pop- ulation genomic analyses. Mycology, 0(0):1–15. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/21501203.2024.2358868. Shi, H., Schmidt, B., Liu, W., and Müller-Wittig, W. (2010). Quality-score guided error correction for short-read sequencing data using cuda. Procedia Computer Science, 1(1):1129–1138. ICCS 2010. Shi, Z. J., Nayfach, S., and Pollard, K. S. (2023). Identifying species-specific k-mers for fast and accurate metagenotyping with Maast and GT-Pro. STAR Protocols, 4(1):101964. Siepel, A., Bejerano, G., Pedersen, J. S., Hinrichs, A. S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L. W., Richards, S., Weinstock, G. M., Wilson, R. K., Gibbs, R. A., Kent, W. J., Miller, W., and Haussler, D. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research, 15(8):1034–1050. Company: Cold Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab. Silva-Junior, O. B. and Grattapaglia, D. (2015). Genome-wide patterns of recombination, linkage disequilibrium and nucleotide diversity from pooled resequencing and single nucleotide poly- morphism genotyping unlock the evolutionary history of Eucalyptus grandis. New Phytologist, 208(3):830–845. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/nph.13505. Sims, G. E., Jun, S.-R., Wu, G. A., and Kim, S.-H. (2009). Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences, 106(8):2677–2682. Publisher: Proceedings of the National Academy of Sciences. Sirén, J., Monlong, J., Chang, X., Novak, A. M., Eizenga, J. M., Markello, C., Sibbesen, J. A., Hickey, G., Chang, P.-C., Carroll, A., Gupta, N., Gabriel, S., Blackwell, T. W., Ratan, A., Tay- lor, K. D., Rich, S. S., Rotter, J. I., Haussler, D., Garrison, E., and Paten, B. (2021). Pange- nomics enables genotyping of known structural variants in 5202 diverse genomes. Science, 374(6574):abg8871. Publisher: American Association for the Advancement of Science. Slotte, T. (2014). The impact of linked selection on plant genomic variation. Briefings in Functional Genomics, 13(4):268–275. Slotte, T., Bataillon, T., Hansen, T. T., St. Onge, K., Wright, S. I., and Schierup, M. H. (2011). Genomic determinants of protein evolution and polymorphism in Arabidopsis. Genome Biology and Evolution, 3:1210–1219. Smith, J. M. and Haigh, J. (1974). The hitch-hiking effect of a favourable gene. Genetics Research, 23(1):23–35. Snell-Rood, E. C., Van Dyken, J. D., Cruickshank, T., Wade, M. J., and Moczek, A. P. the costs, Toward a population genetic framework of developmental evolution: _eprint: BioEssays, 32(1):71–81. (2010). limits, and consequences of phenotypic plasticity. https://onlinelibrary.wiley.com/doi/pdf/10.1002/bies.200900132. 131 Song, B., Buckler, E. S., and Stitzer, M. C. (2024). New whole-genome alignment tools are needed for tapping into plant diversity. Trends in Plant Science, 29(3):355–369. Publisher: Elsevier. Song, B., Marco-Sola, S., Moreto, M., Johnson, L., Buckler, E. S., and Stitzer, M. C. (2022). An- chorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication. Proceedings of the National Academy of Sci- ences, 119(1):e2113075119. Publisher: Proceedings of the National Academy of Sciences. Song, J.-M., Guan, Z., Hu, J., Guo, C., Yang, Z., Wang, S., Liu, D., Wang, B., Lu, S., Zhou, R., Xie, W.-Z., Cheng, Y., Zhang, Y., Liu, K., Yang, Q.-Y., Chen, L.-L., and Guo, L. (2020). Eight high- quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica napus. Nature Plants, 6(1):34–+. Place: Berlin Publisher: Nature Portfolio WOS:000508165000002. Sopniewski, J. and Catullo, R. A. (2024). Estimates of heterozygosity from single nucleotide polymorphism markers are context-dependent and often wrong. Molecular Ecology Resources, n/a(n/a):e13947. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1755-0998.13947. South, A. (2011). rworldmap: A new R package for mapping global data. The R Journal, 3(1):35– 43. Standage, D. S., Brown, C. T., and Hormozdiari, F. (2019). Kevlar: A mapping-free framework for accurate discovery of de novo variants. iScience, 18:28–36. Stern, A. J., Wilton, P. R., and Nielsen, R. (2019). An approximate full-likelihood method for inferring selection and allele frequency trajectories from DNA sequence data. PLOS Genetics, 15(9):e1008384. Publisher: Public Library of Science. Stoletzki, N. and Eyre-Walker, A. (2011). Estimation of the neutrality index. Molecular Biology and Evolution, 28(1):63–70. Suyama, M., Torrents, D., and Bork, P. (2006). PAL2NAL: robust conversion of protein se- quence alignments into the corresponding codon alignments. Nucleic Acids Research, 34(suppl_- 2):W609–W612. Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymor- phism. Genetics, 123(3):585–595. Tajima, F. (1996). The amount of DNA polymorphism maintained in a finite population when the neutral mutation rate varies among sites. Genetics, 143(3):1457–1465. Takuno, S. and Gaut, B. S. (2012). Body-methylated genes in Arabidopsis thaliana are functionally important and evolve slowly. Molecular Biology and Evolution, 29(1):219–227. Tenaillon, M. I., Hufford, M. B., Gaut, B. S., and Ross-Ibarra, J. (2011). Genome size and transpos- able element content as determined by high-throughput sequencing in maize and Zea luxurians. Genome Biology and Evolution, 3:219–229. 132 Tenaillon, M. I., Sawkins, M. C., Long, A. D., Gaut, R. L., Doebley, J. F., and Gaut, B. S. (2001). Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.). Proceedings of the National Academy of Sciences, 98(16):9161–9166. Publisher: Proceedings of the National Academy of Sciences. Teshima, K. M., Coop, G., and Przeworski, M. (2006). How reliable are empirical genomic scans for selective sweeps? Genome Research, 16(6):702–712. Torada, L., Lorenzon, L., Beddis, A., Isildak, U., Pattini, L., Mathieson, S., and Fumagalli, M. (2019). ImaGene: a convolutional neural network to quantify natural selection from genomic data. BMC Bioinformatics, 20(9):337. Turner, I., Garimella, K. V., Iqbal, Z., and McVean, G. (2018). Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics, 34(15):2556–2565. Uricaru, R., Rizk, G., Lacroix, V., Quillery, E., Plantard, O., Chikhi, R., Lemaitre, C., and Peter- longo, P. (2015). Reference-free detection of isolated SNPs. Nucleic Acids Research, 43(2):e11. Urrutia, A. O. and Hurst, L. D. (2001). Codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans, but this is not evidence for selection. Genetics, 159(3):1191–1199. Urrutia, A. O. and Hurst, L. D. (2003). The signature of selection mediated by expression on human genes. Genome Research, 13(10):2260–2264. Company: Cold Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab. Van Buskirk, J. and Steiner, U. K. (2009). ization and plasticity. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1420-9101.2009.01685.x. Journal of Evolutionary Biology, 22(4):852–860. The fitness costs of developmental canal- _eprint: Van Dyken, J. D. and Wade, M. J. (2010). The genetic signature of conditional expression. Genetics, 184(2):557–570. Van Kleunen, M. and Fischer, M. (2005). tive phenotypic plasticity in plants. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-8137.2004.01296.x. Constraints on the evolution of adap- _eprint: New Phytologist, 166(1):49–60. VanWallendael, A. and Alvarez, M. (2022). Alignment-free methods for polyploid genomes: Quick and reliable genetic distance estimation. Molecular Ecology Resources, 22(2):612–622. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1755-0998.13499. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., Gocayne, J. D., Amanatides, P., Ballew, R. M., Huson, D. H., Wortman, J. R., Zhang, Q., Kodira, C. D., Zheng, X. H., Chen, L., Skupski, M., Sub- ramanian, G., Thomas, P. D., Zhang, J., Gabor Miklos, G. L., Nelson, C., Broder, S., Clark, A. G., Nadeau, J., McKusick, V. A., Zinder, N., Levine, A. J., Roberts, R. J., Simon, M., Slay- man, C., Hunkapiller, M., Bolanos, R., Delcher, A., Dew, I., Fasulo, D., Flanigan, M., Florea, L., Halpern, A., Hannenhalli, S., Kravitz, S., Levy, S., Mobarry, C., Reinert, K., Remington, 133 K., Abu-Threideh, J., Beasley, E., Biddick, K., Bonazzi, V., Brandon, R., Cargill, M., Chan- dramouliswaran, I., Charlab, R., Chaturvedi, K., Deng, Z., Francesco, V. D., Dunn, P., Eilbeck, K., Evangelista, C., Gabrielian, A. E., Gan, W., Ge, W., Gong, F., Gu, Z., Guan, P., Heiman, T. J., Higgins, M. E., Ji, R.-R., Ke, Z., Ketchum, K. A., Lai, Z., Lei, Y., Li, Z., Li, J., Liang, Y., Lin, X., Lu, F., Merkulov, G. V., Milshina, N., Moore, H. M., Naik, A. K., Narayan, V. A., Neelam, B., Nusskern, D., Rusch, D. B., Salzberg, S., Shao, W., Shue, B., Sun, J., Wang, Z. Y., Wang, A., Wang, X., Wang, J., Wei, M.-H., Wides, R., Xiao, C., Yan, C., Yao, A., Ye, J., Zhan, M., Zhang, W., Zhang, H., Zhao, Q., Zheng, L., Zhong, F., Zhong, W., Zhu, S. C., Zhao, S., Gilbert, D., Baumhueter, S., Spier, G., Carter, C., Cravchik, A., Woodage, T., Ali, F., An, H., Awe, A., Baldwin, D., Baden, H., Barnstead, M., Barrow, I., Beeson, K., Busam, D., Carver, A., Center, A., Cheng, M. L., Curry, L., Danaher, S., Davenport, L., Desilets, R., Dietz, S., Dodson, K., Doup, L., Ferriera, S., Garg, N., Gluecksmann, A., Hart, B., Haynes, J., Haynes, C., Heiner, C., Hladun, S., Hostin, D., Houck, J., Howland, T., Ibegwam, C., Johnson, J., Kalush, F., Kline, L., Koduru, S., Love, A., Mann, F., May, D., McCawley, S., McIntosh, T., McMullen, I., Moy, M., Moy, L., Murphy, B., Nelson, K., Pfannkoch, C., Pratts, E., Puri, V., Qureshi, H., Reardon, M., Rodriguez, R., Rogers, Y.-H., Romblad, D., Ruhfel, B., Scott, R., Sitter, C., Smallwood, M., Stewart, E., Strong, R., Suh, E., Thomas, R., Tint, N. N., Tse, S., Vech, C., Wang, G., Wetter, J., Williams, S., Williams, M., Windsor, S., Winn-Deen, E., Wolfe, K., Zaveri, J., Zaveri, K., Abril, J. F., Guigó, R., Campbell, M. J., Sjolander, K. V., Karlak, B., Kejariwal, A., Mi, H., Lazareva, B., Hatton, T., Narechania, A., Diemer, K., Muruganujan, A., Guo, N., Sato, S., Bafna, V., Is- trail, S., Lippert, R., Schwartz, R., Walenz, B., Yooseph, S., Allen, D., Basu, A., Baxendale, J., Blick, L., Caminha, M., Carnes-Stine, J., Caulk, P., Chiang, Y.-H., Coyne, M., Dahlke, C., Mays, A. D., Dombroski, M., Donnelly, M., Ely, D., Esparham, S., Fosler, C., Gire, H., Glanowski, S., Glasser, K., Glodek, A., Gorokhov, M., Graham, K., Gropman, B., Harris, M., Heil, J., Hender- son, S., Hoover, J., Jennings, D., Jordan, C., Jordan, J., Kasha, J., Kagan, L., Kraft, C., Levitsky, A., Lewis, M., Liu, X., Lopez, J., Ma, D., Majoros, W., McDaniel, J., Murphy, S., Newman, M., Nguyen, T., Nguyen, N., Nodell, M., Pan, S., Peck, J., Peterson, M., Rowe, W., Sanders, R., Scott, J., Simpson, M., Smith, T., Sprague, A., Stockwell, T., Turner, R., Venter, E., Wang, M., Wen, M., Wu, D., Wu, M., Xia, A., Zandieh, A., and Zhu, X. (2001). The sequence of the human genome. Science, 291(5507):1304–1351. Publisher: American Association for the Advancement of Science. Voichek, Y. and Weigel, D. (2020). Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nature Genetics, 52(5):534–540. Number: 5 Publisher: Nature Publishing Group. Vurture, G. W., Sedlazeck, F. J., Nattestad, M., Underwood, C. J., Fang, H., Gurtowski, J., and Schatz, M. C. (2017). GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics, 33(14):2202–2204. Wang, H.-c., Singer, G. A. C., and Hickey, D. A. (2004). Mutational bias affects protein evolution in flowering plants. Molecular Biology and Evolution, 21(1):90–96. Wang, J., Street, N. R., Scofield, D. G., and Ingvarsson, P. K. (2016). Natural selection and re- combination rate variation shape nucleotide polymorphism across the genomes of three related Populus species. Genetics, 202(3):1185–1200. 134 Wang, J., Yang, W., Zhang, S., Hu, H., Yuan, Y., Dong, J., Chen, L., Ma, Y., Yang, T., Zhou, L., Chen, J., Liu, B., Li, C., Edwards, D., and Zhao, J. (2023). A pangenome analysis pipeline provides insights into functional gene identification in rice. Genome Biology, 24(1):19. Watterson, G. A. (1975). On the number of segregating sites in genetical models without recombi- nation. Theoretical Population Biology, 7(2):256–276. Wheeler, L. C., Walker, J. F., Ng, J., Deanna, R., Dunbar-Wallis, A., Backes, A., Pezzi, P. H., Palchetti, M. V., Robertson, H. M., Monaghan, A., de Freitas, L. B., Barboza, G. E., Moyroud, E., and Smith, S. D. (2022). Transcription factors evolve faster than their structural gene targets in the flavonoid pigment pathway. Molecular Biology and Evolution, 39(3):msac044. Whitehouse, L. S. and Schrider, D. R. (2023). Timesweeper: accurately identifying selective sweeps using population genomic time series. Genetics, 224(3):iyad084. Whitlock, M. C. (1996). The Red Queen Beats the Jack-Of-All-Trades: The Limitations on the Evolution of Phenotypic Plasticity and Niche Breadth. The American Naturalist, 148:S65–S77. Publisher: [University of Chicago Press, American Society of Naturalists]. Whitney, K. D., Baack, E. J., Hamrick, J. L., Godt, M. J. W., Barringer, B. C., Bennett, M. D., Eckert, C. G., Goodwillie, C., Kalisz, S., Leitch, I. J., and Ross-Ibarra, J. (2010). A role for nonadaptive processes in plant genome size evolution? Evolution, 64(7):2097–2109. Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. Wiersma, A. T., Hamilton, J. P., Vaillancourt, B., Brose, J., Awale, H. E., Wright, E. M., Kelly, J. D., and Buell, C. R. (2024). k-mer genome-wide association study for anthracnose and BCMV resistance in a Phaseolus vulgaris Andean Diversity Panel. The Plant Genome, n/a(n/a):1–17. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/tpg2.20523. Wilke, C. O. (2020). cowplot: Streamlined Plot Theme and Plot Annotations for ’ggplot2’. Williamson, R. J., Josephs, E. B., Platts, A. E., Hazzouri, K. M., Haudry, A., Blanchette, M., and Wright, S. I. (2014). Evidence for Widespread Positive and Negative Selection in Coding and Conserved Noncoding Regions of Capsella grandiflora. PLOS Genetics, 10(9):e1004622. Publisher: Public Library of Science. Willis, J. C. (1922). Age and Area: A Study in Geographical Distribution and Origin of Species. The University Press. Google-Books-ID: yBs4AAAAMAAJ. Wilson, B. A., Petrov, D. A., and Messer, P. W. (2014). Soft Selective Sweeps in Complex Demo- graphic Scenarios. Genetics, 198(2):669–684. Winter, E. E., Goodstadt, L., and Ponting, C. P. (2004). Elevated rates of protein secretion, evolu- tion, and disease among tissue-specific genes. Genome Research, 14(1):54–61. Company: Cold Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab. 135 Wong, K. H. Y., Ma, W., Wei, C.-Y., Yeh, E.-C., Lin, W.-J., Wang, E. H. F., Su, J.-P., Hsieh, F.-J., Kao, H.-J., Chen, H.-H., Chow, S. K., Young, E., Chu, C., Poon, A., Yang, C.-F., Lin, D.-S., Hu, Y.-F., Wu, J.-Y., Lee, N.-C., Hwu, W.-L., Boffelli, D., Martin, D., Xiao, M., and Kwok, P.-Y. (2020). Towards a reference genome that captures global genetic diversity. Nature Communications, 11(1):5482. Number: 1 Publisher: Nature Publishing Group. Woolfe, A., Goodson, M., Goode, D. K., Snell, P., McEwen, G. K., Vavouri, T., Smith, S. F., North, P., Callaway, H., Kelly, K., Walter, K., Abnizova, I., Gilks, W., Edwards, Y. J. K., Cooke, J. E., and Elgar, G. (2005). Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biology, 3(1):e7. Wright, S. (1940). Breeding structure of populations in relation to speciation. The American Naturalist, 74(752):232–248. Publisher: [University of Chicago Press, American Society of Naturalists]. Wright, S. I., Lauga, B., and Charlesworth, D. (2002). Rates and patterns of molecular evolution in inbred and outbred Arabidopsis. Molecular Biology and Evolution, 19(9):1407–1420. Wright, S. I., Yau, C. B. K., Looseley, M., and Meyers, B. C. (2004). Effects of gene expression on molecular evolution in Arabidopsis thaliana and Arabidopsis lyrata. Molecular Biology and Evolution, 21(9):1719–1726. Wu, D. Y., Ugozzoli, L., Pal, B. K., Qian, J., and Wallace, R. B. (1991). The effect of temperature and oligonucleotide primer length on the specificity and efficiency of amplification by the poly- merase chain reaction. DNA and Cell Biology, 10(3):233–238. Publisher: Mary Ann Liebert, Inc., publishers. Wu, M., Haak, D. C., Anderson, G. J., Hahn, M. W., Moyle, L. C., and Guerrero, R. F. (2021). Inferring the genetic basis of sex determination from the genome of a dioecious nightshade. Molecular Biology and Evolution, 38(7):2946–2957. Wu, Z., Cai, X., Zhang, X., Liu, Y., Tian, G.-b., Yang, J.-R., and Chen, X. (2022). Expression level is a major modifier of the fitness landscape of a protein coding gene. Nature Ecology & Evolution, 6(1):103–115. Number: 1 Publisher: Nature Publishing Group. Y. C. Brandt, D., Wei, X., Deng, Y., Vaughn, A. H., and Nielsen, R. (2022). Evaluation of methods for estimating coalescence times using ancestral recombination graphs. Genetics, 221(1):iyac044. Yair, S., Lee, K. M., and Coop, G. (2021). The timing of human adaptation from Neanderthal introgression. Genetics, 218(1):iyab052. Yanai, I., Benjamin, H., Shmoish, M., Chalifa-Caspi, V., Shklar, M., Ophir, R., Bar-Even, A., Horn-Saban, S., Safran, M., Domany, E., Lancet, D., and Shmueli, O. (2005). Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specifica- tion. Bioinformatics, 21(5):650–659. Yang, J., Gu, Z., and Li, W.-H. (2003). Rate of protein evolution versus fitness effect of gene deletion. Molecular Biology and Evolution, 20(5):772–774. 136 Yang, L. and Gaut, B. S. (2011). Factors that contribute to variation in evolutionary rate among Arabidopsis genes. Molecular Biology and Evolution, 28(8):2359–2369. Yang, Z. (2007). PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution, 24(8):1586–1591. Yeang, C.-H. (2010). Quantifying the strength of natural selection of a motif sequence. In Moulton, V. and Singh, M., editors, Algorithms in Bioinformatics, Lecture Notes in Computer Science, pages 362–373, Berlin, Heidelberg. Springer. Yi, H., Lin, Y., Lin, C., and Jin, W. (2021). Kssd: sequence dimensionality reduction by k-mer sub- string space sampling enables real-time large-scale datasets analysis. Genome Biology, 22(1):84. Younsi, R. and MacLean, D. (2015). Using 2k+2 bubble searches to find single nucleotide poly- morphisms in k-mer graphs. Bioinformatics, 31(5):642–646. Yu, H., Zhang, K., Cheng, G., Mei, C., Wang, H., and Zan, L. (2024). Genome-wide analysis reveals genomic diversity and signatures of selection in Qinchuan beef cattle. BMC Genomics, 25(1):558. Zhang, H., Zhang, F., Yu, Y., Feng, L., Jia, J., Liu, B., Li, B., Guo, H., and Zhai, J. (2020). A com- prehensive online database for exploring 20,000 public Arabidopsis RNA-seq libraries. Molec- ular Plant, 13(9):1231–1233. Zhang, J. and Yang, J.-R. (2015). Determinants of the rate of protein sequence evolution. Nature Reviews Genetics, 16(7):409–420. Number: 7 Publisher: Nature Publishing Group. Zhang, L. and Li, W.-H. (2004). Mammalian housekeeping genes evolve more slowly than tissue- specific genes. Molecular Biology and Evolution, 21(2):236–239. Zhang, L., Vision, T. J., and Gaut, B. S. (2002). Patterns of nucleotide substitution among si- multaneously duplicated gene pairs in Arabidopsis thaliana. Molecular Biology and Evolution, 19(9):1464–1473. Zhang, Q., Jun, S.-R., Leuze, M., Ussery, D., and Nookaew, I. (2017a). Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer. Scientific Reports, 7(1):40712. Publisher: Nature Publishing Group. Zhang, S.-S., Yang, H., Ding, L., Song, Z.-T., Ma, H., Chang, F., and Liu, J.-X. (2017b). Tissue- Specific Transcriptomics Reveals an Important Role of the Unfolded Protein Response in Main- taining Fertility upon Heat Stress in Arabidopsis. The Plant Cell, 29(5):1007–1023. Zhao, L., Lascoux, M., Overall, A. D. J., and Waxman, D. (2013). The Characteristic Trajectory of a Fixing Allele: A Consequence of Fictitious Selection That Arises from Conditioning. Genetics, 195(3):993–1006. Zhao, L., Xie, J., Bai, L., Chen, W., Wang, M., Zhang, Z., Wang, Y., Zhao, Z., and Li, J. (2018). Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics, 19(10):912. 137 Zhao, X. (2019). BinDash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics, 35(4):671–673. Zhong, L., Zhu, Y., and Olsen, K. M. (2022). Hard versus soft selective sweeps during do- mestication and improvement in soybean. Molecular Ecology, 31(11):3137–3153. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.16454. Zhou, Y., Zhang, Z., Bao, Z., Li, H., Lyu, Y., Zan, Y., Wu, Y., Cheng, L., Fang, Y., Wu, K., Zhang, J., Lyu, H., Lin, T., Gao, Q., Saha, S., Mueller, L., Fei, Z., Städler, T., Xu, S., Zhang, Z., Speed, D., and Huang, S. (2022). Graph pangenome captures missing heritability and empowers tomato breeding. Nature, 606(7914):527–534. Number: 7914 Publisher: Nature Publishing Group. Zielezinski, A., Girgis, H. Z., Bernard, G., Leimeister, C.-A., Tang, K., Dencker, T., Lau, A. K., Röhling, S., Choi, J. J., Waterman, M. S., Comin, M., Kim, S.-H., Vinga, S., Almeida, J. S., Chan, C. X., James, B. T., Sun, F., Morgenstern, B., and Karlowski, W. M. (2019). Benchmarking of alignment-free sequence comparison methods. Genome Biology, 20(1):144. Zielezinski, A., Vinga, S., Almeida, J., and Karlowski, W. M. (2017). Alignment-free sequence comparison: benefits, applications, and tools. Genome Biology, 18(1):186. Zimin, A. V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S. L., and Yorke, J. A. (2013). The MaSuRCA genome assembler. Bioinformatics, 29(21):2669–2677. Zwick, M. E., Cutler, D. J., and Chakravarti, A. (2000). Patterns of genetic variation in Mendelian and complex traits. Annual Review of Genomics and Human Genetics, 1:387–407. 138 APPENDIX A: SUPPLEMENTAL FIGURES FOR CHAPTER 1 General data curation Figure A1: Histogram of mapping rates for RNA-seq runs. Bin width is 5 %. 139 Figure A2: Histograms of other key variables for the genes included in this study. 140 0.00.20.40.62.55.07.510.0log(average expression in TPM)DensityA0500010000150000102030Number of genes in orthogroup (family size)Number of orthogroupsB012012Tajima's DDensityC0510150.30.40.50.60.7GC contentDensityD Figure A3: Violin plot of treatment-specificity in gene expression for six tissue categories. 141 0.250.500.751.00fruit OR flowerleafrootseedshoot systemwhole plantTissueTreatment−specificity Figure A4: Violin plot of tissue-specificity in gene expression by treatment type. Only the five treatments shown had samples from all six tissue types. 142 0.000.250.500.751.00PECO:0001062 control exposurePECO:0007105 abscisic acid exposurePECO:0007162 continuous light exposurePECO:0007319 warm/hot air temperature exposurePECO:0007332 cold air temperature exposureTissue−specificity of expression Figure A5: Comparing dN and dS values returned from codeml and the custom biopython script used in this study. Line shows exact match between codeml and biopython methods. Genes with saturating divergence (dS > 1) were excluded from partial correlation analyses. 143 Partial correlations on overall dataset Figure A6: Partial correlation analysis for root tissue. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on root data. Average expression ex- cludes values < 5 TPM and was calculated using only root tissue samples. Tissue-specificity was calculated using only control runs across all tissue categories. Treatment-specificity was calculated using only root tissue runs. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 144 -0.40.00.4Partialspearmancorrelationcoefficient0.170.11-0.010.090.040.44-0.260.02-0.130.15-0.050.08-0.140.01-0.18-0.18-0.15-0.13-0.23-0.3-0.12dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 14112A0.540.12-0.070.1-0.030.48-0.270.13-0.170.1-0.03-0.02-0.110.01-0.15-0.120.07-0.15-0.29-0.29-0.14-0.010.080.060.070.010.020.08πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 24513B0.020.030.5-0.03-0.210.08-0.03-0.120-0.150-0.17-0.31-0.27-0.140.050.060.070.010.020.09Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 24515C0.030.060.45-0.09-0.170.13-0.06-0.140.01-0.18-0.05-0.16-0.26-0.27-0.13Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 14064D Figure A7: Partial correlation analysis for whole plant tissue. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on whole plant data. Average expression excludes values < 5 TPM and was calculated using only whole plant tissue samples. Tissue-specificity was calculated using only control runs across all tissue categories. Treatment- specificity was calculated using only whole plant tissue runs. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 145 -0.40.00.4Partialspearmancorrelationcoefficient0.160.0500.120.040.33-0.270.02-0.230.08-0.040.08-0.16-0.02-0.12-0.18-0.17-0.24-0.14-0.28-0.1dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 13151A0.530.08-0.040.13-0.040.39-0.260.14-0.260.07-0.02-0.02-0.140-0.11-0.110.06-0.25-0.19-0.27-0.13-0.020.080.060.060.050.010.09πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 22292B0.010.030.4-0.02-0.290.04-0.03-0.14-0.01-0.110-0.27-0.21-0.25-0.120.050.060.060.070.010.1Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 22294C0.020.060.34-0.06-0.260.05-0.05-0.16-0.02-0.12-0.04-0.25-0.17-0.25-0.11Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 13104D Figure A8: Partial correlation analysis for shoot tissue. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on shoot data. Average expression ex- cludes values < 5 TPM and was calculated using only shoot tissue samples. Tissue-specificity was calculated using only control runs across all tissue categories. Treatment-specificity was calculated using only shoot tissue runs. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 146 -0.40.00.4Partialspearmancorrelationcoefficient0.150.07-0.030.050.070.51-0.270.02-0.190.08-0.040.08-0.1-0.09-0.12-0.17-0.15-0.11-0.26-0.26-0.11dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 13500A0.530.08-0.060.070.010.55-0.260.14-0.230.08-0.02-0.02-0.09-0.06-0.11-0.10.07-0.11-0.31-0.23-0.13-0.030.080.030.090.040.020.11πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 22727B00.040.56-0.02-0.260.06-0.02-0.09-0.07-0.110.01-0.12-0.31-0.21-0.130.050.030.090.060.020.11Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 22729C0.030.010.51-0.07-0.210.06-0.05-0.11-0.08-0.12-0.05-0.12-0.29-0.22-0.13Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 13455D Figure A9: Partial correlation analysis for seed tissue. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on seed data. Average expression excludes values < 5 TPM and was calculated using only seed tissue samples. Tissue-specificity was calculated using only control runs across all tissue categories. Treatment-specificity was calculated using only seed tissue runs. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 147 -0.40.00.4Partialspearmancorrelationcoefficient0.160.15-0.020.030.050.57-0.12-0.010.09-0.13-0.020.07-0.05-0.1-0.19-0.16-0.17-0.1-0.17-0.35-0.14dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 14223A0.530.16-0.090.0700.63-0.140.040.08-0.14-0.02-0.03-0.04-0.07-0.19-0.110.05-0.11-0.2-0.35-0.1700.070.030.110.020.020.08πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 25237B00.040.65-0.040.06-0.15-0.03-0.04-0.08-0.190-0.13-0.21-0.34-0.170.050.030.120.020.020.08Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 25242C0.040.030.58-0.030.08-0.14-0.05-0.05-0.1-0.19-0.05-0.12-0.19-0.34-0.16Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 14172D Figure A10: Partial correlation analysis for fruit or flower tissue. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on flower and fruit data. Average expression excludes values < 5 TPM and was calculated using only fruit and flower tissue samples. Tissue-specificity was calculated using only control runs across all tissue categories. Treatment-specificity was calculated using only fruit and flower tissue runs. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 148 -0.40.00.4Partialspearmancorrelationcoefficient0.150.12-0.010.070.030.33-0.27-0.01-0.10.1-0.040.07-0.150.01-0.14-0.2-0.18-0.24-0.04-0.29-0.09dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 13550A0.530.15-0.080.06-0.020.38-0.280.13-0.140.07-0.04-0.01-0.140.03-0.14-0.130.06-0.28-0.09-0.25-0.11-0.020.080.070.050.050.010.07πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 22908B0.010.010.39-0.05-0.20.05-0.03-0.140.03-0.14-0.01-0.31-0.1-0.22-0.110.050.070.050.060.010.08Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 22913C0.040.030.34-0.08-0.140.08-0.05-0.150.01-0.14-0.06-0.28-0.06-0.25-0.11Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 13506D Partial correlations on data subset by tissue type after SVA Figure A11: Partial correlation analysis of root tissue after SVA. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on root tissue data after applying SVA. Data was further subset to include only treatment groups with data from more than one study before applying SVA. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 149 -0.40.00.4Partialspearmancorrelationcoefficient0.170.090.04-0.280.010.1-0.070.08-0.07-0.18-0.23-0.16-0.27-0.32-0.12dSTreatment specificityAverage expressionGClengthdNdSTreatment specificityAverage expressionGCn = 13984A0.530.15-0.04-0.290.140.05-0.05-0.01-0.06-0.15-0.170.09-0.34-0.29-0.1400.070.090.010.020.07πSTreatment specificityAverage expressionGClengthFamily sizeπNπSTreatment specificityAverage expressionGClengthn = 24104B0.05-0.040-0.03-0.07-0.140-0.38-0.25-0.140.050.10.010.010.07Treatment specificityAverage expressionGClengthFamily sizeTajima's DTreatment specificityAverage expressionGClengthn = 24107C0.05-0.090.08-0.06-0.07-0.17-0.07-0.31-0.27-0.13Treatment specificityAverage expressionGClengthDoSTreatment specificityAverage expressionGCn = 13937D Figure A12: Partial correlation analysis of whole plant tissue after SVA. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on whole plant tissue data after applying SVA. Data was further subset to include only treatment groups with data from more than one study before applying SVA. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 150 -0.40.00.4Partialspearmancorrelationcoefficient0.150.10.06-0.290.020.01-0.040.08-0.09-0.09-0.19-0.16-0.28-0.24-0.07dSTreatment specificityAverage expressionGClengthdNdSTreatment specificityAverage expressionGCn = 12909A0.530.14-0.03-0.30.16-0.02-0.03-0.02-0.07-0.08-0.120.07-0.35-0.2-0.1-0.030.080.10.0500.09πSTreatment specificityAverage expressionGClengthFamily sizeπNπSTreatment specificityAverage expressionGClengthn = 21557B0.05-0.02-0.06-0.03-0.07-0.070-0.37-0.18-0.10.050.10.0600.09Treatment specificityAverage expressionGClengthFamily sizeTajima's DTreatment specificityAverage expressionGClengthn = 21562C0.05-0.07-0.02-0.06-0.09-0.08-0.05-0.33-0.2-0.08Treatment specificityAverage expressionGClengthDoSTreatment specificityAverage expressionGCn = 12865D Figure A13: Partial correlation analysis of shoot tissue after SVA. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on shoot tissue data after applying SVA. Data was further subset to include only treatment groups with data from more than one study before applying SVA. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 151 -0.40.00.4Partialspearmancorrelationcoefficient0.160.080.04-0.280.010.19-0.040.08-0.02-0.07-0.18-0.19-0.2-0.19-0.04dSTreatment specificityAverage expressionGClengthdNdSTreatment specificityAverage expressionGCn = 12343A0.530.1-0.01-0.30.160.15-0.04-0.02-0.01-0.07-0.120.06-0.24-0.14-0.07-0.040.090.090.0400.09πSTreatment specificityAverage expressionGClengthFamily sizeπNπSTreatment specificityAverage expressionGClengthn = 20318B0.04-0.020.13-0.04-0.01-0.06-0.01-0.26-0.11-0.070.050.090.0600.1Treatment specificityAverage expressionGClengthFamily sizeTajima's DTreatment specificityAverage expressionGClengthn = 20321C0.05-0.060.17-0.05-0.02-0.07-0.03-0.23-0.14-0.05Treatment specificityAverage expressionGClengthDoSTreatment specificityAverage expressionGCn = 12308D Figure A14: Partial correlation analysis of seed tissue after SVA.Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on seed tissue data after applying SVA. Data was further subset to include only treatment groups with data from more than one study before applying SVA. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 152 -0.40.00.4Partialspearmancorrelationcoefficient0.170.08-0.04-0.240.010.51-0.090.070.09-0.24-0.29-0.170.18-0.48-0.15dSTreatment specificityAverage expressionGClengthdNdSTreatment specificityAverage expressionGCn = 13958A0.520.08-0.07-0.250.110.5-0.0800.04-0.21-0.270.120.11-0.39-0.160.010.07-0.01-0.09-0.02-0.03πSTreatment specificityAverage expressionGClengthFamily sizeπNπSTreatment specificityAverage expressionGClengthn = 24748B-0.03-0.040.5-0.030.04-0.19-0.030.09-0.35-0.140.05-0.01-0.1-0.03-0.03Treatment specificityAverage expressionGClengthFamily sizeTajima's DTreatment specificityAverage expressionGClengthn = 24752C0.05-0.120.51-0.080.08-0.24-0.130.18-0.46-0.16Treatment specificityAverage expressionGClengthDoSTreatment specificityAverage expressionGCn = 13907D Figure A15: Partial correlation analysis of fruit or flower tissue after SVA.Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) based on flower and fruit tissue data after applying SVA. Data was further subset to include only treatment groups with data from more than one study before applying SVA. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 153 -0.40.00.4Partialspearmancorrelationcoefficient0.140.10.04-0.300.12-0.050.070.02-0.1-0.22-0.19-0.08-0.22-0.03dSTreatment specificityAverage expressionGClengthdNdSTreatment specificityAverage expressionGCn = 12331A0.520.1-0.04-0.320.160.06-0.06-0.010.04-0.09-0.160.07-0.15-0.16-0.04-0.040.070.040.04-0.020.05πSTreatment specificityAverage expressionGClengthFamily sizeπNπSTreatment specificityAverage expressionGClengthn = 20327B0.01-0.030.03-0.030.04-0.07-0.01-0.17-0.12-0.030.040.030.06-0.020.06Treatment specificityAverage expressionGClengthFamily sizeTajima's DTreatment specificityAverage expressionGClengthn = 20332C0.03-0.090.09-0.060.02-0.1-0.06-0.12-0.17-0.04Treatment specificityAverage expressionGClengthDoSTreatment specificityAverage expressionGCn = 12291D PCA before and after SVA on data subset by tissue type Figure A16: Effect of surrogate variable analysis on structure of leaf tissue data. (A) Effect of SVA on average expression. (B) Effect of SVA on treatment-specificity. (C) Principal component analysis of leaf tissue data before SVA. (D) Principal component analysis of leaf tissue data after SVA. Color in the PCA plots represent different experimental treatments - legend is Figure A22. 154 2.55.07.510.02.55.07.510.0log average expression (excluding TPM < 5) before SVAlog average expression (excluding TPM < 5) after SVALeafA0.000.250.500.751.000.40.60.81.0Treatment−specificity before SVATreatment−specificity after SVALeafB−300−200−1000100−1000100200300PC1: 15.14 %PC2: 7.01 %Leaf − Before SVAC−200−1000100−1000100200300PC1: 18.46 %PC2: 7.48 %Leaf − After SVAD Figure A17: Effect of surrogate variable analysis on structure of root tissue data. (A) Effect of SVA on average expression. (B) Effect of SVA on treatment-specificity. (C) Principal component analysis of root tissue data before SVA. (D) Principal component analysis of root tissue data after SVA.Color in the PCA plots represent different experimental treatments - legend is Figure A23. 155 2.55.07.510.012.52.55.07.510.012.5log average expression (excluding TPM < 5) before SVAlog average expression (excluding TPM < 5) after SVARootA0.250.500.751.000.40.60.81.0Treatment−specificity before SVATreatment−specificity after SVARootB−100−50050100−200−1000PC1: 12.05 %PC2: 5.62 %Root − Before SVAC−1000100−150−100−50050PC1: 11.55 %PC2: 4.72 %Root − After SVAD Figure A18: Effect of surrogate variable analysis on structure of whole plant tissue data. (A) Effect of SVA on average expression. (B) Effect of SVA on treatment-specificity. (C) Principal component analysis of whole plant tissue data before SVA. (D) Principal component analysis of whole plant tissue data after SVA.Color in the PCA plots represent different experimental treat- ments - legend is Figure A24. 156 2.55.07.510.02.55.07.510.0log average expression (excluding TPM < 5) before SVAlog average expression (excluding TPM < 5) after SVAWhole plantA0.000.250.500.751.000.40.60.81.0Treatment−specificity before SVATreatment−specificity after SVAWhole plantB−150−100−50050100−1000100200PC1: 19.95 %PC2: 5.99 %Whole plant − Before SVaC−150−100−50050100−200−1000100200300PC1: 18.74 %PC2: 4.91 %Whole plant − After SVAD Figure A19: Effect of surrogate variable analysis on structure of shoot tissue data. (A) Effect of SVA on average expression. (B) Effect of SVA on treatment-specificity. (C) Principal component analysis of shoot tissue data before SVA. (D) Principal component analysis of shoot tissue data after SVA. Color in the PCA plots represent different experimental treatments - legend is Figure A25. 157 2.55.07.5369log average expression (excluding TPM < 5) before SVAlog average expression (excluding TPM < 5) after SVAShootA0.250.500.751.000.40.60.81.0Treatment−specificity before SVATreatment−specificity after SVAShootB−1000100200−200−1000100PC1: 26.51 %PC2: 7.34 %Shoot − Before SVAC−200−1000100−1000100200PC1: 22.86 %PC2: 6.65 %Shoot − After SVAD Figure A20: Effect of surrogate variable analysis on structure of seed tissue data. (A) Effect of SVA on average expression. (B) Effect of SVA on treatment-specificity. (C) Principal component analysis of seed tissue data before SVA. (D) Principal component analysis of seed tissue data after SVA. Color in the PCA plots represent different experimental treatments - legend is Figure A26. 158 2.55.07.510.012.52.55.07.510.0log average expression (excluding TPM < 5) before SVAlog average expression (excluding TPM < 5) after SVASeedA0.000.250.500.751.000.250.500.751.00Treatment−specificity before SVATreatment−specificity after SVASeedB−800−600−400−20000100200300400PC1: 2.53 %PC2: 1.58 %Seed − Before SVAC−300−200−1000100200300−750−500−2500PC1: 1.91 %PC2: 1.2 %Seed − After SVAD Figure A21: Effect of surrogate variable analysis on structure of fruit and flower tissue data. (A) Effect of SVA on average expression. (B) Effect of SVA on treatment-specificity. (C) Principal component analysis of fruit and flower tissue data before SVA. (D) Principal component analysis of fruit and flower tissue data after SVA. Color in the PCA plots represent different experimental treatments - legend is Figure A27. 159 2.55.07.510.0369log average expression (excluding TPM < 5) before SVAlog average expression (excluding TPM < 5) after SVAFlower or fruitA0.000.250.500.751.000.250.500.751.00Treatment−specificity before SVATreatment−specificity after SVAFlower or fruitB−200−1000100−200−1000100PC1: 23.87 %PC2: 8.02 %Flower − Before SVAC−400−2000200−1000100200300PC1: 17.14 %PC2: 6.29 %Flower − After SVAD Legends - PCA before and after SVA on data subset by tissue type Figure A22: Legend for PCA before and after SVA, leaf data. 160 TreatmentPECO:0001021 ozone exposurePECO:0001034 Pseudomonas syringae plant exposurePECO:0001062 control exposurePECO:0007048 sodium chloride exposurePECO:0007048 sodium chloride exposurePECO:0007319 warm/hot air temperature exposurePECO:0007075 high light intensity exposurePECO:0007075 high light intensity exposurePECO:0007319 warm/hot air temperature exposurePECO:0007105 abscisic acid exposurePECO:0007162 continuous light exposurePECO:0007162 continuous light exposurePECO:0007315 zero gravity exposurePECO:0007200 short day length exposurePECO:0007200 short day length exposureflagellin 22 exposurePECO:0007200 short day length exposurePECO:0001050 limited nitrogen exposurePECO:0007200 short day length exposurePECO:0007075 high light intensity exposurePECO:0007200 short day length exposurePECO:0007233 fungal plant exposureBotrytis cinereaPECO:0007200 short day length exposurePECO:0007246 bacterial plant exposurePseudomonas fluorescensPECO:0007200 short day length exposurePECO:0007246 bacterial plant exposurePseudomonas syringaePECO:0007203 far red light exposurePECO:0007246 bacterial plant exposureVibrio vulnificusPECO:0007270 continuous dark (no light) exposurePECO:0007319 warm/hot air temperature exposurePECO:0007332 cold air temperature exposurePECO:0007373 mechanical damage exposurePECO:0007404 drought environment exposure Figure A23: Legend for PCA before and after SVA, root data. 161 PECO:0001046 limited phosphate exposurePECO:0001050 limited nitrogen exposurePECO:0001062 control exposurePECO:0007162 continuous light exposurePECO:0007162 continuous light exposurePECO:0007315 zero gravity exposurePECO:0007165 growth hormone exposureACCPECO:0007200 short day length exposurePECO:0007200 short day length exposureflagellin 22 exposurePECO:0007200 short day length exposurelimited copper exposurePECO:0007200 short day length exposurePECO:0001046 limited phosphate exposurePECO:0007200 short day length exposurePECO:0001049 limited magnesium exposurePECO:0007200 short day length exposurePECO:0007397 phosphorus nutrient exposurePECO:0007242 iron nutrient exposurePECO:0007246 bacterial plant exposurePECO:0007284 nitrogen macronutrient exposurePECO:0007319 warm/hot air temperature exposurePECO:0007373 mechanical damage exposure Figure A24: Legend for PCA before and after SVA, whole plant data. 162 Treatmentflagellin 22 exposurePECO:0001036 chemical stress exposuretunicamycinPECO:0001038 osmotic stress exposuremannitolPECO:0001046 limited phosphate exposurePECO:0001062 control exposurePECO:0007048 sodium chloride exposurePECO:0007105 abscisic acid exposurePECO:0007162 continuous light exposurePECO:0007162 continuous light exposurePECO:0007001 UV−B light exposurePECO:0007162 continuous light exposurePECO:0007048 sodium chloride exposurePECO:0007162 continuous light exposurePECO:0007075 high light intensity exposurePECO:0007162 continuous light exposurePECO:0007189 chemical exposureDMSOPECO:0007162 continuous light exposurePECO:0007319 warm/hot air temperature exposurePECO:0007189 chemical exposureDMSOPECO:0007196 light exposurePECO:0007200 short day length exposurePECO:0007200 short day length exposureflagellin 22 exposurePECO:0007200 short day length exposurePECO:0007319 warm/hot air temperature exposurePECO:0007200 short day length exposurePECO:0007332 cold air temperature exposurePECO:0007207 red light exposurePECO:0007246 bacterial plant exposurePECO:0007397 phosphorus nutrient exposurePECO:0007270 continuous dark (no light) exposurePECO:0007319 warm/hot air temperature exposurePECO:0007332 cold air temperature exposurePECO:0007397 phosphorus nutrient exposurePECO:0007404 drought environment exposure Figure A25: Legend for PCA before and after SVA, shoot data. 163 TreatmentPECO:0001046 limited phosphate exposurePECO:0001050 limited nitrogen exposurePECO:0001062 control exposurePECO:0007075 high light intensity exposurePECO:0007162 continuous light exposurePECO:0007165 growth hormone exposureACCPECO:0007200 short day length exposurePECO:0007200 short day length exposurePECO:0007319 warm/hot air temperature exposurePECO:0007200 short day length exposurePECO:0007332 cold air temperature exposurePECO:0007203 far red light exposurePECO:0007270 continuous dark (no light) exposurePECO:0007284 nitrogen macronutrient exposurePECO:0007332 cold air temperature exposurePECO:0007397 phosphorus nutrient exposurePECO:0007407 methyl jasmonate exposure Figure A26: Legend for PCA before and after SVA, seed data. 164 TreatmentPECO:0001062 control exposurePECO:0007162 continuous light exposure Figure A27: Legend for PCA before and after SVA, flower data. 165 TreatmentPECO:0001062 control exposurePECO:0007162 continuous light exposurePECO:0007248 greenhouse studyPECO:0007319 warm/hot air temperature exposurePECO:0007332 cold air temperature exposure Partial correlations, balanced subset after SVA Figure A28: Partial correlation analysis of leaf tissue on balanced data subset after SVA. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) after applying SVA. Average expression and tissue specificity are calculated using only leaf tissue data. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 166 -0.40.00.4Partialspearmancorrelationcoefficient0.160.0700.040.050.43-0.280.02-0.160.07-0.040.08-0.14-0.03-0.13-0.17-0.17-0.18-0.15-0.26-0.1dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 12917A0.530.09-0.040.07-0.010.46-0.260.15-0.20.06-0.03-0.02-0.11-0.01-0.13-0.10.06-0.19-0.21-0.23-0.13-0.030.080.040.090.050.020.11πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 21746B0.020.030.47-0.01-0.230.05-0.03-0.11-0.01-0.130.01-0.2-0.22-0.22-0.120.050.040.090.060.020.11Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 21752C0.0400.44-0.07-0.180.05-0.05-0.14-0.02-0.13-0.04-0.2-0.18-0.22-0.11Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 12876D Figure A29: Partial correlation analysis of root tissue on balanced data subset after SVA. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) after applying SVA. Average expression and tissue specificity are calculated using only root tissue data. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 167 -0.40.00.4Partialspearmancorrelationcoefficient0.170.120.010.030.040.29-0.250.02-0.110.07-0.050.08-0.11-0.08-0.21-0.2-0.15-0.19-0.23-0.36-0.16dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 13795A0.530.14-0.060.04-0.010.32-0.260.12-0.150.05-0.04-0.02-0.09-0.06-0.18-0.140.07-0.21-0.28-0.33-0.18-0.020.080.080.050.010.020.08πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 23515B0.030.010.33-0.04-0.20.04-0.03-0.1-0.06-0.17-0.01-0.24-0.28-0.31-0.170.050.080.050.010.020.08Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 23517C0.050.010.29-0.09-0.140.06-0.06-0.12-0.08-0.21-0.06-0.22-0.25-0.33-0.17Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 13749D Figure A30: Partial correlation analysis of whole plant tissue on balanced data subset after SVA. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) after applying SVA. Average expression and tissue specificity are calculated using only whole plant tissue data. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 168 -0.40.00.4Partialspearmancorrelationcoefficient0.160.070.010.020.030.22-0.270.01-0.180.07-0.030.08-0.160.02-0.11-0.18-0.18-0.25-0.1-0.26-0.08dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 12608A0.530.11-0.040.04-0.010.27-0.260.13-0.220.06-0.03-0.02-0.140.03-0.1-0.110.05-0.28-0.14-0.24-0.11-0.040.080.070.040.0600.09πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 20944B0.030.020.28-0.02-0.260.05-0.03-0.140.03-0.10-0.29-0.14-0.22-0.110.050.070.040.0700.1Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 20947C0.040.010.22-0.06-0.210.06-0.05-0.160.02-0.11-0.04-0.28-0.12-0.22-0.1Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 12566D Figure A31: Partial correlation analysis of shoot tissue on balanced data subset after SVA. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) after applying SVA. Average expression and tissue specificity are calculated using only shoot tissue data. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 169 -0.40.00.4Partialspearmancorrelationcoefficient0.160.09-0.01-0.020.060.39-0.270.01-0.160.07-0.040.08-0.11-0.1-0.11-0.18-0.16-0.16-0.26-0.24-0.11dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 12906A0.520.12-0.050.010.020.43-0.260.14-0.20.07-0.03-0.02-0.09-0.08-0.1-0.10.06-0.16-0.28-0.2-0.12-0.040.080.050.080.050.010.1πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 21405B0.020.020.44-0.02-0.240.07-0.02-0.09-0.08-0.090.01-0.17-0.28-0.18-0.120.050.040.090.060.010.11Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 21411C0.04-0.010.39-0.07-0.190.07-0.05-0.11-0.1-0.11-0.04-0.18-0.27-0.2-0.12Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 12866D Figure A32: Partial correlation analysis of seed tissue on balanced data subset after SVA. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) after applying SVA. Average expression and tissue specificity are calculated using only seed tissue data. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 170 -0.40.00.4Partialspearmancorrelationcoefficient0.160.170-0.050.020.53-0.14-0.010.04-0.16-0.030.07-0.1-0.04-0.2-0.18-0.17-0.14-0.15-0.37-0.13dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 14028A0.520.18-0.090.010.010.55-0.150.040-0.13-0.03-0.02-0.08-0.03-0.19-0.130.06-0.19-0.14-0.35-0.1600.070.060.08-0.010.010.06πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 24617B0.020.020.56-0.04-0.02-0.13-0.03-0.09-0.03-0.18-0.01-0.22-0.15-0.34-0.160.050.060.08-0.010.010.06Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 24621C0.0600.53-0.050.02-0.16-0.05-0.1-0.04-0.2-0.06-0.18-0.15-0.36-0.15Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 13978D Figure A33: Partial correlation analysis of fruit or flower tissue on balanced data subset after SVA. Partial correlations for (A) dN , (B) πN , (C) Tajima’s D, and (D) direction of selection (DoS) after applying SVA. Average expression and tissue specificity are calculated using only fruit and flower tissue data. The number of genes included in each partial correlation analysis (n) is listed at the top of each heatmap. 171 -0.40.00.4Partialspearmancorrelationcoefficient0.150.110.010.1-0.010.24-0.290.01-0.060.1-0.050.07-0.140.02-0.14-0.21-0.17-0.260.03-0.3-0.09dSTissue specificityTreatment specificityAverage expressionGClengthdNdSTissue specificityTreatment specificityAverage expressionGCn = 13331A0.530.15-0.070.07-0.040.28-0.290.15-0.110.07-0.05-0.01-0.130.04-0.14-0.130.07-0.31-0.03-0.25-0.1-0.030.080.080.030.050.010.07πSTissue specificityTreatment specificityAverage expressionGClengthFamily sizeπNπSTissue specificityTreatment specificityAverage expressionGClengthn = 22346B0.0300.3-0.04-0.160.05-0.03-0.140.04-0.130-0.34-0.04-0.22-0.10.040.080.030.0600.08Tissue specificityTreatment specificityAverage expressionGClengthFamily sizeTajima's DTissue specificityTreatment specificityAverage expressionGClengthn = 22350C0.030.040.25-0.09-0.090.08-0.06-0.140.02-0.14-0.06-0.310.01-0.25-0.11Tissue specificityTreatment specificityAverage expressionGClengthDoSTissue specificityTreatment specificityAverage expressionGCn = 13287D Effect of omitting low-expression values on treatment-specificity vs expression level correlation Figure A34: Treatment specificity vs average expression correlation in root tissue. Correlation between average expression in transcripts per million (TPM) and treatment specificity of genes when low expression values (< 5 TPM) are included (A) vs excluded (B). Expression level and treatment specificity were calculated using only data from root tissue samples. Line is a smoothing line with 95 % confidence intervals and values in parentheses give spearman correlation. 172 −10−505100.40.60.81.0treatment specificity of expressionlog(average expression in TPM)root (−0.51)A2.55.07.510.012.50.40.60.81.0treatment specificity of expressionlog(average expression in TPM)root (0.1)B Figure A35: Treatment specificity vs average expression correlation in whole plant tissue. Correlation between average expression in transcripts per million (TPM) and treatment specificity of genes when low expression values (< 5 TPM) are included (A) vs excluded (B). Expression level and treatment specificity were calculated using only data from whole plant tissue samples. Line is a smoothing line with 95 % confidence intervals and values in parentheses give spearman correlation. 173 −10−505100.40.60.81.0treatment specificity of expressionlog(average expression in TPM)whole plant (−0.56)A2.55.07.510.00.40.60.81.0treatment specificity of expressionlog(average expression in TPM)whole plant (−0.01)B Figure A36: Treatment specificity vs average expression correlation in shoot tissue. Correla- tion between average expression in transcripts per million (TPM) and treatment specificity of genes when low expression values (< 5 TPM) are included (A) vs excluded (B). Expression level and treatment specificity were calculated using only data from shoot tissue samples. Line is a smooth- ing line with 95 % confidence intervals and values in parentheses give spearman correlation. 174 −10−505100.40.60.81.0treatment specificity of expressionlog(average expression in TPM)shoot (−0.62)A3690.40.60.81.0treatment specificity of expressionlog(average expression in TPM)shoot (−0.01)B Figure A37: Treatment specificity vs average expression correlation in seed tissue. Correlation between average expression in transcripts per million (TPM) and treatment specificity of genes when low expression values (< 5 TPM) are included (A) vs excluded (B). Expression level and treatment specificity were calculated using only data from seed tissue samples. Line is a smoothing line with 95 % confidence intervals and values in parentheses give spearman correlation. 175 −10−505100.250.500.751.00treatment specificity of expressionlog(average expression in TPM)seed (−0.53)A2.55.07.510.00.250.500.751.00treatment specificity of expressionlog(average expression in TPM)seed (0.01)B Figure A38: Treatment specificity vs average expression correlation in fruit or flower tissue. Correlation between average expression in transcripts per million (TPM) and treatment specificity of genes when low expression values (< 5 TPM) are included (A) vs excluded (B). Expression level and treatment specificity were calculated using only data from fruit and flower tissue samples. Line is a smoothing line with 95 % confidence intervals and values in parentheses give spearman correlation. 176 −10−505100.250.500.751.00treatment specificity of expressionlog(average expression in TPM)fruit and flower (−0.39)A3690.250.500.751.00treatment specificity of expressionlog(average expression in TPM)flower (0.03)B Simulating level-specificity correlation Figure A39: Violin plot of correlations between expression specificity and expression level across 1000 simulated expression matrices. In the two different simulations shown here, expres- sion level was quantified either as average expression across all samples or average expression across samples with expression values > 0. Each expression matrix was simulated by sampling from a zero-inflated negative binomial distribution. 177 −0.4−0.20.0mean expressionacross all samplesmean expressionin samples where geneis expressedspecificity−expression level correlation APPENDIX B: SUPPLEMENTAL METHODS FOR CHAPTER 2 The records that we used to estimate range size from GBIF occurrence data could not have any issue codes except for the following: • “AMBIGUOUS_COLLECTION” • “AMBIGUOUS_INSTITUTION” • “COLLECTION_MATCH_FUZZY” • “COLLECTION_MATCH_NONE” • “CONTINENT_DERIVED_FROM_COORDINATES” • “COORDINATE_ROUNDED” • “COUNTRY_DERIVED_FROM_COORDINATES” • “COUNTRY_MISMATCH” • “DEPTH_MIN_MAX_SWAPPED” • “DEPTH_NON_NUMERIC” • “DEPTH_NOT_METRIC” • “DEPTH_UNLIKELY” • “DIFFERENT_OWNER_INSTITUTION” • “ELEVATION_MIN_MAX_SWAPPED” • “ELEVATION_NON_NUMERIC” • “ELEVATION_NOT_METRIC” • “ELEVATION_UNLIKELY” • “GEODETIC_DATUM_ASSUMED_WGS84” • “INSTITUTION_COLLECTION_MISMATCH” • “INSTITUTION_MATCH_FUZZY” • “INSTITUTION_MATCH_NONE” • “OCCURRENCE_STATUS_INFERRED_FROM_BASIS_OF_RECORD” • “OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT” 178 APPENDIX C: SUPPLEMENTAL FIGURES FOR CHAPTER 2 Exploring relationships in data before outlier removal Figure C1: Tests of independence between life-history traits included in this study. Values across the top of each plot give the results of a χ2 test of independence between each pair of life- history traits. 179 Figure C2: Phylogenetic tree of species included in this study plotted alongside technical se- quencing variables. The phylogenetic tree scale is in millions of years. We limited the number of individuals sequenced per species to no more than 1200. CV stands for coefficient of variation and bp for base pairs. 180 Figure C3: Correlations between technical sequencing variables and diversity. Each point is a species. Values across the top of each plot give the Spearman’s correlation coefficient and p-value (testing whether the correlation differs from zero) for each pairwise relationship. Each pairwise plot includes one of three measures of diversity (Nucleotide diversity: A-C, Jaccard Dissimilarity: D-F, Bray-Curtis dissimilarity: G-I) and one of three technical sequencing variables (mean coverage: A, D, G; Coefficient of variation in bp sequenced: B, E, H; Number of individuals sequenced: C, F, I). Purple line is a loess smoothing line with 95% confidence intervals shaded in gray. All logarithms are base 10. Three species with nucleotide diversity values of 0 are omitted from plots involving nucleotide diversity. 181 Figure C4: Effect of coverage on relationship between π and Jaccard dissimilarity. (A) shows the relationship between k-mer diversity (Jaccard dissimilarity) and nucleotide diversity without omitting species with ≤ 0.5x coverage or ≤ 1000 SNP calls. (B) shows the same relationship, except these species with low coverage or low numbers of SNP calls are omitted. Each data point is a species. All species’ points are colored by the log of average genome-wide coverage per in- dividual (base 10) for that species. Purple lines are loess smoothing curves with 95% confidence intervals shaded in gray. Values across the top of each plot are Spearman correlation coefficients (ρ) and p-values that test whether each correlation coefficient differs from zero. Three species with nucleotide diversity values of 0 are omitted from these plots. 182 Figure C5: Relationship between k-mer diversity, nucleotide diversity and variation in se- quencing coverage. Each point is a species. Values across the top of each plot give the Spearman’s correlation coefficient and p-value (testing whether the correlation differs from zero) for each pair- wise relationship. Purple line is a loess smoothing line with 95% confidence intervals shaded in gray. Points are colored by the coefficient of variation (mean/standard deviation) in bp sequenced for each species. Three species with nucleotide diversity values of 0 are omitted from the plots. 183 Figure C6: Relationship between k-mer diversity, nucleotide diversity, and number of indi- viduals sequenced. Each point is a species. Values across the top of each plot give the Spearman’s correlation coefficient and p-value (testing whether correlation differs from zero) for each pairwise relationship. Purple line is a loess smoothing line with 95% confidence intervals shaded in gray. Points are colored by the logarithm of the number of individuals sequenced within each species (base 10). Three species with nucleotide diversity values of 0 are omitted from the plots. 184 Figure C7: Independent methods of range size estimation correlate with each other. (A) Com- pares the size of total size of native and invaded ranges from GBIF occurrence data and WCVP range maps. (B) Compares native range estimates only (invaded ranges excluded) from GBIF oc- currence data and WCVP range maps. (C) and (D) compare the full and native range estimates from WCVP range maps and GBIF occurrence data, respectively.Every data point is a species. The solid black lines are 1:1 reference lines where the different measures of range size are equal. The purple lines are linear regression lines with 95 % confidence intervals in grey shading. Values across the top of each plot give the Spearman’s correlation coefficient (ρ) and p-value testing whether corre- lation differs from zero. One p-value is reported as zero because it was < 2.2 × 10−16, which is the limit of precision for doubles in R. All logarithms are base 10. 185 Figure C8: Relationship between Bray-Curtis and Jaccard dissimilarity. Each point is a species. Values across the top of each plot give the Spearman’s correlation coefficient and p-value (testing whether correlation differs from zero). The p-value is reported as zero because it was < 2.2 × 10−16, which is the limit of precision for doubles in R. Line is a linear regression with 95% confidence intervals shaded in gray. 186 Figure C9: Relationship between the range size-squared height ratio and diversity without correcting for evolutionary history, genome size, life cycle habit, mating system, or culti- vation status. Each point is a species. Values across the top of each plot give the Spearman’s correlation coefficient and p-value (testing whether correlation differs from zero) for each pair- wise relationship. Each pairwise relationship involves the native range size-squared height ratio, where native range size was estimated from GBIF occurrences (A, C, E), or WCVP range maps (B, D, F), and one of three diversity measures (Nucleotide diversity: A-B, Jaccard Dissimiarity: C-D, Bray-Curtis dissimilarity: E-F). Purple lines are loess smoothing lines with 95% confidence intervals shaded in gray. Three species with nucleotide diversity values of 0 are omitted from plots involving nucleotide diversity. All logarithms are base 10. 187 Population size proxy vs diversity relationships after controlling for phylogeny and life-history variables, but not controlling for genome size Figure C10: Partial phylogenetic regression between WCVP population size proxy and diver- sity. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size-squared height ratio. Range size was estimated from WCVP range maps. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, and evolutionary history. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phy- logenetic variance-covariance matrix to correct for phylogenetic relationships. The values at the top of each plot give the slope of the partial regression ± one standard error and p-values testing whether the slopes differ from zero. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 188 Figure C11: Partial phylogenetic regression between GBIF population size proxy and diver- sity. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size-squared height ratio. Range size was estimated from GBIF occurrence data. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, and evolutionary history. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phy- logenetic variance-covariance matrix to correct for phylogenetic relationships. The values at the top of each plot give the slope of the partial regression ± one standard error and p-values testing whether the slopes differ from zero. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 189 Figure C12: Partial phylogenetic regression between WCVP range size and diversity. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size estimates. Range size was estimated from WCVP range maps. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, and evolutionary history. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. The values at the top of each plot give the slope of the partial regression ± one standard error and p-values testing whether the slopes differ from zero. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 190 Figure C13: Partial phylogenetic regression between GBIF range size and diversity. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size estimates. Range size was estimated from GBIF occurrence data. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, and evolutionary history. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. The values at the top of each plot give the slope of the partial regression ± one standard error and p-values testing whether the slopes differ from zero. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 191 Figure C14: Partial phylogenetic regression between height and diversity. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, and evolutionary history. The statistics across the top of each plot give the value of the slope of the lines (± the standard error), the p-value testing whether the slope differs from zero. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. All logarithms are base 10. 192 Population size proxy vs diversity relationships after controlling for phylogeny, life-history variables, and genome size Figure C15: Partial phylogenetic regression between WCVP full range size-squared height ratio and Jaccard dissimilarity, controlling for genome size. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. WCVP full ratio gives the ratio of range size to squared plant height, where range size includes in- vaded ranges and is estimated from WCVP range maps. The partial regression controls for genome size, mating system, life cycle habit, cultivation status, and evolutionary history. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance- covariance matrix to correct for phylogenetic relationships. The values at the top of the plot give the slope of the partial regression ± one standard error and p-values testing whether the slope differ from zero. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 193 Figure C16: Partial phylogenetic regression between WCVP native range size-squared height ratio and diversity, controlling for genome size. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. WCVP native ratio gives the ratio of range size to squared plant height, where range size excludes invaded ranges and is estimated from WCVP range maps. The partial regression controls for genome size, mating system, life cycle habit, cultivation status, and evolutionary history. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. The values at the top of the plot give the slope of the partial regression ± one standard error and p-values testing whether the slope differ from zero. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 194 Figure C17: Partial phylogenetic regression between GBIF range size-squared height ratio and diversity, controlling for genome size. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size-squared height ratio. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, genome size, and evolutionary history. The statistics across the top of each plot give the value of the slope of the lines (± the standard error) and the p-value testing whether the slope differs from zero. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 195 Figure C18: Partial phylogenetic regression between WCVP range size and diversity, control- ling for genome size. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size estimates. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, genome size, and evolutionary history. The statistics across the top of each plot give the value of the slope of the lines (± the standard error) and the p-value test- ing whether the slope differs from zero. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 196 Figure C19: Partial phylogenetic regression between GBIF range size and diversity, control- ling for genome size. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size estimates. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, genome size, and evolutionary history. The statistics across the top of each plot give the value of the slope of the lines (± the standard error) and the p-value test- ing whether the slope differs from zero. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 197 Figure C20: Partial phylogenetic regression between height and diversity, controlling for genome size. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. Lines give the relationship between the pairs of plot- ted variables after controlling for mating system, life cycle habit, cultivation status, genome size, and evolutionary history. The statistics across the top of each plot give the value of the slope of the lines (± the standard error) and the p-value testing whether the slope differs from zero. Be- fore fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 198 Genome size vs diversity relationships after controlling for population size proxies, phylogeny, and life-history variables Figure C21: Partial phylogenetic regression between Jaccard dissimilarity and genome size, controlling for WCVP full range size-height ratio. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. WCVP full ratio gives the ratio of range size to squared plant height, where range size includes invaded ranges and is estimated from WCVP range maps. The partial regression controls for range size-squared height ratio, mating system, life cycle habit, cultivation status, and evolutionary history. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance- covariance matrix to correct for phylogenetic relationships. The values at the top of the plot give the slope of the partial regression ± one standard error and p-values testing whether the slope differ from zero. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 199 Figure C22: Partial phylogenetic regression between diversity and genome size, controlling for WCVP native range size-height ratio. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. WCVP native ratio gives the ratio of range size to squared plant height, where range size excludes invaded ranges and is estimated from WCVP range maps. The partial regression controls for genome size, mating system, life cycle habit, cultivation status, and evolutionary history. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. The values at the top of the plot give the slope of the partial regression ± one standard error and p-values testing whether the slope differ from zero. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 200 Figure C23: Partial phylogenetic regression between diversity and genome size, controlling for GBIF range size-height ratio. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size-squared height ratio. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, range size-squared height ratio, and evolutionary history. The statistics across the top of each plot give the value of the slope of the lines (± the standard error) and the p-value testing whether the slope differs from zero. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 201 Figure C24: Partial phylogenetic regression between diversity and genome size, controlling for WCVP range size. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size estimates. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, range size, and evolutionary history. The statistics across the top of each plot give the value of the slope of the lines (± the standard error) and the p-value testing whether the slope differs from zero. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 202 Figure C25: Partial phylogenetic regression between diversity and genome size, controlling for GBIF range size. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size estimates. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, range size, and evolutionary history. The statistics across the top of each plot give the value of the slope of the lines (± the standard error) and the p-value testing whether the slope differs from zero. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 203 Figure C26: Partial phylogenetic regression between diversity and genome size, controlling for height. Each point is a species and only species with > 0.5x mean coverage and > 1000 variant sites were included in the regression. The regressions are organized according to whether invaded ranges were excluded (A-C) or included (D-F) in the range size-squared height ratio. Lines give the relationship between the pairs of plotted variables after controlling for mating system, life cycle habit, cultivation status, height, and evolutionary history. The statistics across the top of each plot give the value of the slope of the lines (± the standard error) and the p-value testing whether the slope differs from zero. Before fitting the line, each response variable was scaled to a standard normal distribution (mean = 0, variance = 1), then multiplied by the inverse of the Cholesky decomposition of the phylogenetic variance-covariance matrix to correct for phylogenetic relationships. Dotted lines show the partial regression slope ± one standard error. All logarithms are base 10. 204 APPENDIX D: SUPPLEMENTAL FIGURES FOR CHAPTER 3 The effect of coverage on k-mer measures We observe in Figure D1 a complex interaction between coverage and the resulting Bray-Curtis score depending on the length of k. With 30-mers, there is very little difference in the Bray-Curtis score. However, 10-mers are more affected by changes in coverage. The overall trend and ranks of the scores remain, but the Bray-Curtis score is generally higher with 10x coverage. Figure D1: Effect of k and coverage on Bray-Curtis dissimilarity. Bray-Curtis scores calculated with 10-mers (left) and 30-mers (right). For each sample, reads with 10x coverage (gray/square) and 30x coverage (black/pentagon) were simulated. After simulating reads, the k-mers were counted and then the Bray-Curtis score is calculated. Each point represents one sample with a specific k and coverage. 205 Adjusting array size of Counting Bloom Filter Figure D2: Effect of bloom filter size on cosine dissimilarity. A comparison of cosine similarity scores of CBFs with an array size of 20 million (left) and 10,000 (right). The black line shows the 1-to-1 mapping of πt to the dissimilarity score. Note the difference in the scales of the scores on the y-axes, and that the right panel shows points that are closer to the 1-to-1 line. 206 APPENDIX E: SUPPLEMENTAL FIGURES FOR CHAPTER 4 Figure E1: Comparison of best (A) CNN, (B), DNN, and (C) ABC models at predicting times to fixation for the decay demography scenario. Figure E2: Comparison of best (A) CNN, (B), DNN, and (C) ABC models at predicting times to fixation for the cycle demography scenario. 207 Figure E3: Comparison of best (A) CNN, (B), DNN, and (C) ABC models at predicting times to fixation for the chaos demography scenario. Figure E4: tf vs ta for statistics based on the site frequency spectrum for sweeps in constant- sized populations. 208 Figure E5: tf vs ta for unphased H statistics and unphased Messer’s Hscan in constant-sized populations. Figure E6: tf vs ta for Kern’s gkl statistics in constant-sized populations. 209 Figure E7: tf vs ta for Rogers’ R2 and Kim’s ω. 210